Neural Networks PDF
Document Details
Uploaded by ReputableDalmatianJasper
Tags
Summary
This document provides an introduction to neural networks, from basic concepts to deep networks. The presentation covers activation functions, derivatives of activation functions, forward and backward propagation, and introduces the vanishing gradient problem. It also incorporates questions for further exploration.
Full Transcript
NEURAL NETWORKS Basics to Deep Networks A TWO LAYER NEURAL NET ▪ We will use an alternative representation for X a (also activation) a a = X a 1 a 2...
NEURAL NETWORKS Basics to Deep Networks A TWO LAYER NEURAL NET ▪ We will use an alternative representation for X a (also activation) a a = X a 1 a 2 a a3 a4 The output layer The input layer The hidden layer NN REPRESENTATION Each node on the hidden layer would perform ▪ NN REPRESENTATION Vectorizing we get W with all w i a 4*3 matrix b a 4*1 matrix COMPUTING THE OUTPUT Hence, for the first layer of the NN, we can have the vectorized implementation as Given input x: VECTORIZING ACROSS MULTIPLE EXAMPLES ▪ For the two layer NN, talking notation a(i) where [] denotes the layer and () denotes the ith example, we can vectorize across multiple examples as for i = 1 to m Activation of the 1st hidden unit on th 1st training example…so on OTHER ACTIVATION FUNCTIONS ▪ tanh (tangent hyperbolic) function: a shifted version of the sigmoid function ▪ Works better than the sigmoid function because with values between +1 and -1, the mean of the activations are closer to having a zero mean. ▪ Makes learning for the next layer a little bit easier ▪ Exception?? ▪ The activation function of a binary classifier is always a sigmoid function. WHY? ▪ Compare between sigmoid and tanh function with their similarities and differences. VANISHING GRAIDENT PROBLEM ▪ In back propagation, the new weight, of a node is calculated using the old weight and product of the learning rate and gradient of the loss function. ▪ With the chain rule of partial derivatives, we can represent gradient of the loss function as a product of gradients of all the activation functions of the nodes with respect to their weights. ▪ Hence, the updated weights of nodes in the network depend on the gradients of the activation functions of each node. ▪ When there are more layers in the network, the value of the product of derivative decreases until at some point the partial derivative of the loss function approaches a value close to zero, and the partial derivative vanishes. This is the vanishing gradient problem. OVERCOMING THE VANISHING GRADIENT PROBLEM ▪ The vanishing gradient problem is caused by the derivative of the activation function used to create the neural network. ▪ A solution is to replace activation functions like sigmoid or tanh with ReLU. ▪ Rectified Linear Units (ReLU) are activation functions that generate a positive linear output when they are applied to positive input values. If the input is negative, the function will return zero. ▪ If the ReLU function is used for activation in a neural network in place of a sigmoid function, the value of the partial derivative of the loss function will be having values of 0 or 1 which prevents the gradient from vanishing. ▪ The problem with the use of ReLU is when the gradient has a value of 0. In such cases, the node is considered as a dead node since the old and new values of the weights remain the same. ▪ Another technique to avoid the vanishing gradient problem is weight initialization. OTHER ACTIVATION FUNCTIONS ▪ ReLU (Rectified Linear Unit) function a = max(0,z) ▪ The derivative is 1 as long as z is positive and derivative or the slope is 0 when z is negative. QUESTIONS ▪ What is the problem with tanh and sigmoid function, for which we rely on ReLU? ▪ When to use which activation function?? ▪ Why do we need non linear activation functions?? How does a linear activation function or not having any activation function effects the layers of a Neural Network? DERIVATIVES OF ACTIVATION FUNCTIONS DERIVATIVE OF SIGMOID FUNCTION ▪ DERIVATIVE OF THE TANH FUNCTION ▪ DERIVATIVE OF RELU FUNCTION ▪ BACKPROPAGATION EQUATIONS FOR A TWO LAYERED NN Calculate dz, dW, db, dz, dW, db Home Task HOW TO INITIALIZE WEIGHTS? ▪ For Logistic Regression, weights can be initialized to 0. ▪ What happens if we initialize weights to 0 for a NN? ▪ Solution: Random Initialization ▪ Function: np.random.randn() ▪ Does initialization to 0 affects b? ▪ By now we have seen most of the ideas to implement a deep neural network: ▪ Forward and backward propagation the context of a neural network, with a single hidden layer. ▪ logistic regression. ▪ Vectorization. ▪ Why it is important to initialize the weights randomly. WHAT IS A DEEP NEURAL NETWORK Shallow Deep NEURAL NETWORKS NOTATIONS layer 0 1 2 3 4 FORWARD PROPAGATION IN DEEP ▪ NETS for l = 1…4 MATRICES AND THEIR DIMENSIONS 2 3 1 4 0 5 n = n x = 2 n = 1 n =2 n = 3 n = 4 n =5 Now lets discuss the dimensions of Z, W and X z is (3*1) or (n * 1) x is a (2*1) or (n * 1). So dimension of W??? W is (3*2) or (n * n ). So w ? W is (5*3) or (n * n ) as z (5*1) = W * a (3*1) + b W : (4,5), W: (2,4), Ww5] : (1,2)… So W[l]: (n[l] * n[l-1] ) b: (n, 1), b: (n,1)……. so b[l]: (n[l],1). DIMENSIONS FOR VECTORIZED IMPLEMENTATIONS ▪ WHY DEEP NETWORKS ARE BETTER- INTUITIONS ON DEEP NETWORKS WHY DEEP NETWORKS ARE BETTER ▪ Informally: ▪ There are functions that can be computed with a “small” L layer deep neural network that shallower networks require exponentially more hidden units to compute. FORWARD AND BACKWARD FUNCTIONS For a certain layer l: Forward propogation– Input : a[l-1], Output: a[l] Backward propogation – Input: da[l], Output: da[l-1] FORWARD AND BACKWARD FUNCTIONS layer l Forward Prop a[l-1] a[l] w[l],b[l] need to cache z [l] Backward Prop w[l],b[l] da[l-1] da[l] dz[l] dW[l] db[l] FORWARD AND BACKWARD FUNCTIONS a a a[l] a a [l] w,b w,b[l3 ……… w[l],b[l] w,b y cap …. cache z cache z cache z cache z [l] da w,b w,b da w,b da[l-1] w[l],b[l] dz dz dz ……… da[l] …. dz[l] dW dW dW db db dW[l] db db[l] SUMMARIZING ▪ Forward propogation for layer l ▪ Input a[l-1] ▪ Output: a[l], cache (z[l], w[l], b[l]) ▪ z[l] = w[l] a[l-1] + b[l] A[l] = g[l] (z[l]) ▪ Backward propogation for layer l ▪ Input da[l] Vectorized Equations: ▪ Output: da[l-1], dW[l], db[l] dZ[l] = dA[l] * dg[l](Z[l]) ▪ dz[l] = da[l] * dg[l](z[l]) dW[l] = 1/m dZ[l]. A[l-1]T dW[l] = dz[l]. a[l-1] db[l] = 1/m np.sum (dZ[l], axis =1, keepdims = TRUE) db[l] = dz[l] dA[l-1] = W[l]T. dZ[l] da[l-1] = W[l]T. dz[l] or writing wrt the earlier layer, dz[l] = W[l+1] T. dz[l+1] * dg[l](z[l]) SUMMARIZING X ReLU ReLU Sigmoid Backward Propagation SUMMARIZING ▪ APPLIED DEEP LEARNING IS A VERY EMPIRICAL PROCESS After experiments, we will find some value of alpha gives fast learning and allows conversion to lower cost function J. WHAT DOES ALL THESE HAVE TO DO WITH THE HUMAN BRAIN???