Deep Learning and Variants_Session 5_20240128.pdf
Document Details
Uploaded by PalatialRelativity
2024
Tags
Full Transcript
Presents Deep Learning & its variants GGU DBA Deep Learning Dr. Anand Jayaraman Professor, upGrad; Chief Data Scientist, Agastya Data Solutions 2. Problems with Deep Networks Issues while using more layers. ∙ Vanishing gradients: as we add more and more hidden layers, back-propagation becomes less a...
Presents Deep Learning & its variants GGU DBA Deep Learning Dr. Anand Jayaraman Professor, upGrad; Chief Data Scientist, Agastya Data Solutions 2. Problems with Deep Networks Issues while using more layers. ∙ Vanishing gradients: as we add more and more hidden layers, back-propagation becomes less and less useful in passing information to the lower layers. In effect, as information is passed back, the gradients begin to vanish and become small relative to the weights of the networks. ∙ Difficulty in Optimization: In deeper networks the parameter space is large, and surface being optimized has complex structures. Gradient descent becomes very inefficient. ∙ Overfitting: perhaps the central problem in Machine Learning. Briefly, over-fitting describes the phenomenon of fitting the training data too closely, maybe with hypotheses that are too complex. In such a case, your learner ends up fitting the training data really well, but will perform much, much more poorly on real examples. 3. NURTURING A DEEP NEURAL NETWORK Multiple approaches Improve training Activation function, initialization, pre-processing, gradient descent Further accuracy Ensembles Minimize overfit Regularization Nurturing a deep neural network Vanishing Gradient Solutions – Better Activation functions Improving upon gradient descent Weight initialization Regularization To overcome vanishing gradients For shallow (1 or 2 hidden layer) networks random initialization is enough to learn well. We need better techniques to initialize the weights at many layers of the deep network. Use better activation functions, ReLU and derivatives. Using random sampling from well defined ranges. Relu Does not saturate in the positive region More biologically plausible Computationally faster in practice. Built in regularizer Most popular activation function for deep neural networks since 2012 Activation functions Nurturing a deep neural network Vanishing Gradient Solutions – Better Activation functions Improving upon gradient descent Weight initialization Regularization Making gradient descent better We noticed a couple of problems with gradient descent – Very slow learning rate makes it too slow to converge and will settle at slightest of local minima. Very fast learning rate makes it bounce off. Even sufficient learning rate will settle it at local minima. We need something like a momentum or an inertia that helps us break the local minima – The solution is highly sensitive to the learning rate A QUICK REFRESHER ON GRADIENT DESCENT Gradient or Directional Derivative f :R →R n f f f ( x1 ,..., xn ) := ,..., xn x1 https://mathinsight.org/directional_derivative_gradient_introduction The gradient of the function f(x,y) = −(cos2x + cos2y) https://en.wikipedia.org/wiki/Gradient Gradient descent algorithm: Data x0 Rn Step 0: set i = 0 f ( x ) = 0 Step 1: if i stop, else, compute search directionh i = −f ( xi ) Step 2: compute the learning-rate α Step 3: set 𝑥𝑖+1 = 𝑥𝑖 + α ℎ𝑖 go to step 1 ftp://ftp.unicauca.edu.co/Facultades/FIET/DEIC/Materias/computacion%20inteligente/parte%20II/semana11/gradient/sem_11_CI.ppt Learning Rate Source: https://towardsdatascience.com/gradient-descent-in-a-nutshell-eaf8c18212f0 During Training: Monitor loss function Initial loss should be similar to what you expect the loss would be for a random guess Then it should drop exponentially Learning rate decay Each learning rate has some attractive properties. Decay is a way to use them all. Step decay: Reduce learning rate by half every few epochs Exponential decay 𝛼 = 𝛼0 𝑒 −𝑘𝑡 1/t decay 𝛼= 𝛼0 1 + 𝑘𝑡 Where k is the decay rate and t is the epoch number and k and 𝛼0 are adjusted as hyper parameters. Visualizing Gradient Descent Motivation for Gradient Descent Modifications https://www.youtube.com/watch?v=_e -LFe_igno Gradient descent: Building a momentum Instead of taking the latest gradient, we take the average of past few gradients. This ensures that we do not zig-zag or go off because of one sample Img Source: https://towardsdatascience.com/gradient-descent-explained-9b953fc0d2c Momentum: EMA on gradient descent Without momentum We are more interested in how much importance is given to past. So, we care more for (1-w) and it is called momentum. Momentum indicates how many past values we take into account for averaging. – Number of past days taken for averaging 1 is1−𝜌 𝑤𝑡+1 = 𝑤𝑡 − 𝛼. 𝑔𝑟𝑎𝑑(𝑡) With momentum 𝑣𝑡+1 = 𝜌𝑣𝑡 + 1 − 𝜌 𝑔𝑟𝑎𝑑 𝑡 𝑤𝑡+1 = 𝑤𝑡 − 𝛼𝑣𝑡+1 Implementing SGD with Momentum RMS Prop: How about different learning rates for different w If the average update of a weight until now is high, change it less; if it is less, increase the update – Average update done until now is EMA of gradient square! (why not just gradient; positives might cancel negatives) – Square root of EMA of gradient square is called RMS prop. – Divide the current gradient by RMS Prop Momentum vs RMSProp: Better saddlepoint perf https://towardsdatascience.com/understanding-rmsprop-faster-neural-network-learning62e116fcf29a Implementing RMSprop Adam (adaptive moment) update Adam update combines momentum and RMS Prop. It has two hyper parameters. Momentum (beta 1) and squared momentum (beta 2). https://medium.com/100-days-of-algorithms/day-69-rmsprop-7a88d475003b Adam (adaptive moment) update Adam update combines momentum and RMS Prop. It has two hyper parameters. Momentum (beta 1) and squared momentum (beta 2). 𝑣1 = 0 𝑣𝑡+1 = 𝛽1 𝑣𝑡 + 1 − 𝛽1. 𝑔𝑟𝑎𝑑 (𝑚𝑜𝑚𝑒𝑛𝑡𝑢𝑚) 𝑟𝑡+1 = 𝛽2 𝑟𝑡 + 1 − 𝛽2. 𝑔𝑟𝑎𝑑2 (𝑅𝑀𝑆 𝑃𝑟𝑜𝑝) 𝑤𝑡+1 𝑣𝑡+1 = 𝑤𝑡 − 𝛼. 𝑟𝑡+1 + 𝜖 It is common to implement beta corrections to both 𝑣𝑡+1 𝑎𝑛𝑑 𝑟𝑡+1 in the above equation. http://www.insofe.edu.in Data Science Education and Research Implementing Adam Nurturing a deep neural network Vanishing Gradient Solutions – Better Activation functions Improving upon gradient descent Weight initialization Regularization Weight initialization Weight initialization is an active area of research – All zeros is bad as all updates will be same as all outputs are same – Naïve Gaussian will break down for large nets Zero mean and 0.01 std; 10 layers, 500 neurons and tanh activations Strategy : Xavier/Glorot initialization Xavier initialization sets a layer’s weights to values chosen from a random uniform distribution that’s bounded between Nurturing a deep neural network Vanishing Gradient Solutions – Better Activation functions Improving upon gradient descent Weight initialization Regularization REGULARIZATIONS Regularization for ANN L1 & L2 regularization Noise based regularization – Dropout Batch normalization Overfitting problem 𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝛽2 𝑥2 + 𝛽3 𝑥3 + 𝛽4 𝑥4 + ⋯ How many 𝛽𝑖 do we need for a reasonable fit, while at the same time minimizing the risk of overfitting? In general, given a regression problem with several features, is there a way to determine how many 𝛽𝑖 do we need for a reasonable fit, while at the same time minimizing the risk of overfitting? 𝑦 = 𝛽0 + 𝛽1 𝑥 1 + 𝛽2 𝑥2 + 𝛽3 𝑥3 + 𝛽4 𝑥4 + ⋯ Regularization Standard Least Squares Regression involves minimizing SSE SSE = ( y − x ) = (y − Xβ) (y − Xβ) N d 2 j =1 j i =0 i T i y = vector of all training responses y j X = matrix of all training samples x j We can reduce overfitting by penalizing large coefficients N d d 2 2 min ( y j − i xi ) + i i =0 i =1 j =1 Regularization functions Ridge Simplest is whose sum of squares is the least Lasso Simplest is whose absolute sum is the least High lambda gives simplest models; At infinity, we have no model Architecture http://cs.stanford.edu/people/karpathy/convnet js/demo/classify2d.html Regularization is the key 20 node neural net