Neural Networks - Lecture 2 - Optimization Methods PDF

Summary

This document is a lecture on optimization methods for neural networks. It covers topics such as gradient descent, second-order optimization, and stochastic gradient descent. The lecture includes examples and figures to illustrate the concepts.

Full Transcript

Neural Networks - Lecture 2 Optimization Methods Alexandru Sorici UPB Table of contents 1. Introduction 2. Gradient Descent 3. Second Order Optimization 4. Stochastic Gradient Descent 5. Gradient Descent Optimization Algorithms 6. References 1 Intro Recap - partial derivatives and gradient...

Neural Networks - Lecture 2 Optimization Methods Alexandru Sorici UPB Table of contents 1. Introduction 2. Gradient Descent 3. Second Order Optimization 4. Stochastic Gradient Descent 5. Gradient Descent Optimization Algorithms 6. References 1 Intro Recap - partial derivatives and gradient Consider a two variable function f(θ) = f(θ1 , θ2 ) = θ12 + θ22 ∂f(θ1 , θ2 ) f(θ1 + ∆θ1 , θ2 ) − f(θ1 , θ2 ) = lim ∆θ1 →0 ∂θ1 ∆θ1 ∂f(θ) = 2θ1 ∂θ1 [ 2θ1 ∇J(θ) = 2θ2 ∂f(θ) = 2θ2 ∂θ2 ] 2 Recap - second order derivatives - Hessian Hessian = square matrix of second-order partial derivatives ( ) ∂ ∂f(θ) ∂ 2 f(θ) = =2 ∂θ1 ∂θ1 ∂θ12 ( ) ∂ ∂f(θ) ∂ 2 f(θ) = =2 ∂θ2 ∂θ2 ∂θ22 ∂ 2 f(θ) ∂ 2 f(θ) = =0 ∂θ1 θ2 ∂θ2 θ1 [ 2 0 H= 0 2 ] 3 Recap - General Case - gradient vector θ - a d-dimensional vector f(θ) - scalar-valued function Gradient vector of f(•) with respect to θ:  ∂f(θ)  ∂θ 1  ∂f(θ)   ∂θ2   ∇J(θ) =   ..   .  ∂f(θ) ∂θd Figure 1: Gradient field - Source: Wikimedia Commons (https://bit.ly/2lOcCi9) 4 Recap - General Case - Hessian matrix θ - a d-dimensional vector f(θ) - scalar-valued function The Hessian matrix of f(•) with respect to θ (written as ∇2θ f(θ) or H) is a d × d matrix of second-order partial derivatives:  ∂ 2 f(θ) 2 1  ∂θ  ∂ 2 f(θ)  ∂θ2 θ1 ∂ 2 f(θ) ∂θ1 θ2 ∂ 2 f(θ) ∂θ22 ∂ 2 f(θ) ∂θd θ1 ∂ 2 f(θ) ∂θd θ2 H = ∇2θ f(θ) =     .. . .. .  ... ... .. . ... ∂ 2 f(θ) ∂θ1 θd  ∂ 2 f(θ)  ∂θ2 θd  .. . ∂ 2 f(θ) ∂θd2     5 Recap - Offline learning When doing offline learning, we typically use a batch of data x1:n = x1 , x2 , . . . , xn and optimize functions of the form: n f(θ) = f(θ, x1:n ) = 1∑ f(θ, xi ) ≈ n ∫ f(θ, x)p(x)dx i=1 The gradient is then: g(θ) = ∇θ f(θ) = n ∑ ∇θ f(θ, xi ) i=1 6 Recap - Optimization for Linear regression When f is the loss function for linear regression (MSE loss): T f(θ) = f(θ, X, y) = (y − Xθ) ((y − Xθ)) = n ∑ (yi − xi θ)2 i=1 ∇θ f(θ) = ∂ T (y y − 2yT Xθ + θ T XT Xθ) = −2XT y + 2XT Xθ ∂θ ∇2θ f(θ) = 0 + 2XT X 7 Gradient Descent Gradient Descent Algorithm Gradient descent or steepest descent is an iterative algorithm for optimization which has the following update form: θ (k+1) = θ (k) − λ(k) g(k) = θ (k) − λ(k) ∇f(θ (k) ) where k is the step in the algorithm, g(k) = g(θ (k) ) - gradient at step k, λ(k) is the learning rate or step size 8 Gradient Descent Derivation Function f is our loss function J(θ) Objective: find a small change ∆θ in parameters θ that minimizes the value of J [ ] ∂J arg min J(θ + ∆θ) ≈ arg min J(θ) + ∆θ ∆θ ∆θ ∂θ s.t. ∥∆θ∥ ≤ ϵ Constrained optimization problem −→ Lagrange multipliers arg min J(θ) + ∆θ ∆θ ∂J + η∆θI∆θ T ∂θ Solving for ∆θ in the optimization gives us: ∆θ = θ (k+1) − θ (k) = − 1 ∂J ∂J =⇒ θ (k+1) = θ (k) − λ 2η ∂θ ∂θ 9 How to choose the learning rate? Source: https://www.jeremyjordan.me/nn-learning-rate/ 10 Second Order Optimization Newton’s algorithm The most basic second-order optimization algorithm. Has updates of the form: (k) θ (k+1) = θ (k) − H−1 K g Algorithm derived from second-order Taylor series approx. of J(θ) around θ (k) : T 1 Jquad (θ) = J(θ (k) ) + g(k) (θ − θ (k) ) + (θ − θ (k) )T HK (θ − θ (k) ) 2 Differentiate, equate to 0 and solve for θ: ∇J(θ) = 0+g(k) +HK (θ −θ (k) ) = 0 (k) =⇒ θ = θ (k) − H−1 K g 11 Second Order Optimization - issues (k) To compute the direction of descent d(k) = −H−1 we need to K g invert the Hessian matrix HK , at each iteration! An efficient and popular way to circumvent this issue is to use the Conjugate Gradient method to solve the linear system of equations HK d(k) = −g(k) for d(k) 12 SGD Stochastic Gradient Descent The loss function that helps us compute the gradient with respect to model parameters ∇θ J(θ) also depends on the observed data instances. ∫ want ∇θ J(θ) = ∇θ J(x, θ)p(x)dx = 0 | {z } E[∇θ J(x,θ)] θ (k+1) = θ (k) − λE [∇θ J(x, θ)] ≈ θ (k) − λ n 1 ∑ ∇θ J(x(i) , θ) n i=1 ≈θ (k) − λ∇θ J(x(k) , θ (k) ) The algorithm is stochastic because it only ”looks” at a subset (or even instance) of the data to estimate the true gradient. [ ] θ (k+1) = θ (k) − λE [∇J] + λ E [∇J] − ∇J(x(k) , θ (k) ) | {z } noise If λ is a series that converges under variance of the noise, then the noise is bounded and the algorithm converges 13 SGD and Batches ∑n Batch θ (k+1) = θ (k) + λ n1 Stochastic / Online θ (k+1) = θ (k) + λ∇θ J(x(k) , θ (k) ) Mini-batch θ (k+1) = θ (k) + λ n1 i=1 ∑l i=1 ∇θ J(x(i) , θ (k) ) ∇θ J(x(i) , θ (k) ) Convergence speed of GD variants (source: stack-exchange) 14 SGD Concerns/Challenges Good practices: • Normalize the input space: • Shift inputs to have a zero mean over the whole dataset • Scale the inputs to have unit variance • Decorrelate input components • For linear layers, we get a big win by decorrelating each component of the input from the other components (e.g. use PCA to remove components with small eigenvalues, divide remaining principal components by the sqrt of their eigenvalues) • Initialization of weights should break symmetry (e.g. Xavier initialization) • Aim for class balance in your mini-batches 15 SGD Concerns/Challenges • Choosing a proper learning rate • Define rate of annealing or scheduled reducing • Having the same learning rate for all parameters (e.g. for rarer or more important features we want a bigger update in that direction) • Plateau and saddle points can leave SGD stuck in suboptimal local minima 16 Gradient Descent Optimization Algorithms Momentum SGD without momentum (left) and with momentum (right). Source: DataScience Deep Dive Blog: Optimizations of Gradient Descent Intuition: Momentum accelerates SGD in relevant direction and dampens oscillations Let ∆θt be the update step of parameters θ at step t ∆θt = γ∆θt−1 + λ∇θ J(θ) θ (t+1) = θ (t) − ∆θt γ is usually set to 0.9. 17 Nesterov Accelerated Gradient Intuition: we want to give Momentum a sense of ”when to slow down” before the hill slopes up again. Compute gradient w.r.t future estimated parameters θ − γ∆θt−1 . ∆θt = γ∆θt−1 + λ∇θ J(θ − γ∆θt−1 ) θ (t+1) = θ (t) − ∆θt Momentum update (blue vectors) compared to NAG update (brown + red vectors). Source G. Hinton Lecture 6c, CSC321 18 Adagrad Intuition: adapt learning rate of individual parameters, depending on parameter importance. • smaller updates (lower learning rate) for frequently occurring features • larger updates (higher learning rate) for infrequent ones • suited for sparse data (e.g. object classification in video, word embeddings) (t) Notation : gt,i = ∇Jθ (θi ) (t+1) θi (t) = θi − √ λ Gt,ii + ϵ gt,i Element i, i on the diagonal = sum of squares of past gradients (up to step t) for parameter θi . Gt ∈ Rd×d - diagonal matrix. 19 Adadelta Intuition: Adagrad’s weakness - accumulation of squared gradients =⇒ learning rate shrinks =⇒ no new knowledge updates Adadelta solution: restrict past gradients to a window w + compute a decaying running average at each time step. E[g2 ]t = γE[g2 ]t−1 +(1−γ)g2t , where E[g2 ]t is the running average at step t. ∆θt = √ θ (t+1) =θ λ E[g2 ]t (t) +ϵ ⊙ gt = λ ⊙ gt RMS[g]t − ∆θt 20 Adadelta - contd Note: theoretically, the update should have the same hypothetical units as the parameter Adadelta solution: define an exponential decay of squared parameter updates E[∆θ 2 ]t = γE[∆θ 2 ]t−1 + (1 − γ)∆θt2 RMS[∆θ]t = ∆θt = √ E[∆θ 2 ]t + ϵ RMS[∆θ]t−1 gt , where RMS[∆θ]t−1 replaces λ RMS[g]t θ (t+1) = θ (t) − ∆θt 21 RMSProp Developed by Geoffrey Hinton in Lecture 6e of Coursera Class. Equivalent to Adadelta, without the exponential decay of squared parameter updates E[g2 ]t = 0.9E[g2 ]t−1 + 0.1g2t θ (t+1) = θ (t) − √ λ E[g2 ]t + ϵ gt whereλ = 0.001 is considered a good default value 22 Adam Adaptive Moment Estimation (Adam) ≈ Adadelta with momentum. • store exponentially decaying average of squared gradients vt (second moment variance) • store also exponentially decaying average of past gradients mt (first moment mean), similar to momentum mt = β1 mt−1 + (1 − β1 )gt vt = β2 vt−1 + (1 − β2 )g2t mt and vt are biased towards 0 at initial time steps (especially if β1 , β2 are close to 1) =⇒ bias-correction: mt m̂t = 1 − β1t vt v̂t = 1 − β2t Adam update rule: λ m̂t v̂t + ϵ where default indicated values are β1 = 0.9, β2 = 0.999 and ϵ = 10−8 θ (t+1) = θ (t) − √ 23 Visualization of GD optimization algorithms 24 So, how to go about optimizing? • For small datasets (e.g. 10.000 cases) or big datasets without much redundancy: use full-batch method • conjugate gradient, LBFGS • adaptive learning rates • For big, redundant datasets: use mini-batches • try RMSProp, Adam • try SGD with momentum (particularly when fine-tuning) • If input data is sparse, try adaptive learning-rate methods • If fast-convergence (for initial tests especially) is required, use adaptive learning rate methods 25 So, how to go about optimizing? (Tricks) • Shuffling and Curriculum Learning • Shuffle training data after each epoch, to avoid biasing the optimization algorithm • Curriculum Learning: for some problems (e.g. involving a temporal dimension), training using progressively harder examples helps • Batch Normalization • Common practice: normalize inputs to zero mean and unit variance • For deep networks, intermediary layer inputs may vary significantly during training (covariate shift) =⇒ normalize intermediary layer inputs 26 So, how to go about optimizing? (Tricks) • Early stopping • Monitor error on validation set and stop (with some patience, i.e. number of gradient update steps) if validation error does not improve enough • Gradient Noise • Used to make training more robust to poor initialization, or when having deep and complex networks (highly irregular error function - escape local minima) • Gradient modification: gt,i = gt,i + N(0, σt2 ) η σt2 = (1 + t)γ 27 So, how to go about optimizing? (Tricks) • Stochastic Weight Averaging (SWA) • Effective, low computational overhead method to exploit SGD behavior in ”flattened” loss function space (when approaching convergence) • https://pytorch.org/blog/pytorch-1.6-now-includes-stochasticweight-averaging/ 28 References References Credits for slides to lectures by Nando de Freitas, Geoffrey Hinton and article by Sebastian Ruder. • Nando de Freitas. Machine Learning. Lecture 5 - Optimization • Sebastian Ruder (2016). An overview of gradient descent optimisation algorithms. arXiv preprint arXiv:1609.04747. • Geoffrey Hinton. Neural Networks for Machine Learning. Lecture 6 29

Use Quizgecko on...
Browser
Browser