Gradient Descent PDF
Document Details
Uploaded by Deleted User
IIT Gandhinagar
2024
Nipun Batra
Tags
Summary
This presentation introduces gradient descent, an optimization algorithm used in machine learning. It details contour plots, cost functions, and examples, ultimately providing a detailed description with visual aids and explanations.
Full Transcript
Gradient Descent Nipun Batra February 4, 2024 IIT Gandhinagar Revision Contour Plot And Gradients z = f (x, y ) = x 2 + y 2 Surface Plot Contour Plot 5.0...
Gradient Descent Nipun Batra February 4, 2024 IIT Gandhinagar Revision Contour Plot And Gradients z = f (x, y ) = x 2 + y 2 Surface Plot Contour Plot 5.0 50 2.5 40 30 40 0.0 y z 20 20 −2.5 10 −5 −5 0 −5.0 0 0 5 x −5 0 5 5 x y 1 Contour Plot And Gradients z = f (x, y ) = x 2 + y 2 Surface Plot Contour Plot 5.0 50 2.5 40 30 40 0.0 y z 20 20 −2.5 10 −5 −5 0 −5.0 0 0 5 x −5 0 5 5 x y Gradient denotes the direction of steepest ascent or the direction in which there is a maximum increase in f(x,y) 1 Contour Plot And Gradients z = f (x, y ) = x 2 + y 2 Surface Plot Contour Plot 5.0 50 2.5 40 30 40 0.0 y z 20 20 −2.5 10 −5 −5 0 −5.0 0 0 5 x −5 0 5 5 x y Gradient denotes the direction of steepest ascent or the direction in which there " is a maximum # " #increase in f(x,y) ∂f (x,y ) 2x ∇f (x, y ) = ∂f ∂x (x,y ) = ∂y 2y 1 Introduction Optimization algorithms We often want to minimize/maximize a function 2 Optimization algorithms We often want to minimize/maximize a function We wanted to minimize the cost function: f (θ) = (y − X θ)T (y − X θ) (1) 2 Optimization algorithms We often want to minimize/maximize a function We wanted to minimize the cost function: f (θ) = (y − X θ)T (y − X θ) (1) Note, here θ is the parameter vector 2 Optimization algorithms In general, we have following components: 3 Optimization algorithms In general, we have following components: Maximize or Minimize a function subject to some constraints 3 Optimization algorithms In general, we have following components: Maximize or Minimize a function subject to some constraints Today, we will focus on unconstrained optimization (no constraints) 3 Optimization algorithms In general, we have following components: Maximize or Minimize a function subject to some constraints Today, we will focus on unconstrained optimization (no constraints) We will focus on minimization 3 Optimization algorithms In general, we have following components: Maximize or Minimize a function subject to some constraints Today, we will focus on unconstrained optimization (no constraints) We will focus on minimization Goal: θ∗ = arg minf (θ) (2) θ 3 Introduction Gradient descent is an optimization algorithm 4 Introduction Gradient descent is an optimization algorithm It is used to find the minimum of a function in unconstrained settings 4 Introduction Gradient descent is an optimization algorithm It is used to find the minimum of a function in unconstrained settings It is an iterative algorithm 4 Introduction Gradient descent is an optimization algorithm It is used to find the minimum of a function in unconstrained settings It is an iterative algorithm It is a first order optimization algorithm 4 Introduction Gradient descent is an optimization algorithm It is used to find the minimum of a function in unconstrained settings It is an iterative algorithm It is a first order optimization algorithm It is a local search algorithm/greedy 4 Gradient Descent Algorithm 1. Initialize θ to some random value 5 Gradient Descent Algorithm 1. Initialize θ to some random value 2. Compute the gradient of the cost function at θ, ∇f (θ) 5 Gradient Descent Algorithm 1. Initialize θ to some random value 2. Compute the gradient of the cost function at θ, ∇f (θ) 3. For Iteration i (i = 1, 2,...) or until convergence: 5 Gradient Descent Algorithm 1. Initialize θ to some random value 2. Compute the gradient of the cost function at θ, ∇f (θ) 3. For Iteration i (i = 1, 2,...) or until convergence: θi ← θi−1 − α∇f (θi−1 ) 5 Taylor’s Series Taylor’s Series Taylor’s series is a way to approximate a function f (x) around a point x0 using a polynomial 6 Taylor’s Series Taylor’s series is a way to approximate a function f (x) around a point x0 using a polynomial The polynomial is given by f 0 (x0 ) f 00 (x0 ) f (x) = f (x0 ) + (x − x0 ) + (x − x0 )2 +... (3) 1! 2! 6 Taylor’s Series Taylor’s series is a way to approximate a function f (x) around a point x0 using a polynomial The polynomial is given by f 0 (x0 ) f 00 (x0 ) f (x) = f (x0 ) + (x − x0 ) + (x − x0 )2 +... (3) 1! 2! The vector form of the above equation is given by: 1 f (~x ) = f (x~0 )+∇f (x~0 )T (~x −x~0 )+ (~x −x~0 )T ∇2 f (x~0 )(~x −x~0 )+... 2 (4) 6 Taylor’s Series Taylor’s series is a way to approximate a function f (x) around a point x0 using a polynomial The polynomial is given by f 0 (x0 ) f 00 (x0 ) f (x) = f (x0 ) + (x − x0 ) + (x − x0 )2 +... (3) 1! 2! The vector form of the above equation is given by: 1 f (~x ) = f (x~0 )+∇f (x~0 )T (~x −x~0 )+ (~x −x~0 )T ∇2 f (x~0 )(~x −x~0 )+... 2 (4) where ∇2 f (x~0 ) is the Hessian matrix and ∇f (x~0 ) is the gradient vector 6 Taylor’s Series Let us consider f (x) = cos(x) and x0 = 0 7 Taylor’s Series Let us consider f (x) = cos(x) and x0 = 0 Then, we have: 7 Taylor’s Series Let us consider f (x) = cos(x) and x0 = 0 Then, we have: f (x0 ) = cos(0) = 1 7 Taylor’s Series Let us consider f (x) = cos(x) and x0 = 0 Then, we have: f (x0 ) = cos(0) = 1 f 0 (x0 ) = − sin(0) = 0 7 Taylor’s Series Let us consider f (x) = cos(x) and x0 = 0 Then, we have: f (x0 ) = cos(0) = 1 f 0 (x0 ) = − sin(0) = 0 f 00 (x0 ) = − cos(0) = −1 7 Taylor’s Series Let us consider f (x) = cos(x) and x0 = 0 Then, we have: f (x0 ) = cos(0) = 1 f 0 (x0 ) = − sin(0) = 0 f 00 (x0 ) = − cos(0) = −1 We can write the second order Taylor’s series as: 7 Taylor’s Series Let us consider f (x) = cos(x) and x0 = 0 Then, we have: f (x0 ) = cos(0) = 1 f 0 (x0 ) = − sin(0) = 0 f 00 (x0 ) = − cos(0) = −1 We can write the second order Taylor’s series as: x2 f (x) = 1 + 0(x − 0) + −1 2! (x − 0)2 = 1 − 2 7 Taylor’s series Let us consider another example: f (x) = x 2 + 2 and x0 = 2 8 Taylor’s series Let us consider another example: f (x) = x 2 + 2 and x0 = 2 Question: How does the first order Taylor’s series approximation look like? 8 Taylor’s series Let us consider another example: f (x) = x 2 + 2 and x0 = 2 Question: How does the first order Taylor’s series approximation look like? First order Taylor’s series approximation is given by: 8 Taylor’s series Let us consider another example: f (x) = x 2 + 2 and x0 = 2 Question: How does the first order Taylor’s series approximation look like? First order Taylor’s series approximation is given by: f (x) = f (x0 ) + f 0 (x0 )(x − x0 ) = 6 + 4(x − 2) = 4x − 2 8 Taylor’s Series (Alternative form) We have: f 0 (x0 ) f 00 (x0 ) f (x) = f (x0 ) + (x − x0 ) + (x − x0 )2 +... (5) 1! 2! 9 Taylor’s Series (Alternative form) We have: f 0 (x0 ) f 00 (x0 ) f (x) = f (x0 ) + (x − x0 ) + (x − x0 )2 +... (5) 1! 2! Let us consider x = x0 + ∆x where ∆x is a small quantity 9 Taylor’s Series (Alternative form) We have: f 0 (x0 ) f 00 (x0 ) f (x) = f (x0 ) + (x − x0 ) + (x − x0 )2 +... (5) 1! 2! Let us consider x = x0 + ∆x where ∆x is a small quantity Then, we have: f 0 (x0 ) f 00 (x0 ) 2 f (x0 + ∆x) = f (x0 ) + ∆x + ∆x +... (6) 1! 2! 9 Taylor’s Series (Alternative form) We have: f 0 (x0 ) f 00 (x0 ) f (x) = f (x0 ) + (x − x0 ) + (x − x0 )2 +... (5) 1! 2! Let us consider x = x0 + ∆x where ∆x is a small quantity Then, we have: f 0 (x0 ) f 00 (x0 ) 2 f (x0 + ∆x) = f (x0 ) + ∆x + ∆x +... (6) 1! 2! Let us assume ∆x is small enough such that ∆x 2 and higher order terms can be ignored 9 Taylor’s Series (Alternative form) We have: f 0 (x0 ) f 00 (x0 ) f (x) = f (x0 ) + (x − x0 ) + (x − x0 )2 +... (5) 1! 2! Let us consider x = x0 + ∆x where ∆x is a small quantity Then, we have: f 0 (x0 ) f 00 (x0 ) 2 f (x0 + ∆x) = f (x0 ) + ∆x + ∆x +... (6) 1! 2! Let us assume ∆x is small enough such that ∆x 2 and higher order terms can be ignored f 0 (x0 ) Then, we have: f (x0 + ∆x) ≈ f (x0 ) + 1! ∆x 9 Taylor’s Series to Gradient Descent f 0 (x0 ) Then, we have: f (x0 + ∆x) ≈ f (x0 ) + 1! ∆x 10 Taylor’s Series to Gradient Descent f 0 (x0 ) Then, we have: f (x0 + ∆x) ≈ f (x0 ) + 1! ∆x Or, in vector form: f (x~0 + ∆~x ) ≈ f (x~0 ) + ∇f (x~0 )T ∆~x 10 Taylor’s Series to Gradient Descent f 0 (x0 ) Then, we have: f (x0 + ∆x) ≈ f (x0 ) + 1! ∆x Or, in vector form: f (x~0 + ∆~x ) ≈ f (x~0 ) + ∇f (x~0 )T ∆~x Goal: Find ∆~x such that f (x~0 + ∆~x ) is minimized 10 Taylor’s Series to Gradient Descent f 0 (x0 ) Then, we have: f (x0 + ∆x) ≈ f (x0 ) + 1! ∆x Or, in vector form: f (x~0 + ∆~x ) ≈ f (x~0 ) + ∇f (x~0 )T ∆~x Goal: Find ∆~x such that f (x~0 + ∆~x ) is minimized This is equivalent to minimizing f (x~0 ) + ∇f (x~0 )T ∆~x 10 Taylor’s Series to Gradient Descent f 0 (x0 ) Then, we have: f (x0 + ∆x) ≈ f (x0 ) + 1! ∆x Or, in vector form: f (x~0 + ∆~x ) ≈ f (x~0 ) + ∇f (x~0 )T ∆~x Goal: Find ∆~x such that f (x~0 + ∆~x ) is minimized This is equivalent to minimizing f (x~0 ) + ∇f (x~0 )T ∆~x This happens when vectors ∇f (x~0 ) and ∆~x are at phase angle of 180◦ 10 Taylor’s Series to Gradient Descent f 0 (x0 ) Then, we have: f (x0 + ∆x) ≈ f (x0 ) + 1! ∆x Or, in vector form: f (x~0 + ∆~x ) ≈ f (x~0 ) + ∇f (x~0 )T ∆~x Goal: Find ∆~x such that f (x~0 + ∆~x ) is minimized This is equivalent to minimizing f (x~0 ) + ∇f (x~0 )T ∆~x This happens when vectors ∇f (x~0 ) and ∆~x are at phase angle of 180◦ This happens when ∆~x = −α∇f (x~0 ) where α is a scalar 10 Taylor’s Series to Gradient Descent f 0 (x0 ) Then, we have: f (x0 + ∆x) ≈ f (x0 ) + 1! ∆x Or, in vector form: f (x~0 + ∆~x ) ≈ f (x~0 ) + ∇f (x~0 )T ∆~x Goal: Find ∆~x such that f (x~0 + ∆~x ) is minimized This is equivalent to minimizing f (x~0 ) + ∇f (x~0 )T ∆~x This happens when vectors ∇f (x~0 ) and ∆~x are at phase angle of 180◦ This happens when ∆~x = −α∇f (x~0 ) where α is a scalar This is the gradient descent algorithm: x~1 = x~0 − α∇f (x~0 ) 10 Effect of learning rate Low learning rate α = 0.01 : Converges slowly 8 f (x) = x2 + 2 x0 = 2.00 6 order 1 appx. at x = x0 x1 = 1.96 4 order 1 appx. at x = x1 x2 = 1.92 order 1 appx. at x = x2 2 x3 = 1.88 order 1 appx. at x = x3 0 −2 −1 0 1 2 11 Effect of learning rate High learning rate α = 0.8: Converges quickly, but might overshoot 8 f (x) = x2 + 2 x0 = 2.00 6 order 1 appx. at x = x0 x1 = -1.20 4 order 1 appx. at x = x1 x2 = 0.72 order 1 appx. at x = x2 2 x3 = -0.43 order 1 appx. at x = x3 0 −2 −1 0 1 2 12 Effect of learning rate Very high learning rate α = 1.01: Diverges 8 f (x) = x2 + 2 x0 = 2.00 6 order 1 appx. at x = x0 x1 = -2.04 4 order 1 appx. at x = x1 x2 = 2.08 order 1 appx. at x = x2 2 x3 = -2.12 order 1 appx. at x = x3 0 −2 −1 0 1 2 13 Effect of learning rate Appropriate learning rate α = 0.1 8 f (x) = x2 + 2 x0 = 2.00 order 1 appx. at x = x0 6 x1 = 1.60 order 1 appx. at x = x1 4 x2 = 1.28 order 1 appx. at x = x2 x3 = 1.02 2 order 1 appx. at x = x3 x4 = 0.82 0 order 1 appx. at x = x4 −2 −1 0 1 2 14 Gradient Descent for linear regression Some commonly confused terms Loss function is usually a function defined on a data point, prediction and label, and measures the penalty. 15 Some commonly confused terms Loss function is usually a function defined on a data point, prediction and label, and measures the penalty. square loss l (f (xi |θ) , yi ) = (f (xi |θ) − yi )2 , used in linear regression 15 Some commonly confused terms Loss function is usually a function defined on a data point, prediction and label, and measures the penalty. square loss l (f (xi |θ) , yi ) = (f (xi |θ) − yi )2 , used in linear regression Cost function is usually more general. It might be a sum of loss functions over your training set plus some model complexity penalty (regularization). For example: 15 Some commonly confused terms Loss function is usually a function defined on a data point, prediction and label, and measures the penalty. square loss l (f (xi |θ) , yi ) = (f (xi |θ) − yi )2 , used in linear regression Cost function is usually more general. It might be a sum of loss functions over your training set plus some model complexity penalty (regularization). For example: Mean Squared Error MSE (θ) = N1 N 2 P i=1 (f (xi |θ) − yi ) 15 Some commonly confused terms Loss function is usually a function defined on a data point, prediction and label, and measures the penalty. square loss l (f (xi |θ) , yi ) = (f (xi |θ) − yi )2 , used in linear regression Cost function is usually more general. It might be a sum of loss functions over your training set plus some model complexity penalty (regularization). For example: Mean Squared Error MSE (θ) = N1 N 2 P i=1 (f (xi |θ) − yi ) Objective function is the most general term for any function that you optimize during training. 15 Gradient Descent : Example Learn y = θ0 + θ1 x on following dataset, using gradient descent where initially (θ0 , θ1 ) = (4, 0) and step-size, α = 0.1, for 2 iterations. x y 1 1 2 2 3 3 16 Gradient Descent : Example Our predictor, ŷ = θ0 + θ1 x Error for i th datapoint, i = yi − yˆi 1 = 1 − θ0 − θ1 2 = 2 − θ0 − 2θ1 3 = 3 − θ0 − 3θ1 21 + 22 + 23 14 + 3θ02 + 14θ12 − 12θ0 − 28θ1 + 12θ0 θ1 MSE = = 3 3 17 Difference between SSE and MSE X 2i increases as the number of examples increase So, we use MSE 1X 2 MSE = i n Here n denotes the number of samples 18 Gradient Descent : Example P P 2 (yi − θ0 − θ1 xi ) (−1) 2 i (−1) ∂MSE i i = = ∂θ0 N N P P 2 (yi − θ0 − θ1 xi ) (−xi ) 2 i (−xi ) ∂MSE i i = = ∂θ1 N N 19 Gradient Descent : Example Iteration 1 ∂MSE θ0 = θ0 − α ∂θ0 ∂MSE θ1 = θ1 − α ∂θ1 20 Gradient Descent : Example Iteration 1 ∂MSE θ0 = θ0 − α ∂θ0 θ0 = 4 − 0.2 ((1−(4+0))(−1)+(2−(4+0))(−1)+(3−(4+0))(−1)) 3 θ0 = 3.6 ∂MSE θ 1 = θ1 − α ∂θ1 20 Gradient Descent : Example Iteration 1 ∂MSE θ0 = θ0 − α ∂θ0 θ0 = 4 − 0.2 ((1−(4+0))(−1)+(2−(4+0))(−1)+(3−(4+0))(−1)) 3 θ0 = 3.6 ∂MSE θ 1 = θ1 − α ∂θ1 θ1 = 0 − 0.2 ((1−(4+0))(−1)+(2−(4+0))(−2)+(3−(4+0))(−3)) 3 θ1 = −0.67 20 Gradient Descent : Example Iteration 2 ∂MSE θ0 = θ0 − α ∂θ0 ∂MSE θ1 = θ1 − α ∂θ1 21 Gradient Descent : Example Iteration 2 ∂MSE θ0 = θ0 − α ∂θ0 θ0 = 3.6 − 0.2 ((1−(3.6−0.67))(−1)+(2−(3.6−0.67×2))(−1)+(3−(3.6−0.67×3))(−1)) 3 θ0 = 3.54 ∂MSE θ1 = θ1 − α ∂θ1 21 Gradient Descent : Example Iteration 2 ∂MSE θ0 = θ0 − α ∂θ0 θ0 = 3.6 − 0.2 ((1−(3.6−0.67))(−1)+(2−(3.6−0.67×2))(−1)+(3−(3.6−0.67×3))(−1)) 3 θ0 = 3.54 ∂MSE θ1 = θ1 − α ∂θ1 θ0 = 3.6 − 0.2 ((1−(3.6−0.67))(−1)+(2−(3.6−0.67×2))(−2)+(3−(3.6−0.67×3))(−3)) 3 θ0 = −0.55 21 Gradient Descent : Example (Iteraion 0) Contour Plot 400 4 Line fit 300 6 Actual line 2 200 4 100 0 y 50 2 −2 25 10 0 −4 0 −4 −2 0 2 4 0 1 2 3 4 5 x 22 Gradient Descent : Example (Iteraion 2) Contour Plot 400 4 Line fit 300 6 Actual line 2 200 4 100 0 y 50 2 −2 25 10 0 −4 0 −4 −2 0 2 4 0 1 2 3 4 5 x 23 Gradient Descent : Example (Iteraion 4) Contour Plot 400 4 Line fit 300 6 Actual line 2 200 4 100 0 y 50 2 −2 25 10 0 −4 0 −4 −2 0 2 4 0 1 2 3 4 5 x 24 Gradient Descent : Example (Iteraion 6) Contour Plot 400 4 Line fit 300 6 Actual line 2 200 4 100 0 y 50 2 −2 25 10 0 −4 0 −4 −2 0 2 4 0 1 2 3 4 5 x 25 Gradient Descent : Example (Iteraion 8) Contour Plot 400 4 Line fit 300 6 Actual line 2 200 4 100 0 y 50 2 −2 25 10 0 −4 0 −4 −2 0 2 4 0 1 2 3 4 5 x 26 Gradient Descent : Example (Iteraion 10) Contour Plot 400 4 Line fit 300 6 Actual line 2 200 4 100 0 y 50 2 −2 25 10 0 −4 0 −4 −2 0 2 4 0 1 2 3 4 5 x 27 Gradient Descent : Example (Iteraion 12) Contour Plot 400 4 Line fit 300 6 Actual line 2 200 4 100 0 y 50 2 −2 25 10 0 −4 0 −4 −2 0 2 4 0 1 2 3 4 5 x 28 Gradient Descent : Example (Iteraion 14) Contour Plot 400 4 Line fit 300 6 Actual line 2 200 4 100 0 y 50 2 −2 25 10 0 −4 0 −4 −2 0 2 4 0 1 2 3 4 5 x 29 Gradient Descent : Example (Iteraion 16) Contour Plot 400 4 Line fit 300 6 Actual line 2 200 4 100 0 y 50 2 −2 25 10 0 −4 0 −4 −2 0 2 4 0 1 2 3 4 5 x 30 Gradient Descent : Example (Iteraion 18) Contour Plot 400 4 Line fit 300 6 Actual line 2 200 4 100 0 y 50 2 −2 25 10 0 −4 0 −4 −2 0 2 4 0 1 2 3 4 5 x 31 Gradient Descent : Example (Iteraion 20) Contour Plot 400 4 Line fit 300 6 Actual line 2 200 4 100 0 y 50 2 −2 25 10 0 −4 0 −4 −2 0 2 4 0 1 2 3 4 5 x 32 Gradient Descent : Example (Iteraion 22) Contour Plot 400 4 Line fit 300 6 Actual line 2 200 4 100 0 y 50 2 −2 25 10 0 −4 0 −4 −2 0 2 4 0 1 2 3 4 5 x 33 Gradient Descent : Example (Iteraion 24) Contour Plot 400 4 Line fit 300 6 Actual line 2 200 4 100 0 y 50 2 −2 25 10 0 −4 0 −4 −2 0 2 4 0 1 2 3 4 5 x 34 Gradient Descent : Example (Iteraion 26) Contour Plot 400 4 Line fit 300 6 Actual line 2 200 4 100 0 y 50 2 −2 25 10 0 −4 0 −4 −2 0 2 4 0 1 2 3 4 5 x 35 Gradient Descent : Example (Iteraion 28) Contour Plot 400 4 Line fit 300 6 Actual line 2 200 4 100 0 y 50 2 −2 25 10 0 −4 0 −4 −2 0 2 4 0 1 2 3 4 5 x 36 Gradient Descent : Example (Iteraion 30) Contour Plot 400 4 Line fit 300 6 Actual line 2 200 4 100 0 y 50 2 −2 25 10 0 −4 0 −4 −2 0 2 4 0 1 2 3 4 5 x 37 Gradient Descent : Example (Iteraion 32) Contour Plot 400 4 Line fit 300 6 Actual line 2 200 4 100 0 y 50 2 −2 25 10 0 −4 0 −4 −2 0 2 4 0 1 2 3 4 5 x 38 Gradient Descent : Example (Iteraion 34) Contour Plot 400 4 Line fit 300 6 Actual line 2 200 4 100 0 y 50 2 −2 25 10 0 −4 0 −4 −2 0 2 4 0 1 2 3 4 5 x 39 Gradient Descent : Example (Iteraion 36) Contour Plot 400 4 Line fit 300 6 Actual line 2 200 4 100 0 y 50 2 −2 25 10 0 −4 0 −4 −2 0 2 4 0 1 2 3 4 5 x 40 Gradient Descent : Example (Iteraion 38) Contour Plot 400 4 Line fit 300 6 Actual line 2 200 4 100 0 y 50 2 −2 25 10 0 −4 0 −4 −2 0 2 4 0 1 2 3 4 5 x 41 Gradient Descent : Example (Iteraion 40) Contour Plot 400 4 Line fit 300 6 Actual line 2 200 4 100 0 y 50 2 −2 25 10 0 −4 0 −4 −2 0 2 4 0 1 2 3 4 5 x 42 Iteration v/s Epcohs for gradient descent Iteration: Each time you update the parameters of the model 43 Iteration v/s Epcohs for gradient descent Iteration: Each time you update the parameters of the model Epoch: Each time you have seen all the set of examples 43 Gradient Descent (GD) Dataset: D = {(X , y )} of size N Initialize θ For epoch e in [1, E ] Predict ŷ = pred(X , θ) Compute loss: J(θ) = loss(y , ŷ ) Compute gradient: ∇J(θ) = grad(J)(θ) Update: θ = θ − α∇J(θ) 44 Stochastic Gradient Descent (SGD) Dataset: D = {(X , y )} of size N Initialize θ For epoch e in [1, E ] Shuffle D For i in [1, N] Predict yˆi = pred(Xi , θ) Compute loss: J(θ) = loss(yi , yˆi ) Compute gradient: ∇J(θ) = grad(J)(θ) Update: θ = θ − α∇J(θ) 45 Mini-Batch Gradient Descent (MBGD) Dataset: D = {(X , y )} of size N Initialize θ For epoch e in [1, E ] Shuffle D Batches = make batches(D, B) For b in Batches X b, y b = b Predict yˆb = pred(X b, θ) Compute loss: J(θ) = loss(y b, yˆb) Compute gradient: ∇J(θ) = grad(J)(θ) Update: θ = θ − α∇J(θ) 46 Gradient Descent vs SGD Vanilla Gradient Descent in Vanilla (Batch) gradient descent: We update params after going through all the data 47 Gradient Descent vs SGD Vanilla Gradient Descent in Vanilla (Batch) gradient descent: We update params after going through all the data Smooth curve for Iteration vs Cost 47 Gradient Descent vs SGD Vanilla Gradient Descent in Vanilla (Batch) gradient descent: We update params after going through all the data Smooth curve for Iteration vs Cost For a single update, it needs to compute the gradient over all the samples. Hence takes more time 47 Gradient Descent vs SGD Vanilla Gradient Descent in Vanilla (Batch) gradient descent: We update params after going through all the data Smooth curve for Iteration vs Cost For a single update, it needs to compute the gradient over all the samples. Hence takes more time 47 Gradient Descent vs SGD Vanilla Gradient Descent in Vanilla (Batch) gradient descent: We update params after going through all the data Smooth curve for Iteration vs Cost For a single update, it needs to compute the gradient over all the samples. Hence takes more time Stochastic Gradient Descent In SGD, we update parameters after seeing each each point 47 Gradient Descent vs SGD Vanilla Gradient Descent in Vanilla (Batch) gradient descent: We update params after going through all the data Smooth curve for Iteration vs Cost For a single update, it needs to compute the gradient over all the samples. Hence takes more time Stochastic Gradient Descent In SGD, we update parameters after seeing each each point Noisier curve for iteration vs cost 47 Gradient Descent vs SGD Vanilla Gradient Descent in Vanilla (Batch) gradient descent: We update params after going through all the data Smooth curve for Iteration vs Cost For a single update, it needs to compute the gradient over all the samples. Hence takes more time Stochastic Gradient Descent In SGD, we update parameters after seeing each each point Noisier curve for iteration vs cost For a single update, it computes the gradient over one example. Hence lesser time 47 Stochastic Gradient Descent : Example Learn y = θ0 + θ1 x on following dataset, using SGD where initially (θ0 , θ1 ) = (4, 0) and step-size, α = 0.1, for 1 epoch (3 iterations). x y 2 2 3 3 1 1 48 Stochastic Gradient Descent : Example Our predictor, ŷ = θ0 + θ1 x Error for i th datapoint, ei = yi − yˆi 1 = 2 − θ0 − 2θ1 2 = 3 − θ0 − 3θ1 3 = 1 − θ0 − θ1 While using SGD, we compute the MSE using only 1 datapoint per iteration. So MSE is 21 for iteration 1 and 22 for iteration 2. 49 Stochastic Gradient Descent : Example Contour plot of the cost functions for the three datapoints Contour Plot Contour Plot Contour Plot 400 400 400 4 4 4 300 300 300 2 200 2 200 2 200 100 100 100 0 0 0 y y y 50 50 50 −2 25 −2 25 −2 25 10 10 10 −4 −4 −4 0 0 0 −5 0 5 −5 0 5 −5 0 5 x x x 50 Stochastic Gradient Descent : Example For Iteration i ∂MSE = 2 (yi − θ0 − θ1 xi ) (−1) = 2i (−1) ∂θ0 ∂MSE = 2 (yi − θ0 − θ1 xi ) (−xi ) = 2i (−xi ) ∂θ1 51 Stochastic Gradient Descent : Example Iteration 1 ∂MSE θ0 = θ0 − α ∂θ0 ∂MSE θ1 = θ1 − α ∂θ1 52 Stochastic Gradient Descent : Example Iteration 1 ∂MSE θ0 = θ0 − α ∂θ0 θ0 = 4 − 0.1 × 2 × (2 − (4 + 0)) (−1) θ0 = 3.6 ∂MSE θ 1 = θ1 − α ∂θ1 52 Stochastic Gradient Descent : Example Iteration 1 ∂MSE θ0 = θ0 − α ∂θ0 θ0 = 4 − 0.1 × 2 × (2 − (4 + 0)) (−1) θ0 = 3.6 ∂MSE θ 1 = θ1 − α ∂θ1 θ1 = 0 − 0.1 × 2 × (2 − (4 + 0)) (−2) θ1 = −0.8 52 Stochastic Gradient Descent : Example Iteration 2 ∂MSE θ0 = θ0 − α ∂θ0 ∂MSE θ1 = θ1 − α ∂θ1 53 Stochastic Gradient Descent : Example Iteration 2 ∂MSE θ0 = θ0 − α ∂θ0 θ0 = 3.6 − 0.1 × 2 × (3 − (3.6 − 0.8 × 3)) (−1) θ0 = 3.96 ∂MSE θ1 = θ 1 − α ∂θ1 53 Stochastic Gradient Descent : Example Iteration 2 ∂MSE θ0 = θ0 − α ∂θ0 θ0 = 3.6 − 0.1 × 2 × (3 − (3.6 − 0.8 × 3)) (−1) θ0 = 3.96 ∂MSE θ1 = θ 1 − α ∂θ1 θ0 = −0.8 − 0.1 × 2 × (3 − (3.6 − 0.8 × 3)) (−3) θ1 = 0.28 53 Stochastic Gradient Descent : Example Iteration 3 ∂MSE θ0 = θ0 − α ∂θ0 ∂MSE θ1 = θ1 − α ∂θ1 54 Stochastic Gradient Descent : Example Iteration 3 ∂MSE θ0 = θ0 − α ∂θ0 θ0 = 3.96 − 0.1 × 2 × (1 − (3.96 + 0.28 × 1)) (−1) θ0 = 3.312 ∂MSE θ 1 = θ1 − α ∂θ1 54 Stochastic Gradient Descent : Example Iteration 3 ∂MSE θ0 = θ0 − α ∂θ0 θ0 = 3.96 − 0.1 × 2 × (1 − (3.96 + 0.28 × 1)) (−1) θ0 = 3.312 ∂MSE θ 1 = θ1 − α ∂θ1 θ0 = 0.28 − 0.1 × 2 × (1 − (3.96 + 0.28 × 1)) (−1) θ1 = −0.368 54 Stochastic gradient is an unbiased estimator of the true gradient True Gradient Based on Estimation Theory and Machine Learning by Florian Hartmann Let us say we have a dataset D containing input output pairs {(x1 , y1 ), (x2 , y2 ),... , (xN , yN )} 55 True Gradient Based on Estimation Theory and Machine Learning by Florian Hartmann Let us say we have a dataset D containing input output pairs {(x1 , y1 ), (x2 , y2 ),... , (xN , yN )} We can define overall loss as: N 1 X L(θ) = loss(f (xi , θ), yi ) N i=1 55 True Gradient Based on Estimation Theory and Machine Learning by Florian Hartmann Let us say we have a dataset D containing input output pairs {(x1 , y1 ), (x2 , y2 ),... , (xN , yN )} We can define overall loss as: N 1 X L(θ) = loss(f (xi , θ), yi ) N i=1 loss can be any loss function such as squared loss, cross-entropy loss etc. loss(f (xi , θ), yi ) = (f (xi , θ) − yi )2 55 True Gradient The true gradient of the loss function is given by: n 1X ∇L = ∇ loss (f (xi ) , yi ) n i=1 n 1 X = ∇ loss (f (xi ) , yi ) n i=1 56 True Gradient The true gradient of the loss function is given by: n 1X ∇L = ∇ loss (f (xi ) , yi ) n i=1 n 1 X = ∇ loss (f (xi ) , yi ) n i=1 The above is a consequence of linearity of the gradient operator. 56 Estimator for the true gradient In practice, we do not have access to the true gradient 57 Estimator for the true gradient In practice, we do not have access to the true gradient We can only estimate the true gradient using a subset of the data 57 Estimator for the true gradient In practice, we do not have access to the true gradient We can only estimate the true gradient using a subset of the data For SGD, we use a single example to estimate the true gradient, for mini-batch gradient descent, we use a mini-batch of examples to estimate the true gradient 57 Estimator for the true gradient In practice, we do not have access to the true gradient We can only estimate the true gradient using a subset of the data For SGD, we use a single example to estimate the true gradient, for mini-batch gradient descent, we use a mini-batch of examples to estimate the true gradient Let us say we have a sample: (x, y) 57 Estimator for the true gradient In practice, we do not have access to the true gradient We can only estimate the true gradient using a subset of the data For SGD, we use a single example to estimate the true gradient, for mini-batch gradient descent, we use a mini-batch of examples to estimate the true gradient Let us say we have a sample: (x, y) The estimated gradient is given by: ∇L̃ = ∇ loss(f (x), y ) 57 Bias of the estimator One measure for the quality of an estimator X̃ is its bias or how far off its estimate is on average from the true value X : bias(X ) = E[X̃ ] − X 58 Bias of the estimator One measure for the quality of an estimator X̃ is its bias or how far off its estimate is on average from the true value X : bias(X ) = E[X̃ ] − X Using the rules of expectation, we can show that the expected value of the estimated gradient is the true gradient: n X 1 E[∇L̃] = ∇ loss (f (xi ) , yi ) n i=1 n 1 X = ∇ loss (f (xi ) , yi ) n i=1 = ∇L 58 Bias of the estimator One measure for the quality of an estimator X̃ is its bias or how far off its estimate is on average from the true value X : bias(X ) = E[X̃ ] − X Using the rules of expectation, we can show that the expected value of the estimated gradient is the true gradient: n X 1 E[∇L̃] = ∇ loss (f (xi ) , yi ) n i=1 n 1 X = ∇ loss (f (xi ) , yi ) n i=1 = ∇L Thus, the estimated gradient is an unbiased estimator of the true gradient 58 X y T - - y 1 ↑ I ↑ ↑ ↑ ↑ ↑ ↑ & - xn - YN X y y = f(x 0) , Y T - - y 1 ↑ ↑ I ↑ ↑ ↑ ↑ ↑ ↑ & ↑ ↑ N - ren YN YN ⑦l ~ ⑧ ⑦o SURFACE Ove R LOSS 6N" EXAMPLES ⑦l ~ & - & *< Ot : At tr some iteration ⑦o SURFACE Ove R LOSS 6N" EXAMPLES ~ & Gradient of -- Loss at ot ⑦l &F - -- *< Ot : At some tr iteration ⑦o SURFACE Ove R LOSS 6N" EXAMPLES X y y = f(n 0) , ⑰ty Y T y, 1 S -I : CONSIDER Individual = ↑ ↑ ↑ ↑ ↑ & data points ↑ ! to compute N - ren YN YN Loss I L 1- ⑧ a Ov ! ⑧.. , 0 S O Er 00 80 yi) (YN YN) y /yi Loss Cy loss , loss ,, , 1- ⑳ a I L ⑧ ⑧ ⑧ ⑧ ⑧. E, O Ov S 0 Er 00 80 yi) (YN YN) y /yi Loss Cy loss , loss ,, , 1- a 1? Ov ⑳ ⑧ ⑧ * -⑧ 80.. E, ⑳.. 00 a ⑱ ⑰ Er yi) (YN YN) y /yi Loss (y loss , loss ,, , ⑱ SURFACE OVER LOSS 6N" EXAMPLES ⑦l * ⑦o - 1 & 1- a - ⑧ ⑳ ⑧..a ⑧ * - ⑧ - E, ⑧ Ov. X ⑱ 7 Er 00 80 yi) (YN YN) y /yi Loss (y loss , loss ,, , * ⑧ YL - ⑧ * - ⑧ * ⑧ Tel * li Xlu Gradients for - losses wit ↑ different prints " - - ⑧ I - I - Gradients for losses wit ↑ different prints " S > : ⑧ - Expectation I - I over individual gradients - Gradients for losses wit ↑ different prints " - S > I - : - - - Expec tation ⑧ I - I over individual gradients - Gradient wrt whole data Time Complexity: Gradient Descent v/s Normal Equation for Linear Regression Normal Equation Consider X ∈ RN×D 59 Normal Equation Consider X ∈ RN×D N examples and D dimensions 59 Normal Equation Consider X ∈ RN×D N examples and D dimensions What is the time complexity of solving the normal equation θ̂ = (X T X )−1 X T y ? 59 Normal Equation X has dimensions N × D, X T has dimensions D × N 60 Normal Equation X has dimensions N × D, X T has dimensions D × N X T X is a matrix product of matrices of size: D × N and N × D, which is O(D 2 N) 60 Normal Equation X has dimensions N × D, X T has dimensions D × N X T X is a matrix product of matrices of size: D × N and N × D, which is O(D 2 N) Inversion of X T X is an inversion of a D × D matrix, which is O(D 3 ) 60 Normal Equation X has dimensions N × D, X T has dimensions D × N X T X is a matrix product of matrices of size: D × N and N × D, which is O(D 2 N) Inversion of X T X is an inversion of a D × D matrix, which is O(D 3 ) X T y is a matrix vector product of size D × N and N × 1, which is O(DN) 60 Normal Equation X has dimensions N × D, X T has dimensions D × N X T X is a matrix product of matrices of size: D × N and N × D, which is O(D 2 N) Inversion of X T X is an inversion of a D × D matrix, which is O(D 3 ) X T y is a matrix vector product of size D × N and N × 1, which is O(DN) (X T X )−1 X T y is a matrix product of a D × D matrix and D × 1 matrix, which is O(D 2 ) 60 Normal Equation X has dimensions N × D, X T has dimensions D × N X T X is a matrix product of matrices of size: D × N and N × D, which is O(D 2 N) Inversion of X T X is an inversion of a D × D matrix, which is O(D 3 ) X T y is a matrix vector product of size D × N and N × 1, which is O(DN) (X T X )−1 X T y is a matrix product of a D × D matrix and D × 1 matrix, which is O(D 2 ) Overall complexity: O(D 2 N) + O(D 3 ) + O(DN) + O(D 2 ) = O(D 2 N) + O(D 3 ) 60 Normal Equation X has dimensions N × D, X T has dimensions D × N X T X is a matrix product of matrices of size: D × N and N × D, which is O(D 2 N) Inversion of X T X is an inversion of a D × D matrix, which is O(D 3 ) X T y is a matrix vector product of size D × N and N × 1, which is O(DN) (X T X )−1 X T y is a matrix product of a D × D matrix and D × 1 matrix, which is O(D 2 ) Overall complexity: O(D 2 N) + O(D 3 ) + O(DN) + O(D 2 ) = O(D 2 N) + O(D 3 ) Scales cubic in the number of columns/features of X 60 Gradient Descent Start with random values of θ0 and θ1 Till convergence ∂ P 2 θ0 = θ0 − α ( i ) ∂θ0 61 Gradient Descent Start with random values of θ0 and θ1 Till convergence ∂ P 2 θ0 = θ0 − α ( i ) ∂θ0 ∂ P 2 θ1 = θ1 − α ( i ) ∂θ1 61 Gradient Descent Start with random values of θ0 and θ1 Till convergence ∂ P 2 θ0 = θ0 − α ( i ) ∂θ0 ∂ P 2 θ1 = θ1 − α ( i ) ∂θ1 Question: Can you write the above for D dimensional data in vectorised form? 61 Gradient Descent Start with random values of θ0 and θ1 Till convergence ∂ P 2 θ0 = θ0 − α ( i ) ∂θ0 ∂ P 2 θ1 = θ1 − α ( i ) ∂θ1 Question: Can you write the above for D dimensional data in vectorised form? θ0 = θ0 − α ∂θ∂ 0 (y − X θ)> (y − X θ) θ1 = θ1 − α ∂θ∂ 1 (y − X θ)> (y − X θ)... θD = θD − α ∂θ∂D (y − X θ)> (y − X θ) 61 Gradient Descent Start with random values of θ0 and θ1 Till convergence ∂ P 2 θ0 = θ0 − α ( i ) ∂θ0 ∂ P 2 θ1 = θ1 − α ( i ) ∂θ1 Question: Can you write the above for D dimensional data in vectorised form? θ0 = θ0 − α ∂θ∂ 0 (y − X θ)> (y − X θ) θ1 = θ1 − α ∂θ∂ 1 (y − X θ)> (y − X θ)... θD = θD − α ∂θ∂D (y − X θ)> (y − X θ) θ = θ − α ∂θ ∂ (y − X θ)> (y − X θ) 61 Gradient Descent ∂ − X θ)> (y − X θ) ∂θ (y ∂ y > − θ> X > (y − X θ) = ∂θ ∂ y > y − θ> X > y − y > xθ + θ> X > X θ = ∂θ = −2X > y + 2X > xθ = 2X > (X θ − y ) 62 Gradient Descent We can write the vectorised update equation as follows, for each iteration θ = θ − αX > (X θ − y ) 63 Gradient Descent We can write the vectorised update equation as follows, for each iteration θ = θ − αX > (X θ − y ) For t iterations, what is the computational complexity of our gradient descent solution? 63 Gradient Descent We can write the vectorised update equation as follows, for each iteration θ = θ − αX > (X θ − y ) For t iterations, what is the computational complexity of our gradient descent solution? Hint, rewrite the above as: θ = θ − αX > X θ + αX > y 63 Gradient Descent We can write the vectorised update equation as follows, for each iteration θ = θ − αX > (X θ − y ) For t iterations, what is the computational complexity of our gradient descent solution? Hint, rewrite the above as: θ = θ − αX > X θ + αX > y Complexity of computing X > y is O(DN) 63 Gradient Descent We can write the vectorised update equation as follows, for each iteration θ = θ − αX > (X θ − y ) For t iterations, what is the computational complexity of our gradient descent solution? Hint, rewrite the above as: θ = θ − αX > X θ + αX > y Complexity of computing X > y is O(DN) Complexity of computing αX > y once we have X > y is O(D) since X > y has D entries 63 Gradient Descent We can write the vectorised update equation as follows, for each iteration θ = θ − αX > (X θ − y ) For t iterations, what is the computational complexity of our gradient descent solution? Hint, rewrite the above as: θ = θ − αX > X θ + αX > y Complexity of computing X > y is O(DN) Complexity of computing αX > y once we have X > y is O(D) since X > y has D entries Complexity of computing X > X is O(D 2 N) and then multiplying with α is O(D 2 ) 63 Gradient Descent We can write the vectorised update equation as follows, for each iteration θ = θ − αX > (X θ − y ) For t iterations, what is the computational complexity of our gradient descent solution? Hint, rewrite the above as: θ = θ − αX > X θ + αX > y Complexity of computing X > y is O(DN) Complexity of computing αX > y once we have X > y is O(D) since X > y has D entries Complexity of computing X > X is O(D 2 N) and then multiplying with α is O(D 2 ) All of the above need only be calculated once! 63 Gradient Descent 64 Gradient Descent For each of the t iterations, we now need to first multiply αX > X with θ which is matrix multiplication of a D × D matrix with a D × 1, which is O(D 2 ) 64 Gradient Descent For each of the t iterations, we now need to first multiply αX > X with θ which is matrix multiplication of a D × D matrix with a D × 1, which is O(D 2 ) The remaining subtraction/addition can be done in O(D) for each iteration. 64 Gradient Descent For each of the t iterations, we now need to first multiply αX > X with θ which is matrix multiplication of a D × D matrix with a D × 1, which is O(D 2 ) The remaining subtraction/addition can be done in O(D) for each iteration. What is overall computational complexity? 64 Gradient Descent For each of the t iterations, we now need to first multiply αX > X with θ which is matrix multiplication of a D × D matrix with a D × 1, which is O(D 2 ) The remaining subtraction/addition can be done in O(D) for each iteration. What is overall computational complexity? O(tD 2 ) + O(D 2 N) = O((t + N)D 2 ) 64 Gradient Descent (Alternative) 65 Gradient Descent (Alternative) If we do not rewrite the expression θ = θ − αX > (X θ − y ) For each iteration, we have: Computing X θ is O(ND) 65 Gradient Descent (Alternative) If we do not rewrite the expression θ = θ − αX > (X θ − y ) For each iteration, we have: Computing X θ is O(ND) Computing X θ − y is O(N) 65 Gradient Descent (Alternative) If we do not rewrite the expression θ = θ − αX > (X θ − y ) For each iteration, we have: Computing X θ is O(ND) Computing X θ − y is O(N) Computing αX > is O(ND) 65 Gradient Descent (Alternative) If we do not rewrite the expression θ = θ − αX > (X θ − y ) For each iteration, we have: Computing X θ is O(ND) Computing X θ − y is O(N) Computing αX > is O(ND) Computing αX > (X θ − y ) is O(ND) 65 Gradient Descent (Alternative) If we do not rewrite the expression θ = θ − αX > (X θ − y ) For each iteration, we have: Computing X θ is O(ND) Computing X θ − y is O(N) Computing αX > is O(ND) Computing αX > (X θ − y ) is O(ND) Computing θ = θ − αX > (X θ − y ) is O(N) 65 Gradient Descent (Alternative) If we do not rewrite the expression θ = θ − αX > (X θ − y ) For each iteration, we have: Computing X θ is O(ND) Computing X θ − y is O(N) Computing αX > is O(ND) Computing αX > (X θ − y ) is O(ND) Computing θ = θ − αX > (X θ − y ) is O(N) 65 Gradient Descent (Alternative) If we do not rewrite the expression θ = θ − αX > (X θ − y ) For each iteration, we have: Computing X θ is O(ND) Computing X θ − y is O(N) Computing αX > is O(ND) Computing αX > (X θ − y ) is O(ND) Computing θ = θ − αX > (X θ − y ) is O(N) What is overall computational complexity? 65 Gradient Descent (Alternative) If we do not rewrite the expression θ = θ − αX > (X θ − y ) For each iteration, we have: Computing X θ is O(ND) Computing X θ − y is O(N) Computing αX > is O(ND) Computing αX > (X θ − y ) is O(ND) Computing θ = θ − αX > (X θ − y ) is O(N) What is overall computational complexity? O(NDt) 65