IMG_9410.jpeg
Document Details

Uploaded by ggfghu
Full Transcript
# Lecture 14: October 26, 2023 ## Review ### Gradient Descent $\qquad x_{k+1} = x_k - \alpha \nabla f(x_k)$ ### Stochastic Gradient Descent $\qquad x_{k+1} = x_k - \alpha \nabla f_i(x_k)$ Where $i$ is sampled uniformly at random from ${1, \cdots, n}$. $\qquad \mathbb{E} [\nabla f_i(x)] = \nabla...
# Lecture 14: October 26, 2023 ## Review ### Gradient Descent $\qquad x_{k+1} = x_k - \alpha \nabla f(x_k)$ ### Stochastic Gradient Descent $\qquad x_{k+1} = x_k - \alpha \nabla f_i(x_k)$ Where $i$ is sampled uniformly at random from ${1, \cdots, n}$. $\qquad \mathbb{E} [\nabla f_i(x)] = \nabla f(x)$ ### Mini-Batch $\qquad x_{k+1} = x_k - \alpha \frac{1}{|B|}\sum_{i \in B} \nabla f_i(x_k)$ Where $B$ is a random subset of ${1, \cdots, n}$. ## Today ### Using momentum to accelerate gradient descent ## Momentum The idea of momentum is that we keep track of the past updates, and use them to influence the current update. $\qquad v_{k+1} = \beta v_k + \nabla f(x_k)$ $\newline$ $\qquad x_{k+1} = x_k - \alpha v_{k+1}$ $\newline$ $\qquad v_0 = 0$ Where $\beta \in [0, 1)$ is a hyperparameter that controls how much of the past updates we keep. When $\beta = 0$, this is just gradient descent. ### Intuition When $\beta > 0$, the gradient descent algorithm will "remember" the past updates, and use them to influence the current update. This can help the algorithm to avoid getting stuck in local optima, and to converge faster. ### Polyak's heavy ball method $\qquad x_{k+1} = x_k - \alpha \nabla f(x_k) + \beta(x_k - x_{k-1})$ ### Nesterov's accelerated gradient descent $\qquad v_{k+1} = \beta v_k + \nabla f(x_k - \alpha \beta v_k)$ $\newline$ $\qquad x_{k+1} = x_k - \alpha v_{k+1}$ ### Benefits of Momentum * Faster convergence * Avoidance of local optima * Less oscillation ### Adaptive Gradient Methods The idea behind adaptive gradient methods is to adapt the learning rate for each parameter individually. This can be useful when the parameters have different scales, or when some parameters are more important than others. ### AdaGrad $\qquad x_{k+1} = x_k - \alpha \frac{1}{\sqrt{G_k + \epsilon}} \odot \nabla f(x_k)$ Where $G_k = \sum_{i=1}^k \nabla f(x_i) \odot \nabla f(x_i)$ is a diagonal matrix containing the sum of squares of the gradients up to iteration $k$, and $\epsilon > 0$ is a small constant to avoid division by zero. * Do not use this in practice ### RMSProp $\qquad v_{k+1} = \beta v_k + (1 - \beta) \nabla f(x_k) \odot \nabla f(x_k)$ $\newline$ $\qquad x_{k+1} = x_k - \alpha \frac{1}{\sqrt{v_{k+1} + \epsilon}} \odot \nabla f(x_k)$ Where $\beta \in [0, 1)$ is a hyperparameter that controls how much of the past updates we keep, and $\epsilon > 0$ is a small constant to avoid division by zero. ### Adam $\qquad v_{k+1} = \beta_1 v_k + (1 - \beta_1) \nabla f(x_k)$ $\newline$ $\qquad s_{k+1} = \beta_2 s_k + (1 - \beta_2) \nabla f(x_k) \odot \nabla f(x_k)$ $\newline$ $\qquad \hat{v}_{k+1} = \frac{v_{k+1}}{1 - {\beta_1}^{k+1}}$ $\newline$ $\qquad \hat{s}_{k+1} = \frac{s_{k+1}}{1 - {\beta_2}^{k+1}}$ $\newline$ $\qquad x_{k+1} = x_k - \alpha \frac{\hat{v}_{k+1}}{\sqrt{\hat{s}_{k+1}} + \epsilon}$ Where $\beta_1, \beta_2 \in [0, 1)$ are hyperparameters that control how much of the past updates we keep, and $\epsilon > 0$ is a small constant to avoid division by zero. $\newline$ Typically, we use $\beta_1 = 0.9$, $\beta_2 = 0.999$, and $\epsilon = 10^{-8}$. Due to the bias correction, Adam is relatively insensitive to the choice of hyperparameters.