Lecture 2A - MLP and SGD.pdf

Multi-Layer Perceptrons (MLP) & Stochastic Gradient Descent (SGD) Kevin Bryson Images acknowledged from: Deep Learning with PyTorch, Manning Publishers 1 Overview We will first look at the idea of using gradient descent to improve a very simple network consisting of a single neuron. We will introduce the key components of: The model f(x; w) The cost or loss function L(f(x; w), y) Optimizing the weights / bias using stochastic gradient descent. Then we will show how this applies to neural networks consisting of multiple layers of functions that map vectors to vectors in a non-linear manner. We will look at different types of activation function and why these are required. Finally, you should gain an overall understanding how gradient descent can be used to optimize weight and bias parameters in deep neural networks to minimize loss over a given training data. 2 You should be familiar with this... Dog Dog Dog Cat Cat https://en.wikipedia.org/wiki/Perceptron Cat A single perceptron neuron with 2 inputs (Rosenblatt, 1958) b X1 𝑤 (1) Y y(x; w) X2 𝑤 (2) https://en.wikipedia.org/wiki/Perceptron Introduce a cost or loss function to be optimized... is MSE appropriate for this? b X1 𝑤 (1) Y y(x; w) X2 𝑤 (2) https://en.wikipedia.org/wiki/Perceptron So learning is optimisation ❑If your function is convex, you may have efficient algorithms for finding the global minimum (quadratic optimisation: SVM) ❑If not, you are left with iterative algorithms only guaranteed to give you a local optimum. ❑One of the simplest and most popular is gradient descent https://en.wikipedia.org/wiki/Maxima_and_minima#/media/File:Extrema_example_original.svg Optimizing the cost or loss based on a single parameter Which way should we change the weights? 𝑙 𝑤 𝜕𝐿 ∇𝑤 𝐿 = 𝜕𝑤 𝑙 𝑤 = 𝐿 𝑓 𝑥; 𝑤 , 𝑦 This requires the model 𝑓 𝑥; 𝑤 and the loss 𝑤 function to be smooth. 7 Gradient descent 𝑤 = 𝑤 − 𝛼 ∇𝑤 𝐿 Where 𝛼 is learning rate 8 Step size or learning rate is important…. 9 Gradient descent of loss which depends on a vector of weight parameters 𝑙 𝑤 = 𝐿 𝑓 𝑥; 𝒘 , 𝑦 𝜕𝐿 𝜕𝐿 ∇𝒘 𝐿 = (1) , 𝜕𝑤 𝜕𝑤 (2) 𝜕𝐿 𝜕𝑓 𝜕𝐿 𝜕𝑓 𝑤 (2) =. ,. 𝜕𝑓 𝜕𝑤 (1) 𝜕𝑓 𝜕𝑤 (2) 𝒘 = 𝒘 − 𝛼 ∇𝑤 𝐿 𝑤 (1) 10 How should the gradient be calculated when we have a thousand (x, y) training points? In theory, all thousand (x, y) training should be used to determine the gradient at each point on the loss(w) surface by cycling through all of them and averaging the gradient. But this is very inefficient ! 𝑤 (2) Stochastic Gradient Descent (SGD) calculates the average gradient using a mini-batch of samples... say 100. 𝑤 (1) This makes the downhill walk more stochastic... hence the name. 11 SGD uses approximate gradients... but this can be beneficial to avoid local minima ! Also, this 2D view is misleading since, in high dimensional spaces, ‘stationary points’ (gradient = 0) are likely to be saddle points where the search point can escape along one of the dimensions where the function is decreasing. 12 Applying this to neural nets 13 Simple deep networks (Chollet, 14 2018) 15 Including the bias in the weight matrix 16 17 Overall parameter optimization (Chollet, 18 2018) Why do we have activation functions? 19 Activation functions – step function (Rosenblatt, 1950s) A simple step function was used for the original perceptron developed in the 1950s: 1 If w.x + b > 0 𝑓 𝑠 = ቊ −1 otherwise Gradient is not defined at 0 … also gradient is zero elsewhere … this makes gradient descent with this activation function difficult ! 20 Activation functions – sigmoid (1980s...) 21 Sigmoids in multiple dimensions 22 Decision boundary can vary enormously depending on the weights 23 Activation functions - tanh 24 Derivative of activation function 25 Saturation issues Linear Low gradient in saturation areas See (Glorot & Bengio 11) for details 26 Activation functions ReLU 27 28 29 Summary You should have a clear idea of the architecture of a feed-forward neural network consisting of layers, neurons, weights, biases and activation functions. The network can be seen as acting like a non-linear function f(x; w) to predict a value for a given input x. You should understand how stochastic gradient decent can be used to optimize the weight and bias parameters in the network to give minimal loss when predicting training data. READING: Chapter 2 of Drori et al. 2023. READING: Chapter 6 of Goodfellow et al. 2018. https://www.deeplearningbook.org/contents/mlp.html (Sections 6.1-6.5.7 are examinable) 30

Lecture 2A - MLP and SGD.pdf

Document Details

Tags

Related

Full Transcript