AI305 Introduction to Machine Learning - Deep Learning PDF
Document Details
Uploaded by ExaltedAnemone
Tags
Summary
These are lecture notes from a course on Introduction to Machine Learning, focusing on Deep Learning. The document covers key concepts, such as neural networks, and provides a basic overview of these topics.
Full Transcript
Introduction to Machine learning AI 305 Deep Learning 1 Introduction Neural networks became popular in the 1980s. Lots of successes, hype, and great conferences: NeurIPS, Snowbird. Then along came SVMs, Random Forests and Boosting...
Introduction to Machine learning AI 305 Deep Learning 1 Introduction Neural networks became popular in the 1980s. Lots of successes, hype, and great conferences: NeurIPS, Snowbird. Then along came SVMs, Random Forests and Boosting in the 1990s, and Neural Networks took a back seat. Re-emerged around 2010 as Deep Learning. By 2020s very dominant and successful. Part of success due to vast improvements in computing power, larger training sets, and software: Tensorflow and PyTorch. oMuch of the credit goes to three pioneers and their students: Yann LeCun, Geoffrey Hinton and Yoshua Bengio, who received the 2019 ACM Turing Award for their work in Neural Networks. 2 Machine Learning Basics Machine learning is a field of computer science that gives computers the ability to learn without being explicitly programmed Machine Learning Labeled Data algorithm Training Prediction Data Learned model Prediction Methods that can learn from and make predictions on data ML vs. Deep Learning Most machine learning methods work well because of human-designed representations and input features ML becomes just optimizing weights to best make a final prediction What is Deep Learning (DL) ? A machine learning subfield of learning representations of data. Exceptional effective at learning patterns. Deep learning algorithms attempt to learn (multiple levels of) representation by using a hierarchy of multiple layers If you provide the system tons of information, it begins to understand it and respond in useful ways. https://www.xenonstack.com/blog/static/public/uploads/media/machine-learning-vs-deep-learning.png Why is DL useful? o Manually designed features are often over-specified, incomplete and take a long time to design and validate o Learned Features are easy to adapt, fast to learn o Deep learning provides a very flexible, (almost?) universal, learnable framework for representing world, visual and linguistic information. o Can learn both unsupervised and supervised o Effective end-to-end joint system learning o Utilize large amounts of training data In ~2010 DL started outperforming other ML techniques first in speech and vision, then NLP Representational Power NNs with at least one hidden layer are universal approximators Given any continuous function h(x) and some 𝜖 > 0, there exists a NN with one hidden layer (and with a reasonable choice of non-linearity) described with the function f(x), such that ∀𝑥, ℎ 𝑥 − 𝑓(𝑥) < 𝜖 I.e., NN can approximate any arbitrary complex continuous function NNs use nonlinear mapping of the inputs x to the outputs f(x) to compute complex decision boundaries But then, why use deeper NNs? ▪ The fact that deep NNs work better is an empirical observation ▪ Mathematically, deep NNs have the same representational power as a one-layer NN Perceptron The perceptron is the basic processing element. It has inputs that may come from the environment or may be the outputs of other perceptrons. d y = w j x j + w0 = w T x j =1 w = w 0 , w1 ,...,wd T x = 1, x1 ,..., xd T (Rosenblatt, 1962) 8 9 Single Layer Neural Network 10 -0.06 W1 W2 -2.5 f(x) W3 1.4 -0.06 2.7 -8.6 -2.5 f(x) 0.002 x = -0.06×2.7 + 2.5×8.6 + 1.4×0.002 = 21.34 1.4 A dataset Features class 1.4 2.7 1.9 0 3.8 3.4 3.2 0 6.4 2.8 1.7 1 4.1 0.1 0.2 0 etc … Training the neural network Features class 1.4 2.7 1.9 0 3.8 3.4 3.2 0 6.4 2.8 1.7 1 4.1 0.1 0.2 0 etc … Training data Features class Initialise with random weights 1.4 2.7 1.9 0 3.8 3.4 3.2 0 6.4 2.8 1.7 1 4.1 0.1 0.2 0 etc … Training data Features class Present a training instance 1.4 2.7 1.9 0 3.8 3.4 3.2 0 6.4 2.8 1.7 1 4.1 0.1 0.2 0 1.4 etc … 2.7 1.9 Training data Features class Feed it through to get output 1.4 2.7 1.9 0 3.8 3.4 3.2 0 6.4 2.8 1.7 1 4.1 0.1 0.2 0 1.4 etc … 2.7 0.8 1.9 Training data Fields class Compare with target output 1.4 2.7 1.9 0 3.8 3.4 3.2 0 6.4 2.8 1.7 1 4.1 0.1 0.2 0 1.4 etc … 2.7 0.8 0 1.9 error 0.8 Training data Features class Adjust weights based on error 1.4 2.7 1.9 0 3.8 3.4 3.2 0 6.4 2.8 1.7 1 4.1 0.1 0.2 0 1.4 etc … 2.7 0.8 0 1.9 error 0.8 Training data Features class Present a training instance 1.4 2.7 1.9 0 3.8 3.4 3.2 0 6.4 2.8 1.7 1 4.1 0.1 0.2 0 6.4 etc … 2.8 1.7 Training data Features class Feed it through to get output 1.4 2.7 1.9 0 3.8 3.4 3.2 0 6.4 2.8 1.7 1 4.1 0.1 0.2 0 6.4 etc … 2.8 0.9 1.7 Training data Features class Compare with target output 1.4 2.7 1.9 0 3.8 3.4 3.2 0 6.4 2.8 1.7 1 4.1 0.1 0.2 0 6.4 etc … 2.8 0.9 1 1.7 error -0.1 Training data Features class Adjust weights based on error 1.4 2.7 1.9 0 3.8 3.4 3.2 0 6.4 2.8 1.7 1 4.1 0.1 0.2 0 6.4 etc … 2.8 0.9 1 1.7 error -0.1 Training data Features class And so on …. 1.4 2.7 1.9 0 3.8 3.4 3.2 0 6.4 2.8 1.7 1 4.1 0.1 0.2 0 6.4 etc … 2.8 0.9 1 1.7 error -0.1 Repeat this thousands, maybe millions of times – each time taking a random training instance, and making slight weight adjustments Algorithms for weight adjustment are designed to make changes that will reduce the error The decision boundary perspective… Initial random weights The decision boundary perspective… Present a training instance / adjust the weights The decision boundary perspective… Present a training instance / adjust the weights The decision boundary perspective… Present a training instance / adjust the weights The decision boundary perspective… Present a training instance / adjust the weights The decision boundary perspective… Eventually …. Some points weight-learning algorithms for NNs are dumb they work by making thousands and thousands of tiny adjustments, each making the network do better at the most recent instance, but perhaps a little worse on many others but, by dumb luck, eventually this tends to be good enough to learn effective classifiers for many real applications Some other points If f(x) is non-linear, a network with 1 hidden layer can, in theory, learn perfectly any classification problem. A set of weights exists that can produce the targets from the inputs. The problem is finding them. Some other points If f(x) is linear, the NN can only draw straight decision boundaries (even if there are many layers of units) Some other points NNs use nonlinear f(x) so they can draw complex boundaries, but keep the data unchanged Some other points NNs use nonlinear f(x) so they SVMs only draw straight lines, can draw complex boundaries, but they transform the data first but keep the data unchanged in a way that makes that OK Example Application Handwriting Digit Recognition Machine “2” Handwriting Digit Recognition Input Output y1 0.1 is 1 x1 x2 y2 0.7 is 2 The image is “2” …… …… …… x256 y10 0.2 is 0 16 x 16 = 256 Ink → 1 Each dimension represents No ink → 0 the confidence of a digit. Example Application Handwriting Digit Recognition x1 y1 x2 y2 …… Machine “2” …… x256 y10 𝑓: 𝑅256 → 𝑅10 In deep learning, the function 𝑓 is represented by neural network Elements of Neural Network NNs consist of hidden layers with neurons (i.e., computational units) A single neuron maps a set of inputs into an output number, or 𝑓: 𝑅𝐾 → 𝑅 Neuron 𝑓: 𝑅𝐾 → 𝑅 a1 w1 z = a1w1 + a2 w2 + + aK wK + b a2 w2 z (z ) + a wK … aK Activation weights function b bias Neural Network neuron Input Layer 1 Layer 2 Layer L Output x1 …… y1 x2 …… y2 …… …… …… …… …… xN …… yM Input Output Layer Hidden Layers Layer Deep means many hidden layers Example of Neural Network 1 4 0.98 1 -2 1 -1 -2 0.12 -1 1 0 Sigmoid Function (z ) (z ) = 1 −z 1+ e z Example of Neural Network 1 4 0.98 2 0.86 3 0.62 1 -2 -1 -1 1 0 -2 -1 -2 0.12 -2 0.11 -1 0.83 -1 1 -1 4 0 0 2 Example of Neural Network 1 0.73 2 0.72 3 0.51 0 -2 -1 -1 1 0 -2 -1 0.5 -2 0.12 -1 0.85 0 1 -1 4 0 0 2 𝑓: 𝑅2 → 𝑅2 1 0.62 0 0.51 𝑓 = 𝑓 = −1 0.83 0 0.85 Different parameters define different function Matrix Operation 1 4 0.98 1 y1 -2 1 -1 -2 0.12 -1 y2 1 0 1 −2 1 1 0.98 𝜎 + = −1 1 −1 0 0.12 4 𝜎 W x + b = a −2 Neural Network x1 …… y1 x2 W1 W2 …… WL y2 b1 b2 bL …… …… …… …… …… xN x a1 …… a2 y yM 𝜎 W1 x + b1 𝜎 W2 a1 + b2 𝜎 WL aL-1 + bL Neural Network x1 …… y1 x2 W1 W2 …… WL y2 b1 b2 bL …… …… …… …… …… xN x a1 …… a2 y yM Using parallel computing techniques y =𝑓 x to speed up matrix operation =𝜎 WL …𝜎 W2 𝜎 W1 x + b1 + b2 … + bL Softmax Layer In multi-class classification tasks, the output layer is typically a softmax layer ▪ I.e., it employs a softmax activation function ▪ If a layer with a sigmoid activation function is used as the output layer instead, the predictions by the NN may not be easy to interpret o Note that an output layer with sigmoid activations can still be used for binary classification A Layer with Sigmoid Activations z1 3 0.95 ( ) y1 = z1 z2 1 0.73 ( ) y2 = z 2 z3 -3 0.05 ( ) y3 = z 3 Softmax The softmax layer applies softmax activations to output a Probability: probability value in the range [0, 1] ◼ 1 > 𝑦𝑖 > 0 The values z inputted to the softmax layer are referred to as logits ◼ σ𝑖 𝑦𝑖 = 1 Softmax Layer 3 20 0.88 3 e z1 y1 = e zj z1 e e z1 j =1 1 0.12 2.7 3 e z2 z2 e y2 = e zj e z2 j =1 0.05 ≈0 z3 -3 3 e z3 e e y3 = e z3 zj 3 j =1 + e zj j =1 Activation: Sigmoid Sigmoid function σ: takes a real-valued number and “squashes” it into the range between 0 and 1 ▪ The output can be interpreted as the firing rate of a biological neuron o Not firing = 0; Fully firing = 1 ▪ When the neuron’s activation are 0 or 1, sigmoid neurons saturate o Gradients at these regions are almost zero (almost no signal will flow) Sigmoid activations are less common in modern NNs ▪ it caused the vanishing gradient problem o Gradient Vanishing problem: it happens when the partial derivative of the loss function approaches a value close to zero 𝑓 𝑥 ℝ𝑛 → 0,1 𝑥 Activation: Tanh Tanh function: takes a real-valued number and “squashes” it into range between -1 and 1 ▪ Like sigmoid, tanh neurons saturate ▪ Unlike sigmoid, the output is zero-centered o It is therefore preferred than sigmoid ▪ Tanh is a scaled sigmoid: tanh(𝑥) = 2 ∙ 𝜎(2𝑥) − 1 𝑓 𝑥 ℝ𝑛 → −1,1 𝑥 Activation: ReLU ReLU (Rectified Linear Unit): takes a real-valued number and thresholds it at zero 𝑓 𝑥 = max(0, 𝑥) ℝ𝑛 → ℝ𝑛+ ▪ Most modern deep NNs use ReLU activations ▪ ReLU is fast to compute 𝑓 𝑥 o Compared to sigmoid, tanh o Simply threshold a matrix at zero ▪ Accelerates the convergence of gradient descent o Due to linear, non-saturating form 𝑥 ▪ Prevents the gradient vanishing problem Activation: Leaky ReLU The problem of ReLU activations: they can “die” ▪ ReLU could cause weights to update in a way that the gradients can become zero and the neuron will not activate again on any data ▪ E.g., when a large learning rate is used Leaky ReLU activation function is a variant of ReLU ▪ Instead of the function being 0 when 𝑥 < 0, a leaky ReLU has a small negative slope (e.g., α = 0.01, or similar) ▪ This resolves the dying ReLU problem 𝑓 𝑥 ▪ Most current works still use ReLU 𝛼𝑥 for 𝑥 < 0 o With a proper setting of the learning rate, the =ቊ 𝑥 for 𝑥 ≫ 0 problem of dying ReLU can be avoided Activation: Linear Function Linear function means that the output signal is proportional to the input signal to the neuron ℝ𝑛 → ℝ𝑛 ▪ If the value of the constant c is 1, it is also called identity activation function 𝑓 𝑥 = 𝑐𝑥 ▪ This activation type is used in regression problems o E.g., the last layer can have linear activation function, in order to output a real number (and not a class membership) Training NNs The network parameters 𝜃 include the weight matrices and bias vectors from all layers 𝜃 = 𝑊 1 , 𝑏1 , 𝑊 2 , 𝑏 2 , ⋯ 𝑊 𝐿 , 𝑏 𝐿 ▪ Often, the model parameters 𝜃 are referred to as weights Training a model to learn a set of parameters 𝜃 that are optimal (according to a criterion) is one of the greatest challenges in ML x1 …… y1 0.1 is 1 Softmax x2 …… 0.7 y2 is 2 …… …… …… x256 …… y10 0.2 is 0 16 x 16 = 256 Training NNs Data preprocessing - helps convergence during training ▪ Mean subtraction, to obtain zero-centered data o Subtract the mean for each individual data dimension (feature) ▪ Normalization o Divide each feature by its standard deviation To obtain standard deviation of 1 for each data dimension (feature) o Or, scale the data within the range [0,1] or [-1, 1] E.g., image pixel intensities are divided by 255 to be scaled in the [0,1] range Picture from: https://cs231n.github.io/neural-networks-2/ Training NNs To train a NN, set the parameters 𝜃 such that for a training subset of images, the corresponding elements in the predicted output have maximum values Input: y1 has the maximum value Input: y2 has the maximum value... Input: y9 has the maximum value Input: y10 has the maximum value Training NNs Define a loss function/objective function/cost function ℒ 𝜃 that calculates the difference (error) between the model prediction and the true label ▪ E.g., ℒ 𝜃 can be mean-squared error, cross-entropy, etc. x1 …… y1 0.2 1 x2 …… y2 0.3 0 …… …… Cost …… …… …… …… x256 …… y10 0.5 ℒ(𝜃) 0 True label “1” Training NNs For a training set of 𝑁 images, calculate the total loss overall all images: ℒ 𝜃 = σ𝑁 𝑛=1 ℒ 𝑛 𝜃 Find the optimal parameters 𝜃 ∗ that minimize the total loss ℒ 𝜃 ℒ1 𝜃 x1 NN 𝑦ො 1 y1 ℒ2 𝜃 x2 NN 𝑦ො 2 y 2 ℒ3 𝜃 x3 3 NN 𝑦ො y3 …… …… …… …… ℒ𝑛 𝜃 xN NN 𝑦ො 𝑁 yN Slide credit: Hung-yi Lee – Deep Learning Tutorial Loss Functions Classification tasks Training Pairs of 𝑁 inputs 𝑥𝑖 and ground-truth class labels 𝑦𝑖 examples Output Softmax Activations Layer [maps to a probability distribution] 𝑁 𝐾 1 (𝑖) (𝑖) (𝑖) Loss function Cross-entropy ℒ 𝜃 = − 𝑦𝑘 log 𝑦ො𝑘 + 1 − 𝑦𝑘 log 1 − 𝑦ො𝑘𝑖 𝑁 𝑖=1 𝑘=1 Ground-truth class labels 𝑦𝑖 and model predicted class labels 𝑦ො𝑖 Loss Functions Regression tasks Training Pairs of 𝑁 inputs 𝑥𝑖 and ground-truth output values 𝑦𝑖 examples Output Linear (Identity) or Sigmoid Activation Layer 𝑛 1 2 Mean Squared Error ℒ 𝜃 = 𝑦 (𝑖) − 𝑦ො (𝑖) 𝑛 Loss function 𝑖=1 𝑛 1 Mean Absolute Error ℒ 𝜃 = 𝑦 (𝑖) − 𝑦ො (𝑖) 𝑛 𝑖=1 Training NNs Optimizing the loss function ℒ 𝜃 ▪ Almost all DL models these days are trained with a variant of the gradient descent (GD) algorithm ▪ GD applies iterative refinement of the network parameters 𝜃 ▪ GD uses the opposite direction of the gradient of the loss with respect to the NN parameters (i.e.,𝛻ℒ 𝜃 = 𝜕ℒ Τ𝜕𝜃𝑖 ) for updating 𝜃 o The gradient of the loss function 𝛻ℒ 𝜃 gives the direction of fastest increase of the loss function ℒ 𝜃 when the parameters 𝜃 are changed ℒ 𝜃 𝜕ℒ 𝜕𝜃𝑖 𝜃𝑖 Gradient Descent Algorithm Steps in the gradient descent algorithm: 1. Randomly initialize the model parameters, 𝜃 0 2. Compute the gradient of the loss function at the initial parameters 𝜃 0 : 𝛻ℒ 𝜃 0 3. Update the parameters as: 𝜃 𝑛𝑒𝑤 = 𝜃 0 − 𝛼𝛻ℒ 𝜃 0 o Where α is the learning rate 4. Go to step 2 and repeat (until a terminating criterion is reached) Loss ℒ Initial Gradient 𝛻ℒ = 𝜕ℒ parameters 𝜃 0 𝜕𝜃 Parameter update: 𝜃 𝑛𝑒𝑤 = 𝜃 − 𝛼𝛻ℒ 𝜃 0 Global loss minimum ℒ𝑚𝑖𝑛 Parameters 𝜃 Gradient Descent Algorithm Example: a NN with only 2 parameters 𝑤1 and 𝑤2 , i.e., 𝜃 = 𝑤1 , 𝑤2 ▪ The different colors represent the values of the loss (minimum loss 𝜃 ∗ is ≈ 1.3) 1. Randomly pick a starting point 𝜃 0 2. Compute the gradient at 𝜃 0 , 𝛻ℒ 𝜃 0 𝜃∗ 3. Times the learning 𝑤2 𝜃1 rate 𝜂, and update 𝜃, 𝜃 𝑛𝑒𝑤 = 𝜃 0 − 𝛼𝛻ℒ 𝜃 0 𝜃1 = 𝜃 0 − 𝛼𝛻ℒ 𝜃 0 −𝛻ℒ 𝜃 0 4. Go to step 2, repeat 𝜃0 0 𝜕ℒ 𝜃 0 /𝜕𝑤1 𝛻ℒ 𝜃 = 𝜕ℒ 𝜃 0 /𝜕𝑤2 𝑤1 Slide credit: Hung-yi Lee – Deep Learning Tutorial Gradient Descent Algorithm Example (contd.) Eventually, we would reach a minimum ….. 2. Compute the gradient 𝜃2 at 𝜃 𝑜𝑙𝑑 , 𝛻ℒ 𝜃 𝑜𝑙𝑑 𝜃1 − 𝛼𝛻ℒ 𝜃1 𝑤2 𝜃 2 − 𝛼𝛻ℒ 𝜃 2 3. Times the learning rate 𝜃1 𝜂, and update 𝜃, 𝜃 𝑛𝑒𝑤 = 𝜃 𝑜𝑙𝑑 − 𝛼𝛻ℒ 𝜃 𝑜𝑙𝑑 4. Go to step 2, repeat 0 𝜃 𝑤1 Slide credit: Hung-yi Lee – Deep Learning Tutorial Gradient Descent Algorithm Gradient descent algorithm stops when a local minimum of the loss surface is reached ▪ GD does not guarantee reaching a global minimum ▪ However, empirical evidence suggests that GD works well for NNs ℒ 𝜃 𝜃 Picture from: https://blog.paperspace.com/intro-to-optimization-in-deep-learning-gradient-descent/ Gradient Descent Algorithm For most tasks, the loss surface ℒ 𝜃 is highly complex (and non-convex) Random initialization in NNs results in different initial parameters 𝜃 0 every time the NN is trained ℒ ▪ Gradient descent may reach different minima at every run ▪ Therefore, NN will produce different predicted outputs In addition, currently we don’t have algorithms that guarantee reaching a global minimum for an arbitrary loss function 𝑤2 𝑤1 Slide credit: Hung-yi Lee – Deep Learning Tutorial Backpropagation Modern NNs employ the backpropagation method for calculating the gradients of the loss function 𝛻ℒ 𝜃 = 𝜕ℒΤ𝜕𝜃𝑖 ▪ Backpropagation is short for “backward propagation” For training NNs, forward propagation (forward pass) refers to passing the inputs 𝑥 through the hidden layers to obtain the model outputs (predictions) 𝑦 ▪ The loss ℒ 𝑦, 𝑦ො function is then calculated ▪ Backpropagation traverses the network in reverse order, from the outputs 𝑦 backward toward the inputs 𝑥 to calculate the gradients of the loss 𝛻ℒ 𝜃 ▪ The chain rule is used for calculating the partial derivatives of the loss function with respect to the parameters 𝜃 in the different layers in the network Each update of the model parameters 𝜃 during training takes one forward and one backward pass (e.g., of a batch of inputs) Automatic calculation of the gradients (automatic differentiation) is available in all current deep learning libraries ▪ It significantly simplifies the implementation of deep learning algorithms, since it avoids deriving the partial derivatives of the loss function by hand Mini-batch Gradient Descent It is wasteful to compute the loss over the entire training dataset to perform a single parameter update for large datasets ▪ E.g., ImageNet has 14M images ▪ Therefore, GD (a.k.a. vanilla GD) is almost always replaced with mini-batch GD Mini-batch gradient descent ▪ Approach: o Compute the loss ℒ 𝜃 on a mini-batch of images, update the parameters 𝜃, and repeat until all images are used o At the next epoch, shuffle the training data, and repeat the above process ▪ Mini-batch GD results in much faster training ▪ Typical mini-batch size: 32 to 256 images ▪ It works because the gradient from a mini-batch is a good approximation of the gradient from the entire training set Mini-batch Randomly initialize 𝜃 0 x1 NN y1 𝑦ො 1 Pick the 1st batch Mini-batch 𝐶1 𝐶 = 𝐶 1 + 𝐶 31 + ⋯ x31 NN y31 𝑦ො 31 𝜃 1 ← 𝜃 0 − 𝜂𝛻𝐶 𝜃 0 𝐶 31 Pick the 2nd batch …… 𝐶 = 𝐶 2 + 𝐶 16 + ⋯ 𝜃 2 ← 𝜃 1 − 𝜂𝛻𝐶 𝜃 1 x2 NN y2 𝑦ො 2 Mini-batch … 𝐶2 Until all mini-batches have been picked x16 NN y16 𝑦ො 16 𝐶 16 one epoch …… Repeat the above process Stochastic Gradient Descent Stochastic gradient descent ▪ SGD uses mini-batches that consist of a single input example o E.g., one image mini-batch ▪ Although this method is very fast, it may cause significant fluctuations in the loss function o Therefore, it is less commonly used, and mini-batch GD is preferred ▪ In most DL libraries, SGD typically means a mini-batch GD (with an option to add momentum) Problems with Gradient Descent Besides the local minima problem, the GD algorithm can be very slow at plateaus, and it can get stuck at saddle points cost ℒ 𝜃 Very slow at the plateau Stuck at a saddle point Stuck at a local minimum 𝛻ℒ 𝜃 ≈ 0 𝛻ℒ 𝜃 =0 𝛻ℒ 𝜃 = 0 𝜃 Slide credit: Hung-yi Lee – Deep Learning Tutorial Gradient Descent with Momentum Gradient descent with momentum uses the momentum of the gradient for parameter optimization cost ℒ 𝜃 Movement = Negative of Gradient + Momentum Negative of Gradient Momentum Real Movement 𝜃 Gradient = 0 Slide credit: Hung-yi Lee – Deep Learning Tutorial Gradient Descent with Momentum Parameters update in GD with momentum at iteration 𝑡: 𝜃 𝑡 = 𝜃 𝑡−1 − 𝑉 𝑡 o Where: 𝑉 𝑡 = 𝛽𝑉 𝑡−1 + 𝛼𝛻ℒ 𝜃 𝑡−1 o I.e., 𝜃 𝑡 = 𝜃 𝑡−1 − 𝛼𝛻ℒ 𝜃 𝑡−1 − 𝛽𝑉 𝑡−1 Compare to vanilla GD: 𝜃 𝑡 = 𝜃 𝑡−1 − 𝛼𝛻ℒ 𝜃 𝑡−1 ▪ Where 𝜃 𝑡−1 are the parameters from the previous iteration 𝑡 − 1 The term 𝑉 𝑡 is called momentum ▪ This term accumulates the gradients from the past several steps, i.e., 𝑉 𝑡 = 𝛽𝑉 𝑡−1 + 𝛼𝛻ℒ 𝜃 𝑡−1 = 𝛽 𝛽𝑉 𝑡−2 + 𝛼𝛻ℒ 𝜃 𝑡−2 + 𝛼𝛻ℒ 𝜃 𝑡−1 = 𝛽 2 𝑉 𝑡−2 + 𝛽𝛼𝛻ℒ 𝜃 𝑡−2 + 𝛼𝛻ℒ 𝜃 𝑡−1 = 𝛽 3 𝑉 𝑡−3 + 𝛽2 𝛼𝛻ℒ 𝜃 𝑡−3 + 𝛽𝛼𝛻ℒ 𝜃 𝑡−2 + 𝛼𝛻ℒ 𝜃 𝑡−1 ▪ This term is analogous to a momentum of a heavy ball rolling down the hill The parameter 𝛽 is referred to as a coefficient of momentum ▪ A typical value of the parameter 𝛽 is 0.9 This method updates the parameters 𝜃 in the direction of the weighted average of the past gradients Adam Adaptive Moment Estimation (Adam) ▪ Adam combines insights from the momentum optimizers that accumulate the values of past gradients, and it also introduces new terms based on the second moment of the gradient o Similar to GD with momentum, Adam computes a weighted average of past gradients (first moment of the gradient), i.e., 𝑉 𝑡 = 𝛽1 𝑉 𝑡−1 + 1 − 𝛽1 𝛻ℒ 𝜃 𝑡−1 o Adam also computes a weighted average of past squared gradients (second moment of the gradient), , i.e., 𝑈 𝑡 = 𝛽2 𝑈 𝑡−1 + (1 2 − 𝛽2 ) 𝛻ℒ 𝜃 𝑡−1 𝑉 𝑡 ▪ The parameter update is:𝜃 𝑡 = 𝜃 𝑡−1 − 𝛼 𝑡 +𝜖 𝑈 𝑉𝑡 𝑡 𝑡 = 𝑈 o Where: 𝑉 𝑡 = 1−𝛽 and 𝑈 1 1−𝛽 2 o The proposed default values are 𝛽1 = 0.9, 𝛽2 = 0.999, and 𝜖 = 10−8 Other commonly used optimization methods include: ▪ Adagrad, Adadelta, RMSprop, Nadam, etc. ▪ Most commonly used optimizers nowadays are Adam and SGD with momentum Learning Rate Learning rate ▪ The gradient tells us the direction in which the loss has the steepest rate of increase, but it does not tell us how far along the opposite direction we should step ▪ Choosing the learning rate (also called the step size) is one of the most important hyper-parameter settings for NN training LR LR too too small large Learning Rate Training loss for different learning rates ▪ High learning rate: the loss increases or plateaus too quickly ▪ Low learning rate: the loss decreases too slowly (takes many epochs to reach a solution) Picture from: https://cs231n.github.io/neural-networks-3/ Learning Rate Scheduling Learning rate scheduling is applied to change the values of the learning rate during the training ▪ Annealing is reducing the learning rate over time (a.k.a. learning rate decay) o Approach 1: reduce the learning rate by some factor every few epochs Typical values: reduce the learning rate by a half every 5 epochs, or divide by 10 every 20 epochs o Approach 2: exponential or cosine decay gradually reduce the learning rate over time o Approach 3: reduce the learning rate by a constant (e.g., by half) whenever the validation loss stops improving In TensorFlow: tf.keras.callbacks.ReduceLROnPleateau() Monitor: validation loss, factor: 0.1 (i.e., divide by 10), patience: 10 (how many epochs to wait before applying it), Minimum learning rate: 1e-6 (when to stop) ▪ Warmup is gradually increasing the learning rate initially, and afterward let it cool down until the end of the training Exponential decay Cosine decay Warmup Vanishing Gradient Problem In some cases, during training, the gradients can become either very small (vanishing gradients) of very large (exploding gradients) ▪ They result in very small or very large update of the parameters ▪ Solutions: change learning rate, ReLU activations, regularization, LSTM units in RNNs x1 …… y1 x2 …… y2 …… …… …… …… …… xN …… yM Small gradients, learns very slow Slide credit: Hung-yi Lee – Deep Learning Tutorial Generalization Underfitting ▪ The model is too “simple” to represent all the relevant class characteristics ▪ E.g., model with too few parameters ▪ Produces high error on the training set and high error on the validation set Overfitting ▪ The model is too “complex” and fits irrelevant characteristics (noise) in the data ▪ E.g., model with too many parameters ▪ Produces low error on the training error and high error on the validation set Overfitting Overfitting – a model with high capacity fits the noise in the data instead of the underlying relationship The model may fit the training data very well, but fails to generalize to new examples (test or validation data) Picture from: http://cs231n.github.io/assets/nn1/layer_sizes.jpeg Regularization: Weight Decay ℓ𝟐 weight decay ▪ A regularization term that penalizes large weights is added to the loss function Data loss Regularization loss ℒ𝑟𝑒𝑔 𝜃 = ℒ 𝜃 + 𝜆 𝜃𝑘2 𝑘 ▪ For every weight in the network, we add the regularization term to the loss value o During gradient descent parameter update, every weight is decayed linearly toward zero ▪ The weight decay coefficient 𝜆 determines how dominant the regularization is during the gradient computation Regularization: Weight Decay Effect of the decay coefficient 𝜆 ▪ Large weight decay coefficient → penalty for weights with large values Regularization: Weight Decay ℓ𝟏 weight decay ▪ The regularization term is based on the ℓ1 norm of the weights ℒ𝑟𝑒𝑔 𝜃 = ℒ 𝜃 + 𝜆 σ𝑘 𝜃𝑘 ▪ ℓ1 weight decay is less common with NN o Often performs worse than ℓ2 weight decay ▪ It is also possible to combine ℓ1 and ℓ2 regularization o Called elastic net regularization ℒ𝑟𝑒𝑔 𝜃 = ℒ 𝜃 + 𝜆1 σ𝑘 𝜃𝑘 + 𝜆2 σ𝑘 𝜃𝑘2 Regularization: Dropout Dropout ▪ Randomly drop units (along with their connections) during training ▪ Each unit is retained with a fixed dropout rate p, independent of other units ▪ The hyper-parameter p needs to be chosen (tuned) o Often, between 20% and 50% of the units are dropped Slide credit: Hung-yi Lee – Deep Learning Tutorial Regularization: Dropout Dropout is a kind of ensemble learning ▪ Using one mini-batch to train one network with a slightly different architecture minibatc minibatc minibatc minibatc h h h h 1 2 3 n …… Slide credit: Hung-yi Lee – Deep Learning Tutorial Regularization: Early Stopping Early-stopping ▪ During model training, use a validation set o E.g., validation/train ratio of about 25% to 75% ▪ Stop when the validation accuracy (or loss) has not improved after n epochs o The parameter n is called patience Stop training validation Batch Normalization Batch normalization layers act similar to the data preprocessing steps mentioned earlier ▪ They calculate the mean μ and variance σ of a batch of input data, and normalize the data x to a zero mean and unit variance 𝑥−𝜇 ▪ I.e., 𝑥ො = 𝜎 BatchNorm layers alleviate the problems of proper initialization of the parameters and hyper-parameters ▪ Result in faster convergence training, allow larger learning rates ▪ Reduce the internal covariate shift BatchNorm layers are inserted immediately after convolutional layers or fully-connected layers, and before activation layers ▪ They are very common with convolutional NNs Hyper-parameter Tuning Training NNs can involve setting many hyper-parameters The most common hyper-parameters include: ▪ Number of layers, and number of neurons per layer ▪ Initial learning rate ▪ Learning rate decay schedule (e.g., decay constant) ▪ Optimizer type Other hyper-parameters may include: ▪ Regularization parameters (ℓ2 penalty, dropout rate) ▪ Batch size ▪ Activation functions ▪ Loss function Hyper-parameter tuning can be time-consuming for larger NNs Hyper-parameter Tuning Grid search ▪ Check all values in a range with a step value Random search ▪ Randomly sample values for the parameter ▪ Often preferred to grid search Bayesian hyper-parameter optimization ▪ Is an active area of research k-Fold Cross-Validation Using k-fold cross-validation for hyper-parameter tuning is common when the size of the training data is small ▪ It also leads to a better and less noisy estimate of the model performance by averaging the results across several folds E.g., 5-fold cross-validation (see the figure on the next slide) 1. Split the train data into 5 equal folds 2. First use folds 2-5 for training and fold 1 for validation 3. Repeat by using fold 2 for validation, then fold 3, fold 4, and fold 5 4. Average the results over the 5 runs (for reporting purposes) 5. Once the best hyper-parameters are determined, evaluate the model on the test data k-Fold Cross-Validation Illustration of a 5-fold cross-validation Picture from: https://scikit-learn.org/stable/modules/cross_validation.html Ensemble Learning Ensemble learning is training multiple classifiers separately and combining their predictions ▪ Ensemble learning often outperforms individual classifiers ▪ Better results obtained with higher model variety in the ensemble ▪ Bagging (bootstrap aggregating) o Randomly draw subsets from the training set (i.e., bootstrap samples) o Train separate classifiers on each subset of the training set o Perform classification based on the average vote of all classifiers ▪ Boosting o Train a classifier, and apply weights on the training set (apply higher weights on misclassified examples, focus on “hard examples”) o Train new classifier, reweight training set according to prediction error o Repeat o Perform classification based on weighted vote of the classifiers Deep vs Shallow Networks Deeper networks perform better than shallow networks ▪ But only up to some limit: after a certain number of layers, the performance of deeper networks plateaus output Shallow Deep NN NN …… x1 x2 …… xN input Slide credit: Hung-yi Lee – Deep Learning Tutorial Convolutional Neural Networks (CNNs) Convolutional neural networks (CNNs) were primarily designed for image data CNNs use a convolutional operator for extracting data features ▪ Allows parameter sharing ▪ Efficient to train ▪ Have less parameters than NNs with fully-connected layers CNNs are robust to spatial translations of objects in images CNN mainly contains: convolution and pooling layers A convolutional filter slides (i.e., convolves) across the image Convolutional Input matrix 3x3 filter Picture from: http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution How CNNs Work The CNN builds up an image in a hierarchical fashion. Edges and shapes (local features) are recognized and pieced together to form more complex shapes (compound features)such as eye and ear, eventually assembling the target image. This hierarchical construction is achieved using convolution and pooling layers. 96 Convolutional Neural Networks (CNNs) Convolution layer: When the convolutional filters are scanned over the image, they capture useful features ▪ E.g., edge detection by convolutions 0 1 0 Filter 1 -4 1 1 1 1 1 1 0 1 0 1 0.015686 0.015686 0.011765 0.015686 0.015686 0.015686 0.015686 0.964706 0.988235 0.964706 0.866667 0.031373 0.023529 0.007843 0.007843 0.741176 1 1 0.984314 0.023529 0.019608 0.015686 0.015686 0.015686 0.011765 0.101961 0.972549 1 1 0.996078 0.996078 0.996078 0.058824 0.015686 0.019608 0.513726 1 1 1 0.019608 0.015686 0.015686 0.015686 0.007843 0.011765 1 1 1 0.996078 0.031373 0.015686 0.019608 1 0.011765 0.015686 0.733333 1 1 0.996078 0.019608 0.019608 0.015686 0.015686 0.011765 0.984314 1 1 0.988235 0.027451 0.015686 0.007843 0.007843 1 0.352941 0.015686 0.823529 1 1 0.988235 0.019608 0.019608 0.015686 0.015686 0.019608 1 1 0.980392 0.015686 0.015686 0.015686 0.015686 0.996078 1 0.996078 0.015686 0.913726 1 1 0.996078 0.019608 0.019608 0.019608 0.019608 1 1 0.984314 0.015686 0.015686 0.015686 0.015686 0.952941 1 1 0.992157 0.019608 0.913726 1 1 0.988235 0.019608 0.019608 0.019608 0.039216 0.996078 1 0.015686 0.015686 0.015686 0.015686 0.996078 1 1 1 0.007843 0.019608 0.898039 1 1 0.988235 0.019608 0.015686 0.019608 0.968628 0.996078 0.980392 0.027451 0.015686 0.019608 0.980392 0.972549 1 1 1 0.019608 0.043137 0.905882 1 1 1 0.015686 0.035294 0.968628 1 1 0.023529 1 0.792157 0.996078 1 1 0.980392 0.992157 0.039216 0.023529 1 1 1 1 1 0.992157 0.992157 1 1 0.984314 0.015686 0.015686 0.858824 0.996078 1 0.992157 0.501961 0.019608 0.019608 0.023529 0.996078 0.992157 1 1 1 0.933333 0.003922 0.996078 1 0.988235 1 0.992157 1 1 1 0.988235 1 1 1 1 0.015686 0.74902 1 1 0.984314 0.019608 0.019608 0.031373 0.984314 0.023529 0.015686 0.015686 1 1 1 0 0.003922 0.027451 0.980392 1 0.019608 0.023529 1 1 1 0.019608 0.019608 0.564706 0.894118 0.019608 0.015686 0.015686 1 1 1 0.015686 0.015686 0.015686 0.05098 1 0.015686 0.015686 1 1 1 0.047059 0.019608 0.992157 0.007843 0.011765 0.011765 0.015686 1 1 1 0.015686 0.019608 0.996078 0.023529 0.996078 0.019608 0.015686 0.243137 1 1 0.976471 0.035294 1 0.003922 0.011765 0.011765 0.015686 1 1 1 0.988235 0.988235 1 0.003922 0.015686 0.019608 0.019608 0.027451 1 1 0.992157 0.223529 0.662745 0.011765 0.011765 0.011765 0.015686 1 1 1 0.015686 0.023529 0.996078 0.011765 0.011765 0.015686 0.015686 0.011765 1 1 1 1 0.035294 0.011765 0.011765 0.011765 0.015686 1 1 1 0.015686 0.015686 0.964706 0.003922 0.996078 0.007843 0.019608 0.011765 0.054902 1 1 0.988235 0.007843 0.011765 0.011765 0.015686 0.011765 1 1 1 0.015686 0.015686 0.015686 0.023529 1 0.007843 0.007843 0.015686 0.015686 0.960784 1 0.490196 0.015686 0.015686 0.015686 0.007843 0.027451 1 1 1 0.011765 0.011765 0.043137 1 1 0.023529 0.003922 0.007843 0.023529 0.980392 0.976471 0.039216 0.019608 0.007843 0.019608 0.015686 1 1 1 1 1 1 1 1 1 Input Image Convoluted Image Slide credit: Param Vir Singh – Deep Learning Convolutional Neural Networks (CNNs) In CNNs, hidden units in a layer are only connected to a small region of the layer before it (called local receptive field) ▪ The depth of each feature map corresponds to the number of convolutional filters used at each layer w1 w2 w3 w4 w5 w6 w7 w8 Filter 1 Filter 2 Input Image Layer 1 Feature Map Layer 2 Feature Map Slide credit: Param Vir Singh – Deep Learning Fully Connected Layer 32x32x3 image -> stretch to 3072 x 1 input activation 1 1 10 x 3072 3072 10 weights Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 17, 2018 Fully Connected Layer 32x32x3 image -> stretch to 3072 x 1 input activation 1 1 10 x 3072 3072 10 weights 1 number: the result of taking a dot product between a row of W and the input (a 3072-dimensional dot product) Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 17, 2018 Convolution Layer 32x32x3 image -> preserve spatial structure 32 height 32 width 3 depth Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 17, 2018 Convolution Layer 32x32x3 image 5x5x3 filter 32 Convolve the filter with the image i.e. “slide over the image spatially, computing dot products” 32 3 Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 17, 2018 Convolution Layer Filters always extend the full depth of the input volume 32x32x3 image 5x5x3 filter 32 Convolve the filter with the image i.e. “slide over the image spatially, computing dot products” 32 3 Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 17, 2018 Convolution Layer 32x32x3 image 5x5x3 filter 32 1 number: the result of taking a dot product between the filter and a small 5x5x3 chunk of the image 32 (i.e. 5*5*3 = 75-dimensional dot product + bias) 3 Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 17, 2018 Convolution Layer activation map 32x32x3 image 5x5x3 filter 32 28 convolve (slide) over all spatial locations 32 28 3 1 Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 17, 2018 consider a second, green filter Convolution Layer 32x32x3 image activation maps 5x5x3 filter 32 28 convolve (slide) over all spatial locations 32 28 3 1 Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - 33 April 17, 2018 For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps: activation maps 32 28 Convolution Layer 32 28 3 6 We stack these up to get a “new image” of size 28x28x6! Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 17, 2018 Preview: ConvNet is a sequence of Convolution Layers, interspersed with activation functions 32 28 CONV, ReLU e.g. 6 5x5x3 32 filters 28 3 6 Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 17, 2018 Preview: ConvNet is a sequence of Convolution Layers, interspersed with activation functions 32 28 24 CONV, …. CONV, CONV, ReLU ReLU ReLU e.g. 6 e.g. 10 5x5x3 5x5x6 32 filters 28 24 filters 3 6 10 Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 17, 2018 Preview [Zeiler and Fergus 2013] Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 17, 2018 one filter => one activation map example 5x5 filters (32 total) We call the layer convolutional because it is related to convolution of two signals: elementwise multiplication and sum of a filter and the signal (image) Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 17, 2018 preview: Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 17, 2018 The brain/neuron view of CONV Layer 32x32x3 image 5x5x3 filter 32 1 number: 32 the result of taking a dot product between the filter and this part of the image 3 (i.e. 5*5*3 = 75-dimensional dot product) Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 17, 2018 The brain/neuron view of CONV Layer 32x32x3 image 5x5x3 filter 32 It’s just a neuron with local connectivity... 1 number: 32 the result of taking a dot product between the filter and this part of the image 3 (i.e. 5*5*3 = 75-dimensional dot product) Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 17, 2018 The brain/neuron view of CONV Layer 32 28 An activation map is a 28x28 sheet of neuron outputs: 1. Each is connected to a small region in the input 2. All of them share parameters 32 “5x5 filter” -> “5x5 receptive field for each neuron” 28 3 Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 17, 2018 The brain/neuron view of CONV Layer 32 E.g. with 5 filters, 28 CONV layer consists of neurons arranged in a 3D grid (28x28x5) There will be 5 different 32 28 neurons all looking at the same 3 region in the input volume 5 Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 17, 2018 two more layers to go: POOL/FC Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 17, 2018 Convolutional Neural Networks (CNNs) Pooling layer: Max pooling: reports the maximum output within a rectangular neighborhood Average pooling: reports the average output of a rectangular neighborhood Pooling layers reduce the spatial size of the feature maps ▪ Reduce the number of parameters, prevent overfitting MaxPool with a 2×2 filter with stride of 2 1 3 5 3 4 5 4 2 3 1 3 4 3 1 1 3 0 1 0 4 Output Matrix Input Matrix Slide credit: Param Vir Singh – Deep Learning Pooling layer - makes the representations smaller and more manageable - operates over each activation map independently: Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 17, 2018 MAX POOLING Single depth slice 1 1 2 4 x max pool with 2x2 filters 5 6 7 8 and stride 2 6 8 3 2 1 0 3 4 1 2 3 4 y Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 17, 2018 Fully Connected Layer (FC layer) - Contains neurons that connect to the entire input volume, as in ordinary Neural Networks Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 17, 2018 CNN - Summary - ConvNets stack CONV,POOL,FC layers - Trend towards smaller filters and deeper architectures - Trend towards getting rid of POOL/FC layers (just CONV) Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 17, 2018 End 125