Deep Neural Networks I PDF
Document Details
Uploaded by MesmerizingGyrolite5380
Ajou University
Kyung-Ah Sohn
Tags
Related
Summary
These lecture notes cover deep neural networks, including their architecture, training methods, and different learning approaches.
Full Transcript
Deep neural networks I Kyung-Ah Sohn Ajou University Contents Introduction Feed-forward neural network: Multi-Layer Perceptron (MLP) Training neural networks: Gradient Descent and Backpropagation Regularization...
Deep neural networks I Kyung-Ah Sohn Ajou University Contents Introduction Feed-forward neural network: Multi-Layer Perceptron (MLP) Training neural networks: Gradient Descent and Backpropagation Regularization 2 Supervised learning: predictive model 𝑥𝑥 𝑦𝑦 = 𝑓𝑓𝜃𝜃 (𝑥𝑥) Regression model: 𝑓𝑓𝜃𝜃 5.5 age (numeric) Classification 0.7 Cat model: 𝑓𝑓𝜃𝜃 0.2 Dog 0.1 Hamster input output (categorical) 3 2024 Nobel prize and AI [In Physics] Two Scientists who Laid Foundation for [In Chemistry] Three Scientists who uncovered the the Core of Artificial Intelligence, “Machine Learning” "Secrets of Proteins" using AI 4 Secrets of Proteins: Predicting 3D Structures Central Dogma of Molecular Biology Protein 3structure prediction problem [50-year grand challenge DNA -> RNA -> Protein in life science] 𝒙𝒙 𝒇𝒇(𝒙𝒙) 𝒚𝒚 5 Secrets of Protein x AI: AlphaFold AlphaFold2 (2020, Googld DeepMind) – [In the past] determining the full structure of a single protein required years of labor- intensive experiments and millions of dollars in specialized equipment – [AlphaFold] can predict numerous protein structures with very high accuracy within days or even hours Reshaping the paradigm of biological research – Accelerating studies on drug development, disease mechanisms, the creation of enzymes, etc. – (Introduced Evoformer, a deep learning architecture based on Transformer) AlphaFold3 announced on 8 May 2024 (Nature) – (Introduced Pairformer, similar but simpler than Evoformer. Combined with a diffusion model) 6 Machine learning example Input Output ML algorithm 𝒙𝒙 𝒚𝒚 = 𝒇𝒇(𝒙𝒙) 𝒚𝒚 7 Machine learning example Input Output Gender = 0 GPA = 3.7 Age = 22 ML Pass (1) or Fail (0) NumInternship = 1 algorithm NumProject = 3 EngScore = 85 𝒙𝒙 𝒚𝒚 = 𝒇𝒇(𝒙𝒙) 𝒚𝒚 8 Machine learning example Input Output ML algorithm 𝒙𝒙 𝒚𝒚 = 𝒇𝒇(𝒙𝒙) 𝒚𝒚 𝑛𝑛 Learn a function 𝒇𝒇 using data 𝐷𝐷 = { 𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖 ) 𝑖𝑖=1 9 Machine learning example Input Output Prompt: A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city ML signage. She wears a black leather jacket, a long red dress, and block algorithm boots. 𝒙𝒙 𝒚𝒚 = 𝒇𝒇(𝒙𝒙) 𝒚𝒚 How to represent input (or output) data as a numeric vector? (representation learning) 10 Basics of Machine Learning Predictive modeling (a.k.a. supervised learning) – Given data 𝐷𝐷 = 𝑥𝑥 (1) , 𝑦𝑦 (1) ), (𝑥𝑥 (2) , 𝑦𝑦 (2) , … , 𝑥𝑥 (𝑛𝑛) , 𝑦𝑦 (𝑛𝑛) , find a function 𝑦𝑦 = 𝑓𝑓𝜃𝜃 𝑥𝑥 that best fits 𝐷𝐷 – Modeling: which types of function to use for 𝑓𝑓 (e.g., linear, polynomial, tree-based, …) – Training a model means to learn (find) the optimal 𝜃𝜃 for the given data 𝐷𝐷 Given 𝐷𝐷 find a function 𝑓𝑓𝜃𝜃 𝑥𝑥 such that ∑𝑖𝑖 ||𝑓𝑓𝜃𝜃 𝒙𝒙(𝒊𝒊) − 𝑦𝑦 (𝑖𝑖) || is small 𝜃𝜃: “(Trainable/learnable) model parameters” (e.g., coefficients in linear or logistic regression) 11 Deep learning Represent 𝒇𝒇 by combining multiple (deep) layers 𝑓𝑓𝜃𝜃 – Each layer transforms the given representation (feature vector) into another representation that is better suited for the final prediction – Feature learning and prediction in the same framework 𝒙𝒙 𝒚𝒚 Popular deep learning architecture – Multilayer perceptron – base architecture: use matrix multiplication – Convolutional neural network : use convolution instead of matrix multiplication – Recurrent neural network : similar to MLP but with sequential processing – Transformer : use attention mechanism for sequence processing – Graph neural network : use graph structural information 12 DEEP NEURAL NETWORK ARCHITECTURE 13 Recap: linear regression Input features Output 𝑠𝑠 = 𝑤𝑤0 + 𝑤𝑤1 𝑥𝑥1 + 𝑤𝑤2 𝑥𝑥2 𝑦𝑦 = 𝒔𝒔 𝐼𝐼 : identity function 𝐼𝐼 𝑠𝑠 𝒚𝒚 Output 𝑤𝑤0 1 Trainable parameters: What if the output y is binary (or categorical)? 14 logistic regression: binary classification Suppose 𝑦𝑦 ∈ {1, 0} We can model 𝑝𝑝𝑤𝑤 (𝑦𝑦 = 1|𝑥𝑥) instead of modeling 𝒚𝒚 directly by using a sigmoid function 𝜎𝜎 as a non-linear activation function [Logistic regression] Input 𝑤𝑤1 𝑠𝑠 = 𝑤𝑤0 + 𝑤𝑤1 𝑥𝑥1 + 𝑤𝑤2 𝑥𝑥2 𝜎𝜎 𝑤𝑤2 𝑠𝑠 Output 1 𝜎𝜎 𝑠𝑠 = Output 𝑤𝑤0 𝑝𝑝 𝑦𝑦 = 1 𝒙𝒙 1 + e−𝑠𝑠 = 𝜎𝜎(𝑤𝑤0 + 𝑤𝑤1 𝑥𝑥1 + 𝑤𝑤2 𝑥𝑥2 ) 1 Linear combination Non-linear activation 𝜎𝜎 0 = 0.5 𝜎𝜎 𝑠𝑠 > 0.5 if 𝑠𝑠 > 0 ⇒ classify as 1 𝜎𝜎 𝑠𝑠 < 0.5 if 𝑠𝑠 < 0 ⇒ classify as 0 : linear classifier 15 Softmax regression(classifier): multi-class classification What if the label is not binary, but multi-class, e.g., 𝑦𝑦 ∈ {1,2,3}? If we have three target classes, introduce three output nodes and associated weight parameters Think of the linear combination as a score for the corresponding class 3𝑥𝑥1 − 2 𝑥𝑥2 + 1 3 How can we obtain class probabilities from scores (so -2 𝑠𝑠1 that we can compute the loss for training)? -2 −2𝑥𝑥1 + 𝑥𝑥2 − 1 Use softmax function 1 𝑠𝑠2 𝑒𝑒 𝑠𝑠1 𝑒𝑒 𝑠𝑠𝐾𝐾 𝑠𝑠1 , 𝑠𝑠2 , … , 𝑠𝑠𝐾𝐾 → ( 𝐾𝐾 , … , 𝐾𝐾 ) 1 ∑𝑘𝑘=1 𝑒𝑒 𝑠𝑠𝑘𝑘 ∑𝑘𝑘=1 𝑒𝑒 𝑠𝑠𝑘𝑘 1 -1 1 𝑠𝑠3 𝑥𝑥1 + 𝑥𝑥2 − 0.5 1 -0.5 𝑒𝑒 1 𝑝𝑝 𝑦𝑦 = 1 𝑥𝑥 = = 0.37 The softmax output is 𝑒𝑒 1 + 𝑒𝑒 −2 + 𝑒𝑒 1.5 If 𝑥𝑥1 , 𝑥𝑥2 = 1,1 : 𝑒𝑒 −2 compared with the score for class 1 ⇒ 𝑠𝑠1 = 1 𝑝𝑝 𝑦𝑦 = 2 𝑥𝑥 = 1 = 0.02 ground truth label to 𝑒𝑒 + 𝑒𝑒 −2 + 𝑒𝑒 1.5 for class 2 ⇒ 𝑠𝑠2 = −2 𝑒𝑒 1.5 compute the loss for class 3 ⇒ 𝑠𝑠3 = 1.5 𝑝𝑝 𝑦𝑦 = 3 𝑥𝑥 = 1 = 0.61 during training 𝑒𝑒 + 𝑒𝑒 −2 + 𝑒𝑒 1.5 16 Softmax classifier: multi-class classification What if the label is not binary, but multi-class, e.g., 𝑦𝑦 ∈ {1,2,3}? If we have three target classes, introduce three output nodes and associated weight parameters Think of the linear combination as a score for the corresponding class 3𝑥𝑥1 − 2 𝑥𝑥2 + 1 3 𝑠𝑠1 3 −2 1 𝑥𝑥1 -2 𝑠𝑠1 𝑠𝑠2 = −2 1 −1 𝑥𝑥2 𝒔𝒔 = 𝑊𝑊 𝑇𝑇 𝒙𝒙 -2 −2𝑥𝑥1 + 𝑥𝑥2 − 1 𝑠𝑠3 1 1 −5 1 1 𝑠𝑠2 or equivalently, 𝑠𝑠1 3 −2 𝑥𝑥 1 1 1 1 𝑠𝑠2 = −2 1 -1 1 𝑥𝑥2 + −1 𝒔𝒔 = 𝑊𝑊 𝑇𝑇 𝒙𝒙 + 𝑏𝑏 𝑥𝑥1 + 𝑥𝑥2 − 0.5 1 𝑠𝑠3 𝑠𝑠3 -0.5 1 1 −0.5 If 𝑥𝑥1 , 𝑥𝑥2 = 1,1 : Training goal: find the weight parameters (and biases) score for class 1 ⇒ 𝑠𝑠1 = 1 for class 2 ⇒ 𝑠𝑠2 = −2 that best-fit training data for class 3 ⇒ 𝑠𝑠3 = 1.5 17 Artificial neuron : building block 𝑏𝑏 = 𝐰𝐰 𝑇𝑇 𝐱𝐱 + 𝑏𝑏 Dot product between weights and input features bias + Model parameters: Example: linear regression, logistic regression 18 Layer: parallelized linear/dense/fully connected(fc) layer weighted sum and non- linearity (activation) 𝑠𝑠𝑗𝑗 = 𝐰𝐰𝑗𝑗𝑇𝑇 𝐱𝐱 + 𝑏𝑏𝑗𝑗 → 𝐬𝐬 = 𝐖𝐖 𝑇𝑇 𝐱𝐱 + 𝐛𝐛 𝐡𝐡 = 𝜎𝜎(𝐬𝐬) Matrix multiplication + Model parameters: bias terms Example: multi-output linear regression, softmax regression 19 Network: sequence of parallelized weighted Multi-layer perceptron(MLP) sums and non-linearities 1st layer 2nd layer 1 𝑇𝑇 (0) 2 𝑇𝑇 (1) (1) 𝐬𝐬 (1) = 𝐖𝐖 𝐱𝐱 + 𝐛𝐛 (1) 𝐬𝐬 (2) = 𝐖𝐖 𝐱𝐱 +𝐛𝐛 + 𝐛𝐛 (2) 𝐱𝐱 (1) = 𝜎𝜎(𝐬𝐬 (1) ) 𝐱𝐱 (2) = 𝜎𝜎(𝐬𝐬 (2) ) + + = 𝜎𝜎( … 𝜎𝜎 𝜎𝜎 …) output 2nd weights 1st weights input 20 Activation functions Traditionally More recently 21 Pop quiz. How many parameters to estimate (in a single layer) if we have 200 input features and a hidden layer with 100 hidden nodes? – That is, we want to transform 200-dim input feature vector to 100-dim output feature vector. What will be the size of the weight matrix and the bias? input output Model size (= number of parameters) : 200 100 + = 𝜎𝜎( ) output input 22 MLP Example: keras Two-layer neural network: 𝑦𝑦 = 𝑊𝑊2 max 0, 𝑊𝑊1 𝑥𝑥 + 𝑏𝑏1 +𝑏𝑏2 Three−layer neural network: 𝑦𝑦 = 𝑊𝑊3 max 0, 𝑊𝑊2 max 0, 𝑊𝑊1 𝑥𝑥 + 𝑏𝑏1 + 𝑏𝑏2 + 𝑏𝑏3 Ex) The number of parameters in each layer p0 L1: 235,500 p1 L2: 30,100 p2 L3: 1,010 ? p3 Total params: 266,610 … p9 10x1 784x1 23 Activation at output layer For regression For classification Identity function (do nothing more) Softmax function 𝐹𝐹 𝐹𝐹 −1.5 −1.5 −1.5 exp ⋅ 0.22 normalize 0.03 1.5 1.5 1.5 4.48 0.76 0.2 0.2 0.2 1.2 0.20 What should be the dimension of the final output layer? 24 Training deep neural networks GRADIENT DESCENT Training neural network parameters How to learn the weight parameters? Step 1. Define the loss function L (or objective function, cost function) to minimize – The mean of squared errors (MSE loss) between the calculated outcome and the true outcome (for regression) – Cross entropy loss for classification Step 2. Find the parameters that minimize the loss function – Gradient descent for optimization – How to compute the gradient of the loss function? Back-propagation 26 Loss function for classification problems How to measure the difference between two probability distributions? – Sum of squared errors?? Use Cross-entropy (information-theoretic measures) 𝑥𝑥 Predicted (𝑦𝑦) Ground truth ( 𝑡𝑡 ) How “bad” is the 0.7 Cat current estimation? 1 𝑓𝑓𝜃𝜃 0.2 Dog 0 0.1 Hamster 0 27 Learning as optimization: Gradient descent Gradient descent is an optimization algorithm for finding a local minimum of a differentiable function Used for training a machine learning model – for finding the parameter values that minimize a cost function Main idea – Climbing down a hill until a local minimum is reached – In each iteration, take a step in the opposite direction of the gradient – Step size is determined by the learning rate and the gradient 𝒘𝒘(𝒏𝒏𝒏𝒏𝒏𝒏) ← 𝒘𝒘(𝒐𝒐𝒐𝒐𝒐𝒐) − 𝛼𝛼𝛻𝛻𝑤𝑤 𝐿𝐿(𝒘𝒘) 28 0.1 -0.3 … 1.1 1 Large-scale learning 𝑥𝑥 𝑖𝑖 𝑦𝑦 𝑖𝑖 Training data 1 𝐿𝐿 𝒘𝒘 = 𝑙𝑙(𝑥𝑥 𝑖𝑖 , 𝑦𝑦 𝑖𝑖 ; 𝑤𝑤) 𝑛𝑛 𝑖𝑖 (Batch) gradient descent – Weight update is calculated from the whole training set – Computationally very costly 1 𝑤𝑤 ← 𝑤𝑤 − 𝛼𝛼 𝛻𝛻𝑤𝑤 𝑙𝑙(𝑥𝑥 𝑖𝑖 , 𝑦𝑦 𝑖𝑖 ; 𝑤𝑤) 𝑛𝑛 Stochastic gradient descent (SGD) 𝑖𝑖 – Update the weight incrementally for each training sample (𝑥𝑥 𝑖𝑖 , 𝑦𝑦 𝑖𝑖 ) – SGD typically converges faster because of the more frequent update 𝑤𝑤 ← 𝑤𝑤 − 𝛼𝛼𝛻𝛻𝑤𝑤 𝑙𝑙(𝑥𝑥 𝑖𝑖 , 𝑦𝑦 𝑖𝑖 ; 𝑤𝑤) 29 Mini-batch gradient descent Compromise between batch gradient and SGD Apply batch gradient descent to smaller subsets of the training data, e.g., 64 training samples at a time – The optimal size depends on the problem, data and the hardware (memory) Advantages – Faster convergence than GD – Vectorized operations to improve the computational efficiency one epoch=when the entire training data is processed once 30 https://ml-explained.com/blog/gradient-descent-explained 31 Learning Rate (LR) 𝒘𝒘(𝒕𝒕+𝟏𝟏) : = 𝒘𝒘(𝒕𝒕) − 𝛼𝛼𝛻𝛻𝑤𝑤 𝐽𝐽(𝒘𝒘(𝑡𝑡) ) LR too large 32 Adaptive learning rate Constant learning rate often prevents convergence Common learning rate schedules – time-based/step/exponential decay Typically, from large to small values 33 GD for neural networks Gradient computation: back-propagation algorithm The loss function is non-convex – A lot of local minima or saddle points Hard to find a good learning rate adaptive learning rate SGD can be too noisy and unstable momentum https://www.telesens.co/2019/01/16/neural-network- loss-visualization/ Gradients often vanish/explode normalization 34 Parameter update rules (“optimizers”) SGD – Simple, easy to implement – Inefficient sometimes Optimization path 1 𝑓𝑓(𝑤𝑤1 , 𝑤𝑤2 ) = 𝑤𝑤1 2 + 𝑤𝑤2 2 20 35 Parameter update rules (“optimizers”) Momentum AdaGrad Adam use a “moving average” of the gradients maintains a per-parameter learning rate Momentum + adaptive learning rate to reflect the previous moving direction 𝑤𝑤 ← 𝑤𝑤 + 𝑣𝑣 RMSprop, Adadelta, Adamax, … 𝑣𝑣 ← 𝛼𝛼𝑣𝑣 − 𝜂𝜂𝛻𝛻𝑤𝑤 𝐿𝐿(𝑤𝑤) 36 Computing gradients How to compute the gradient of the loss function from a deep neural network? 2 e.g., 𝑦𝑦 = 𝜎𝜎 𝑊𝑊2 𝜎𝜎 𝑊𝑊1 𝑥𝑥 + 𝑏𝑏1 + 𝑏𝑏2 , 𝑙𝑙 = 𝑦𝑦𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 − 𝑦𝑦 𝜕𝜕𝑙𝑙(𝑊𝑊1 , 𝑊𝑊2 , 𝑏𝑏1 , 𝑏𝑏2 ) =? 𝜕𝜕𝑊𝑊2 𝜕𝜕𝑙𝑙(𝑊𝑊1 , 𝑊𝑊2 , 𝑏𝑏1 , 𝑏𝑏2 ) =? 𝜕𝜕𝑊𝑊1 Back-propagation is “just the chain rule” of calculus – but a particular implementation of the chain rule – avoids recomputing repeated subexpressions 37 Backpropagation Update weights using gradients BP calculates the gradients via chain rule Gradient is propagated backward through the network Most deep learning software libraries automatically calculate gradients 38 Vanishing gradient problem Difficult to train very deep neural network with sigmoid (or tanh) – In backprop, the product of many small terms goes to zero ReLU does not saturate in the positive region, preventing gradients from vanishing in deep networks – In the negative region, ReLU results in “dead” units, but in practice, it doesn’t seem to be a problem – Mostly commonly used in DNN 39 [0. 0. 0. Example: classification (Keras) 0.3 0. 0.7 … # define the model architecture 0.] # make predictions on new samples (inference) # train the model 40 Review questions What is a softmax function What is cross-entropy, where to use Update rules for gradient descent Batch GD vs. SGD vs. mini-batch GD What is back-propagation algorithm for Why vanishing gradient in deep neural nets 41 How to combat overfitting REGULARIZATION Hyperparameters for NNs MLP Example – Two-layer neural network: 𝑦𝑦 = 𝑊𝑊2 max 0, 𝑊𝑊1 𝑥𝑥 + 𝑏𝑏1 +𝑏𝑏2 – Three−layer neural network: 𝑦𝑦 = 𝑊𝑊3 max 0, 𝑊𝑊2 max 0, 𝑊𝑊1 𝑥𝑥 + 𝑏𝑏1 + 𝑏𝑏2 + 𝑏𝑏3 Hyperparameters to tune (users must choose a proper value to define the model): The number of hidden nodes at each layer (feature dimension at each intermediate layer) The number of layers Which activation to use 43 Hyperparameters for gradient descent Update rule for GD: 𝒘𝒘(𝒕𝒕+𝟏𝟏) : = 𝒘𝒘(𝒕𝒕) − 𝛼𝛼𝛻𝛻𝑤𝑤 𝐽𝐽(𝒘𝒘(𝑡𝑡) ) Hyperparameters – Learning rate (𝛼𝛼): (initial) value, or scheduling scheme – Mini-batch size: how many sub-samples per iteration – Epochs: when to stop (one epoch=when the entire training data is processed once) 44 Model selection How to determine the optimal hyperparameter value? – We need a validation set that the training algorithm does not observe – Training/validation/test set split – Cross-validation 45 Generalization Training error: some error measure on the training set Generalization error: the expected value of the error on a new input – typically estimated by measuring its performance on a test set that were collected separately from the training set (test error) In machine learning, we want the generalization error, to be low as well as the training error 46 Underfitting and overfitting The following two factors correspond to the two central challenges in machine learning: underfitting and overfitting – Make the training error small – Make the gap between training and test error small We can control whether a model is more likely to overfit or underfit by altering the model complexity (capacity) 47 Overfitting Especially with – Complex models with many parameters (like deep neural networks) – Small training data How to avoid (in NNs) – Regularize, e.g., Dropout: randomly drop(remove) neurons during training Batch normalization Norm penalties (e.g., L2 penalty on W to loss) Early stopping, data augmentation, etc. 48 Dropout Model averaging for combining predictions of many different neural nets is non-trivial – requires huge datasets and training time Dropout can be considered as an ensemble for neural nets – A bunch of networks with different structures – Network is forced to learn a redundant representation Create (smaller) thinned networks Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 2014 49 Dropout trains an ensemble DL. Figure 7.6 50 Batch normalization Normalization via Mini-Batch Statistics to combat the internal covariate shift Differentiable transformation Can use gradient descent through back- propagation Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. ICML 2015 51 Why Batch Norm? Training a deep neural network is complicated because the inputs to each layer are affected by the parameters of all preceding layers. This causes the distribution of each layer’s inputs changes during training, making optimization more difficult Covariate shift – when the input distribution to a learning system changes (e.g., 𝑝𝑝𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 (𝑥𝑥) ≠ 𝑝𝑝𝑡𝑡𝑒𝑒𝑒𝑒𝑒𝑒 (𝑥𝑥)) Internal covariate shift (in a deep neural network) – The change in the distribution of network activations due to the change in network parameters during training 52 Norm penalties L1-regularization 𝐿𝐿 𝑤𝑤 = 𝐿𝐿 𝑤𝑤 + 𝛼𝛼 𝑤𝑤 1 – Encourages sparsity L2-regularization (≈weight decay) 𝐿𝐿 𝑤𝑤 = 𝐿𝐿 𝑤𝑤 + 𝛼𝛼 𝑤𝑤 2 2 – Encourages small weights – In standard SGD, weight decay is mathematically equivalent to L2 regularization – With optimizers like Momentum or Adam, they differ slightly The L2 penalty directly modifies the loss function, while weight decay modifies the gradient update rule 53 Early stopping The most commonly used form of regularization in deep learning Treat the number of training steps as another hyperparameter Every time the error on the validation set improves, store a copy of the model parameters. When the training algorithm terminates, return these parameters, rather than the latest parameters 54 Dataset augmentation The best way to make a machine learning model generalize better is to train it on more data In practice, the amount of data we have is limited. One solution is to create fake data and add it to the training set particularly effective for object recognition 55 Deep learning approach in general Works great for unstructured (non-tabular) data (such as images, text, signal or audio) Architectures modularized (compose like lego) Requires lots of data and computing resources Many pre-trained models available (may not always need huge data) Regularization techniques to improve generalization 56 Specialized deep learning architectures Convolutional neural network (CNN) Recurrent neural network (RNN) – LSTM, GRU Transformer 57 Summary Feedforward neural network architecture Training deep neural networks – Gradient descent to minimize loss – Backpropagation for computing gradients Regularization techniques for deep neural networks 58