Deep Neural Networks I PDF
Document Details

Uploaded by MesmerizingGyrolite5380
Ajou University
Kyung-Ah Sohn
Tags
Related
Summary
These lecture notes cover deep neural networks, including their architecture, training methods, and different learning approaches.
Full Transcript
Deep neural networks I Kyung-Ah Sohn Ajou University Contents Introduction Feed-forward neural network: Multi-Layer Perceptron (MLP) Training neural networks: Gradient Descent and Backpropagation Regularization...
Deep neural networks I Kyung-Ah Sohn Ajou University Contents Introduction Feed-forward neural network: Multi-Layer Perceptron (MLP) Training neural networks: Gradient Descent and Backpropagation Regularization 2 Supervised learning: predictive model π₯π₯ π¦π¦ = ππππ (π₯π₯) Regression model: ππππ 5.5 age (numeric) Classification 0.7 Cat model: ππππ 0.2 Dog 0.1 Hamster input output (categorical) 3 2024 Nobel prize and AI [In Physics] Two Scientists who Laid Foundation for [In Chemistry] Three Scientists who uncovered the the Core of Artificial Intelligence, βMachine Learningβ "Secrets of Proteins" using AI 4 Secrets of Proteins: Predicting 3D Structures Central Dogma of Molecular Biology Protein 3structure prediction problem [50-year grand challenge DNA -> RNA -> Protein in life science] ππ ππ(ππ) ππ 5 Secrets of Protein x AI: AlphaFold AlphaFold2 (2020, Googld DeepMind) β [In the past] determining the full structure of a single protein required years of labor- intensive experiments and millions of dollars in specialized equipment β [AlphaFold] can predict numerous protein structures with very high accuracy within days or even hours ο¨ Reshaping the paradigm of biological research β Accelerating studies on drug development, disease mechanisms, the creation of enzymes, etc. β (Introduced Evoformer, a deep learning architecture based on Transformer) AlphaFold3 announced on 8 May 2024 (Nature) β (Introduced Pairformer, similar but simpler than Evoformer. Combined with a diffusion model) 6 Machine learning example Input Output ML algorithm ππ ππ = ππ(ππ) ππ 7 Machine learning example Input Output Gender = 0 GPA = 3.7 Age = 22 ML Pass (1) or Fail (0) NumInternship = 1 algorithm NumProject = 3 EngScore = 85 ππ ππ = ππ(ππ) ππ 8 Machine learning example Input Output ML algorithm ππ ππ = ππ(ππ) ππ ππ Learn a function ππ using data π·π· = { π₯π₯ππ , π¦π¦ππ ) ππ=1 9 Machine learning example Input Output Prompt: A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city ML signage. She wears a black leather jacket, a long red dress, and block algorithm boots. ππ ππ = ππ(ππ) ππ How to represent input (or output) data as a numeric vector? (representation learning) 10 Basics of Machine Learning Predictive modeling (a.k.a. supervised learning) β Given data π·π· = π₯π₯ (1) , π¦π¦ (1) ), (π₯π₯ (2) , π¦π¦ (2) , β¦ , π₯π₯ (ππ) , π¦π¦ (ππ) , find a function π¦π¦ = ππππ π₯π₯ that best fits π·π· β Modeling: which types of function to use for ππ (e.g., linear, polynomial, tree-based, β¦) β Training a model means to learn (find) the optimal ππ for the given data π·π· Given π·π· find a function ππππ π₯π₯ such that βππ ||ππππ ππ(ππ) β π¦π¦ (ππ) || is small ππ: β(Trainable/learnable) model parametersβ (e.g., coefficients in linear or logistic regression) 11 Deep learning Represent ππ by combining multiple (deep) layers ππππ β Each layer transforms the given representation (feature vector) into another representation that is better suited for the final prediction β Feature learning and prediction in the same framework ππ ππ Popular deep learning architecture β Multilayer perceptron β base architecture: use matrix multiplication β Convolutional neural network : use convolution instead of matrix multiplication β Recurrent neural network : similar to MLP but with sequential processing β Transformer : use attention mechanism for sequence processing β Graph neural network : use graph structural information 12 DEEP NEURAL NETWORK ARCHITECTURE 13 Recap: linear regression Input features Output π π = π€π€0 + π€π€1 π₯π₯1 + π€π€2 π₯π₯2 π¦π¦ = ππ πΌπΌ : identity function πΌπΌ π π ππ Output π€π€0 1 οΆ Trainable parameters: What if the output y is binary (or categorical)? 14 logistic regression: binary classification Suppose π¦π¦ β {1, 0} We can model πππ€π€ (π¦π¦ = 1|π₯π₯) instead of modeling ππ directly by using a sigmoid function ππ as a non-linear activation function [Logistic regression] Input π€π€1 π π = π€π€0 + π€π€1 π₯π₯1 + π€π€2 π₯π₯2 ππ π€π€2 π π Output 1 ππ π π = Output π€π€0 ππ π¦π¦ = 1 ππ 1 + eβπ π = ππ(π€π€0 + π€π€1 π₯π₯1 + π€π€2 π₯π₯2 ) 1 Linear combination Non-linear activation ππ 0 = 0.5 ππ π π > 0.5 if π π > 0 β classify as 1 ππ π π < 0.5 if π π < 0 β classify as 0 : linear classifier 15 Softmax regression(classifier): multi-class classification What if the label is not binary, but multi-class, e.g., π¦π¦ β {1,2,3}? If we have three target classes, introduce three output nodes and associated weight parameters Think of the linear combination as a score for the corresponding class 3π₯π₯1 β 2 π₯π₯2 + 1 3 How can we obtain class probabilities from scores (so -2 π π 1 that we can compute the loss for training)? -2 β2π₯π₯1 + π₯π₯2 β 1 ο¨ Use softmax function 1 π π 2 ππ π π 1 ππ π π πΎπΎ π π 1 , π π 2 , β¦ , π π πΎπΎ β ( πΎπΎ , β¦ , πΎπΎ ) 1 βππ=1 ππ π π ππ βππ=1 ππ π π ππ 1 -1 1 π π 3 π₯π₯1 + π₯π₯2 β 0.5 1 -0.5 ππ 1 ππ π¦π¦ = 1 π₯π₯ = = 0.37 οΆ The softmax output is ππ 1 + ππ β2 + ππ 1.5 If π₯π₯1 , π₯π₯2 = 1,1 : ππ β2 compared with the score for class 1 β π π 1 = 1 ππ π¦π¦ = 2 π₯π₯ = 1 = 0.02 ground truth label to ππ + ππ β2 + ππ 1.5 for class 2 β π π 2 = β2 ππ 1.5 compute the loss for class 3 β π π 3 = 1.5 ππ π¦π¦ = 3 π₯π₯ = 1 = 0.61 during training ππ + ππ β2 + ππ 1.5 16 Softmax classifier: multi-class classification What if the label is not binary, but multi-class, e.g., π¦π¦ β {1,2,3}? If we have three target classes, introduce three output nodes and associated weight parameters Think of the linear combination as a score for the corresponding class 3π₯π₯1 β 2 π₯π₯2 + 1 3 π π 1 3 β2 1 π₯π₯1 -2 π π 1 π π 2 = β2 1 β1 π₯π₯2 ππ = ππ ππ ππ -2 β2π₯π₯1 + π₯π₯2 β 1 π π 3 1 1 β5 1 1 π π 2 or equivalently, π π 1 3 β2 π₯π₯ 1 1 1 1 π π 2 = β2 1 -1 1 π₯π₯2 + β1 ππ = ππ ππ ππ + ππ π₯π₯1 + π₯π₯2 β 0.5 1 π π 3 π π 3 -0.5 1 1 β0.5 If π₯π₯1 , π₯π₯2 = 1,1 : οΆTraining goal: find the weight parameters (and biases) score for class 1 β π π 1 = 1 for class 2 β π π 2 = β2 that best-fit training data for class 3 β π π 3 = 1.5 17 Artificial neuron : building block ππ = π°π° ππ π±π± + ππ Dot product between weights and input features bias + Model parameters: Example: linear regression, logistic regression 18 Layer: parallelized linear/dense/fully connected(fc) layer weighted sum and non- linearity (activation) π π ππ = π°π°ππππ π±π± + ππππ β π¬π¬ = ππ ππ π±π± + ππ π‘π‘ = ππ(π¬π¬) Matrix multiplication + Model parameters: bias terms Example: multi-output linear regression, softmax regression 19 Network: sequence of parallelized weighted Multi-layer perceptron(MLP) sums and non-linearities 1st layer 2nd layer 1 ππ (0) 2 ππ (1) (1) π¬π¬ (1) = ππ π±π± + ππ (1) π¬π¬ (2) = ππ π±π± +ππ + ππ (2) π±π± (1) = ππ(π¬π¬ (1) ) π±π± (2) = ππ(π¬π¬ (2) ) + + = ππ( β¦ ππ ππ β¦) output 2nd weights 1st weights input 20 Activation functions Traditionally More recently 21 Pop quiz. How many parameters to estimate (in a single layer) if we have 200 input features and a hidden layer with 100 hidden nodes? β That is, we want to transform 200-dim input feature vector to 100-dim output feature vector. What will be the size of the weight matrix and the bias? input output οΆ Model size (= number of parameters) : 200 100 + = ππ( ) output input 22 MLP Example: keras Two-layer neural network: π¦π¦ = ππ2 max 0, ππ1 π₯π₯ + ππ1 +ππ2 Threeβlayer neural network: π¦π¦ = ππ3 max 0, ππ2 max 0, ππ1 π₯π₯ + ππ1 + ππ2 + ππ3 Ex) The number of parameters in each layer p0 L1: 235,500 p1 L2: 30,100 p2 L3: 1,010 ? p3 Total params: 266,610 β¦ p9 10x1 784x1 23 Activation at output layer For regression For classification Identity function (do nothing more) Softmax function πΉπΉ πΉπΉ β1.5 β1.5 β1.5 exp β 0.22 normalize 0.03 1.5 1.5 1.5 4.48 0.76 0.2 0.2 0.2 1.2 0.20 οΆ What should be the dimension of the final output layer? 24 Training deep neural networks GRADIENT DESCENT Training neural network parameters οΌ How to learn the weight parameters? Step 1. Define the loss function L (or objective function, cost function) to minimize β The mean of squared errors (MSE loss) between the calculated outcome and the true outcome (for regression) β Cross entropy loss for classification Step 2. Find the parameters that minimize the loss function β Gradient descent for optimization β How to compute the gradient of the loss function? ο Back-propagation 26 Loss function for classification problems How to measure the difference between two probability distributions? β Sum of squared errors?? ο¨ Use Cross-entropy (information-theoretic measures) π₯π₯ Predicted (π¦π¦) Ground truth ( π‘π‘ ) How βbadβ is the 0.7 Cat current estimation? 1 ππππ 0.2 Dog 0 0.1 Hamster 0 27 Learning as optimization: Gradient descent Gradient descent is an optimization algorithm for finding a local minimum of a differentiable function Used for training a machine learning model β for finding the parameter values that minimize a cost function Main idea β Climbing down a hill until a local minimum is reached β In each iteration, take a step in the opposite direction of the gradient β Step size is determined by the learning rate and the gradient ππ(ππππππ) β ππ(ππππππ) β πΌπΌπ»π»π€π€ πΏπΏ(ππ) 28 0.1 -0.3 β¦ 1.1 1 Large-scale learning π₯π₯ ππ π¦π¦ ππ Training data 1 πΏπΏ ππ = ππ(π₯π₯ ππ , π¦π¦ ππ ; π€π€) ππ ππ (Batch) gradient descent β Weight update is calculated from the whole training set β Computationally very costly 1 π€π€ β π€π€ β πΌπΌ π»π»π€π€ ππ(π₯π₯ ππ , π¦π¦ ππ ; π€π€) ππ Stochastic gradient descent (SGD) ππ β Update the weight incrementally for each training sample (π₯π₯ ππ , π¦π¦ ππ ) β SGD typically converges faster because of the more frequent update π€π€ β π€π€ β πΌπΌπ»π»π€π€ ππ(π₯π₯ ππ , π¦π¦ ππ ; π€π€) 29 Mini-batch gradient descent Compromise between batch gradient and SGD Apply batch gradient descent to smaller subsets of the training data, e.g., 64 training samples at a time β The optimal size depends on the problem, data and the hardware (memory) Advantages β Faster convergence than GD β Vectorized operations to improve the computational efficiency ο one epoch=when the entire training data is processed once 30 https://ml-explained.com/blog/gradient-descent-explained 31 Learning Rate (LR) ππ(ππ+ππ) : = ππ(ππ) β πΌπΌπ»π»π€π€ π½π½(ππ(π‘π‘) ) LR too large 32 Adaptive learning rate Constant learning rate often prevents convergence Common learning rate schedules β time-based/step/exponential decay Typically, from large to small values 33 GD for neural networks Gradient computation: back-propagation algorithm The loss function is non-convex β A lot of local minima or saddle points Hard to find a good learning rate ο adaptive learning rate SGD can be too noisy and unstable ο momentum https://www.telesens.co/2019/01/16/neural-network- loss-visualization/ Gradients often vanish/explode ο normalization 34 Parameter update rules (βoptimizersβ) SGD β Simple, easy to implement β Inefficient sometimes Optimization path 1 ππ(π€π€1 , π€π€2 ) = π€π€1 2 + π€π€2 2 20 35 Parameter update rules (βoptimizersβ) Momentum AdaGrad Adam use a βmoving averageβ of the gradients maintains a per-parameter learning rate Momentum + adaptive learning rate to reflect the previous moving direction π€π€ β π€π€ + π£π£ RMSprop, Adadelta, Adamax, β¦ π£π£ β πΌπΌπ£π£ β πππ»π»π€π€ πΏπΏ(π€π€) 36 Computing gradients How to compute the gradient of the loss function from a deep neural network? 2 e.g., π¦π¦ = ππ ππ2 ππ ππ1 π₯π₯ + ππ1 + ππ2 , ππ = π¦π¦π‘π‘π‘π‘π‘π‘π‘π‘ β π¦π¦ ππππ(ππ1 , ππ2 , ππ1 , ππ2 ) =? ππππ2 ππππ(ππ1 , ππ2 , ππ1 , ππ2 ) =? ππππ1 Back-propagation is βjust the chain ruleβ of calculus β but a particular implementation of the chain rule β avoids recomputing repeated subexpressions 37 Backpropagation Update weights using gradients BP calculates the gradients via chain rule Gradient is propagated backward through the network Most deep learning software libraries automatically calculate gradients 38 Vanishing gradient problem Difficult to train very deep neural network with sigmoid (or tanh) β In backprop, the product of many small terms goes to zero ReLU does not saturate in the positive region, preventing gradients from vanishing in deep networks β In the negative region, ReLU results in βdeadβ units, but in practice, it doesnβt seem to be a problem β Mostly commonly used in DNN 39 [0. 0. 0. Example: classification (Keras) 0.3 0. 0.7 β¦ # define the model architecture 0.] # make predictions on new samples (inference) # train the model 40 Review questions What is a softmax function What is cross-entropy, where to use Update rules for gradient descent Batch GD vs. SGD vs. mini-batch GD What is back-propagation algorithm for Why vanishing gradient in deep neural nets 41 How to combat overfitting REGULARIZATION Hyperparameters for NNs MLP Example β Two-layer neural network: π¦π¦ = ππ2 max 0, ππ1 π₯π₯ + ππ1 +ππ2 β Threeβlayer neural network: π¦π¦ = ππ3 max 0, ππ2 max 0, ππ1 π₯π₯ + ππ1 + ππ2 + ππ3 Hyperparameters to tune (users must choose a proper value to define the model): The number of hidden nodes at each layer (feature dimension at each intermediate layer) The number of layers Which activation to use 43 Hyperparameters for gradient descent Update rule for GD: ππ(ππ+ππ) : = ππ(ππ) β πΌπΌπ»π»π€π€ π½π½(ππ(π‘π‘) ) Hyperparameters β Learning rate (πΌπΌ): (initial) value, or scheduling scheme β Mini-batch size: how many sub-samples per iteration β Epochs: when to stop (one epoch=when the entire training data is processed once) 44 Model selection How to determine the optimal hyperparameter value? β We need a validation set that the training algorithm does not observe β Training/validation/test set split β Cross-validation 45 Generalization Training error: some error measure on the training set Generalization error: the expected value of the error on a new input β typically estimated by measuring its performance on a test set that were collected separately from the training set (test error) In machine learning, we want the generalization error, to be low as well as the training error 46 Underfitting and overfitting The following two factors correspond to the two central challenges in machine learning: underο¬tting and overο¬tting β Make the training error small β Make the gap between training and test error small We can control whether a model is more likely to overο¬t or underο¬t by altering the model complexity (capacity) 47 Overfitting Especially with β Complex models with many parameters (like deep neural networks) β Small training data How to avoid (in NNs) β Regularize, e.g., Dropout: randomly drop(remove) neurons during training Batch normalization Norm penalties (e.g., L2 penalty on W to loss) Early stopping, data augmentation, etc. 48 Dropout Model averaging for combining predictions of many different neural nets is non-trivial β requires huge datasets and training time Dropout can be considered as an ensemble for neural nets β A bunch of networks with different structures β Network is forced to learn a redundant representation Create (smaller) thinned networks Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 2014 49 Dropout trains an ensemble DL. Figure 7.6 50 Batch normalization Normalization via Mini-Batch Statistics to combat the internal covariate shift Differentiable transformation Can use gradient descent through back- propagation Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. ICML 2015 51 Why Batch Norm? Training a deep neural network is complicated because the inputs to each layer are affected by the parameters of all preceding layers. This causes the distribution of each layerβs inputs changes during training, making optimization more difficult Covariate shift β when the input distribution to a learning system changes (e.g., πππ‘π‘π‘π‘π‘π‘π‘π‘π‘π‘π‘π‘π‘π‘π‘π‘ (π₯π₯) β πππ‘π‘ππππππ (π₯π₯)) Internal covariate shift (in a deep neural network) β The change in the distribution of network activations due to the change in network parameters during training 52 Norm penalties L1-regularization πΏπΏ π€π€ = πΏπΏ π€π€ + πΌπΌ π€π€ 1 β Encourages sparsity L2-regularization (βweight decay) πΏπΏ π€π€ = πΏπΏ π€π€ + πΌπΌ π€π€ 2 2 β Encourages small weights β In standard SGD, weight decay is mathematically equivalent to L2 regularization β With optimizers like Momentum or Adam, they differ slightly The L2 penalty directly modifies the loss function, while weight decay modifies the gradient update rule 53 Early stopping The most commonly used form of regularization in deep learning Treat the number of training steps as another hyperparameter Every time the error on the validation set improves, store a copy of the model parameters. When the training algorithm terminates, return these parameters, rather than the latest parameters 54 Dataset augmentation The best way to make a machine learning model generalize better is to train it on more data In practice, the amount of data we have is limited. One solution is to create fake data and add it to the training set particularly eο¬ective for object recognition 55 Deep learning approach in general Works great for unstructured (non-tabular) data (such as images, text, signal or audio) Architectures modularized (compose like lego) Requires lots of data and computing resources Many pre-trained models available (may not always need huge data) Regularization techniques to improve generalization 56 Specialized deep learning architectures Convolutional neural network (CNN) Recurrent neural network (RNN) β LSTM, GRU Transformer 57 Summary Feedforward neural network architecture Training deep neural networks β Gradient descent to minimize loss β Backpropagation for computing gradients Regularization techniques for deep neural networks 58