Deep Learning and Variants_Lecture 6_20240204.pdf
Document Details
Uploaded by PalatialRelativity
Tags
Related
- Deep Learning and Variants_Session 4_20240127.pdf
- Deep Learning and Variants_Session 5_20240128.pdf
- Deep Learning and Variants_Lecture 8_20240217.pdf
- 23-24 - M2AI - DL4CV - 1 - Deep Learning 57-92.pdf
- COMP9517 Deep Learning Part 1 - 2024 Term 2 Week 7 PDF
- Deep Learning (Ian Goodfellow, Yoshua Bengio, Aaron Courville) PDF
Full Transcript
Presents Deep Learning & its variants GGU DBA Regularization & Autoencoders Dr. Anand Jayaraman Professor, upGrad; Chief Data Scientist, Agastya Data Solutions Nurturing a deep neural network Vanishing Gradient Solutions – Better Activation functions Improving upon gradient descent Weight initializa...
Presents Deep Learning & its variants GGU DBA Regularization & Autoencoders Dr. Anand Jayaraman Professor, upGrad; Chief Data Scientist, Agastya Data Solutions Nurturing a deep neural network Vanishing Gradient Solutions – Better Activation functions Improving upon gradient descent Weight initialization Regularization REGULARIZATIONS Regularization for ANN L1 & L2 regularization Noise based regularization – Dropout Batch normalization Overfitting problem 𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝛽2 𝑥2 + 𝛽3 𝑥3 + 𝛽4 𝑥4 + ⋯ How many 𝛽𝑖 do we need for a reasonable fit, while at the same time minimizing the risk of overfitting? In general, given a regression problem with several features, is there a way to determine how many 𝛽𝑖 do we need for a reasonable fit, while at the same time minimizing the risk of overfitting? 𝑦 = 𝛽0 + 𝛽1 𝑥 1 + 𝛽2 𝑥2 + 𝛽3 𝑥3 + 𝛽4 𝑥4 + ⋯ Regularization Standard Least Squares Regression involves minimizing SSE SSE = ( y − x ) = (y − Xβ) (y − Xβ) N d 2 j =1 j i =0 i T i y = vector of all training responses y j X = matrix of all training samples x j We can reduce overfitting by penalizing large coefficients N d d 2 2 min ( y j − i xi ) + i i =0 i =1 j =1 Regularization functions Ridge Simplest is whose sum of squares is the least Lasso Simplest is whose absolute sum is the least High lambda gives simplest models; At infinity, we have no model Architecture http://cs.stanford.edu/people/karpathy/convnet js/demo/classify2d.html Regularization is the key 20 node neural net Controlling overfit DROPOUT 11 Regularizati on Drop out http://www.insofe.edu.in Data Science Education and Research Training with dropout p=0.25 Drop out in test time 3 http://www.insofe.edu.in Data Science Education and Research Dropout Implementation Without Dropout From: https://machinelearningknowledge.ai/keras-dropout-layer-explained-for-beginners/ With Dropout Batch, layer and weight normalizations NORMALIZATION METHODS 16 Normalization/Standardization Why do we standardize the data? If you standardize optimisation gets easier, ever noticed it? If above statement is true, why? https://arxiv.org/pdf/1502.03167v3.pdf Pre-Processing Normalizing the data with a zero mean allows more play for the classifier All the good work of normalization evaporates Summation operation will destroy zero centricity (-2,0,2) can get linearly transformed to (4,6,8) Activation could destroy the distribution Batch normalization Normalize the output of each hidden layer either after summation or after activation. People reported both winning in different data sets. Use the mini batch to normalize. Compute for every mini batch, the mean and variance and use the transformation After summation After activation During the test time At the end of training, several mini batches will have gone through each node and we will have computed the mean and variance several times We compute one empirical mean and variance from these and use during test time. We do Batch notnormalization recompute for every test details sample Code Sample model = Sequential([ Dense(64, input_shape=(4,), activation="relu"), BatchNormalization(), Dense(128, activation='relu'), BatchNormalization(), Dense(128, activation='relu'), BatchNormalization(), Dense(64, activation='relu'), BatchNormalization(), Dense(3, activation='softmax') ]); Batch normalization Applied to a state-of-the-art image classification model. Batch Normalization achieves the same accuracy with 14 times fewer training steps. Also beats the original model by a significant margin. Acts as a regularizer too, in some cases eliminating the need for Dropout https://arxiv.org/pdf/1502.03167v3.pdf Performance improvement Better activation (RELU) Better initialization (Xavier-He or Autoencoder) Varying learning rate (momentum corrections) Learning rate decay Dropout, Batchnorm & Regularization Hyper parameters On a log scale select 100 learning rates. With 10-50 mini batches, pick the best amongst them Fix activation (RELU), initialization (Modified Xavier), Dropout (10% input layer and 50% hidden layers), logistic loss, ridge regularization and 128 mini batch. Adam update with 0.9 momentum and 0.999 RMS prop Experiment and fix the weight decay You can then experiment with momentum (pick 100 values on a log scale between 0.001 to 0.1 and momentum is 1-the value. Pick the best momentum based on loss after one epoch. Experiment with best drop out AUTOENCODERS Unsupervised Models Auto encoders and Restricted Boltzmann Machines (RBM) are popular unsupervised learning models. Theoretically in many cases both models can be shown as the same. Autoencoders operating in reconstruction space and RBM in energy space. Kamyshanska, H., Memisevic, R. The potential energy of an autoencoder. IEEE Transactions on Pattern Analysis and Machine Inte lligence (PAMI) en.wikipedia.org Auto-encoder structure Output layer Hidden layer Input layer In principle its an MLP hence what will be its properties? Auto-encoder: input and output For a MLP what is an input x and what will be a target or output y? X: Y: Classification/regression label For an autoencoder... X: Y: Auto-encoder: dimensionality reduction A vanilla autoencoder moves the input data into a lower dimensional hidden space. What happens when we squeeze data through a bottleneck and yet try to reconstruct it on the other side? MNIST dataset 60k Examples of hand-written digits 28x28 = 784 pixels Autoencoders: Dimensionality Reduction with MNIST https://blog.keras.io/building-autoencoders-in-keras.html Autoencoders: Dimensionality Reduction with MNIST Autoencoders: Dimensionality Reduction with MNIST We have taken the input data which had a dimensionality of 784 (28pix * 28 pix) and reduced it to 32 dimensions! Captures the essence of the images without too much loss http://dkopczyk.quantee.co.uk/d ae-part1/ Dimensionality Reduction PCA helps with dimensionality Reduction (through linear combinations) Autoencoders can be thought of as nonlinear PCA PCA Auto-encoder as Replacement for PCA Original Problem Train Data with 784 features (60000x784) Construct an autoencoder with input dim of 784 and the encoding layer with 32 neurons Once the autoencoder is trained, use the output of the encoding layer as the new compressed features Autoencoder feature creation flow Input: 60K Rows, 784 Columns Model architecture Use the output as new feature set Train the model Extract the output of the encoder layer Output: 60K Rows, 32 Columns Over-completeness in hidden space What happens when number of hidden units is greater than number of input units? Will we still learn useful features? Sparse coding ∙ Sparse coding is a way to get a more explicit representation of the input space. ∙ Sparse representation implies each input or region of the input space is represented by only a small subset of the hidden units ∙ The smaller the subset, the sparser the representation ∙ Learning sparse codes using autoencoders and regularization. Autoencoder Application: Feature Engineering Original Data with 35 features – Target: 1 binary classifier Deep non-linear structure Is there a way to do feature engineering which represents the data in a simpler space? Autoencoder with sparse encoding – 100 features Use coded features in Xgboost algorithm Denoising autoencoders ∙ Simple regularization scheme is denoising which mainly involves reconstructing the original input from a corrupted version of it. ∙ In this case, corruption involves randomly setting a portion of the input dimensions to zero also termed as zero-mask noise. Original data Corrupted data Hidden layer Target: Original data Denoising Autoencoder: MNIST example https://blog.keras.io/building-autoencodersin-keras.html Denoising Autoencoder From: https://www.v7labs.com/blog/autoencoders-guide Autoencoder for colorization From: Autoencoder:Grayscale to color image | Kaggle Autoencoders applications 1. Image processing application: - De-noising - Auto filling 2. Anomaly detection. 3. Feature generation. 4. Learning generative models 5. Text translation... www.cc.gatech.edu SegNet: Object Segmentation Object Segmentation https://youtu.be/e9bHTlYFwhg