Deep Learning and Variants_Session 4_20240127.pdf

Presents Deep Learning & its variants GGU DBA Deep Learning Dr. Anand Jayaraman Professor, upGrad; Chief Data Scientist, Agastya Data Solutions 1. What is Deep Learning? Shallow knowledge Vs Deep knowledge What is Deep Learning? “Deep” in the context of neural networks is having more than “two” layers in an architecture....more layers hidden layer 1 Input layer hidden layer 2 N>2 is Deep hidden layer N Why Deep Learning? Given x is an input and y corresponding target y = f(x) We learn a neural network to represent f Ability to represent a complex non-linear function compactly using less parameters. Learning hierarchies of features present by a more general solution. Universal approximation theorem An artificial neural network with a single hidden layer consisting of infinite number of neurons can approximate any continuous function on compact subsets of Rn Any one sees a problem with such approach of using very large number of neurons in just one hidden layer? Problem with shallow networks! To easily understand let us consider an image dataset. If we chose a unstructured data problem visualization will be difficult! Total of 10 classes and huge variance within each class. Given a single hidden layer network and large number of neurons I will be able to get 0% train error. How is that possible? And are we learning a good model? Learning hierarchies of features Learning hierarchies of features Hierarchy of features learned on face images in a classification task For ease of visualization we picked images as input data but hierarchies do exist in any other type input data as well. Compact representations! A k-layer network can represent a function compactly, with a number of hidden units that is polynomial in the number of inputs. A (k−1)-layer network cannot represent the same function unless it has an exponentially large number of hidden units. MLP as a deep network...more layers hidden layer 1 hidden layer 2 hidden layer N Input layer Is it that simple? Add more layers and we get better performance? 2. Problems with Deep Networks Issues while using more layers. ∙ Vanishing gradients: as we add more and more hidden layers, back-propagation becomes less and less useful in passing information to the lower layers. In effect, as information is passed back, the gradients begin to vanish and become small relative to the weights of the networks. ∙ Difficulty in Optimization: In deeper networks the parameter space is large, and surface being optimized has complex structures. Gradient descent becomes very inefficient. ∙ Overfitting: perhaps the central problem in Machine Learning. Briefly, over-fitting describes the phenomenon of fitting the training data too closely, maybe with hypotheses that are too complex. In such a case, your learner ends up fitting the training data really well, but will perform much, much more poorly on real examples. ANN learning Learning is changing weights In the very simple cases – Start random – If the output is correct then do nothing. – If the output is too high, decrease the weights attached to high inputs – If the output is too low, increase the weights attached to high inputs 𝑤𝑡+1 𝜕𝐸 = 𝑤𝑡 − 𝛼 𝜕𝑊 ANN Learning: Back-propagation The method of computing the sensitivity of the error to 𝜕𝐸 the change in weights, , is called Back-propagation 𝜕𝑊 The term is an abbreviation for “Backward propagation of Errors” Popularized by a paper by Geoffery Hinton The led to a renaissance in the area of Neural Networks Visualization of Backpropagat ion learning Backprop output layer From: https://www.slideshare.net/keepurcalm/backpropagation-inneural-networks 17 8 From: https://www.slideshare.net/keepurcalm/ backpropagation-in-neural-networks 9 From: https://www.slideshare.net/keepurcalm/ backpropagation-in-neural-networks 0 From: https://www.slideshare.net/keepurcalm/ backpropagation-in-neural-networks Vanishing gradients Given the N is very large, the back propagation of error to initial layers in the network is almost zero and thereby no learning. Algorithm-X Layer-1 This much is our contribution to the output error Cant hear you!!!...more layers output layer hidden layer 1 Input layer hidden layer 2 hidden layer N Remember the manager (Algorithm-X) who helps distribute the error back to layers? Understanding Vanishing Gradients Feed forward 𝑤1. 𝑥 → ℎ𝑖𝑑𝑑𝑒𝑛1 → ℎ𝑖𝑑𝑑𝑒𝑛2 → 𝑂𝑢𝑡𝑝𝑢𝑡 → 𝑒𝑟𝑟𝑜𝑟 Backward Propagation Issues with deep learning: Vanishing gradients Derivative of a sigmoid Each of the sigmoid derivatives are less than 0.3! So derivative with w1 is number that’s product of multiple terms, each of which are less than 0.3. Optimization Difficult The surface over which we need to find minimum can be very complicated Saddle points Steep gradients Extremely shallow gradients Neural nets can overfit Training accuracy: 97.36% Test accuracy: 89.38% Training accuracy: 100% Test accuracy: 86.24% Overfitting As we carefully chose loss functions, local minima is not a big issue But, high dimensional spaces are filled with saddle points Vanishing gradients coupled with overfitting to local minima or saddle points is a deadly combo!

Deep Learning and Variants_Session 4_20240127.pdf

Document Details

Tags

Related

Full Transcript

Upgrade to continue