Deep Learning Lecture 8 PDF

DS 3000 Lecture 8 Deep Learning Nov 11 , 2024 DS3000 1 8.1. Why Deep Learning DS3000 2 Recap: Machine and Deep Learning ❑ Artificial intelligence was born in the 1950s, when a handful of pioneers from computer science started asking whether computers could be made to “think”! ❑ AI can be described as the effort to automate intellectual tasks normally performed by humans. ❑ Machine learning only started to flourish in the 1990s, it has quickly become the most popular and most successful subfield of AI, a trend driven by the availability of faster hardware and larger datasets. ML is related to mathematical statistics. ❑ Deep learning is a specific subfield of ML: a new take on learning meaningful representations from data successively. DS3000 3 Why Deep Learning? ❑ Deep learning has proven to be effective in capturing complex patterns in data o (e.g., remarkable in image recognition, handwriting transcription, autonomous driving, machine translation, text-to-speech conversion, digital assistants, ad targeting, etc.) ❑ DL completely automates the most crucial step of feature engineering in ML/shallow learning. ❑ The “deep” isn’t deeper understanding; rather, stands for successive layers of representations (“layered representations learning” or “hierarchical representations learning”) DS3000 4 Some Practical Applications of CNNs ❑ Semantic Segmentation ❑ Pose detection ❑ Object detection ❑ Face recognition DS3000 5 Deep Learning ❑ Deep learning: Learning successive layers of increasingly meaningful representations from data (layered representations learning/hierarchical representations learning) ❑ There are several variants of deep learning such as Auto-encoders, Deep Belief Network, Deep Boltzmann Machines, Convolutional Neural Networks, Recurrent Neural Networks and Transformers. Hierarchical Representation DS3000 6 Why Deep Learning? Deep learning also makes problem-solving much easier, because it completely automates what used to be the most crucial step in a machine learning workflow: feature engineering. Learn features by example: If a set of input data are repeatedly presented to it, gradually acquires the ability to recognize patterns More accuracy Feature extraction can not be updated during training, while in DL the whole network is updated during training. DS3000 7 Deep Learning (DL) vs Classical Machine Learning (ML) DEEPLEARNING ACCURACY CONVENTIONAL MACHINE LEARNING AMOUNT OF DATA ❑ For smaller datasets, traditional ML often provides slightly better performance. ❑ Traditional ML often provide more interpretable insights, and ways to handcraft features. ❑ For larger data sets, DL methods tend to dominate ❑ Progress in computational tools enables DL algorithms to process large set of data in a short period of time. DS3000 8 [Textbook: NNDL-1.1] 8.2. Convolutional Neural Networks (CNNs) DS3000 9 Convolutional Neural Network (CNN) ❑ Neural Networks (NNs) composed of successive layers of multiple replicated image filters are called "convolutional neural networks (CNNs)" DS3000 [cs230 Andrew Ng, 2022] 10 Convolutional Neural Network (CNN) vs Multi-Layer Perceptron (MLP) Fully Connected Layer vs Locally Connected Layer MLP: ❑ Each neuron detects one feature/pattern ❑ One pattern over the whole image (learns at a more abstract level) ❑ Too many weights for each neuron/pattern example: 100x100 image = 10000 weights/pattern CNN: ❑ Each neuron detects a different pattern ❑ Kernel/filter of a neuron: spatially limited connections example: if kernel = 10x10, 100 weights/pattern ❑ Each neuron only detects a pattern at a certain location DS3000 11 [Kamnitsas, 2017] Convolutional Neural Network (CNN) vs Multi-Layer Perceptron (MLP) MLP: Fully connected CNN: Slide local receptive field across entire image For each local receptive field, there is a different neuron in hidden layer DS3000 12 [cs231 Li et al., 2022 & Kamnitsas, 2017] Weight Sharing in CNN – Feature Maps ❑ Each neuron in hidden layer connects to a small region of input (local receptive field of hidden neuron) ❑ Hidden neurons learn an overall bias and each connection learns a weight(to analyze its receptive field ) ❑ Multiple neurons of a feature map (FM), share the same kernels, and share same weights and bias ❑ All neurons of a FM look for same pattern/feature, at different locations (⇒ translation invariance) ❑ Activations of a FM’s neurons (FM/Activation map) is computed by convolving image with FM’s kernel! ❑ A FM can be perceived as an image (the result of convolution). Single neuron: Feature Map: DS3000 13 [Kamnitsas, 2017] Convolutional Layer ❑ We can detect multiple features per layer (multiple feature maps) ❑ A layer can be perceived as a multi-channel image (similar to RGB) ❑ For each feature map we need 5×5×3=75 shared weights, plus a single shared bias. Totally 76 parameters to learn. ❑ If we have 20 feature maps that's a total of 20×76=1520 parameters DS3000 14 [cs231 Li et al., 2022 & Kamnitsas, 2017] Feature Transformations For Images Filter 1 w1 w2 w3 w4 w5 w6 = 0.9 ❑ Individual pixels are not w7 w8 w9 adequately discriminant in most image classification problems. ❑ Patches and filters are commonly used to extract other features from images. DS3000 15 Feature Transformations For Images Filter 1 w1 w2 w3 ❑ Individual pixels are not w4 w5 w6 = 0.7 adequately discriminant in most w7 w8 w9 image classification problems. ❑ Patches and filters are commonly used to extract other features from images. DS3000 16 Feature Transformations For Images Filter 1 w1 w2 w3 ❑ Individual pixels are not w4 w5 w6 = 0.2 adequately discriminant in most image classification problems. w7 w8 w9 ❑ Patches and filters are commonly used to extract other features from images. DS3000 17 Feature Transformations For Images Filter 1 ❑ Individual pixels are not w1 w2 w3 adequately discriminant in most w4 w5 w6 = 0.5 image classification problems. w7 w8 w9 ❑ Patches and filters are commonly used to extract other features from images. DS3000 18 Feature Transformations For Images Filter 2 w1 w2 w3 w4 w5 w6 = 0.1 ❑ Individual pixels are not w7 w8 w9 adequately discriminant in most image classification problems. ❑ Patches and filters are commonly used to extract other features from images. DS3000 19 Feature Transformations For Images Filter 2 w1 w2 w3 ❑ Individual pixels are not w4 w5 w6 = 0.3 adequately discriminant in most w7 w8 w9 image classification problems. ❑ Patches and filters are commonly used to extract other features from images. DS3000 20 Feature Transformations For Images Filter 2 w1 w2 w3 ❑ Individual pixels are not w4 w5 w6 = 0.9 adequately discriminant in most image classification problems. w7 w8 w9 ❑ Patches and filters are commonly used to extract other features from images. DS3000 21 Feature Transformations For Images Filter 2 ❑ Individual pixels are not w1 w2 w3 adequately discriminant in most w4 w5 w6 = 0.1 image classification problems. w7 w8 w9 ❑ Patches and filters are commonly used to extract other features from images. DS3000 22 Convolutional Neural Network DS3000 23 8.3. Deep Model – CNN Layers DS3000 24 Deep CNN - Multiple Convolutional Layers ❑ Sequence of convolution layers ❑ Deeper layers process FMs (channels) of previous layers to extract more complex representations DS3000 25 [cs231 Li et al., 2022 & Kamnitsas, 2017] Learned Features by Convolutional Layers ❑ Examples of convolutional filters learned by a deep neural network (DNN) on a supervised image classification task. Krizhevsky, Sutskever, and Hinton, 2012 https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks DS3000 26 Deeper Architectures The final layer fully-connected (FC) Has fixed dimension Throws away spatial coordinates. DS3000 27 [Pingel, Nehemiah, 2017] Stride ❑ Does a filter always have to move one pixel at a time? Of course not. ❑ We can also make it move two steps or three steps at a time both in the horizontal and vertical ways. This skip is called ‘stride.’ ❑ A useful trick in calculating convolutions for neural networks more efficiently. ❑ Calculating the convolution at every location corresponds to a stride of one. Skipping every second value is a stride of two, and so forth. DS3000 28 Pooling Layer - Down Sampling Feature Maps ❑ Pooling is the process of merging. So it’s basically for the purpose of reducing the size of the data. ❑ Makes representations smaller and more manageable (simplify output of convolutional layer) ❑ Operates over each activation map independently. ❑ Pooling as a way for the network to ask whether a given feature is found anywhere in a region of the image. Then throws away the exact positional information. ❑ Once a feature is found, its exact location isn't as important as its rough location relative to other features. ❑ By removing some noise in the data and extracting only the significant one, we can reduce overfitting and speed up the computation. DS3000 29 Convolutional Neural Network DS3000 30 Convolutional Neural Network ❑ Training Progress on Image Experiment 1 Hidden Layer Convolutional DS3000 31 Transfer Learning ❑ Transfer learning is generally used: To save time and resources from having to train multiple machine learning models from scratch to complete similar tasks. ❑ Training a DNN from scratch needs lots of data and computational resources. ❑ With medium amount of data a pre-trained model was trained on a large dataset to do a task similar to our need can be used. ❑ Tune it and apply it to the new data: o Train the entire model. Use the architecture of pre-trained model and train it on your large dataset (a lot of computational power is required!) o Train some layers. Lower layers extract general features and higher layers extract specific features o Freeze the convolutional layers in original form as a fixed feature extraction and fine tune the classifier. Useful if you’re short on computational power and the amount of data. Knowledge DS3000 32 Take advantage of Popular Models ❑ EX. ImageNet Benchmark DS3000 33 8.4. Other Deep Neural Networks – Sequential Models DS3000 34 Sequence Modeling ❑ How to predict sequential data using a neural network? ❑ Flexible NN in terms of the length of sequence is needed. Handwritten Text Translation Video Captioning Text to Speech DS3000 Machine Translation 35 Sequence Analysis ❑ How to predict sequential data using a neural network? ❑ Needed criteria's: o Handle variable length sequences (flexible NN in terms of the length of sequence is needed). o Track long-term dependencies and maintain information about order o Share parameters across sequence Recurrent Neural Networks (RNN) Long Short-Term Memory (LSTM) Transformers DS3000 36 Other Deep Learning Methods ❑ Recurrent Neural Networks (RNN) o RNNs have unique ability to process sequential data dynamically ideal for natural language processing (NLP) and time series analysis. o RNNs maintain an internal memory or hidden state to capture dependencies. ❑ Long Short-Term Memory networks (LSTM) and Gated Recurrent Units (GRU): o Subclasses of RNN, specialized in remembering information for extended periods by introducing various gates which regulate the cell state by adding or removing information from it. DS3000 37 RNNs Issues ❑ The sequence-to-sequence model (RNN categories) has an issue when the input sequence is quite long and contains a lot of information. ❑ Not every piece of the input sequence’s context is required at every decoding stage for all text production activities. DS3000 38 Other Deep Learning Methods ❑ Transformers: o Encoder-decoder models able to process a whole sequence with a sophisticated attention mechanism. o The Transformer can learn longer-range dependencies than RNNs and its variants such as GRUs and LSTMs. o The biggest benefit, however, comes from how the Transformer lends itself to parallelization. Encoder: o Assign to each unique word a unique identifier (a number serves as a token to represent that word). o Note the location of every token relative to every other token. o Using just token and location—determine the probability of it being adjacent to, or in the vicinity of, every other word. Feed these probabilities into a NN to build a map of relationships. Given any string of words, NN predicts next word (e.g., AutoCorrect) Decoder o Takes in token representation and decodes it back into text. DS3000 39 8.5. A Life-Changing Breakthrough – Chabot (Generative AI) DS3000 40 Language Models (LMs) ❑ LM is fed with a large corpus (dataset) of text and tasked with predicting the next word in a sentence by randomly truncating the last part of an input sentence and training the model to fill in the missing word(s) ❑ Pre-2017, while CNNs worked great for images, RNNs for language did not. ❑ RNNs were sequential, error-prone, and did not capture the language model insights and were therefore did not output natural looking text. ❑ Google Brain introduced Transformer model architecture DS3000 41 M. Ramponi, “The Full Story of Large Language Models and RLHF”, 2023 Process Input ❑ In Natural Language Processing (NLP), our inputs are sequences of words, but deep learning needs vectors of numbers: ❑ Tokenization: split into individual words ❑ Embedding: A word embedding layer can be thought of as a lookup table to grab a learned vector representation of each word. Ex: Word to Vector (word2vec), GloVe, BERT etc. vocabulary indexing encoding DS3000 42 Generative Pre-trained Transformers(GPT) by OpenAI ❑ Development of large LM (LLM) is characterized by dramatic increase in size (number of parameters) ❑ GPT is a Transformer-based LLM ❑ The largest models (like GPT-4) are close to 1 trillion parameters (Human Brain has 100 trillion connections!) DS3000 C. Greyling “What Are Realistic GPT-4 Size Expectations?”, 2023 43 Generative Pre-trained Transformers(GPT) by OpenAI ❑ LLMs "read" more than 100 billion words words during their training phase ❑ More than 100 times what a human will ever hear or read in their lifetime! ❑ They learn much more slowly than us, but have access to much (much!) more data. ❑ Training larger models on more data, is that the cost of computing is increasing DS3000 44 C. Greyling “What Are Realistic GPT-4 Size Expectations?”, 2023 Example in Text Generation ❑ Install the Hugging Face Transformers library by running the following command: ❑ Hugging Face provides pre-trained models for a wide range of natural language processing (NLP) tasks, including language translation, question answering, and text classification. DS3000 45

Deep Learning Lecture 8 PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue