EE6427 Lecture Part 4 AY2425S1 Convolutional Neural Networks PDF

Dr Yap Kim Hui Room: S2-B2b-53 Tel: 6790 4339 Email: [email protected] Stanford Lecture Notes, CS231n: Convolutional Neural Networks for Visual Recognition. Ian Goodfellow, Yoshua Bengio and Aaron Courville, Deep Learning, MIT Press, http://www.deeplearningbook.org Pytorch Tutorial. https://pytorch.org/tutorials/ University of Wisconsin Madison Lecture Notes, CS638. University of Michigan, EECS 498-007 / 598-005: Deep Learning for Computer Vision Simplilearn, Recurrent Neural Network Tutorial Shusen Wang, Transformer Model, Youtube Online Videos Jay Alammar, The Illustrated Transformer. 2 Part 4 Outline AI Models & Architectures - Convolutional Neural Network (CNN) - Recurrent Neural Network (RNN) - Long Short-Term Memory (LSTM) - Transformer 3 Section I Convolutional Neural Network (CNN) 4 The section covers the following topics: - Introduction - Linear Classifier - Convolutional Neural Networks (CNNs) - CNN Training & Optimization - Well-Known CNN Architectures - Applications 5 Different Deep Neural Network (DNN) Architectures 6 Common DNN Architectures Convolutional Neural Networks (CNNs) Recurrent Neural Networks (RNNs) Transformers Large Language Models (LLMs) Etc. 7 Different DNNs (tools) for different applications. Each DNN has its own unique features to address unique problem. 8 Convolutional Neural Network (CNN) Consist of deep layers to extract progressively higher-level abstraction features. Commonly used in classification and regression applications. 9 Source: “Efficiency Optimization of Trainable Feature Extractors for a Consumer Platform”, M.Peemen et al., 2011 Recurrent Neural Network (RNN) A type of neural network that specializes in processing sequences. Commonly used in applications involving time-series and state-series prediction and modelling. Example applications: stock price prediction, language translation. 10 Source: “Language modeling a billion words”, torch blog Transformer A type of network that uses attention mechanism to process input sequence in parallel. Good at modelling long-range dependency. Achieve state-of-the-art performance in many vision and NLP applications. 11 12 Source: Intro to Large Language Models, Andrej Karpathy, Nov 2023 FMs are models that are trained on large scale broad data that can be adapted (finetuned) to a wide range of downstream tasks / applications. Examples: LLMs (e.g., GPT), Vision-Language Models (VLMs) (e.g., CLIP). 13 Source: Rishi Bommasani, On the Opportunities and Risks of Foundation Models. Linear Classifiers 14 Primary cortical region of brain that receives, integrates, and processes visual information relayed from the retinas. 15 Source: ncbi.nlm.nih.gov 16 Source: “Neural Networks and Learning Machines”, Siamak Azodolmolky Source: Zhenzhu Meng, et al. Using a Data Driven Approach to Predict Waves Generated by Gravity Driven Mass Flows 17 - 18 Source: Stanford Lecture Notes, CS231n 19 Source: Stanford Lecture Notes, CS231n 20 Source: Stanford Lecture Notes, CS231n 21 Source: Stanford Lecture Notes, CS231n How do we determine W and b? We need a loss function (error measurement) which is a metric/distance between predicted values and output target values (teachers) during training. 22 The loss function will depend on what type of problems we are addressing. Two common problems: - Regression  Target output is continuous value.  E.g., share price prediction, rain fall estimation, etc. - Classification  Predict the label/category of input.  Target output is discrete (from a set of possible labels).  E.g., cancer classification (binary), face recognition (multi-class). 23 Square loss: Mean Square Error (MSE): Mean Absolute Error (MAE): x and y are the input data and target output value, f(x) is the network predicted output. 24 Softmax loss: - Cross-entropy loss with softmax normalization. zj e pj  , where z j  f ( x j ) e k zk L   y j log e p j j 25 The output scores of a linear classifier for an input cat image are given as follows. What is the softmax loss for this training data? zj e pj  e k zk Predicted Ground truth output Output Normalized Scores Labels probabilities (z) (y) (p) 0.7 0 0.109 0.1 0 0.060 1.6 0 0.269 2.2 1 0.489 0.3 0 0.073 L   y j log p j  0.715 (Note: The log is natural log namely ln) j 26 CNN Architecture 27 28 Source: “A Comprehensive Guide to Convolutional Neural Networks — the ELI5 way”, towardsdatascience.com Convolutional Layer Activation Function Layer Pooling Layer Fully-Connected (FC) Layer / Linear Layer Softmax Layer 29 Convolutional Layer Extract features in a hierarchical manner. Early layers extract low-level features whereas later layers extract high-level features. Reduce parameters to be trained by sharing weights through convolution kernel. 30 Convolutional Layer 31 32 Source: Stanford Lecture Notes, CS231n consider a second, green filter 33 Source: Stanford Lecture Notes, CS231n 34 Padding The input volume is. If we pad two borders of zeros around the volume, this gives us a volume. Then, when we apply our conv layer with three filters and a stride of 1, we will get a output volume. 35 Source: “A Beginner's Guide To Understanding Convolutional Neural Networks Part 2”, Adit Deshpande, github.io Padding 36 Source: cloud.tencent.com/developer/article/1031131 37 Source: https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks Stride filter with stride 1 filter with stride 2 38 Source: “A Beginner's Guide To Understanding Convolutional Neural Networks Part 2”, Adit Deshpande, github.io Stride Stride = 2: 39 Source: cloud.tencent.com/developer/article/1031131 Activation Layer Perform element-by-element nonlinear activation function mapping. RELU is one of most popular activation functions. Often combined with convolution layer. 40 41 Source: Stanford Lecture Notes, CS231n 42 Source: https://learnopencv.com/understanding-convolutional-neural-networks-cnn/ Pooling Layer Reduce activation map dimension, hence reducing computation and storage requirement. Common pooling operation: max pooling, average pooling. Each channel of activation/feature maps operate independently. 43 44 Source: Stanford Lecture Notes, CS231n 45 Source: Stanford Lecture Notes, CS231n 46 Source: https://towardsdatascience.com/applied-deep-learning-part-4-convolutional-neural-networks-584bc134c1e2 Fully Connected (FC) Layer All nodes in one layer are connected to all nodes in the next layer. FC layer feature vector can be used as feature (also known as embedding) to represent input image. FC layer behaves like a linear layer for classification. 47 48 Source: Stanford Lecture Notes, CS231n Softmax Layer Map the output scores (logits) from last FC layer into probabilities for classification problem. Softmax loss is computed based on the probabilities. zj e pj  e k zk 49 Source: https://stats.stackexchange.com/questions/265905/derivative-of-softmax-with-respect-to-weights Conv Layer Layer Fully-connected Layer Definition Pooling Layer Feed Forward Logic Activation Layer 50 Exercise: CNN 51 Exercise: CNN 52 53 Exercise: CNN FC and Softmax Layers 54 55 CNN Training & Optimization 56 57 Source: “Coding Deep Learning for Beginners — Linear Regression Part 3”, Kamil Krzyk, towardsdatascience.com 58 Source: Stanford Lecture Notes, CS231n Well-Known CNN Architectures 59 60 Source: packtpub.com/../evolution-of-cnn-architectures First neural network applied on large scale image data. Classical structure for image classification consisting of convolution layers and fully connected layers. ILSVRC champion in 2012. 61 Source: “Imagenet classification with deep convolutional neural networks”, Krizhevsky Alex et al., 2012 Runner up of ILSVRC 2014. Elegant network architecture. Use deep network with small 3×3 kernels. Require large number of parameters. Two common variants: VGG-16 and VGG-19 62 63 Source: “Very deep convolutional networks for large-scale image recognition”, K. Simonyan et al., 2014 Winner of ILSVRC 2015 Very deep structure Use residual block and highway/skip connection for better gradient backpropagation. 64 65 Source: “Deep residual learning for image recognition”, Kaiming He et al., 2016 66 Source: “Are Deep Learning Networks Getting Too Deep?”, principlesofdeeplearning.com 67 Accuracy Memory footprint (parameters/weights + activation maps) Speed/computational complexity (FLOPS) Often, what is the important metric will depend on the applications/problems. 68 Applications 69 70 71 Home page Taking picture Food recognition 72 73 The section covers the following topics: - Introduction - Linear Classifier - Convolutional Neural Networks (CNNs) - CNN Training & Optimization - Well-Known CNN Architectures - Applications 74 Section II Recurrent Neural Network (RNN) & Long Short-Term Memory (LSTM) 75 Section II Overview The section covers the following topics: - Introduction - Recurrent Neural Network (RNN) - RNN Training & Optimization - Long Short-Term Memory (LSTM) - Applications 76 77 Source: Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | Simplilearn 78 Time series prediction / forecasting - Prediction of stock prices, product sales, etc. Speech recognition Natural Language Processing (NLP) - Language translation - Text sentiment classification Image captioning 79 Sequence Modelling 80 Sequence Modelling (One-to-Many Mapping) One input, many outputs (sequence of outputs) E.g., image captioning Image  sequence of words Source: UMich EECS 498-007 / 598-005: Deep Learning for Computer Vision 81 Source: https://www.analyticsvidhya.com/blog/2018/04/solving-an-image-captioning-task-using-deep-learning/ Sequence Modelling (Many-to-One Mapping) Many inputs (sequence of inputs), one output E.g., Video classification Sequence of images  Label Sentiment classification Sequence of words  Label 82 Sequence Modelling (Many-to-Many Mapping) Many inputs (sequence of inputs), many outputs (sequence of outputs) E.g., machine translation Sequence of words  sequence of words Per-frame video classification Sequence of images  sequence of labels 83 84 Source: https://medium.com/analytics-vidhya/seq2seq-model-and-the-exposure-bias-problem-962bb5607097 Recurrent Neural Networks (RNNs) 85 RNNs are a class of neural networks which have inputs in the forms of sequential data, e.g., time series data. Commonly used in analysis of temporal / sequential data. Widely deployed in applications including Natural Language Processing (NLP), machine translation, image captioning, etc. 86 87 https://medium.com/deeplearningbrasilia/deep-learning-recurrent-neural-networks-f9482a24d010 88 Source: Stanford Lecture Notes, CS231n 89 Source: Stanford Lecture Notes, CS231n 90 Source: Stanford Lecture Notes, CS231n 91 Source: Stanford Lecture Notes, CS231n Exercise: RNN 92 93 Pros: - Can process any length of input - Computation for step t can in theory use information from many steps back - Model size does not increase for longer input - Same weights applied for every timestep, so there is consistency in how inputs are processed. Cons: - Recurrent computation is slow - In practice, difficult to leverage information from many steps back 94 RNN Training & Optimization 95 96 Source: “Coding Deep Learning for Beginners — Linear Regression Part 3”, Kamil Krzyk, towardsdatascience.com 97 Source: Stanford Lecture Notes, CS231n 98 99 100 Full Batch: use entire set of training sequences at each iteration Stochastic: use a single sequence at each iteration Minibatch (Stochastic): use a few training sequences at each iteration Full Batch Minibatch Stochastic 101 Source: https://medium.com/analytics-vidhya/gradient-descent-vs-stochastic-gd-vs-mini-batch-sgd-fbd3a2cb4ba4 102 103 104 Exploding / Vanishing Gradients (1) 105 Exploding / Vanishing Gradients (2) 106 Source: Stanford Lecture Notes, CS231n Long Short-Term Memory (LSTM) 107 108 Source: Stanford Lecture Notes, CS231n A memory cell which can maintain its state over time. Consist of cell state (ct), hidden state (ht) and 4 gates (i, f, o, g). ct : long-term memory, ht : short-term memory. Cell state (ct) undergoes changes via forgetting old memory (through forget (f) gate) and adding new memory (through input (i) gate & gate (g) gate). Hidden state (ht) is updated by passing cell state (ct) through output gate. Gates control the flow of information to the memory. Gate are obtained through a sigmoid/tanh layer, and they update ct and ht cell states using pointwise multiplication operator. 109 110 Source: Stanford Lecture Notes, CS231n 111 Source: Stanford Lecture Notes, CS231n 112 Source: Stanford Lecture Notes, CS231n 113 Source: Stanford Lecture Notes, CS231n 114 Source: Stanford Lecture Notes, CS231n Exercise: LSTM 115 116 Applications 117 118 Source: https://www.kdnuggets.com/2019/05/machine-learning-time-series-forecasting.html Machine Translation 119 Source: Stanford Lecture Notes, CS224n The section covers the following topics: - Introduction - Recurrent Neural Network (RNN) - RNN Training & Optimization - Long Short-Term Memory (LSTM) - Applications 120 Section III Transformer 121 Section III Overview The section covers the following topics: - Attention Concept - Transformer Architecture - Vision Transformer 122 What is Attention? Attention is used to determine which input tokens (e.g., words in NLP, image patches in CV) are relevant to the current input / token. Attention is computed through correlation (dot product) between 2 vectors. Correlation -> similarity / relatedness / importance -> attention 123 Source: https://jalammar.github.io/illustrated-transformer/ An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (2020), Alexey Dosovitskiy et. al. How is Attention Computed? Each input token generates 3 vectors: query (q), key (k) and value (v). These vectors provide more flexible representation through linear mapping (WQ , WK, Wv) to learn the underlying relationship / attention between input tokens. Attention is computed using: - Step 1: compute the correlation (dot product) between the query (q) and key (k) vectors. - Step 2: correlation values from Step 1 are scaled and normalized using Softmax function. - Step 3: multiplied output from Step 2 by corresponding value (v) vectors and sum them up. 124 Scaled Dot-Product Attention Step 1: compute the correlation (dot product) between the query (q) and key (k) vectors. Step 2: correlation values from Step 1 are scaled and normalized using Softmax function. Step 3: multiplied output from Step 2 by corresponding value (v) vectors and sum them up. 125 Transformer uses attention mechanism to process input sequence in parallel. Use attention mechanism and dense/feedforward/MLP layers. Highly parallelizable. Can offer global attention. Good at modelling long-range dependency. Achieve state-of-the-art performance in many vision and NLP applications. Lead to other SOTA methods such as BERT: Pre-training of Deep Bidirectional Transformers for Language. 126 Transformer in Machine Translation (1) 127 Source: https://jalammar.github.io/illustrated-transformer/ Transformer in Machine Translation (2) 128 Source: https://jalammar.github.io/illustrated-transformer/ Initially designed for neural machine translation. Later extended to visual tasks such as recognition, detection, etc, with great success. Leverage on attention mechanism to analyze the importance of a token (word/image patch) with respect to other tokens/image patches. Consist of transformer encoder and transformer decoder. Attention Is All You Need (2017) https://arxiv.org/abs/1706.03762 129 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin Input pre-processing: - Map input words / tokens (e.g., French words) into text embedding / vectors. - Add position encoding info of input words. Encoder: - Map input vectors from pre-processing into context vectors using attention mechanism. - Context vectors pass through feedforward layer to generate encoder outputs. - Encoder outputs have better representation than input vectors as they leverage the context information of other input tokens due to attention mechanism. 130 Output pre-processing: - Map output words / tokens (e.g. English words) into text embedding. - Add position encoding info of output words. Decoder: - Output masked self-attention: map output vectors into context vectors of output words (e.g., English words). Masking is used to hide unseen words during training. - Encoder decoder self-attention (or cross-attention): perform cross- attention between context vectors from self-attention stage (e.g., English words) and encoder outputs (e.g., French words). - Resulting vectors pass through feedforward layer to generate decoder outputs. - Decoder outputs leverage (1) self-attention of English word and (2) cross attention French-English words to obtain better representation. Output post-processing: - Map the decoder outputs into probability using Softmax function to generate the next output word (i.e., English word). 131 132 Source: Ria Kulshrestha, Transformers. https://towardsdatascience.com/transformers-89034557de14 133 134 Source: Shusen Wang, Transformer Model Cross-Attention / Encoder-Decoder Self-Attention 135 More Details Multi-Head Attention Feed Forward Network Positional Encoding Residual Connection Layer Normalization (Norm) Input/Output Embedding Masked Attention 136 Multi-Head Attention 137 Source: https://deepfrench.gitlab.io/deep-learning-project/ Position Encoding Represent the position information of individual input tokens. Position encoding using sin / cos functions are often used. 138 Source: https://jalammar.github.io/illustrated-transformer/ Residual Connection & Layer Norm Residual connection leverages the idea of residual learning in Resnet. Layer Norm is used to perform normalization. 139 Source: https://jalammar.github.io/illustrated-transformer/ Transformer Encoder + Decoder Architecture 140 Source: https://jalammar.github.io/illustrated-transformer/ Final Linear and Softmax Layer 141 Source: https://jalammar.github.io/illustrated-transformer/ LSTM Transformer Pros Work reasonably well Excellent at long sequences as for long sequences attention looks at all inputs. Parallel computation. Cons Sequential computation Require large memory and training data 142 Vision Transformer (1) A SOTA image classification model based on attention. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (2020) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias 143 Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby Vision Transformer (2) Key steps of Vision Transformer (ViT): - Partition an image into patches / tokens. - Flatten the patches / tokens using lexicographical ordering. - Generate linear embeddings from the flattened patches / tokens. - Introduce an extra learnable class embedding. - Add positional embeddings / encoding. - Pass the tokens to a transformer encoder. - Encoder output of the extra learnable token will pass through a MLP head network for classification. - Training will involve using a pretrain model, and finetune it using the target dataset for image classification. 144 Exercise: Model Comparison 145 146 The section covers the following topics: - Attention Concept - Transformer Architecture - Vision Transformer 147 Part 4 Summary This part covers the following: - Convolutional Neural Network (CNN) - Recurrent Neural Network (RNN) - Long Short-Term Memory (LSTM) - Transformer 148

EE6427 Lecture Part 4 AY2425S1 Convolutional Neural Networks PDF

Document Details

Tags

Related

Summary

Full Transcript