Deep Neural Networks II - CNNs and RNNs - PDF

Deep neural networks II - CNNs and RNNs - Kyung-Ah Sohn Ajou university 목차 Convolutional neural networks – CNN architecture, training and regularization – Named CNNs – Transfer learning Recurrent neural networks – Sequence-based prediction – Gated RNNs – Sequence-to-sequence problem 2 [0. 0. 0. Review: MLP for classification 0.3 0. 0.7 … 0.] # define the model architecture Ex) The number of parameters in each layer Ex) Hyper-parameters? L1: 235,500 L2: 30,100 L3: 1,010 Total params: 266,610 # make predictions on new samples # train the model 3 Image data 120 115 125 127 110 105 121 130 122 50 55 60 160 162 40 35 42 43 190 45 20 110 105 110 180 30 50 240 230 230 160 40 180 210 220 200 An image is a matrix or a vector of numbers [0, 255] e.g., 1080x1080x3 dimensional vector for an RGB image of size 1080x1080 4 Fully-connected layer FC layers model relationship from every input feature to output If used for images (assuming flattened input): 120 115 125 127 110 105 121 130 122 50 55 60 160 162 40 35 42 43 190 45 20 110 105 110 180 30 50 240 230 230 160 40 180 210 220 200 5 Limitation of FC layers for (large) images High computational cost – FC layers generate a massive number of parameters, which results in high memory and computational requirements Loss of spatial structure – FC layers treat all input pixels independently and do not take into account local relationships between pixels 6 Spatial locality How to recognize a pattern from an image – Design a filter(=kernel) that gives higher score where the desired pattern matches  Use dot product to compute the score of each location  The same filter can be used at different filter locations of the image 7 Convolution How to recognize a pattern from an image – Design a filter that gives higher score where the desired pattern matches filter 8 Exploit spatial structure in images 9 Convolutional networks Convolutional networks (LeCun, 1989) are a specialized kind of neural network for grid-like data (e.g., image, time-series) – Applicable to any input that is laid out on a grid (1-D, 2-D, 3-D, …) Scale up neural networks to process very large images / video sequences Employ a mathematical operation called convolution Convolutional neural networks (CNNs) are simply neural networks that use convolution in place of general matrix multiplication in at least one of their layers 10 Convolution 2D convolution take dot product of filter and each input location filter (kernel) 11 Convolution with bias Bias (1x1) input filter (kernel) output 12 Convolution example Acts like a high-pass filter: pixels near edges survive – Can be used to extract useful features Traditionally, hand-crafted convolution kernels were used Now let the neural network learn the kernels automatically from data Can be performed very efficiently on modern libraries and GPUs 13 Key idea of CNN Replace matrix multiplication in neural nets with convolution Everything else stays almost the same 14 Convolutional networks Network with Fully Connected (FC) layers Affine(/linear/dense/FC) layers Network with convolutional layers Conv layers Pooling layers FC layers 15 Multiple channels Learnable parameters B G R + 5 input filter output Bias WxHxC K 1 x K2 x C Filter depth = input depth (#channel) 16 Multiple filters produce multiple output feature maps D D D filters  D output feature maps 17 ConvLayer: multiple filters with bias Input (1 bias per filter) Output D D D D Layer (𝑙 − 1) Filters Bias Layer (𝑙) Learn the filters and biases from data tf.keras.layers.Conv2D(32, (3,3), activation=‘relu’, input_shape=(28,28,3)) (learnable parameters) 18 Stacking conv-layers Common CNN architectures have multiple convolutional layers – The filters learned usually represent lower to higher spatial features 19 Stacking conv-layers If we use a larger filter, activation map will get smaller quicker If the input image is in high resolution, conv layers require too much computation Hyperparameters to control this: stride and padding 20 Stride Use stride to downsample the input stride = 2 Move 2 pixels at a time 21 Stride stride = 2 22 Padding Common to zero-pad the border typically add zeros around the perimeter 23 Pooling Locally aggregates values in each feature map – makes the representations smaller and more manageable – Some level of denoising, reduce overfitting – operates over each activation map independently 2x2 (stride=2) max pooling predefined operation: maximum, average, etc. 24 Pooling layer No parameter to learn Hyperparameters: size(F), stride(S) The number of channels does not change Robust to input variation 25 Typical CNN architecture (Repeated) Convolutional layers and pooling layers – Act like feature extractor (low-level to high-level) Fully connected (/dense/affine) layers at the end of the network – For classification / regression End-to-end learning Img src: https://developersbreach.com/convolution-neural-network-deep-learning/ 26 3D volume transformation in CNNs It transforms an input 3D volume to an output 3D volume with some differentiable function that may or may not have parameters. Regular NN Convolutional NN 224x224x3 55x55x48 1x1x10 (w x h x c) http://cs231n.github.io/convolutional-networks/ 27 Example (Keras) Input: 28x28x1 28 CNN architecture summary A ConvNet architecture is a list of Layers that transform the image volume into an output volume There are a few distinct types of Layers (e.g., CONV/FC/POOL) Each Layer accepts an input 3D volume and transforms it to an output 3D volume through a differentiable function Each Layer may or may not have parameters Each Layer may or may not have additional hyperparameters http://cs231n.github.io/convolutional-networks/ 29 Training CNNs Back-propagation – SGD, Adam, etc. (Trainable) parameters: – Kernel patch for each layer – bias for each kernel Hyperparameters – e.g., number of kernels per layer, kernel size, stride, zero-padding, network depth – Which activation function 30 Training deep neural networks The larger the network, the more difficult it is to design and train – Optimization difficulty: gradient vanishing/exploding – Generalization difficulty: larger networks are more likely to overfit Regularization techniques help improve generalization, reduce overfitting, and enhance the robustness – Batch normalization, dropout, data augmentation, weight decay, early stopping, etc. 31 Normalization methods for CNNs Different normalization strategies across layers, channels or groups normalizes in a normalizes divides the channels single training each individual into groups and sample feature map applies normalization within each group Wu, Yuxin, and Kaiming He. Group normalization. ECCV 2018 32 Dropout in CNNs Dropout: a fraction of hidden units is randomly dropped at every iteration with a certain probability (p=0.5 is common) More effective in fully connected layers than in convolutional layers Used after the activation function of each convolutional layer: CONV->RELU->DROP, at much lower rate (e.g., 0.1 or 0.2) Batch normalization can partly replace dropout in CNNs (but not always) 33 Data augmentation for images Cutout Mixup Training: randomly mask (set zero) Training: Train on random blends of images square sections of the training images Testing: use full image Testing: Use original images DeVries and Taylor, “Improved Regularization of Zhang et al., “mixup: Beyond Empirical Risk Minimization”, ICLR 2018 Convolutional Neural Networks with Cutout”, arXiv 2017 34 NAMED CNNS Image classification 36 Natural image datasets 37 ImageNet Large Scale Visual Recognition Challenge (ILSVRC) Annually ran from 2010 to 2017, and now hosted by Kaggle ILSVRC contributed greatly to development of CNN architectures 38 CNNs for classification (2014) (2012) Small filters, The first work that popularized CNNs deeper network in Computer Vision 39 ResNet [He et al., 2015] Swept 1st place in all ILSVRC and COCO 2015 competitions Very deep networks using residual connections – Training >100-layer network without difficulty ResNet also shows good generalization ability Batch normalization after every CONV layers Missing fully connected layers at the end of the network Mini-batch size 256, weight decay of 1e-5, SGD+momentum (0.9), Learning rate 0.1, divided by 10 gradually No dropout used 40 Residual connections Skip-connections via addition Learn the “residuals” by reparametrize each layer to make them easy to represent an identity function This helps mitigate issues like vanishing gradients and makes deep networks much easier to train "Deep Residual Learning for Image Recognition", He et al, 2016 41 CNNs for detection, segmentation, etc. 42 Image generation models GAN: generative adversarial network Diffusion model BigGAN (ICLR, 2019) Created by Jason Allen via Midjourney (2022) Yang, Binxin, et al. Paint by Example: Exemplar-based Image Editing with Diffusion Models (CVPR, 2023) TRANSFER LEARNING 44 Transfer learning Training a big model requires huge datasets and computing resource… In transfer learning, a model trained for a task (or a domain) can be reused as the starting point for a model on a related task (or a domain) Typical steps (pre-training and fine-tuning): 1. Download a pre-trained model (or find a very large dataset that has similar data, train a big model there) 2. Transfer-learn to your dataset Fine-tune some of the model parameters using your own dataset 45 Fine-tuning approach Convolutional layers (actually any other kind of layers that extract useful features) act like a feature extractor A source model pre-trained by a large dataset, e.g., ImageNet, is well-generalized, so one can expect it as a good feature extractor or parameter initialization – To avoid overfitting, one can often freeze convolutional layers for small target datasets – Can transfer to different domains and tasks – But with the same architectures (at least for feature extraction part) 46 Transfer learning with CNNs 4. Finetune the whole layers (large data with low similarity to pretraining data) (small data with high similarity to pretraining data) 47 Pretrained models and weights https://pytorch.org/vision/stable/models.html 48 Deep learning for computer vision Largely relied on supervised learning (need labeled data) Recently, self-supervised learning—in which implicit labels are extracted from data points and used to train algorithms—have pushed the field towards fully unsupervised learning in pre-training Transfer learning—in which an algorithm is first trained on a large and unrelated corpus, then fine-tuned on a dataset of interest (e.g., medical)—has been critical Techniques to generate synthetic data, such as data augmentation and generative models are being developed 49 CNN: Summary CNN architecture Many CNN architectures and pretrained models are available Regularization techniques for deep convolutional networks Transfer learning with CNNs (or any other recent models) is pervasive 50 RECURRENT NEURAL NETWORK 51 Applications https://jinglescode.github.io/2020/05/21/three-types-sequence-prediction-problems/ 52 Applications in NLP: input output Sentiment classification input Speech recognition output 53 Applications in NLP: Machine Translation Text generation 54 Challenges How to define network architecture to model p(“Hey Jude”| )? For example, the input sentences can be of variable length Standard neural networks can only handle data of a fixed input size How to remember past information and use it for future prediction? 55 Example: Sequence classification [NN for non-sequential data] [NN for sequential data] Output 𝑦 [0.3 Output 𝑦 [0.3 𝑦 = 𝜎(𝑊ℎ + 𝑏) 0.7] 0.7] ℎ ℎ ℎ ℎ ℎ ℎ Hidden layer ℎ random (ℎ : hidden state) [0.1 Input 𝑥 𝑥 𝑥 𝑥 𝑥 Input 𝑥 2.5 sequence 0.2] [0.1 [0.15 [0.6 [0.8 [0.6 2.5 3.0 2.6 1.5 2.6 0.2] 0.21] 0.7] 1.2] 0.7] e.g., observation of 3 variables for 5 days (sequence data of length T=5) 56 Sequence classification: input encoding ℎ = 𝜎 𝑊ℎ + 𝑈𝑥 + 𝑏 ℎ = 𝜎 𝑊ℎ + 𝑈𝑥 + 𝑏 Encoded information Encoded about 𝑥 information about [𝑥 , 𝑥 ] Randomly initialized Encoded information about [𝑥 , 𝑥 , … , , 𝑥 ] ℎ ℎ ℎ ℎ ℎ ℎ The same weights (W, U) Input sequence 𝑥 𝑥 𝑥 𝑥 𝑥 are shared across time [0.1 [0.15 [0.6 [0.8 [0.6 2.5 3.0 2.6 1.5 2.6 0.2] 0.21] 0.7] 1.2] 0.7] 57 Sequence classification One the entire sequence is encoded (as the last hidden state), we put a classifier (or a regressor) that maps the encoding (/the last hidden state/latent representation) to the output 𝑦 [0.3 𝑦 = 𝜎 𝑉ℎ + 𝑐 0.7] ℎ ℎ ℎ ℎ ℎ ℎ ℎ =Encoded information about the whole sequence Input 𝑥 𝑥 𝑥 𝑥 𝑥 sequence [0.1 [0.15 [0.6 [0.8 [0.6 2.5 3.0 2.6 1.5 2.6 0.2] 0.21] 0.7] 1.2] 0.7] 58 Recurrent neural network We can process a sequence of vectors x by applying a recurrence formula at every time step: unroll Notice: the same function and the same set of parameters are used at every time step 59 Output of RNN The recurrent layer can return a sequence as output , or simply return the last output (at ) 60 Different categories of sequence modeling Image captioning Sentiment analysis Machine translation Video classification on (image  word sequence) (word sequence  (word seq.  word seq.) frame level sentiment) : seq2seq problem 61 RNN is hard to train RNN-based network is not always easy to learn 62 Exploding/vanishing gradient problem During backpropagation: ℎ = 𝑡𝑎𝑛ℎ 𝑊ℎ 𝜕ℎ + 𝑈𝑥 + 𝑏 𝑦 = 𝑡𝑎𝑛ℎ 𝑊ℎ + 𝑈𝑥 + 𝑏 𝑊 𝜕ℎ 𝜕𝐿 𝜕𝐿 𝜕ℎ 𝜕ℎ 𝜕ℎ 𝜕𝐿 𝜕ℎ 𝜕ℎ 𝜕ℎ = … = … 𝜕𝑊 𝜕ℎ 𝜕ℎ 𝜕ℎ 𝜕𝑊 𝜕ℎ 𝜕ℎ 𝜕ℎ 𝜕𝑊 𝜕𝐿 𝜕ℎ ℎ ℎ ℎ ℎ ℎ ℎ = 𝑡𝑎𝑛ℎ 𝑊ℎ + 𝑈𝑥 + 𝑏 𝑊 𝜕ℎ 𝜕𝑊  tanh’ is almost always < 1 𝑥 𝑥 𝑥 𝑥 𝑥  vanishing gradient  Because of the multiplications of the  For RNNs, input sequence length acts like same term ( ) depth! If |W| > 1  exploding gradient 63 Practical measures Exploding gradients – Clip gradient at threshold – Truncated Backpropagation through time – Adjust learning rate Vanishing gradient – Harder to detect and resolve – Use gated RNNs (LSTM, GRU) instead of vanilla RNN 64 Long short-term memory (LSTM) Standard RNNs suffer from short-term memory – Long-term dependencies cannot be modeled properly LSTMs introduced to overcome the vanishing gradient problem Building block of LSTM is a memory cell, which replaces the hidden state of standard RNNs Gating mechanism is employed to control the flow of information 65 Gating mechanism Introducing gate – A vector with each element between 0 and 1 – If 1, info will be kept completely – If 0, info will be flushed – Sigmoid is an intuitive selection – To be learned by Neural Network 66 LSTM: using gates & cell state To avoid the gradient vanishing problem, a new set of hidden states called cell state (ct) with a “highway” detouring the FC layer is introduced Hidden state in vanilla RNN LSTM cell Cell state Hidden state Uses three types of gates: forget, ℎ( ) = 𝑡𝑎𝑛ℎ(𝑾ℎ + 𝑼𝑥 + 𝑏) input, output 67 Gates in LSTM Forget gate – controls long-term memory(cell state) Input gate – controls the flow from the input 𝑐 =𝑓 ∘𝑐 + 𝑖 ∘ 𝜎 (𝑊 ℎ +𝑈 𝑥 +𝑏 ) ℎ = 𝑜 ∘ 𝜎 (𝑐 ) Output gate  Learnable parameters – controls the value updated to the hidden state 68 Gated Recurrent Units (GRU) LSTM is good but seems redundant – Do we need so many gates? – Do we need both hidden states and cell states to remember? Gated Recurrent units (GRU, 2014) – Simpler architecture than LSTMs (with less parameters by using only two gates) Combines forget and input gate into Update Gate Merges Cell state and hidden state – With fewer parameters than LSTM, faster to train Update gate Reset gate 69 LSTM vs GRU Researchers have proposed many gated RNN variants, but LSTM and GRU are the most widely-used LSTM is a good default choice; switch to GRUs for speed and fewer parameters 70 Common variations Bi-directional RNN Deep (multi-layer) RNN – Makes two passes (Forward and – Stack more than one hidden layer reverse) over each input sequence – Need residual connections if it’s – The results are concatenated deep – May use ‘sum’, ‘mul’, ‘ave’ 71 Example: LSTM for sequence classification https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM 72 Sequence-to-sequence problems In seq2seq problems such as machine translation – Both input and output are sequence – Input/output sequence lengths can be different – There is no one-to-one correspondence between input-output tokens  Use an encoder-decoder structure Encoder-decoder vs. architecture 73 Encoder-decoder structure Encoder: the semantics of the entire input sequence is encoded as an embedding vector – For example, as the last hidden state (hT) of the encoder RNN Decoder: generating outputs (one by one) starting from this embedding as input “How are you today?” Encoder Decoder “I am good!” [input] input [output] embedding 74 Decoder RNN: autoregressive generation 𝑠 : decoder hidden state ℎ : encoder hidden state Generate the first token (“I”) (“am”) “< 𝐸𝑂𝑆 >” Softmax activation to  Loss at time t: generate token probabilities generated token probabilities 𝑦 𝑦 𝑦 … vs. one-hot encoded ground truth 𝑦 Encoder “How are you today?” 𝑐 𝑠 𝑠 𝑠 𝑠 𝑠 RNN  Input layer: 𝑠 initialized as the last hidden At training: use the ground truth state of the encoder ℎ 𝑦 𝑦 𝑦 𝑦 At inference: use the previous output (𝑦 ), auto-regressively Special token indicating ‘start of Autoregressive sequence’ input from the previous output 75 Information loss in RNN The entire sequence is encoded into a single embedding ( ) – Information from earlier inputs tends to be forgotten more  Main idea Make the decoder take into account hidden 𝑠 𝑠 𝑠 states at all input steps , with closer ℎ ℎ ℎ attention on more relevant input tokens (compute instead of using a constant for decoder) 76 Attention mechanism Attention mechanism helps to look at all hidden states from the encoder sequence for making predictions in the decoder – make the context vector for the decoder vary across Learn which states are more useful from data (how to compute , the attention scores) – Use a neural network to “learn” which hidden encoder states to “attend” to and by how much – Or use (weighted) dot-product, similarity measures, etc. 77 Attention heatmap Source (𝑗) Example: machine translation The x-axis and y-axis correspond to the words in the source sentence (English) and the generated translation (French), respectively Each pixel shows the weight of the -th source word for the -th target word Target (𝑡) Neural machine translation by jointly learning to align and translate (ICLR 2015) 78 RNN encoder-decoder [w/o attention] [w/ attention] 𝛼 = 𝑠𝑐𝑜𝑟𝑒 𝑞𝑢𝑒𝑟𝑦, 𝑘𝑒𝑦 𝑐 = 𝛼 𝑣𝑎𝑙 + ⋯ + 𝛼 𝑣𝑎𝑙 How are you I …. good How are you key 1 key T Value T … Value T I …. good query Image src: https://medium.com/@edloginova/attention-in-nlp-734c6fa9d983 79 Attention function In a seq2seq RNN model: – Q (query): the decoder hidden state at time t ( ) – K (key): the encoder hidden states at all times ( ) – V (value): the encoder hidden states at all times ( ) Attention(Q, K, V)=attention value I …. good – Q and K must be comparable 𝛼(𝑞, 𝑘 ) – V and Attention value are in the same dimensionality – In many applications, all these four have the same 𝑞 dimensionality  Role of K and V: 𝛼(𝑞, 𝑘 ) 𝑣 K: used to compute the attention scores 𝛼 V: used to compute the final attention value 80 Attention methods Options for scoring similarity – Dot-product attention: – Learnable weighted dot-product: – Concatenation followed by an additional FC layer  Weight matrix (W) learned from data 81 Attention-based seq2seq model Allowing modeling of dependencies without regard to their distance in the input or output sequences Limitations – Dealing with long-range dependencies is still challenging – Hard to parallelize Please come here Attention weights Komm bitte her 82 Real-world success In 2013-2015, LSTMs started achieving state-of-the-art results Transformers have become more dominant – For example, in WMT (a MT conference + competition): – In WMT 2016, the summary report contains ”RNN” 44 times – In WMT 2018, the report contains “RNN” 9 times and “Transformer” 63 times – In WMT 2019: "RNN" 7 times, "Transformer" 105 times Now, Transformer (attention-based) is taking over all the communities 83 RNN: Summary RNN’s are good for processing sequence data for predictions but suffers from short- term memory LSTMs and GRUs mitigate short-term memory problems using gating mechanism Clip your gradients, use bidirectionality when possible Multi-layer RNNs are powerful, but you might need skip/dense-connections if it’s deep Encoder-decoder structure for seq2seq problems and attention mechanism 84

Deep Neural Networks II - CNNs and RNNs - PDF

Document Details

Tags

Related

Summary

Full Transcript