RNN Lecture 5 - Recurrent Neural Networks PDF

Neural Networks - Lecture 5 Recurrent Neural Networks Slides include extracts from: - CS231n Lecture 10 @ Stanford – Fei-Fei Li et al., 2021 Outline ● RNN Tasks ● Elman RNN model and Backpropagation through time ● RNN training; vanishing and exploding gradients ● LSTM ● Intuitions + Visualization of RNNs ● Tips for Training RNN RNN Tasks RNN Tasks Vanilla RNNs Source: CS231n Lecture 10 RNN Tasks e.g. Image Captioning Image → sequence of words Source: CS231n Lecture 10 RNN Tasks e.g. Sentiment Classification Sequence of words → sentiment Source: CS231n Lecture 10 RNN Tasks e.g. Translation Sequence of words → sequence of words Source: CS231n Lecture 10 RNN Tasks e.g. Video classification on frame level Source: CS231n Lecture 10 Sequential Processing of Non-Sequence Data Classify images by taking a series of “glimpses” Ba, Mnih, and Kavukcuoglu, “Multiple Object Recognition with Visual Attention”, ICLR 2015. Gregor et al, “DRAW: A Recurrent Neural Network For Image Generation”, ICML 2015 Figure copyright Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra, 2015. Reproduced with permission. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 16 April 29, 2021 Sequential Processing of Non-Sequence Data Generate images one piece at a time! Gregor et al, “DRAW: A Recurrent Neural Network For Image Generation”, ICML 2015 Figure copyright Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra, 2015. Reproduced with permission. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 17 April 29, 2021 Why existing convnets are insufficient? Variable sequence length inputs and outputs! Example task: video captioning Input video can have variable number of frames Output captions can be variable length. Krishna, Hata, Ren, Fei-Fei, Niebles. Dense captioning Events in Videos. ICCV 2019 Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 18 April 29, 2021 RNN Model Recurrent Neural Network y RNN x Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 20 April 29, 2021 Recurrent Neural Network y RNN Key idea: RNNs have an “internal state” that is updated as a sequence is processed x Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 21 April 29, 2021 Unrolled RNN y1 y2 y3 RNN RNN RNN x1 x2 x3 Fei-Fei Li, Ranjay Krishna, Danfei Xu yt ... Lecture 10 - 22 RNN xt April 29, 2021 RNN hidden state update We can process a sequence of vectors x by applying a recurrence formula at every time step: y RNN new state old state input vector at some time step some function with parameters W Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 23 x April 29, 2021 RNN output generation We can process a sequence of vectors x by applying a recurrence formula at every time step: y RNN new state output another function with parameters Wo Fei-Fei Li, Ranjay Krishna, Danfei Xu x Lecture 10 - 24 April 29, 2021 Recurrent Neural Network y1 h0 RNN x1 y3 y2 h1 RNN x2 Fei-Fei Li, Ranjay Krishna, Danfei Xu h2 RNN yt h3 ... x3 Lecture 10 - 25 RNN xt April 29, 2021 Recurrent Neural Network We can process a sequence of vectors x by applying a recurrence formula at every time step: y RNN Notice: the same function and the same set of parameters are used at every time step. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 26 x April 29, 2021 Vanilla RNN Model h (t−1) (t) x h wih (t ) y whh (t ) who ● Current state depends on current inputs and previous state ● RNNs can yield outputs at each time step (t ) h =f W (h (t ) (t −1) h =tanh (W hh h (t ) (t) (t ) ,x ) (t−1) (t) +W ih x ) y =W ho h , ∀ t ∈{1... τ } L RNN: Computational Graph: Many to Many h0 W fW x1 y1 L1 y2 L2 y3 L3 yT h1 fW h2 fW h3 … hT x2 Fei-Fei Li, Ranjay Krishna, Danfei Xu LT x3 Lecture 10 - 34 April 29, 2021 Unfolding RNN in time Source: NN Lectures, Tudor Berariu, 2016 Unfolding RNN in time Source: NN Lectures, Tudor Berariu, 2016 Unfolding RNN in time Source: NN Lectures, Tudor Berariu, 2016 RNN: Computational Graph: Many to One y h0 W fW x1 h1 fW x2 Fei-Fei Li, Ranjay Krishna, Danfei Xu h2 fW h3 … hT x3 Lecture 10 - 36 April 29, 2021 RNN: Computational Graph: One to Many y2 y1 h0 W fW x h1 fW 0 Fei-Fei Li, Ranjay Krishna, Danfei Xu h2 y3 fW h3 0 Lecture 10 - 39 yT … hT 0 April 29, 2021 RNN: Computational Graph: One to Many y2 y1 h0 W fW x h1 fW y1 Fei-Fei Li, Ranjay Krishna, Danfei Xu h2 y3 fW h3 y2 Lecture 10 - 40 yT … hT y T-1 April 29, 2021 Sequence to Sequence: Many-to-one + one-to-many One to many: Produce output sequence from single input vector Many to one: Encode input sequence in a single vector y1 h0 W1 fW x1 h1 fW x2 h2 fW h3 … fW hT h1 y2 fW h2 fW … x3 W2 Sutskever et al, “Sequence to Sequence Learning with Neural Networks”, NIPS 2014 Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 42 April 29, 2021 Example: Character-level Language Model Vocabulary: [h,e,l,o] Example training sequence: “hello” Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 43 April 29, 2021 Example: Character-level Language Model Vocabulary: [h,e,l,o] Example training sequence: “hello” Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 44 April 29, 2021 Example: Character-level Language Model Vocabulary: [h,e,l,o] Example training sequence: “hello” Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 45 April 29, 2021 Example: Character-level Language Model Sampling Sample Softmax “e” “l” “l” “o” .03 .84 .00 .13 .25 .20 .05 .50 .11 .17 .68 .03 .11 .02 .08 .79 Vocabulary: [h,e,l,o] At test-time sample characters one at a time, feed back to model Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 46 April 29, 2021 Example: Character-level Language Model Sampling Sample Softmax “e” “l” “l” “o” .03 .84 .00 .13 .25 .20 .05 .50 .11 .17 .68 .03 .11 .02 .08 .79 Vocabulary: [h,e,l,o] At test-time sample characters one at a time, feed back to model Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 47 April 29, 2021 Example: Character-level Language Model Sampling Sample Softmax “e” “l” “l” “o” .03 .84 .00 .13 .25 .20 .50 .05 .11 .17 .68 .03 .11 .02 .08 .79 Vocabulary: [h,e,l,o] At test-time sample characters one at a time, feed back to model Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 48 April 29, 2021 Example: Character-level Language Model Sampling Sample Softmax “e” “l” “l” “o” .03 .84 .00 .13 .25 .20 .50 .05 .11 .17 .68 .03 .11 .02 .08 .79 Vocabulary: [h,e,l,o] At test-time sample characters one at a time, feed back to model Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 49 April 29, 2021 Unfolding RNN in time Source: NN Lectures, Tudor Berariu, 2016 RNN – the loss function τ (t ) L=∑ L t=1 RNN – the loss function τ (t ) L=∑ L t=1 τ (t ) τ t (t) ∂L ∂L ∂L =∑ =∑ ∑ (s) ∂ θ t=1 ∂ θ t=1 s=1 ∂ θ ● Parameters are shared and tied, so θ(si) = θ(sj), unless they were updated between timesteps si and sj RNN – the loss function τ (t ) L=∑ L t=1 τ τ (t ) t (t) (t) (s) ∂L ∂L ∂L =∑ =∑ ∑ (s) ∂ θ t=1 ∂ θ t=1 s=1 ∂ θ τ t (t) ∂L ∂L ∂h ∂h = ∑ (t ) (s) (s) ∂θ ∑ ∂ h ∂θ t=1 s=1 ∂ h ● Parameters are shared and tied, so θ(si) = θ(sj), unless they were updated between timesteps si and sj RNN – the loss function τ (t ) L=∑ L t=1 τ τ (t ) t (t) (t) (s) ∂L ∂L ∂L =∑ =∑ ∑ (s) ∂ θ t=1 ∂ θ t=1 s=1 ∂ θ τ t (t) ∂L ∂L ∂h ∂h = ∑ (t ) (s) (s) ∂θ ∑ ∂ h ∂θ t=1 s=1 ∂ h τ t (t) ∂L ∂L =∑ ∑ ∂ θ t=1 s=1 ∂ h(t ) ● t (∏ i=(s+ 1) (i) (s) ∂h ∂h (i−1) (s) ∂h ∂θ ) Parameters are shared and tied, so θ(si) = θ(sj), unless they were updated between timesteps si and sj Backpropagation through time Forward through entire sequence to compute loss, then backward through entire sequence to compute gradient Loss Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 50 April 29, 2021 Truncated Backpropagation through time Loss Run forward and backward through chunks of the sequence instead of whole sequence Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 51 April 29, 2021 Truncated Backpropagation through time Loss Carry hidden states forward in time forever, but only backpropagate for some smaller number of steps Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 52 April 29, 2021 Truncated Backpropagation through time Loss Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 53 April 29, 2021 Truncated BPTT ● Used in practice ● Summary of the algorithm: – Present a sequence of k1 timesteps of input and output pairs to the network. – Unroll the network then calculate and accumulate errors across k2 timesteps. – Roll-up the network and update weights. – Repeat Truncated BPTT ● ● ● ● ● TBPTT(n, n): Updates are performed at the end of the sequence across all timesteps in the sequence (e.g. classical BPTT). TBPTT(1, n): Timesteps are processed one at a time followed by an update that covers all timesteps seen so far TBPTT(k1, 1): The network likely does not have enough temporal context to learn, relying heavily on internal state and inputs. TBPTT(k1, k2), where k1<k2<n: Multiple updates performed per sequence which can accelerate training. TBPTT(k1, k2), where k1=k2: common configuration: fixed number of timesteps are used for both forward and backward-pass timesteps (e.g. 10 to 100). Teacher Forcing and Warm-start ● ● ● When training a RNN to generate a sequence, often, the predictions (outputs y(t)) of a RNN cell are used as the input of the cell at the next timestamp Teacher Forcing: at training time, use the targets of the sequence, instead of RNN predictions, as inputs to the next step Warm-start: at evaluation time, give the network a subset of the groundtruth sequence, before asking it to make future predictions all on its own Vanishing and exploding gradients problem Vanishing and Exploding Gradient ● Remember the loss function τ t (t) ∂L ∂L =∑ ∑ ∂ θ t=1 s=1 ∂ h(t ) ● t (∏ i=(s+1) (i) (s) ∂h ∂h (i−1) (s) ∂h ∂θ ) Consider a typical vanilla RNN implementation (t ) h =f ( θh h (t −1) (t) (t ) + θx x +b)=f (a ) where f is a sigmoid function (e.g. tanh or logistic) ● Consider the Jacobian (t ) ∂h (t ) =diag(f ' (a )) θh (t−1) ∂h Vanishing and Exploding Gradient (t) ∂h (t) θ f ⩽ diag (f ' (a )) ⋅ θ ⩽ γ γ ‖ ‖ 2 ‖ h‖2 (t−1) ∂h 2 ‖ ‖ ● Therefore, when computing (t) ∂h θ f (t −k ) ⩽( γ γ) (k ) ∂h 2 ‖ ‖ ● For tanh ● For sigmoid γf =1 γf =0.25 ● If ( γ θ γ f )(t−k ) <1 , vanishing gradient as sequence gets larger ● If ( γ γ ) θ f (t−k ) >1 , exploding gradient as sequence gets larger Vanishing and Exploding Gradient - Solutions ● For exploding gradients: – ● Gradient clipping :-) For vanishing gradients: – e.g. Gradient regularization (Pascanu et al, 2013 - On the difficulty of training recurrent neural networks) – Truncated BPTT – LSTM :-) y RNN x Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 55 April 29, 2021 at first: train more train more train more Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 56 April 29, 2021 Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 57 April 29, 2021 The Stacks Project: open source algebraic geometry textbook Latex source Fei-Fei Li, Ranjay Krishna, Danfei Xu http://stacks.math.columbia.edu/ The stacks project is licensed under the GNU Free Documentation License Lecture 10 - 58 April 29, 2021 Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 59 April 29, 2021 Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 60 April 29, 2021 Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 61 April 29, 2021 Generated C code Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 62 April 29, 2021 Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 63 April 29, 2021 Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 64 April 29, 2021 Searching for interpretable cells Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 66 April 29, 2021 Searching for interpretable cells Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016 Figures copyright Karpathy, Johnson, and Fei-Fei, 2015; reproduced with permission Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 67 April 29, 2021 Searching for interpretable cells quote detection cell Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016 Figures copyright Karpathy, Johnson, and Fei-Fei, 2015; reproduced with permission Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 68 April 29, 2021 Searching for interpretable cells line length tracking cell Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016 Figures copyright Karpathy, Johnson, and Fei-Fei, 2015; reproduced with permission Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 69 April 29, 2021 Searching for interpretable cells if statement cell Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016 Figures copyright Karpathy, Johnson, and Fei-Fei, 2015; reproduced with permission Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 70 April 29, 2021 Searching for interpretable cells quote/comment cell Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016 Figures copyright Karpathy, Johnson, and Fei-Fei, 2015; reproduced with permission Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 71 April 29, 2021 Searching for interpretable cells code depth cell Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016 Figures copyright Karpathy, Johnson, and Fei-Fei, 2015; reproduced with permission Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 72 April 29, 2021 RNN tradeoffs RNN Advantages: - Can process any length input - Computation for step t can (in theory) use information from many steps back - Model size doesn’t increase for longer input - Same weights applied on every timestep, so there is symmetry in how inputs are processed. RNN Disadvantages: - Recurrent computation is slow - In practice, difficult to access information from many steps back Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 73 April 29, 2021 LSTM Vanilla RNN Gradient Flow Bengio et al, “Learning long-term dependencies with gradient descent is difficult”, IEEE Transactions on Neural Networks, 1994 Pascanu et al, “On the difficulty of training recurrent neural networks”, ICML 2013 Gradients over multiple time steps: y1 h0 y2 h1 x1 y3 h2 x2 y4 h3 x3 h4 x4 What if we assumed no non-linearity? Largest singular value > 1: Exploding gradients Gradient clipping: Scale gradient if its norm is too big Largest singular value < 1: Vanishing gradients Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 105 April 29, 2021 Vanilla RNN Gradient Flow Bengio et al, “Learning long-term dependencies with gradient descent is difficult”, IEEE Transactions on Neural Networks, 1994 Pascanu et al, “On the difficulty of training recurrent neural networks”, ICML 2013 Gradients over multiple time steps: y1 h0 y2 h1 x1 y3 h2 x2 y4 h3 h4 x3 x4 What if we assumed no non-linearity? Largest singular value > 1: Exploding gradients Largest singular value < 1: Vanishing gradients Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 106 Change RNN architecture April 29, 2021 Long Short Term Memory (LSTM) Vanilla RNN LSTM Hochreiter and Schmidhuber, “Long Short Term Memory”, Neural Computation 1997 Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 107 April 29, 2021 LSTM Cell Img source: https://medium.com/ @kangeugine/ ● Input Gate (i in (0, 1) – sigmoid) – scales input to cell (write) ● Output Gate (o in (0, 1) – sigmoid) – scales output from cell (read) ● Forget Gate (f in (0, 1) – sigmoid) – scales old cell values (reset mem) LSTM Cell - Equations (t ) (t−1) it =σ ( θ xi x + θhi h +b i ) (t ) (t−1) (t ) (t−1) f t =σ ( θ xf x + θhf h o t =σ ( θ xo x + θho h (t) g t =tanh ( θ xg x + θhg h +b f ) +b o ) (t−1) +b g ) c t =f t ⊙c(t−1)+it ⊙g t h t =ot ⊙tanh(ct ) , where ⊙ is elementwise multiplication Long Short Term Memory (LSTM) [Hochreiter et al., 1997] ☉ ct-1 f i g W ht-1 stack + ct ☉ tanh o ☉ ht xt Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 113 April 29, 2021 Long Short Term Memory (LSTM): Gradient Flow [Hochreiter et al., 1997] ☉ ct-1 f i g W ht-1 stack + ct ☉ tanh o ☉ Backpropagation from ct to ct-1 only elementwise multiplication by f, no matrix multiply by W ht xt Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 114 April 29, 2021 Long Short Term Memory (LSTM): Gradient Flow [Hochreiter et al., 1997] Uninterrupted gradient flow! c0 c1 c2 c3 Notice that the gradient contains the f gate’s vector of activations - allows better control of gradients values, using suitable parameter updates of the forget gate. Also notice that are added through the f, i, g, and o gates - better balancing of gradient values Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 115 April 29, 2021 Do LSTMs solve the vanishing gradient problem? The LSTM architecture makes it easier for the RNN to preserve information over many timesteps - e.g. if the f = 1 and the i = 0, then the information of that cell is preserved indefinitely. - By contrast, it’s harder for vanilla RNN to learn a recurrent weight matrix Wh that preserves info in hidden state LSTM doesn’t guarantee that there is no vanishing/exploding gradient, but it does provide an easier way for the model to learn long-distance dependencies Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 116 April 29, 2021 Long Short Term Memory (LSTM): Gradient Flow [Hochreiter et al., 1997] Uninterrupted gradient flow! Softmax FC 1000 Pool 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 April 29, 2021 Lecture 10 - 117 Fei-Fei Li, Ranjay Krishna, Danfei Xu ... 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 / 2 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 Pool 7x7 conv, 64 / 2 Input Similar to ResNet! c3 c2 c1 c0 Tips for Training RNNs Tips for Training RNNs ● Use LSTM or GRUs (Gated Recurrent Unit) ● Use adaptive learning rate optimizers (e.g. Adam, RMSProp) ● ● ● ● Apply Layer Normalization (remember previous lecture) to linear mappings of the RNN Use Gradient Clipping for exploding gradients case Use Truncated backprop (train on overlapping chunks of sequences, rather than single large one) Stacked recurrent nets – RNN cells use dense layers in the input-hidden, hidden-hidden and hidden-out updates → high number of weights – In some cases it is more efficient to stack several smaller layers than one big one ● Sum the outputs of all layers instead of using just the last one 33/37 Tips for Training RNNs ● ● ● Initial state (h(0)) initialization – Default of 0 may lead to large loss initially – Train initial state as a variable (if there are many training sequences), or use noisy initial state (see discussion here) Forget gate bias – RNNs may be slow to remember on some tasks :-) – Initialize biases for LSTM’s forget gate to 1 → remember by default Perform regularization if model overfits, e.g. dropout 34/37 Summary - RNNs allow a lot of flexibility in architecture design - Vanilla RNNs are simple but don’t work very well - Common to use LSTM or GRU: their additive interactions improve gradient flow - Backward flow of gradients in RNN can explode or vanish. Exploding is controlled with gradient clipping. Vanishing is controlled with additive interactions (LSTM) - Better/simpler architectures are a hot topic of current research, as well as new paradigms for reasoning over sequences - Better understanding (both theoretical and empirical) is needed. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 122 April 29, 2021

RNN Lecture 5 - Recurrent Neural Networks PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue