Deep Learning Lecture 6: LSTM Architectures PDF

Summary

These are lecture notes for the AIE332 Deep Learning course, covering special RNN architectures, focusing on LSTM (Long Short-Term Memory) networks, encoder-decoder models, and attention mechanisms. The notes discuss the limitations of traditional RNNs and how LSTMs address these limitations, including managing context and handling long dependencies within sequences. Visual question answering and other applications are discussed also, by Mohamed Abdel Rahman.

Full Transcript

AIE332 Deep Learning Lecture 6 Special RNN Architectures LSTM – Encoder/Decoder – Attention Mechanism Prof. Mohamed Abdel Rahman Summary for Common RNN – NLP Architectures Prof. Mohamed Abdel Rahman 2 Summary for Common RNN – NLP Archit...

AIE332 Deep Learning Lecture 6 Special RNN Architectures LSTM – Encoder/Decoder – Attention Mechanism Prof. Mohamed Abdel Rahman Summary for Common RNN – NLP Architectures Prof. Mohamed Abdel Rahman 2 Summary for Common RNN – NLP Architectures - In sequence labeling (for example for part of speech tagging), we train a model to produce a label for each input word or token. - In sequence classification, for example for sentiment analysis, we ignore the output for each token, and only take one value from the end of the sequence. - In language modeling (sentence completion), we train the model to predict the next word at each token step. - In encoder – decoder models, we train the model for machine translation. - The main problem within the RNN is that it can not support long dependencies / long sequence. Prof. Mohamed Abdel Rahman 3 RNN Limitations It is hard to reach high accuracies in long sentences using RNNs. It is also difficult to train RNNs to make use of distant information, i.e. relation between two words should be kept along other intermediate words. Example: The flights the airline was canceling were full. An RNN could predict “was” after “airline” Difficult to predict “were” as a verb for “flights” for two reasons: Four intermediate words The singular noun “airline” is closer in the context Ideally, we need a network able to retain the distant information, while still processing the intermediate parts of the sequence correctly. Prof. Mohamed Abdel Rahman 4 RNN Limitations Why a conventional RNN can NOT satisfy previous needs? Hidden layers and the weights that determine the values in the hidden layer, are asked to perform two tasks simultaneously: Provide information useful for the current decision, and Updating and carrying forward information for near future decisions, NOT too far. Backpropagation algorithm leads to vanishing gradient problem with longer RNNs. A special RNN extension is the long short-term memory (LSTM) network introduced for the first time by Hochreiter and Schmidhuber. Prof. Mohamed Abdel Rahman 5 The Long Short Term Memory (LSTM) Prof. Mohamed Abdel Rahman 6 LSTM LSTMs divide the context management problem into two sub-problems: Removing information no longer needed from the context (negative values or small values in h(t) vector) Adding information likely to be needed for later decision making (long dependencies) LSTM accomplish this by adding a context layer in addition to the RNN hidden layer To do that, LSTMs compute three filters + one conventional hidden layer output All 4 are trained with a different couple (U,W) matrices because it has a different task The three filters share a common design pattern: A feed-forward RNN with a sigmoid activation function and followed by a pointwise multiplication The sigmoid as the activation function pushes its outputs to either 0 or 1 Pointwise multiplication has an effect similar to that of a binary mask. Prof. Mohamed Abdel Rahman 7 LSTM The LSTM has tow gates: Forget & Add Both gates uses a weighted sum of ht-1 and the current input Each of them is trained with a different couple of weights (Uf ,Wf) and (Ui ,Wi). Reason is that each of them does a different functionality thus should have different weights This is exactly like an RNN, except using an activation function sigmoid (0 or 1) Here, ft and it represent filter masks Prof. Mohamed Abdel Rahman 8 Forget & Add Gates The first gate is the forget gate (Short Term) The purpose of this gate is to delete information from the context that is no longer needed The forget gate mask is multiplied element-wise by the previous context vector – keeping only fittest values (fractions) and deleting no longer needed (zeros) The second gate is the add gate (Long Term) The purpose of this gate is to retain long dependencies information from the context. The add gate mask operates similarly to the “forget gate” except that the add gate mask is multiplied element-wise by the conventional RNN hidden layer output. Both vectors are added to get our new context vector Prof. Mohamed Abdel Rahman 9 Few Tips comparing RNN and LSTM algorithms: 3- The context vector at any instant: 1- LSTM uses two versions of h(t): - Comes from both gates - One with sigmoid activation as a filter - Multiply the conventional-RNN h(t) with the filter - One with tanh as the conventional RNN-h(t) - Sum both gates 2- Two similar filters (trained with different weights): 4- The resultant new LSTM h(t) - Forget gate (f) - Long term + short term (context) masked with filter - Add gate (i) Prof. Mohamed Abdel Rahman 10 LSTM – Computational Graph A single LSTM unit displayed as a computation graph. The inputs: the current input xt - the previous hidden state, ht−1 - and the previous context, ct−1. The outputs: a new hidden state, ht and an updated context, ct. Comment on final h(t): Taking the effect of the context vector means that we have both current fittest information + long dependencies Using tanh: add more non-linearity to the context vector Multiplication with another filter for stronger h(t) Prof. Mohamed Abdel Rahman 11 The forget gate uses previous context The add gate uses the current context The forget gate uses previous context The add gate uses the current context Using the previous context makes the forget gate The focus on the current context allows the add gate (short term memory) similar to the conventional RNN (long term memory) to incorporate new, relevant information into the cell state Retains the effect of previous word, Moreover highlighting this effect through the filtration process The model adds relevant new information about (deleting non fittest values) current words The model will forget irrelevant past information Context vector from add / forget gates is summed Another advantage: Keeping h(t) values increasing in This model can be repeated several times to learn positive direction means that we prevent the long dependencies without falling into vanishing vanishing gradient gradient Prof. Mohamed Abdel Rahman 12 ANN – RNN – LSTM in a simplest diagram (a) Basic ANN (b) Conventional RNN (c) LSTM Prof. Mohamed Abdel Rahman 13 Could RNN be used for image Applications? Recurrent Neural Network Convolutional Neural Network Prof. Mohamed Abdel Rahman 16 17 Stories: Example A cat sitting on a A cat is sitting on a tree A dog is running in the A white teddy bear sitting in suitcase on the floor branch grass with a frisbee the grass Prof. M. Abdel Rahman 18 Two people walking on A tennis player in action Two giraffes standing in a A man riding a dirt bike on the beach with surfboards on the court grassy field a dirt track Visual Question Answering Prof. M. Abdel Rahman 19 Agrawal et al, “VQA: Visual Question Answering”, ICCV 2015 Zhu et al, “Visual 7W: Grounded Question Answering in Images”, CVPR 2016 Figure from Zhu et al, copyright IEEE 2016. Reproduced for educational purposes. Visual Question Answering Agrawal et al, “Visual 7W: Grounded Question Answering in Images”, CVPR 2015 Figures from Agrawal et al, copyright IEEE 2015. Reproduced for educational purposes. The Encoder Decoder Model Prof. Mohamed Abdel Rahman 21 The Encoder – Decoder Model Encoder-decoder networks, sometimes called sequence-to-sequence networks, are models capable of generating contextually appropriate, arbitrary length, output sequences given an input sequence. The encoder – decoder model is used when we are taking an input sequence and transforming (translating) it into an output sequence of a different length (the input doesn’t align with the output in a word-to-word). Recall that in the sequence labeling task, we have two sequences, but they are the same length. By contrast, encoder-decoder models are used especially for tasks like machine translation. Encoder-decoder networks have been applied to a very wide range of applications including: Summarization, Question answering, and Dialogue, but They are particularly popular for machine translation. Prof. Mohamed Abdel Rahman 22 The Encoder – Decoder Model The key idea underlying these networks is the use of an encoder network that takes an input sequence and creates a contextualized representation of it, often called the context. This representation is then passed to a decoder which generates a task- specific output sequence. Prof. Mohamed Abdel Rahman 23 The Encoder – Decoder Model Encoder-decoder networks consist of three conceptual components: An encoder that accepts an input sequence, x1:n, and generates a corresponding sequence of contextualized representations, h1:m. LSTMs, RNNs, and transformers can all be employed as encoders. A context vector, c, which is a function of h1:n, and conveys the essence of the input to the decoder. A decoder, which accepts c as input and generates an arbitrary length sequence of hidden states h1:m, from which a corresponding sequence of output states y1:m, can be obtained. Just as with encoders; decoders can be realized using any kind of sequence architecture. Prof. Mohamed Abdel Rahman 24 RNN- Encoder – RNN- Decoder Training Each training example is a tuple of paired strings, a source input (Xn) and a target output (yn) separated by. The figure shows an encoder-decoder model with two RNNs with the contextual concept described in the LSTM The input words Xn to the encoder (LSTM) will be treated resulting a contextual output C The contextual output represents the effect of the last word token The input of the decoder is both the output dataset and the contextual output of the encoder stage The whole network will be trained in the same manner till reaching convergence Prof. Mohamed Abdel Rahman 25 RNN- Encoder – RNN- Decoder Inference During inference, the decoder has only contextual information as input from the encoder stage It thus uses an estimated output yˆt as an additional input for the next time step xt+1. Thus the decoder will tend to deviate more and more from the gold target sentence as it keeps generating more tokens. In training, it is more common to use teacher forcing in the decoder. Teacher forcing means that we force the system to use the gold target token from training as the next input xt+1, rather than allowing it to rely on the (possibly erroneous) decoder output yˆt. This speeds up training. Prof. Mohamed Abdel Rahman 26 Drawback of the encoder- decoder The context vector hn is the hidden state of the last (nth) time step of the source text. This way does not represent everything about the meaning of the whole source text. This is thus acting as a bottleneck The only thing the decoder knows about the source text is the context vector (mapping the last word). This could be suitable for short sentences. Information at the beginning of the sentence for long sentences will be poorly represented in the context vector. Prof. Mohamed Abdel Rahman 27 Attention Mechanism The attention mechanism is a solution to the bottleneck problem, a way of allowing the decoder to get information from all the hidden states of the encoder, not just the last hidden state In the attention mechanism, as in the vanilla encoder-decoder model, the context vector c is a single vector that is a function of the hidden states of the encoder, that is: The attention layer dynamically provides importance to a few key tokens in the input sequence Prof. Mohamed Abdel Rahman 28 More about attention mechanism … next lecture Prof. Mohamed Abdel Rahman 29 RNN / LSTM / Encoder-Decoder In the ANN, input is entered to the hidden layer which in turns fires the output. Using ANN for language models is NOT recommended since the relation between words of the same sentence is ignored. In the RNN, the input is entered to the hidden layer which in turns fires the output. Here, a feedback from the previous state is entered as well to the hidden layer (memory). We can understand that relations between words of the same sentence is considered. RNNs can be trained with a straightforward extension of the back propagation algorithm, known as back propagation through time (BPTT). Simple recurrent networks fail on long sentences because of problems like vanishing gradients. In the LSTM, which is a modified-RNN, the input is entered to the hidden layer + the feedback from the previous state. In addition, an add / forget gate retains important relations and forget redundant relations. In the Encoder – Decoder, the input is buffered from the output. The context vector is used as an output from the encoder (only representing the last word token). This could be seen as a bias information to direct the decoder towards the desired output. Buffering i/p from o/p in two networks is the aim of a machine translation system where the i/p and o/p sentences could have different number of words. The attention mechanism, which is a modified version of the encoder-decoder, has exactly similar design. In the calculations, we will use the context vector (from all words token) as an output from the encoder and input to the decoder. Prof. Mohamed Abdel Rahman 30 Summary Prof. Mohamed Abdel Rahman 31 Thank You Prof. Mohamed Abdel Rahman 32