Advanced AI - 9 PDF

Transformers What is a Transformer The Transformer architecture excels at handling text data which is inherently sequential. They take a text sequence as input and produce another text sequence as output. eg. to translate an input English sentence to Spanish. What is a Transformer At its core, it contains a stack of Encoder layers and Decoder layers. To avoid confusion we will refer to the individual layer as an Encoder or a Decoder and will use Encoder stack or Decoder stack for a group of Encoder layers. The Encoder stack and the Decoder stack each have their corresponding Embedding layers for their respective inputs. Finally, there is an Output layer to generate the final output. What is a Transformer All the Encoders are identical to one another. Similarly, all the Decoders are identical. The Encoder contains the all- important Self-attention layer that computes the relationship between different words in the sequence, as well as a Feed-forward layer. The Decoder contains the Self- attention layer and the Feed- forward layer, as well as a second Encoder-Decoder attention layer. Each Encoder and Decoder has its own set of weights. What is a Transformer The Encoder is a reusable module that is the defining component of all Transformer architectures. In addition to the above two layers, it also has Residual skip connections around both layers along with two LayerNorm layers. There are many variations of the Transformer architecture. Some Transformer architectures have no Decoder at all and rely only on the Encoder. What does Attention Do? The key to the Transformer’s ground-breaking performance is its use of Attention. While processing a word, Attention enables the model to focus on other words in the input that are closely related to that word. eg. ‘Ball’ is closely related to ‘blue’ and ‘holding’. On the other hand, ‘blue’ is not related to ‘boy’. The Transformer architecture uses self-attention by relating every word in the input sequence to every other word. Example The cat drank the milk because it The cat drank the milk because it was hungry. was sweet. Example In the first sentence, the word ‘it’ refers to ‘cat’ In the second it refers to ‘milk. When the model processes the word ‘it’, self-attention gives the model more information about its meaning so that it can associate ‘it’ with the correct word. Example To enable it to handle more nuances about the intent and semantics of the sentence, Transformers include multiple attention scores for each word. eg. While processing the word ‘it’, the first score highlights ‘cat’, while the second score highlights ‘hungry’. So when it decodes the word ‘it’, by translating it into a different language, for instance, it will incorporate some aspect of both ‘cat’ and ‘hungry’ into the translated word. Training the Transformer The Transformer works slightly differently during Training and while doing Inference. Training data consists of two parts: The source or input sequence (eg. “You are welcome” in English, for a translation problem) The destination or target sequence (eg. “De nada” in Spanish) The Transformer’s goal is to learn how to output the target sequence, by using both the input and target sequence. Training the Transformer The Transformer processes the data like this: 1. The input sequence is converted into Embeddings (with Position Encoding) and fed to the Encoder. 2. The stack of Encoders processes this and produces an encoded representation of the input sequence. 3. The target sequence is prepended with a start-of- sentence token, converted into Embeddings (with Position Encoding), and fed to the Decoder. 4. The stack of Decoders processes this along with the Encoder stack’s encoded representation to produce an encoded representation of the target sequence. 5. The Output layer converts it into word probabilities and the final output sequence. 6. The Transformer’s Loss function compares this output sequence with the target sequence from the training data. This loss is used to generate gradients to train the Transformer during back-propagation. Inference During Inference, we have only the input sequence and don’t have the target sequence to pass as input to the Decoder. The goal of the Transformer is to produce the target sequence from the input sequence alone. Like in a Seq2Seq model, we generate the output in a loop and feed the output sequence from the previous timestep to the Decoder in the next timestep until we come across an end-of-sentence token. The difference from the Seq2Seq model is that, at each timestep, we re-feed the entire output sequence generated thus far, rather than just the last word. Inference The flow of data during Inference is: 1. The input sequence is converted into Embeddings (with Position Encoding) and fed to the Encoder. 2. The stack of Encoders processes this and produces an encoded representation of the input sequence. 3. Instead of the target sequence, we use an empty sequence with only a start-of-sentence token. This is converted into Embeddings (with Position Encoding) and fed to the Decoder. 4. The stack of Decoders processes this along with the Encoder stack’s encoded representation to produce an encoded representation of the target sequence. 5. The Output layer converts it into word probabilities and produces an output sequence. 6. We take the last word of the output sequence as the predicted word. That word is now filled into the second position of our Decoder input sequence, which now contains a start-of-sentence token and the first word. 7. Go back to step #1. As before, feed the Encoder input sequence and the new Decoder sequence into the model. Then take the second word of the output and append it to the Decoder sequence. Repeat this until it predicts an end- of-sentence token. Teacher Forcing The approach of feeding the target sequence to the Decoder during training is known as Teacher Forcing. Why do we do this and what does that term mean? During training, we could have used the same approach that is used during inference: Run the Transformer in a loop, take the last word from the output sequence, append it to the Decoder input and feed it to the Decoder for the next iteration. Finally, when the end-of-sentence token is predicted, the Loss function would compare the generated output sequence to the target sequence in order to train the network. Not only would this looping cause training to take much longer, but it also makes it harder to train the model. The model would have to predict the second word based on a potentially erroneous first predicted word, and so on. Instead, by feeding the target sequence to the Decoder, we are giving it a hint, so to speak, just like a Teacher would. Even though it predicted an erroneous first word, it can instead use the correct first word to predict the second word so that those errors don’t keep compounding. In addition, the Transformer is able to output all the words in parallel without looping, which greatly speeds up training. What are Transformers used for? Transformers are very versatile and are used for most NLP tasks such as language models and text classification. They are frequently used in sequence-to-sequence models for applications such as: Machine Translation Text Summarization Question-Answering Named Entity Recognition Speech Recognition. There are different flavors of the Transformer architecture for different problems. The basic Encoder Layer is used as a common building block for these architectures, with different application-specific ‘heads’ depending on the problem being solved. Transformer Classification architecture A Sentiment Analysis application, for instance, would take a text document as input. A Classification head takes the Transformer’s output and generates predictions of the class labels such as a positive or negative sentiment. Transformer Language Model architecture A Language Model architecture would take the initial part of an input sequence such as a text sentence as input, and generate new text by predicting sentences that would follow. A Language Model head takes the Transformer’s output and generates a probability for every word in the vocabulary. The highest probability word becomes the predicted output for the next word in the sentence. How are they better than RNNs? RNNs and their cousins, LSTMs However, they had two and GRUs, were the de facto limitations: architecture for all NLP It was challenging to deal with long- applications until Transformers range dependencies between words came along and dethroned them. that were spread far apart in a long sentence. RNN-based sequence-to- They process the input sequence sequence models performed well, sequentially one word at a time, and when the Attention which means that it cannot do the mechanism was first introduced, computation for time-step 𝑡 until it it was used to enhance their has completed the computation for performance. time-step 𝑡 − 1. This slows down training and inference. How are they better than RNNs? The Transformer architecture addresses both of these limitations. It got rid of RNNs altogether and relied exclusively on the benefits of Attention. They process all the words in the sequence in parallel, thus greatly speeding up computation. The distance between words in the input sequence does not matter. It is equally good at computing dependencies between adjacent words and words that are far apart. Return to details Data inputs for both the Encoder and Decoder, which contains: Embedding layer Position Encoding layer The Encoder stack contains a number of Encoders. Each Encoder contains: Multi-Head Attention layer Feed-forward layer The Decoder stack contains a number of Decoders. Each Decoder contains: Two Multi-Head Attention layers Feed-forward layer Output (top right) — generates the final output, and contains: Linear layer Softmax layer Detailed example To understand what each component does, let’s walk through the working of the Transformer while we are training it to solve a translation problem. We’ll use one sample of our training data which consists of an input sequence (‘You are welcome’ in English) and a target sequence (‘De nada’ in Spanish). Embedding and Position Encoding Like any NLP model, the Transformer needs two things about each word: the meaning of the word and its position in the sequence. The Embedding layer encodes the meaning of the word. The Position Encoding layer represents the position of the word. The Transformer combines these two encodings by adding them. Embedding The Transformer has two Embedding layers. The input sequence is fed to the first Embedding layer, known as the Input Embedding. The target sequence is fed to the second Embedding layer after shifting the targets right by one position and inserting a Start token in the first position. Note that, during Inference, we have no target sequence and we feed the output sequence to this second layer in a loop. That is why it is called the Output Embedding. Embedding The text sequence is mapped to numeric word IDs using our vocabulary. The embedding layer then maps each input word into an embedding vector, which is a richer representation of the meaning of that word. Position Encoding Since an RNN implements a loop where each word is input sequentially, it implicitly knows the position of each word. However, Transformers don’t use RNNs and all words in a sequence are input in parallel. This is its major advantage over the RNN architecture, but it means that the position information is lost, and has to be added back in separately. Position Encoding Just like the two Embedding layers, there are two Position Encoding layers. The Position Encoding is computed independently of the input sequence. These are fixed values that depend only on the max length of the sequence. For instance, the first item is a constant code that indicates the first position the second item is a constant code that indicates the second position, and so on. Position Encoding These constants are computed using the formula below, where 𝑝𝑜𝑠 is the position of the word in the sequence 𝑑𝑚𝑜𝑑𝑒𝑙 is the length of the encoding vector (same as the embedding vector) 𝑖 is the index value into this vector. Position Encoding In other words, it interleaves a sine curve and a cos curve, with sine values for all even indexes and cos values for all odd indexes. As an example, if we encode a sequence of 40 words, we can see below the encoding values for a few (word position, encoding_index) combinations. The blue curve shows the encoding of the 0th index for all 40 word-positions and the orange curve shows the encoding of the 1st index for all 40 word-positions. There will be similar curves for the remaining index values. Matrix Dimensions Deep learning models process a batch of training samples at a time. The Embedding and Position Encoding layers operate on matrices representing a batch of sequence samples. The Embedding takes a (samples, sequence length) shaped matrix of word IDs. It encodes each word ID into a word vector whose length is the embedding size, resulting in a (samples, sequence length, embedding size) shaped output matrix. The Position Encoding uses an encoding size that is equal to the embedding size. So it produces a similarly shaped matrix that can be added to the embedding matrix. Matrix Dimensions The (samples, sequence length, embedding size) shape produced by the Embedding and Position Encoding layers is preserved all through the Transformer, as the data flows through the Encoder and Decoder Stacks until it is reshaped by the final Output layers. This gives a sense of the 3D matrix dimensions in the Transformer. However, to simplify the visualization, from here on we will drop the first dimension (for the samples) and use the 2D representation for a single sample. Encoder The Input Embedding sends its outputs into the Encoder. Similarly, the Output Embedding feeds into the Decoder. The Encoder and Decoder Stacks consists of several (usually six) Encoders and Decoders respectively, connected sequentially. Encoder The first Encoder in the stack receives its input from the Embedding and Position Encoding. The other Encoders in the stack receive their input from the previous Encoder. The Encoder passes its input into a Multi-head Self-attention layer. The Self-attention output is passed into a Feed-forward layer, which then sends its output upwards to the next Encoder. Both the Self-attention and Feed-forward sub-layers, have a residual skip-connection around them, followed by a Layer- Normalization. The output of the last Encoder is fed into each Decoder in the Decoder Stack. Decoder The Decoder’s structure is very similar to the Encoder’s but with a couple of differences. Like the Encoder, the first Decoder in the stack receives its input from the Output Embedding and Position Encoding. The other Decoders in the stack receive their input from the previous Decoder. The Decoder passes its input into a Multi-head Self- attention layer. This operates in a slightly different way than the one in the Encoder. It is only allowed to attend to earlier positions in the sequence. This is done by masking future positions, which we’ll talk about shortly. Decoder Unlike the Encoder, the Decoder has a second Multi- head attention layer, known as the Encoder- Decoder attention layer. The Encoder-Decoder attention layer works like Self- attention, except that it combines two sources of inputs: the Self-attention layer below the output of the Encoder stack. The Self-attention output is passed into a Feed- forward layer, which then sends its output upwards to the next Decoder. Each of these sub-layers, Self-attention, Encoder- Decoder attention, and Feed-forward, have a residual skip-connection around them, followed by a Layer-Normalization. Attention In the Transformer, Attention is used in three places: Self-attention in the Encoder: the input sequence pays attention to itself Self-attention in the Decoder: the target sequence pays attention to itself Encoder-Decoder-attention in the Decoder: the target sequence pays attention to the input sequence The Attention layer takes its input in the form of three parameters, known as the Query, Key, and Value. Attention The Attention layer takes its input in the form of three parameters, known as the Query, Key, and Value. In the Encoder’s Self-attention, the Encoder’s input is passed to all three parameters, Query, Key, and Value. Attention The Attention layer takes its input in the form of three parameters, known as the Query, Key, and Value. In the Decoder’s Self-attention, the Decoder’s input is passed to all three parameters, Query, Key, and Value. In the Decoder’s Encoder-Decoder attention, the output of the final Encoder in the stack is passed to the Value and Key parameters. The output of the Self-attention (and Layer Norm) module below it is passed to the Query parameter. Multi-head Attention The Transformer calls each Attention processor an Attention Head and repeats it several times in parallel. This is known as Multi-head attention. It gives its Attention greater power of discrimination, by combining several similar Attention calculations. Multi-head Attention The Query, Key, and Value are each passed through separate Linear layers, each with their own weights, producing three results called Q, K, and V respectively. These are then combined together using the Attention formula as shown, to produce the Attention Score. The important thing to realize here is that the Q, K, and V values carry an encoded representation of each word in the sequence. The Attention calculations then combine each word with every other word in the sequence, so that the Attention Score encodes a score for each word in the sequence. When discussing the Decoder a little while back, we briefly mentioned masking. The Mask is also shown in the Attention diagrams above. Let’s see how it works. Attention Masks While computing the Attention Score, the Attention module implements a masking step Masking serves two purposes: In the Encoder Self-attention and in the Encoder-Decoder-attention: masking serves to zero attention outputs where there is padding in the input sentences, to ensure that padding doesn’t contribute to the self-attention. (Note: since input sequences could be of different lengths they are extended with padding tokens like in most NLP applications so that fixed-length vectors can be input to the Transformer.) Attention Masks While computing the Attention Score, the Attention module implements a masking step Masking serves two purposes: In the Decoder Self-attention: masking serves to prevent the decoder from ‘peeking’ ahead at the rest of the target sentence when predicting the next word. The Decoder processes words in the source sequence and uses them to predict the words in the destination sequence. Attention Masks During training, this is done via Teacher Forcing, where the complete target sequence is fed as Decoder inputs. Therefore, while predicting a word at a certain position, the Decoder has available to it the target words preceding that word as well as the target words following that word. This allows the Decoder to ‘cheat’ by using target words from future ‘time steps’. For instance, when predicting ‘Word 3’, the Decoder should refer only to the first 3 input words from the target but not the fourth word ‘Ketan’. Generate Output The last Decoder in the stack passes its output to the Output component which converts it into the final output sentence. The Linear layer projects the Decoder vector into Word Scores, with a score value for each unique word in the target vocabulary, at each position in the sentence. For instance, if our final output sentence has 7 words and the target Spanish vocabulary has 10000 unique words, we generate 10000 score values for each of those 7 words. The score values indicate the likelihood of occurrence for each word in the vocabulary in that position of the sentence. The Softmax layer then turns those scores into probabilities (which add up to 1.0). In each position, we find the index for the word with the highest probability, and then map that index to the corresponding word in the vocabulary. Those words then form the output sequence of the Transformer. Training and Loss Function During training, we use a loss function such as cross-entropy loss to compare the generated output probability distribution to the target sequence. The probability distribution gives the probability of each word occurring in that position. Training and Loss Function Let’s assume our target vocabulary contains just four words. Our goal is to produce a probability distribution that matches our expected target sequence “De nada END”. This means that the probability distribution for the first word-position should have a probability of 1 for “De” with probabilities for all other words in the vocabulary being 0. Similarly, “nada” and “END” should have a probability of 1 for the second and third word-positions respectively. As usual, the loss is used to compute gradients to train the Transformer via backpropagation.

Advanced AI - 9 PDF

Document Details

Tags

Related

Summary

Full Transcript