Attention Mechanisms and Transformer Models PDF

Attention Mechanisms and Transformer Models An Exploration of Modern NLP Techniques Presented by: Fadoua BEN AHMED 14/06/2024 Table of Contents 1. Introduction 2. The Importance of Transformers 3. Understanding Transformer Models 4. Architecture of Transformers 4.1 Encoder and Decoder Structure 4.2 Input Embedding 4.3 Positional Encoding 5. Attention Mechanisms 5.1 Self-Attention Mechanism 5.2 Multi-Head Attention 6. Additional Components of the Transformer 6.1 Feedforward Neural Networks 6.2 Layer Normalization and Residual Connections 6.3 The Final Linear and Softmax Layer 7. Training Transformers 7.1 The Loss Function 8. Applications of Transformers 9. Conclusion 10. References 1. Introduction Self-attention has become a cornerstone of modern deep learning, driving significant advancements in both natural language processing (NLP) and computer vision. Introduced in the groundbreaking 2017 paper "Attention is All You Need" by Ashish Vaswani and his team at Google Brain and the University of Toronto, transformers leverage self- attention to efficiently process and understand complex data. This paper marked a watershed moment in AI, as transformers can capture long-range dependencies and handle parallel processing, making them faster and more scalable. This innovation has paved the way for powerful models like BERT, GPT, and Vision Transformers, transforming how we approach language and vision tasks in AI. 2. The Importance of Transformers  Advantages Over RNNs and CNNs: Transformers excel at capturing long-range dependencies and can process data in parallel, unlike RNNs and CNNs. This parallel processing capability significantly enhances their efficiency and scalability, making them ideal for handling complex and extensive datasets.  Real-World Impact: Transformers power applications from real-time text translation to advanced image recognition. They are integral to tools like OpenAI's ChatGPT and Google's BERT, which optimize search results. Beyond language and vision, transformers aid in fields like DNA sequencing, drug design, and fraud detection, showcasing their versatile impact across various domains. CNN Problems Not practical for very Long Sequences RNNs face a challenge with parallel computation because they process data sequentially, making it difficult to take advantage of the parallel computing power of GPUs. However, Transformers overcome this issue by not using recurrence, allowing them to perform computations in parallel at every step, which makes them much faster and more efficient. RNNs struggle with long-range dependencies, making them ineffective for processing long text documents. In contrast, Transformers use attention blocks that can connect any parts of a sequence, handling long-range dependencies as effectively as short-range ones. This means Transformers can manage long text documents without any issues. 3. Understanding Transformer Models A Transformer is a type of neural network architecture designed for handling sequences of data, such as sentences, paragraphs, or time series. It was introduced in the paper “Attention Is All You Need” to address limitations in previous sequence models like recurrent neural networks (RNNs) and convolutional neural networks (CNNs). The Transformer’s self- attention mechanism allows it to consider the relationships between all elements in a sequence simultaneously, capturing complex dependencies. The Transformer model was primarily designed for sequence- to-sequence tasks, such as machine translation, where the goal is to convert a sequence of words in one language to a sequence of words in another language. However, its architecture has proven to be versatile and effective for a wide range of other tasks as well, including text generation, text summarization, question answering, sentiment analysis, and more. 4. Transformer Architecture The Transformer architecture, introduced by Vaswani et al. in 2017, consists of an encoder-decoder structure. The encoder processes the input sequence, while the decoder generates the output sequence. Key components include multi-head self-attention mechanisms and feedforward neural networks, which allow the model to capture complex dependencies in data efficiently. This design has led to significant advancements in natural language processing and other fields. The Transformer is like a super-smart listener who can not only pay attention to each word you say but also understand the connections between all the words in the story. Example: Let’s begin by looking at the model as a single black box. In a machine translation application, it would take a sentence in one language, and output its translation in another. Encoder and Decoder Structure Encoder: Processes the input sequence and generates contextualized representations of each token. Decoder: Generates the output sequence based on the encoder's representations and the self-attention mechanism. The encoding component is a stack of encoders (the paper stacks six of them on top of each other – there’s nothing magical about the number six, one can definitely experiment with other arrangements). The decoding component is a stack of decoders of the same number. The encoders are all identical in structure (yet they do not share weights). Each one is broken down into two sub-layers: The encoder’s inputs first flow through a self-attention layer – a layer that helps the encoder look at other words in the input sentence as it encodes a specific word. We’ll look closer at self-attention later in the post. The outputs of the self-attention layer are fed to a feed-forward neural network. The exact same feed- forward network is independently applied to each position. The decoder has both those layers, but between them is an attention layer that helps the decoder focus on relevant parts of the input sentence (similar what attention does in seq2seq models).  Input Embedding Input embedding transforms an input sequence, like a sentence, into numerical representations called embeddings. These embeddings capture the semantic meaning of the tokens, enabling the model to understand and process the information effectively. Each token is represented by a vector in a low- dimensional space, where semantically similar words are closer together. This process happens in the bottom-most encoder, setting the foundation for subsequent layers to learn and generalize patterns in the data. The embedding only happens in the bottom-most encoder. The abstraction that is common to all the encoders is that they receive a list of vectors each of the size 512 – In the bottom encoder that would be the word embeddings, but in other encoders, it would be the output of the encoder that’s directly below.  Positional Encoding The position of a word plays a determining role in understanding the sequence we try to model. Therefore, we add positional information about the word within the sequence in the vector.  Positional Encoding Since the Transformer model can process all words in a sequence in parallel, it lacks inherent understanding of the order of words. To address this, positional encodings are added to the embeddings. These encodings are vectors that follow specific patterns using sine and cosine functions to represent the position of each word within the sequence. By adding these positional encodings to the input embeddings, the model retains information about the position and order of words, which is crucial for understanding the sequence. The authors of the paper used the following functions to model the position of a word within a sequence.  Positional Encoding For example, Let us consider a sentence, "The dog ran fast", with 𝑛=100n=100 and 𝑑=4d=4; the output positional encoding matrix should be the matrix shown below. It will remain the same for any input with the same length and 𝑛=100n=100 and 𝑑=4d=4. As we’ve mentioned already, an encoder receives a list of vectors as input. It processes this list by passing these vectors into a ‘self-attention’ layer, then into a feed-forward neural network, then sends out the output upwards to the next encoder. 5. Attention Mechanisms Say the following sentence is an input sentence we want to translate: ”The animal didn't cross the street because it was too tired” What does “it” in this sentence refer to? Is it referring to the street or to the animal? It’s a simple question to a human, but not as simple to an algorithm. When the model is processing the word “it”, self-attention allows it to associate “it” with “animal”. As the model processes each word, self attention allows it to look at other positions in the input sequence for clues that can help lead to a better encoding for this word. If you’re familiar with RNNs, think of how maintaining a hidden state allows an RNN to incorporate its representation of previous words/vectors it has processed with the current one it’s processing. Self-attention is the method the Transformer uses to bake the “understanding” of other relevant words into the one we’re currently processing.  Self-Attention Mechanism Self-attention is a mechanism in the Transformer model that allows each word in a sequence to focus on every other word, capturing their relationships. This is achieved by computing attention scores using dot products of queries, keys, and values, which determine the importance of each word in the context of the sequence. By generating context-aware representations, self-attention helps the model understand the nuances and dependencies between words, enhancing its ability to process language effectively. Calculating Self-Attention 1. Create Query, Key, and Value Vectors 2. Calculate Attention Scores 3. Scale and Apply Softmax to Scores 4. Weighted Sum of Value Vectors 5. Sum the Weighted Value Vectors 6. Matrix Form Implementation Step 1: Create Query, Key, and Value Vectors For each word in the input sequence, we create a Query vector, a Key vector, and a Value vector. These vectors are obtained by multiplying the word embedding by three different weight matrices (WQ, WK, WV) that are learned during training. These new vectors are smaller in dimension (e.g., 64) compared to the embedding vector (e.g., 512), an architectural choice to make multi-headed attention computation more efficient. Step 2: Calculate Attention Scores For a given word, we calculate the attention score with respect to all words in the sequence. This score is obtained by taking the dot product of the Query vector of the word with the Key vector of each word in the sequence. For example, if calculating attention for the word at position 1 ("Thinking"), the score with respect to the word at position 1 is the dot product of q1 and k1. The score with respect to the word at position 2 is the dot product of q1 and k2, and so on. Step 3: Scale and Apply Softmax to Scores We divide the attention scores by the square root of the dimension of the Key vectors (e.g., 8 for a dimension of 64) to achieve more stable gradients. We pass the scaled scores through a softmax function to normalize them. This step converts the scores into probabilities, ensuring they sum to 1 and are easier to interpret. Step 4: Weighted Sum of Value Vectors We multiply each Value vector by the corresponding softmax score. This step emphasizes the Value vectors of important words and diminishes the impact of less relevant words. We then sum the weighted Value vectors to obtain the final output of the self-attention layer for the current word. Step 5: Matrix Form Implementation For efficiency, we perform the calculations in matrix form. We pack the embeddings into a matrix X and multiply by the weight matrices(WQ, WK, WV). Every row in the X matrix corresponds to a word in the input sentence. We again see the difference in size of the embedding vector (512, or 4 boxes in the figure), and the q/k/v vectors (64, or 3 boxes in the figure) Step 6: Condensed Matrix Calculation Finally, we condense the calculations into a single matrix formula to compute the outputs of the self-attention layer efficiently. The self-attention calculation in matrix form This structure should provide a clear and logical explanation of Calculating self-attention  Multi-Head Attention Multi-Head Attention extends self-attention by using multiple heads to simultaneously capture diverse and complementary information from the input sequence. This mechanism enables the model to attend to different parts of the sequence in parallel, enhancing representation learning and improving performance. By utilizing multiple sets of Query, Key, and Value weight matrices, Multi-Head Attention allows the model to better understand complex relationships and dependencies within the data. For example, it can help the model discern which word "it" refers to in the sentence "The animal didn’t cross the street because it was too tired," ensuring more accurate comprehension of context. Steps to Implement Multi-Head Attention 1. Perform Self-Attention in Parallel 2. Generate Multiple Z Matrices 3. Concatenate Z Matrices 4. Apply Additional Linear Transformation 5. Integrate into the Model  Multi-Head Attention Multi-headed attention enhances the performance of the attention layer in two key ways: 1.Focus on Different Positions: It allows the model to attend to different parts of the sequence simultaneously. For example, in translating the sentence “The animal didn’t cross the street because it was too tired,” multi-headed attention helps the model understand which word “it” refers to, ensuring more accurate comprehension of context. 2.Multiple Representation Subspaces: It provides the attention layer with multiple "representation subspaces." Instead of a single set, multi-headed attention uses multiple sets of Query/Key/Value weight matrices. For instance, the Transformer uses eight attention heads, resulting in eight sets for each encoder/decoder. These sets are randomly initialized and, after training, project the input embeddings into different representation subspaces, enriching the model's understanding of the data. Steps to Implement Multi-Head Attention i. Linear Projections ii. Self-Attention Calculation iii. Combine and Transform Step 1: Linear Projections For each word in the input sequence, we create multiple sets of Query, Key, and Value vectors by applying linear projections using different weight matrices (WQ, WK, WV). If using eight heads, generate eight sets of Query, Key, and Value vectors for each word. With multi-headed attention, we maintain separate Q/K/V weight matrices for each head resulting in different Q/K/V matrices. As we did before, we multiply X by the WQ/WK/WV matrices to produce Q/K/V matrices. Step 2: Self-Attention Calculation If we do the same self-attention calculation we outlined above, just eight different times with different weight matrices, we end up with eight different Z matrices This leaves us with a bit of a challenge. The feed-forward layer is not expecting eight matrices – it’s expecting a single matrix (a vector for each word). So we need a way to condense these eight down into a single matrix. Step 3: Combine and Transform Concatenate the output matrices from all attention heads along the depth axis to form a single combined matrix. Apply a final linear transformation using an additional weight matrix (WO) to the concatenated matrix to produce the final output. This ensures the output has the same dimension as the input embeddings. 6. Additional Components of the Transformer 6.1 Feedforward Neural Networks After the self-attention layer, the output is passed through feedforward neural networks. Each feedforward network consists of two linear layers with a ReLU activation function in between. These networks apply non-linear transformations to the token representations, allowing the model to capture complex patterns and relationships in the data. The weights and biases are the same across different positions in the sequence but differ between each encoder and decoder. 6.2 Layer Normalization and Residual Connections In the Transformer's encoder architecture, each sub-layer (self-attention, feedforward network) has a residual connection around it and is followed by a layer normalization step. The residual connections involve adding the input embeddings to the context vectors produced by the self-attention layer, ensuring the original information is retained. These combined vectors are then normalized across the layer dimension, resulting in output vectors that have zero mean and unit variance. Finally, the output from the feedforward sublayer is added to the normalized output from the attention sublayer, and this combined vector is once again normalized. This goes for the sub-layers of the decoder as well. If we’re to think of a Transformer of 2 stacked encoders and decoders, it would look something like this: The decoder uses the encoder's output attention vectors KKK and VVV to focus on relevant parts of the input sequence during decoding. It generates output step-by-step, feeding each output to the next time step, and applies embeddings and positional encoding to maintain sequence order. Self-attention in the decoder only attends to earlier positions by masking future positions (−∞-\infty−∞) before the softmax step. The “encoder-decoder attention” layer uses Queries from the decoder and Keys/Values from the encoder to align input and output sequences. 6.3 The Final Linear and Softmax Layer Finally, to get the output predictions, we need to transform the output vector of the last decoder into words. We first feed the output vector into a fully connected linear layer to obtain a logits vector of the vocabulary size. Next, we apply the softmax function to this vector to get a probability score for each word in the vocabulary. The word with the highest probability is then chosen as our prediction. 7. Training Transformers During training, an untrained model would go through the exact same forward pass. But since we are training it on a labeled training dataset, we can compare its output with the actual correct output. To visualize this, let’s assume our output vocabulary only contains six words(“a”, “am”, “i”, “thanks”, “student”, and “” ). Once we define our output vocabulary, we can use a vector of the same width to indicate each word in our vocabulary. This also known as one-hot encoding. So for example, we can indicate the word “am” using the following vector: Example: one-hot encoding of our output vocabulary 7.1 The Loss Function I am a student Softmax Cross Entropy Loss (seq,vocab-sz) Linear One Time Step! Encoder Output Decoder Output 44 Encoder Decoder Je suis un etudiant I am a student To effectively train our model, we need a way to quantify how close our model's outputs are to the desired outputs. This is where the loss function comes into play. The loss function measures the difference between the predicted probability distributions and the actual target distributions. After training the model for enough time on a large enough dataset, we would hope the produced probability distributions would look like this: 8. Applications of Transformers Machine Text Generation: Sentiment Analysis: Chatbots and Translation: Virtual Assistants: E.g., Google Translate uses Models like GPT generate Understanding emotional Enhanced natural language transformers for more coherent and contextually tone in social media posts understanding in systems accurate and context-aware relevant text. and reviews. like Siri and Alexa. translations. Use cases of the Transformer Algorithm 1. BERT (Bidirectional Encoder Representations from Transformers): BERT enhances NLP tasks like text classification and question answering by using bidirectional context understanding, allowing it to be pre-trained on large text datasets and fine-tuned for specific applications. 2. GPT (Generative Pre-trained Transformer) Series: Known for their text-generation prowess, GPT models, especially GPT-3, excel in producing coherent and contextually relevant text across various topics due to their vast number of parameters. 3. T5 (Text-to-Text Transfer Transformer): T5 treats all NLP tasks as text-to-text problems, enabling a single model to handle diverse tasks such as translation and summarization effectively by maintaining a consistent text-based approach. 9. Conclusion In conclusion, the rise of transformers and attention mechanisms has revolutionized various fields, from natural language processing to computer vision and beyond. These architectures, with their ability to capture long-range dependencies and contextual information efficiently, have propelled advancements in machine learning and AI. Understanding the intricacies of transformer models, from their architecture to training methodologies, equips us with powerful tools to tackle complex tasks. As we continue to explore and innovate with these models, it's crucial to remember their potential impact on society and to use them responsibly. With ongoing research and practical applications, the journey of transformers and attention mechanisms is far from over, promising exciting possibilities for the future of artificial intelligence. 10. References  Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). "Attention is All You Need." In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17), 6000-6010.  Bahdanau, D., Cho, K., & Bengio, Y. (2015). "Neural Machine Translation by Jointly Learning to Align and Translate." arXiv preprint arXiv:1409.0473.  Luong, M.-T., Pham, H., & Manning, C. D. (2015). "Effective Approaches to Attention-based Neural Machine Translation." arXiv preprint arXiv:1508.04025.  Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). "Layer Normalization." arXiv preprint arXiv:1607.06450.  Rae, J. W., Potapenko, A., Jayakumar, S. M., & Lillicrap, T. P. (2020). "Compressive Transformers for Long-Range Sequence Modelling." In International Conference on Learning Representations (ICLR). THANK YOU FOR YOUR ATTENTION Presented by: Fadoua BEN AHMED 14/06/2024

Attention Mechanisms and Transformer Models PDF

Document Details

Tags

Related

Summary

Full Transcript