Transformers_made_easy PDF
Document Details
Uploaded by ImpeccableBiography
ElMahdi ELBAKKAR
Tags
Summary
This document is a practical guide to transformers, a type of neural network designed to process sequential data, like sentences or time series. It provides an introduction, explains the fundamentals and explores real-world applications. The guide focuses on the technical aspects, making it suitable for those with an interest in fundamental concepts and applications of artificial intelligence.
Full Transcript
Transformers 101 : From Zero to Hero By : ElMahdi ELBAKKAR 1 Contents 1 Prologue 3 2 Introduction : The age of reliance on RNNs is gone (for seque...
Transformers 101 : From Zero to Hero By : ElMahdi ELBAKKAR 1 Contents 1 Prologue 3 2 Introduction : The age of reliance on RNNs is gone (for sequences) 4 2.1 RNNs and sequences : How ?...................................... 4 2.2 Sequential data in the world of medicine................................ 5 3 Transformers : (Not Optimus Prime) 7 3.1 Natural Language Processing (NLP).................................. 7 3.2 What are transformers models ?..................................... 9 3.2.1 Key notions............................................ 9 3.2.2 To retain : What you need to know before detailing the transformers............ 10 3.3 The Encoder Workflow.......................................... 11 3.4 What you need to remember about encoders............................. 18 3.5 The decoder workflow.......................................... 19 3.6 What you need to remember about the decoders....................... 23 4 Real-life transformers : ChatGPT 23 5 Conclusion 24 2 1 Prologue This summary is an initiative to summarize the main concepts learned during the Data Science Course for the record of the first year for both programs (Medicine & Pharmacy). What this sentence means is that, in no circumstance, is this a replacement for the original course or a ticket for absenteeism, it’s a trial to simplify the notions learned during class and make Data Science the favorite subject you never expected. I tried my best to explain every notion as the foundations are everything, I will be equipping you with a methodology that you can adapt by yourselves to your learning styles, I link every notion to real-life aspects as we, human beings, are the result of environments, and thus, we learn better by connecting everything to our lived experiences. It turns out that my favorite quote is one about genius, it goes like this : " Genius takes time and extraordinary effort " It’s a beautiful quote but more importantly, an insightful one, as it indicates two important things : If one desires to become a genius, the road is clear : one must dedicate time and effort to achieve the goal lying ahead of one’s eyes. The second point is that there is no preset list of genius people, genius is created. The third component that I will be adding is curiosity, curious minds are the ones who never die as they strive to enrich their knowledge. Curious minds show incredible consistency as they are motivated by an immaterial reason, I want you all to keep this in mind. ElMahdi ELBAKKAR 3 2 Introduction : The age of reliance on RNNs is gone (for sequences) Deep learning experienced a significant period of reliance on Recurrent Neural Networks (RNNs), which excelled in processing sequential data like time series and text. While RNNs and their variants, such as LSTMs and GRUs, brought notable advancements, their progress was relatively incremental and constrained by challenges like vanishing gradients and difficulty in handling long-range dependencies. This trajectory changed dramatically with the introduction of transformers, a groundbreaking architecture that leveraged self-attention mechanisms to process sequences in parallel rather than sequentially. Transformers revo- lutionized the field by enabling faster training, better scalability, and enhanced performance on tasks requiring context understanding. Their impact was profound, catalyzing the development of sophisticated chatbots and models like GPT, which marked a new era in natural language processing and AI applications. Don’t feel overwhelmed by the multitude of concepts and terms you may encounter ; I’ll break everything down for you step by step. Each idea will be explained clearly and thoroughly, with examples to help you understand and connect the dots. My goal is to guide you through the complexities at a manageable pace, ensuring that no question goes unanswered and every concept becomes accessible. So, take a deep breath and trust the process—we’ll navigate this together ! 2.1 RNNs and sequences : How ? Recurrent Neural Networks (RNNs) were created to work with sequences, like sentences or time- based data. They process one part of the sequence at a time, remembering what came before to understand the context. This made them useful for tasks like recognizing speech, predicting the next word in a sentence, or analyzing patterns over time. Definition of a Sequence A sequence is an ordered list of items or events where the order matters and each element is connected to the next. It can represent various types of data, such as a sentence made up of words, a melody composed of notes, or time series data like daily temperatures. Sequences are important in many fields because they capture patterns, relationships, and trends that unfold over time or within a specific order. Understanding sequences helps in tasks like predicting what comes next, analyzing patterns, or finding meaningful connections in the data. In Feed-forward Neural Networks (FNNs), the output for a given data point is entirely independent of any previous inputs. For example, the health risk prediction for one person does not rely on the health risk prediction for another. Similarly, in Convolutional Neural Networks (CNNs), the output from the softmax layer during image classification is independent of any prior input images. This highlights two important characteristics of Feed-forward Neural Networks (FNNs) and Convolutional Neural Networks (CNNs) : 4 1. Outputs are independent of previous inputs : Each output is determined solely by the current input and does not rely on any past data. 2. Input is of fixed length : These networks process inputs of a fixed size, making them unsuitable for tasks involving variable-length sequences or dependencies between inputs. In summary, while FNNs and CNNs are powerful for tasks like health risk prediction or image classifi- cation, their inability to handle sequential dependencies and variable-length data limits their application in domains where context or temporal relationships are crucial. While Feed-forward Neural Networks (FNNs) and Convolutional Neural Networks (CNNs) excel at tasks with fixed-length inputs and outputs that do not depend on previous data, they fall short in handling sequential dependencies or variable-length data. Recurrent Neural Networks (RNNs) addressed this limitation by introducing mechanisms to retain information from previous inputs, enabling them to analyze sequences. However, RNNs faced challenges with long-range dependencies and computational in- efficiencies. This paved the way for a revolutionary approach : the transformer architecture. By replacing sequential processing with a self-attention mechanism, transformers overcame these limitations, offering unparalleled performance on tasks requiring deep contextual understanding. 2.2 Sequential data in the world of medicine Sequential data in medicine is abundant and plays a crucial role in understanding and predicting health outcomes. Some examples include : 1. Electronic Health Records (EHRs) : Patient histories, including sequences of medical visits, treat- ments, and lab results over time. 2. Time-Series Data : Continuous monitoring data such as heart rate, blood pressure, glucose levels, and oxygen saturation. Figure 1 – Heart rate Monitor 5 3. Genomic Sequences : DNA or RNA sequences analyzed to identify mutations or gene expression patterns. Figure 2 – Genome sequencing 4. Medical Imaging Sequences : Time-ordered imaging data, such as cardiac MRI scans over a heartbeat cycle or CT scans over time to track disease progression. Figure 3 – Cardiac MRI 5. Drug Administration Records : Sequences of medications administered and their effects, important for personalized medicine. 6. Speech and Language Data : Speech patterns in neurological disorders or sequential responses in cognitive assessments. 7. Epidemiological Data : Disease spread patterns and sequences of infection rates across regions and time. Figure 4 – Disease spread patterns 6 Sequential data analysis in these areas can provide critical insights, such as predicting disease progression, optimizing treatment plans, and improving patient outcomes. 3 Transformers : (Not Optimus Prime) The field of deep learning has undergone a monumental transformation with the advent and rapid advan- cement of Transformer models. These innovative architectures have not only set new benchmarks in Natural Language Processing (NLP) but have also extended their influence across various domains of artificial intelligence. Defined by their powerful attention mechanisms and parallel processing capabilities, Transformer models have revolutionized the ability to understand and generate human language with unprecedented accuracy and efficiency. Introduced in 2017 through Google’s landmark paper “Attention is All You Need”, the transformer architecture underpins groundbreaking models such as ChatGPT, fueling a wave of innovation and excitement in the AI community. These models have driven advancements in OpenAI’s language technologies and played pivotal roles in DeepMind’s AlphaStar. In this new era of AI, mastering Transformer models is indispensable for data scientists and NLP practi- tioners seeking to stay at the forefront of the field. Before we delve into the transformative world of transformers, it’s essential to first explore the foun- dations of Natural Language Processing (NLP). NLP serves as the bridge between human language and machine understanding, enabling computers to analyze, interpret, and generate text in meaningful ways. By understanding the core principles and challenges of NLP, we can better appreciate the revolutionary impact transformers have had on this field and beyond. Let’s begin by unpacking the fundamentals of NLP. 3.1 Natural Language Processing (NLP) Natural Language Processing (NLP) Natural Language Processing (NLP) is a branch of artificial intelligence focused on enabling ma- chines to understand, interpret, and generate human language in a way that is both meaningful and useful. It combines computational linguistics with machine learning to process and analyze text or speech data. NLP encompasses a wide range of tasks, such as sentiment analysis, machine translation, text summarization, and speech recognition. By bridging the gap between human communication and computer understanding, NLP has become a cornerstone of modern AI applications, powering tools like virtual assistants, chatbots, and language translation systems. NLP powers advanced language models to create human-like text for various purposes. Pre-trained models, such as GPT-4, can generate articles, reports, marketing copy, product descriptions and even creative writing based on prompts provided by users. NLP-powered tools can also assist in automating tasks like drafting emails, writing social media posts or legal documentation. By understanding context, tone and style, NLP sees to it that the generated content is coherent, relevant and aligned with the intended message, saving time and effort in content creation while maintaining quality. NLP tasks Several NLP tasks are essential for processing human text and voice data, allowing computers to make sense of the information they receive. These tasks include : — Coreference Resolution : Determining which words or phrases in a text refer to the same entity (e.g., "John" and "he" referring to the same person). — Named Entity Recognition (NER) : Identifying and classifying entities in text, such as names of people, organizations, locations, dates, and other specific information. — Part-of-Speech Tagging : Assigning grammatical categories to words in a sentence (e.g., noun, verb, adjective) to understand their role in the sentence. 7 — Word Sense Disambiguation : Determining the correct meaning of a word based on its context, especially when the word has multiple meanings. These tasks help in breaking down complex language into structured data that machines can analyze and interpret effectively. How NLP works ? Figure 5 – NLP workflow NLP text preprocessing gets raw text ready for analysis by changing it into a format that machines can easily understand. The process begins with tokenization, which breaks the text into smaller parts like words or sentences, making it easier to handle. Then, lowercasing is applied to make everything consistent, so words like "Apple" and "apple" are treated the same. Stop word removal follows, where common words like "is" or "the" are removed because they don’t add much meaning. Stemming or lemmatization reduces words to their basic form (e.g., "running" becomes "run"), so different forms of the same word are grouped together. Finally, text cleaning removes unnecessary things like punctuation, special characters, and numbers that could confuse the analysis. Feature extraction is the process of turning raw text into numbers that machines can understand. This involves using NLP techniques like Bag of Words and TF-IDF to measure the presence and importance of words in a document. Text analysis helps extract useful information from text using different computational methods. These include tasks like part-of-speech (POS) tagging, which identifies the grammatical role of words, and named entity recognition (NER), which detects things like names, places, and dates. Dependency parsing looks at how words relate to each other in a sentence to understand its structure, while sentiment analysis checks if the tone of the text is positive, negative, or neutral. Topic modeling identifies common themes or topics in a text or a group of documents. Natural language understanding (NLU), a part of NLP, focuses on figuring out the meaning behind sentences, allowing software to understand similar meanings in different sentences or handle words with multiple meanings. These techniques turn raw text into useful insights. 8 Processed data is then used to train machine learning models, which learn patterns and relationships within the data. During training, the model adjusts its parameters to reduce errors and improve perfor- mance. Once trained, the model can make predictions or generate results on new, unseen data. The model’s accuracy and relevance are improved through evaluation and fine-tuning, ensuring it works well in real-world situations. Various software tools are helpful in these processes. For example, the Natural Language Toolkit (NLTK) is a collection of libraries in Python that supports text classification, tokenization, stemming, and other NLP tasks. TensorFlow, an open-source machine learning library, can also be used to train NLP models. There are many tutorials and certifications available for those who want to learn how to use these tools. 3.2 What are transformers models ? A transformer model is a type of neural network designed to learn the context of sequential data and generate new data based on that understanding. In simple terms, a transformer is an advanced AI model that learns to comprehend and generate human-like text by analyzing patterns in large amounts of text data. Transformers are currently state-of-the-art in Natural Language Processing (NLP) and are considered an evolution of the encoder-decoder architecture. Unlike the traditional encoder-decoder models that heavily rely on Recurrent Neural Networks (RNNs) to capture sequential information, transformers completely eliminates the need for recurrence. Instead, they use a self-attention mechanism that allows the model to weigh the importance of different words in a sequence, regardless of their position, leading to more efficient and accurate language understanding. Figure 6 – Transformer’s mechanism 3.2.1 Key notions Encoder-decoder The encoder-decoder architecture is a framework commonly used in machine learning, particularly in Natural Language Processing (NLP) and tasks like machine translation, text summarization, and image captioning. It consists of two main components : the encoder and the decoder. — Encoder : The encoder processes the input data (e.g., a sentence in one language) and compresses it into a fixed-size vector, often referred to as the context vector or encoded representation. It extracts features and relevant information from the input sequence and transforms it into a form that the decoder can use. In traditional encoder-decoder models, Recurrent Neural Networks (RNNs) or Long Short-Term Memory (LSTM) networks are typically used in the encoder to handle sequential data, where the model processes the input one element at a time. — Decoder : The decoder takes the encoded information from the encoder and generates the desired output (e.g., the translated sentence or summarized text). It uses the context vector to produce the sequence of outputs. In sequence-to-sequence tasks like translation, the decoder generates one word at a time, feeding its previous predictions back into the model to produce the next output. 9 Figure 7 – Encoder Decoder Long Short-Term Memory (LSTM) LSTM (Long Short-Term Memory) is an advanced version of Recurrent Neural Networks (RNNs) designed to handle the vanishing gradient problem and maintain long-term dependencies in sequential data. Long Short-Term Memory (LSTM) is a specialized type of Recurrent Neural Network (RNN) architecture designed to effectively handle sequential data by overcoming the limitations of traditional RNNs. LSTMs are particularly adept at learning long-term dependencies and patterns in data, making them highly suitable for tasks such as speech recognition, time series prediction, and text processing. LSTMs achieve this through a unique architecture that incorporates gates, which regulate the flow of information through the network. These gates include : — Forget Gate : Decides which information from the previous state should be discarded. — Input Gate : Determines what new information will be added to the current state. — Output Gate : Controls what information will be passed to the next step in the sequence. The combination of these gates allows LSTMs to maintain and update a memory cell, enabling them to retain important information over long sequences while forgetting irrelevant details. This capability makes LSTMs more robust than standard RNNs for processing data with long-range dependencies. The vanishing gradient problem The vanishing gradient problem is a common issue in training deep neural networks, particularly in recurrent neural networks (RNNs) and feed-forward networks with many layers. It occurs when gradients, the values used to update the weights during backpropagation, become extremely small as they are propagated back through the layers of the network. During backpropagation, the gradients are computed using the chain rule, which involves multiplying the derivatives of the activation functions. If these derivatives are small (as is the case with activation functions like the sigmoid or tanh), the gradient shrinks as it is propagated backward through each layer. After many layers, the gradient becomes so small that it essentially "vanishes," preventing the network from updating its weights effectively. 3.2.2 To retain : What you need to know before detailing the transformers In the field of deep learning, various architectures have been developed to address specific challenges in processing sequential data. The evolution of these models has led to significant advancements in tasks such as time-series forecasting, natural language processing, and machine translation. Below is an overview of key architectures, starting from the basic Recurrent Neural Network (RNN) and progressing through Long Short-Term Memory (LSTM) networks, Encoder-Decoder (seq2seq) models, and the Transformer architecture, which has revolutionized sequence-based learning. 1. RNN (Recurrent Neural Network) : (a) Introduced as a basic sequential model to process time-series data and text. 10 (b) Suffers from issues like the vanishing gradient problem, making it unsuitable for long sequences. 2. LSTM (Long Short-Term Memory) : (a) An improvement over RNNs, solving the vanishing gradient problem. (b) Introduced mechanisms like gates (forget, input, output) to manage memory better. 3. Encoder-Decoder (seq2seq) : (a) Built on LSTMs to handle sequence-to-sequence tasks (e.g., translation). (b) Includes two components : i. Encoder : Encodes input into a fixed-size context vector. ii. Decoder : Decodes the context vector into an output sequence. (c) Performance starts to decline for very long sequences due to reliance on a single context vector. 4. Transformer : (a) Replaces RNNs/LSTMs in encoder-decoder architectures. (b) Introduced attention mechanisms (e.g., self-attention) to focus on relevant parts of the input se- quence without sequential processing. (c) Allows for better parallelization and long-range dependency modeling. Figure 8 – Summary 3.3 The Encoder Workflow The encoder is a crucial component in sequence-to-sequence models, responsible for processing the input data and converting it into a form that can be interpreted by the decoder. It takes a variable-length input sequence and encodes it into a fixed-size context vector, which captures the relevant features of the input. This context vector is then passed to the decoder, which generates the output sequence. The encoder is typically composed of layers such as LSTMs or GRUs, which help capture temporal dependencies within the input sequence. Despite its effectiveness, the performance of traditional encoder-decoder models can degrade with longer sequences due to the limitations of encoding the entire input into a single vector. 11 Figure 9 – Encoder Workflow Now that we have an understanding of the encoder’s role in sequence-to-sequence models, let’s dive deeper into its structure. The encoder is designed to capture the essential features of the input sequence through a series of layers, typically made up of recurrent units such as LSTMs or GRUs. These layers process the input step-by-step, maintaining hidden states that evolve over time to represent the temporal dependencies within the sequence. By examining the architecture and mechanisms behind the encoder, we can better understand how it encodes information and the challenges involved, especially when dealing with longer sequences and complex data. Figure 10 – Detailed architecture of an encoder 12 Input embeddings Embedding occurs only in the bottom-most encoder. The process starts with the encoder converting input tokens, such as words or subwords, into vectors through embedding layers. These embeddings represent the semantic meaning of the tokens and transform them into numerical vectors. Each encoder receives a fixed-size list of vectors, typically of size 512. In the bottom encoder, these vectors correspond to the word embeddings, while in the higher encoders, the vectors are the outputs from the encoder directly beneath them. Figure 11 – Input embeddings Positional Encoding In the encoder’s workflow, since Transformers do not rely on a recurrence mechanism like RNNs, they incorporate positional encodings into the input embeddings to capture the position of each token in the sequence. This enables the model to understand the order and structure of the words within the sentence. To achieve this, researchers proposed a method that combines various sine and cosine functions to generate positional vectors. This approach allows the positional encoding to be applied to sentences of any length. Each dimension of the positional encoding is associated with unique frequencies and offsets of the sine and cosine waves. The values of these functions range from −1 to 1, effectively encoding the position of each token within the sequence. This addition ensures that the model can distinguish between tokens based on their relative position, crucial for understanding sequential data. Positional encoding helps Transformers understand the order of words in a sentence. Since Transfor- mers don’t process words one after another like RNNs, they don’t naturally know which word comes first, second, etc. To solve this, positional encoding adds special "position information" to each word, so the model knows where each word is in the sentence. The way it works is simple : for each position in the sentence (like 1st, 2nd, 3rd, etc.), we use math (sine and cosine waves) to create a unique pattern of numbers. These numbers are then added to the word’s representation, so each word gets a mix of its meaning and its position. By using these unique patterns, the model can tell where each word belongs in the sequence and understand the structure of the sentence. 13 Figure 12 – Positional encoding Stack of Encoder Layers — The Transformer encoder consists of a stack of identical layers, typically 6 in the original Trans- former model. Each encoder layer transforms the input sequence into a continuous, abstract representation that captures the learned information from the entire sequence. — Each encoder layer has two main components : — Multi-headed attention mechanism : This allows the model to focus on different parts of the input sequence simultaneously, helping it learn relationships between words, regardless of their position in the sentence. — Fully connected network : This processes the information passed through the attention mecha- nism and transforms it further to enhance the model’s understanding. — Additionally, the encoder layer includes residual connections around each sub-layer, allowing infor- mation to pass through the layer directly. Afterward, layer normalization is applied to stabilize the learning process and prevent issues like overfitting. — The stack of these layers works together to refine and combine the features from each word in the input sequence, helping the model understand the overall structure and meaning of the sentence. Multi-Headed Self-Attention Mechanism — In the encoder, the multi-headed attention uses a specialized attention mechanism known as self- attention. This allows the model to relate each word in the input to other words in the sequence. For example, the model might learn to associate the word “are” with “you” in a given sentence. — The self-attention mechanism enables the encoder to focus on different parts of the input sequence while processing each token. It computes attention scores based on the following components : — Query : A vector representing a specific word or token from the input sequence in the attention mechanism. — Key : A vector corresponding to each word or token in the input sequence, used to compare with the query. — Value : Each value is associated with a key and is used to construct the output of the attention layer. When a query and a key match well (i.e., have a high attention score), the corresponding 14 value is emphasized in the output. — This first Self-Attention module allows the model to capture contextual information from the entire sequence. Instead of performing a single attention function, queries, keys, and values are linearly projected h times. For each of these projected versions of the queries, keys, and values, the attention mechanism is performed in parallel, producing h-dimensional output values. Figure 13 – Query, Key, and Value Matrix Multiplication (MatMul) - Dot Product of Query and Key — Once the query, key, and value vectors are passed through a linear layer, a dot product matrix multiplication is performed between the queries and keys, resulting in the creation of a score matrix. — The score matrix determines the degree of emphasis each word should place on other words. Each word is assigned a score in relation to the other words within the same time step. A higher score indicates greater focus or attention on that particular word. — This process effectively maps the queries to their corresponding keys, allowing the model to establish how much attention should be given to each word in the sequence. Figure 14 – Dot multiplication Reducing the Magnitude of attention scores The scores are then scaled down by dividing them by the square root of the dimension of the query and key vectors. This step is implemented to ensure more stable gradients, as the multiplication of values 15 can otherwise lead to excessively large scores, which can destabilize the learning process. Applying Softmax to the Adjusted Scores Subsequently, a softmax function is applied to the adjusted scores to obtain the attention weights. This results in probability values ranging from 0 to 1. The softmax function emphasizes higher scores while diminishing lower scores, thereby enhancing the model’s ability to effectively determine which words should receive more attention. Figure 15 – Softmax application Combining Softmax Results with the Value Vector The following step of the attention mechanism is that weights derived from the softmax function are multiplied by the value vector, resulting in an output vector. In this process, only the words that present high softmax scores are preserved. Finally, this output vector is fed into a linear layer for further processing. Figure 16 – Enter Caption And we finally get the output of the Attention mechanism ! So, you might be wondering why it’s called Multi-Head Attention ? Remember that before all the process starts, we break our queries, keys and values h times. This process, known as self-attention, happens separately in each of these smaller stages or ’heads’. Each head works its magic independently, conjuring up an output vector. This ensemble passes through a final linear layer, much like a filter that fine-tunes their collective performance. The beauty here lies in the diversity of learning across each head, enriching the encoder model with a robust and multifaceted understanding. Normalization and Residual connections Each sub-layer in an encoder layer is followed by a normalization step. Also, each sub-layer output is added to its input (residual connection) to help mitigate the vanishing gradient problem, allowing deeper models. This process will be repeated after the Feed-Forward Neural Network too. 16 Figure 17 – Normalization Feedforward Neural Networks The journey of the normalized residual output continues as it navigates through a pointwise feed- forward network, a crucial phase for additional refinement. Picture this network as a duo of linear layers, with a ReLU activation nestled in between them, acting as a bridge. Once processed, the output embarks on a familiar path : it loops back and merges with the input of the pointwise feed-forward network. This reunion is followed by another round of normalization, ensuring everything is well-adjusted and in sync for the next steps. Figure 18 – Feedforward Neural network Output of the encoder The output of the final encoder layer is a set of vectors, each representing the input sequence with a rich contextual understanding. This output is then used as the input for the decoder in a Transformer model. This careful encoding paves the way for the decoder, guiding it to pay attention to the right words in the input when it’s time to decode. Think of it like building a tower, where you can stack up N encoder layers. Each layer in this stack gets a chance to explore and learn different facets of attention, much like layers of knowledge. This not only diversifies the understanding but could significantly amplify the predictive capabilities of the transformer network. 17 3.4 What you need to remember about encoders Component Description Encoder General Architecture The encoder consists of a stack of identical layers, typically 6 in the ori- ginal Transformer model. Each layer refines input representations, cap- turing contextual information across the sequence. Input Embedding Transforms each token in the input sequence into a fixed-size numerical vector. These embeddings represent the semantic meaning of the tokens for further processing. Positional Encoding Adds positional encodings to the input embeddings to provide informa- tion about the position of each token in the sequence. The encodings use sine and cosine functions, enabling the model to capture token order. Stack of Encoder Layers Comprises multiple identical layers that progressively refine and enhance the representations from the previous layer, learning deeper relationships within the sequence. Multi-Headed Attention Mecha- Enables the encoder to focus on different parts of the sequence for each nism token. It splits the attention mechanism into multiple heads to learn diverse relationships simultaneously. Matrix Multiplication and Dot Calculates attention scores by performing a dot product between query Product of Query and Key and key vectors. This determines the degree of similarity between tokens, identifying relevant relationships. Reducing the Magnitude of Atten- Scales the attention scores by dividing them by the square root of the key tion Scores vector’s dimension, ensuring numerical stability and smoother gradient updates. Combining Softmax Results with The scaled scores are passed through a softmax function, producing at- the Value Vector tention weights (probabilities). These weights are combined with value vectors to generate the attention mechanism’s output. Normalization and Residual Residual connections add the input of each sub-layer to its output, al- Connection lowing direct information flow. Layer normalization ensures consistent scaling and prevents divergence during training. Feedforward Neural Networks Applies non-linear transformations to refine the representation further. It consists of two linear transformations with a ReLU activation in between. Output of the Encoder Produces a continuous representation of the input sequence, where each token is encoded into a vector that captures its meaning and relationships with other tokens. 18 3.5 The decoder workflow The decoder is designed to generate text sequences. Similar to the encoder, it consists of sub-layers that includes two multi-headed attention mechanisms and a point-wise feed-forward network. Each sub-layer is followed by residual connections and layer normalization, ensuring efficient processing and stability. Figure 19 – Decoder architecture The decoder works similarly to the encoder’s layers but with a key difference : each multi-headed attention layer in the decoder has a specific purpose. At the end of the decoding process, a linear layer acts as a classifier, followed by a softmax function that calculates the probabilities of possible words. The Transformer’s decoder is designed to generate output step by step by processing the encoded information. It works in an auto regressive way, starting with a special token (the start token) and using previously generated words as inputs. It also incorporates the encoder’s outputs, which carry important infor- mation about the original input. This process continues in sequence until the decoder generates a special token indicating the end of the output. 19 Output Embeddings The decoder starts by converting the input tokens into numerical vectors using an embedding layer, similar to the encoder. This step turns words into a format the model can understand. Positional Encoding Next, the input goes through a positional encoding layer, which adds information about the order of the tokens. This helps the model know which word comes first, second, and so on. These positional embeddings are then passed into the first multi-head attention layer, where the model calculates attention scores to understand how different parts of the input relate to each other. Stack of Decoder Layers The decoder is made up of several identical layers (6 in the original Transformer model). Each layer contains three main parts : Masked Self-Attention Mechanism, Encoder-Decoder Multi-Head Atten- tion or Cross Attention, and Feed-Forward Neural Network. Masked self attention mechanism The masked self-attention mechanism functions similarly to the self-attention used in the encoder but with a critical modification : it prevents each position from attending to future positions in the sequence. This ensures that predictions for a particular word are based only on the words that precede it. For instance, when computing attention scores for the word “are,” the mechanism ensures that it does not take into account the word “you,” which appears later in the sequence. As illustrated in the figure : - First, scaled scores are computed. - Next, a look-ahead mask is applied, where values corresponding to future positions are replaced with negative infinity (−∞). - The resulting masked scores ensures that attention is restricted to known outputs up to the current position. This masking guarantees that the decoder operates autoregressively, relying only on prior context to generate its output. Figure 20 – Masked self attention mechanism Encoder-Decoder Multi-Head Attention or Cross Attention In the second multi-headed attention layer of the decoder, the encoder and decoder collaborate in a unique way. The encoder’s outputs act as both keys and values, while the queries come from the outputs of the first multi-headed attention layer of the decoder. This arrangement aligns the encoded input with the decoder’s ongoing process, enabling the decoder to focus on the most relevant information from the encoder’s input. The resulting output from this attention mechanism is then passed through a pointwise feedforward layer, which further refines the information and enhances processing. In this sub-layer, queries are derived from the previous decoder layer, while keys and values are obtained from the encoder’s output. This mechanism allows every position in the decoder to attend to all 20 positions in the input sequence, seamlessly integrating the encoder’s learned information with the decoder’s generated context. Figure 21 – Decoder’s workflow. Encoder-Decoder Attention. Feed-Forward Neural Network Similar to the encoder, each decoder layer includes a fully connected feed-forward network, applied to each position separately and identically. Generating Output with Linear Classifier and Softmax The final step in the transformer’s workflow happens in the last linear layer, which acts as a classifier. This layer’s size matches the total number of possible words (the vocabulary size). For example, if there are 1,000 words in the vocabulary, the output will be an array with 1,000 values. These values are passed through a softmax layer, which converts them into probabilities between 0 and 1. The word with the highest probability is the model’s prediction for the next word in the sequence. Each sub-layer in the decoder (masked self-attention, encoder-decoder attention, feed-forward network) includes two key features : 1. Normalization : Ensures stable training by scaling outputs. 2. Residual Connections : Adds the input back to the output of each sub-layer, helping the model retain original information while learning new patterns. 21 Figure 22 – Decoder’s workflow. Transformer’s final output. Final steps Normalization and Residual Connections Each sub-layer in the decoder (masked self-attention, encoder-decoder attention, feed-forward network) includes two important components : — Normalization : Ensures consistent scaling of outputs, improving the model’s stability during training. — Residual Connections : Adds the sub-layer’s input back to its output, helping retain original information while allowing new patterns to be learned. Output of the Decoder The final layer of the decoder generates a predicted sequence by passing its output through : — A linear layer, which transforms the output into scores corresponding to the vocabulary size. — A softmax function, which converts these scores into probabilities for all words in the vocabulary. The word with the highest probability is selected as the prediction. The decoder operates in a loop, using its own predictions as inputs for the next step. This process continues until a special token (such as an end token) is predicted, signaling the sequence is complete. 22 Layered Structure of the Decoder The decoder consists of multiple stacked layers (e.g., 6 in the original Transformer). Each layer builds upon : — The outputs from the encoder. — The outputs from the previous decoder layers. This layered architecture enables the decoder to focus on different patterns and relationships using multiple attention heads, improving its predictive capabilities. By capturing complex dependencies in the input, the model develops a nuanced understanding of attention patterns, resulting in more accurate predictions. 3.6 What you need to remember about the decoders Component Description Input Embedding The decoder begins by converting input tokens (e.g., words or subwords) into embeddings using an embedding layer, similar to the encoder. Positional Encoding Positional encodings are added to embeddings to provide positional context, allowing the model to understand the order of tokens. Stack of Layers The decoder is composed of N identical layers (6 in the ori- ginal model). Each layer processes information sequentially, enhancing it step by step. Masked Self-Attention A modified self-attention mechanism that prevents tokens from attending to future tokens, ensuring the model pre- dicts sequentially. Encoder-Decoder Attention Attention mechanism where queries come from the decoder, and keys and values come from the encoder, allowing the decoder to focus on relevant encoder outputs. Feed-Forward Network A fully connected feed-forward network processes the out- puts of the attention mechanisms, refining the representa- tion further. Residual Connections and Each sub-layer includes residual connections to help with Normalization gradient flow and is followed by layer normalization for sta- bility. Linear Classifier and Soft- The final layer transforms the processed information into max vocabulary probabilities using a linear layer and softmax function. The highest probability corresponds to the next predicted token. Autoregressive Decoding The decoder generates outputs one token at a time, using previously generated tokens as input, and stops when an end token is predicted. 4 Real-life transformers : ChatGPT 23 GPT and ChatGPT, developed by OpenAI, are powerful generative AI models renowned for their ability to produce clear and contextually relevant text. GPT-1, the first version, was launched in June 2018, while GPT-3, a groundbreaking iteration, was released in 2020. These models excel in diverse tasks such as content creation, conversation, language translation, and more. Their architecture allows them to generate text that mimics human writing, making them valuable in areas like creative writing, customer service, and coding assistance. ChatGPT, a specialized variant designed for conversational use, stands out for its ability to produce natural and human-like dialogue. This makes it especially useful for chatbots, virtual assistants, and other interactive applications. 5 Conclusion In conclusion, Transformers have emerged as a monumental breakthrough in the field of artificial intel- ligence, NLP. By effectively managing sequential data through their unique self-attention mechanism, these models have outperformed traditional RNNs. Their ability to handle long sequences more efficiently and parallelize data processing significantly accelerates training. Pioneering models like Google’s BERT and OpenAI’s GPT series exemplify the transformative impact of Transformers in enhancing search engines and generating human-like text. As a result, they have become indispensable in modern machine learning, driving forward the boundaries of AI and opening new avenues in technological advancements. END 24