2b3da1d0-5dfe-11ef-8703-8e1c4cca5750.pdf
Document Details
Uploaded by SolidAlgorithm
Tags
Related
- SwinTRG: Swin Transformer Based Radiology Report Generation for Chest X-rays PDF
- Large Language Models PDF
- Chapter 3 Introduction to AI, Machine Learning, Deep Learning, and Large Language Models (LLMs).pdf
- Chapter 3 Introduction to AI, Machine Learning, Deep Learning, and Large Language Models (LLMs).pdf
- InstructGPT PDF - Training Language Models for Instructions (NeurIPS 2022)
- A Survey of Large Language Models PDF
Full Transcript
Build a Large Language Model (From Scratch) 1. welcome 2. 1_Understanding_Large_Language_Models 3. 2_Working_with_Text_Data 4. 3_Coding_Attention_Mechanisms 5. 4_Implementing_a_GPT_model_from_Scratch_To_Generate_Text 6. 5_Pretraining_on_Unlabeled_Data 7. Appendix_A._Introducti...
Build a Large Language Model (From Scratch) 1. welcome 2. 1_Understanding_Large_Language_Models 3. 2_Working_with_Text_Data 4. 3_Coding_Attention_Mechanisms 5. 4_Implementing_a_GPT_model_from_Scratch_To_Generate_Text 6. 5_Pretraining_on_Unlabeled_Data 7. Appendix_A._Introduction_to_PyTorch 8. Appendix_B._References_and_Further_Reading 9. Appendix_C._Exercise_Solutions 10. Appendix_D._Adding_Bells_and_Whistles_to_the_Training_Loop welcome Thank you for purchasing the MEAP edition of Build a Large Language Model (From Scratch). In this book, I invite you to embark on an educational journey with me to learn how to build Large Language Models (LLMs) from the ground up. Together, we'll delve deep into the LLM training pipeline, starting from data loading and culminating in finetuning LLMs on custom datasets. For many years, I've been deeply immersed in the world of deep learning, coding LLMs, and have found great joy in explaining complex concepts thoroughly. This book has been a long-standing idea in my mind, and I'm thrilled to finally have the opportunity to write it and share it with you. Those of you familiar with my work, especially from my blog, have likely seen glimpses of my approach to coding from scratch. This method has resonated well with many readers, and I hope it will be equally effective for you. I've designed the book to emphasize hands-on learning, primarily using PyTorch and without relying on pre-existing libraries. With this approach, coupled with numerous figures and illustrations, I aim to provide you with a thorough understanding of how LLMs work, their limitations, and customization methods. Moreover, we'll explore commonly used workflows and paradigms in pretraining and fine-tuning LLMs, offering insights into their development and customization. The book is structured with detailed step-by-step introductions, ensuring no critical detail is overlooked. To gain the most from this book, you should have a background in Python programming. Prior experience in deep learning and a foundational understanding of PyTorch, or familiarity with other deep learning frameworks like TensorFlow, will be beneficial. I warmly invite you to engage in the liveBook discussion forum for any questions, suggestions, or feedback you might have. Your contributions are immensely valuable and appreciated in enhancing this learning journey. — Sebastian Raschka In this book welcome 1 Understanding Large Language Models 2 Working with Text Data 3 Coding Attention Mechanisms 4 Implementing a GPT model from Scratch To Generate Text 5 Pretraining on Unlabeled Data Appendix A. Introduction to PyTorch Appendix B. References and Further Reading Appendix C. Exercise Solutions Appendix D. Adding Bells and Whistles to the Training Loop 1 Understanding Large Language Models This chapter covers High-level explanations of the fundamental concepts behind large language models (LLMs) Insights into the transformer architecture from which ChatGPT-like LLMs are derived A plan for building an LLM from scratch Large language models (LLMs), such as those offered in OpenAI's ChatGPT, are deep neural network models that have been developed over the past few years. They ushered in a new era for Natural Language Processing (NLP). Before the advent of large language models, traditional methods excelled at categorization tasks such as email spam classification and straightforward pattern recognition that could be captured with handcrafted rules or simpler models. However, they typically underperformed in language tasks that demanded complex understanding and generation abilities, such as parsing detailed instructions, conducting contextual analysis, or creating coherent and contextually appropriate original text. For example, previous generations of language models could not write an email from a list of keywords—a task that is trivial for contemporary LLMs. LLMs have remarkable capabilities to understand, generate, and interpret human language. However, it's important to clarify that when we say language models "understand," we mean that they can process and generate text in ways that appear coherent and contextually relevant, not that they possess human-like consciousness or comprehension. Enabled by advancements in deep learning, which is a subset of machine learning and artificial intelligence (AI) focused on neural networks, LLMs are trained on vast quantities of text data. This allows LLMs to capture deeper contextual information and subtleties of human language compared to previous approaches. As a result, LLMs have significantly improved performance in a wide range of NLP tasks, including text translation, sentiment analysis, question answering, and many more. Another important distinction between contemporary LLMs and earlier NLP models is that the latter were typically designed for specific tasks; whereas those earlier NLP models excelled in their narrow applications, LLMs demonstrate a broader proficiency across a wide range of NLP tasks. The success behind LLMs can be attributed to the transformer architecture which underpins many LLMs, and the vast amounts of data LLMs are trained on, allowing them to capture a wide variety of linguistic nuances, contexts, and patterns that would be challenging to manually encode. This shift towards implementing models based on the transformer architecture and using large training datasets to train LLMs has fundamentally transformed NLP, providing more capable tools for understanding and interacting with human language. Beginning with this chapter, we set the foundation to accomplish the primary objective of this book: understanding LLMs by implementing a ChatGPT- like LLM based on the transformer architecture step by step in code. 1.1 What is an LLM? An LLM, a large language model, is a neural network designed to understand, generate, and respond to human-like text. These models are deep neural networks trained on massive amounts of text data, sometimes encompassing large portions of the entire publicly available text on the internet. The "large" in large language model refers to both the model's size in terms of parameters and the immense dataset on which it's trained. Models like this often have tens or even hundreds of billions of parameters, which are the adjustable weights in the network that are optimized during training to predict the next word in a sequence. Next-word prediction is sensible because it harnesses the inherent sequential nature of language to train models on understanding context, structure, and relationships within text. Yet, it is a very simple task and so it is surprising to many researchers that it can produce such capable models. We will discuss and implement the next-word training procedure in later chapters step by step. LLMs utilize an architecture called the transformer (covered in more detail in section 1.4), which allows them to pay selective attention to different parts of the input when making predictions, making them especially adept at handling the nuances and complexities of human language. Since LLMs are capable of generating text, LLMs are also often referred to as a form of generative artificial intelligence (AI), often abbreviated as generative AI or GenAI. As illustrated in Figure 1.1, AI encompasses the broader field of creating machines that can perform tasks requiring human- like intelligence, including understanding language, recognizing patterns, and making decisions, and includes subfields like machine learning and deep learning. Figure 1.1 As this hierarchical depiction of the relationship between the different fields suggests, LLMs represent a specific application of deep learning techniques, leveraging their ability to process and generate human-like text. Deep learning is a specialized branch of machine learning that focuses on using multi-layer neural networks. And machine learning and deep learning are fields aimed at implementing algorithms that enable computers to learn from data and perform tasks that typically require human intelligence. The algorithms used to implement AI are the focus of the field of machine learning. Specifically, machine learning involves the development of algorithms that can learn from and make predictions or decisions based on data without being explicitly programmed. To illustrate this, imagine a spam filter as a practical application of machine learning. Instead of manually writing rules to identify spam emails, a machine learning algorithm is fed examples of emails labeled as spam and legitimate emails. By minimizing the error in its predictions on a training dataset, the model then learns to recognize patterns and characteristics indicative of spam, enabling it to classify new emails as either spam or legitimate. As illustrated in Figure 1.1, deep learning is a subset of machine learning that focuses on utilizing neural networks with three or more layers (also called deep neural networks) to model complex patterns and abstractions in data. In contrast to deep learning, traditional machine learning requires manual feature extraction. This means that human experts need to identify and select the most relevant features for the model. While the field of AI is nowadays dominated by machine learning and deep learning, it also includes other approaches, for example, using rule-based systems, genetic algorithms, expert systems, fuzzy logic, or symbolic reasoning. Returning to the spam classification example, in traditional machine learning, human experts might manually extract features from email text such as the frequency of certain trigger words ("prize," "win," "free"), the number of exclamation marks, use of all uppercase words, or the presence of suspicious links. This dataset, created based on these expert-defined features, would then be used to train the model. In contrast to traditional machine learning, deep learning does not require manual feature extraction. This means that human experts do not need to identify and select the most relevant features for a deep learning model. (However, in both traditional machine learning and deep learning for spam classification, you still require the collection of labels, such as spam or non-spam, which need to be gathered either by an expert or users.) The upcoming sections will cover some of the problems LLMs can solve today, the challenges that LLMs address, and the general LLM architecture, which we will implement in this book. 1.2 Applications of LLMs Owing to their advanced capabilities to parse and understand unstructured text data, LLMs have a broad range of applications across various domains. Today, LLMs are employed for machine translation, generation of novel texts (see Figure 1.2), sentiment analysis, text summarization, and many other tasks. LLMs have recently been used for content creation, such as writing fiction, articles, and even computer code. Figure 1.2 LLM interfaces enable natural language communication between users and AI systems. This screenshot shows ChatGPT writing a poem according to a user's specifications. LLMs can also power sophisticated chatbots and virtual assistants, such as OpenAI's ChatGPT or Google's Gemini (formerly called Bard), which can answer user queries and augment traditional search engines such as Google Search or Microsoft Bing. Moreover, LLMs may be used for effective knowledge retrieval from vast volumes of text in specialized areas such as medicine or law. This includes sifting through documents, summarizing lengthy passages, and answering technical questions. In short, LLMs are invaluable for automating almost any task that involves parsing and generating text. Their applications are virtually endless, and as we continue to innovate and explore new ways to use these models, it's clear that LLMs have the potential to redefine our relationship with technology, making it more conversational, intuitive, and accessible. In this book, we will focus on understanding how LLMs work from the ground up, coding an LLM that can generate texts. We will also learn about techniques that allow LLMs to carry out queries, ranging from answering questions to summarizing text, translating text into different languages, and more. In other words, in this book, we will learn how complex LLM assistants such as ChatGPT work by building one step by step. 1.3 Stages of building and using LLMs Why should we build our own LLMs? Coding an LLM from the ground up is an excellent exercise to understand its mechanics and limitations. Also, it equips us with the required knowledge for pretraining or finetuning existing open-source LLM architectures to our own domain-specific datasets or tasks. Research has shown that when it comes to modeling performance, custom- built LLMs—those tailored for specific tasks or domains—can outperform general-purpose LLMs, such as those provided by ChatGPT, which are designed for a wide array of applications. Examples of this include BloombergGPT, which is specialized for finance, and LLMs that are tailored for medical question answering (please see the Further Reading and References section in Appendix B for more details). The general process of creating an LLM includes pretraining and finetuning. The term "pre" in "pretraining" refers to the initial phase where a model like an LLM is trained on a large, diverse dataset to develop a broad understanding of language. This pretrained model then serves as a foundational resource that can be further refined through finetuning, a process where the model is specifically trained on a narrower dataset that is more specific to particular tasks or domains. This two-stage training approach consisting of pretraining and finetuning is depicted in Figure 1.3. Figure 1.3 Pretraining an LLM involves next-word prediction on large text datasets. A pretrained LLM can then be finetuned using a smaller labeled dataset. As illustrated in Figure 1.3, the first step in creating an LLM is to train it in on a large corpus of text data, sometimes referred to as raw text. Here, "raw" refers to the fact that this data is just regular text without any labeling information. (Filtering may be applied, such as removing formatting characters or documents in unknown languages.) This first training stage of an LLM is also known as pretraining, creating an initial pretrained LLM, often called a base or foundation model. A typical example of such a model is the GPT-3 model (the precursor of the original model offered in ChatGPT). This model is capable of text completion, that is, finishing a half-written sentence provided by a user. It also has limited few- shot capabilities, which means it can learn to perform new tasks based on only a few examples instead of needing extensive training data. This is further illustrated in the next section, Using transformers for different tasks. After obtaining a pretrained LLM from training on large text datasets, where the LLM is trained to predict the next word in the text, we can further train the LLM on labeled data, also known as finetuning. The two most popular categories of finetuning LLMs include instruction- finetuning and finetuning for classification tasks. In instruction-finetuning, the labeled dataset consists of instruction and answer pairs, such as a query to translate a text accompanied by the correctly translated text. In classification finetuning, the labeled dataset consists of texts and associated class labels, for example, emails associated with spam and non-spam labels. In this book, we will cover both code implementations for pretraining and finetuning LLM, and we will delve deeper into the specifics of instruction- finetuning and finetuning for classification later in this book after pretraining a base LLM. 1.4 Using LLMs for different tasks Most modern LLMs rely on the transformer architecture, which is a deep neural network architecture introduced in the 2017 paper Attention Is All You Need. To understand LLMs we briefly have to go over the original transformer, which was originally developed for machine translation, translating English texts to German and French. A simplified version of the transformer architecture is depicted in Figure 1.4. Figure 1.4 A simplified depiction of the original transformer architecture, which is a deep learning model for language translation. The transformer consists of two parts, an encoder that processes the input text and produces an embedding representation (a numerical representation that captures many different factors in different dimensions) of the text that the decoder can use to generate the translated text one word at a time. Note that this figure shows the final stage of the translation process where the decoder has to generate only the final word ("Beispiel"), given the original input text ("This is an example") and a partially translated sentence ("Das ist ein"), to complete the translation. The transformer architecture depicted in Figure 1.4 consists of two submodules, an encoder and a decoder. The encoder module processes the input text and encodes it into a series of numerical representations or vectors that capture the contextual information of the input. Then, the decoder module takes these encoded vectors and generates the output text from them. In a translation task, for example, the encoder would encode the text from the source language into vectors, and the decoder would decode these vectors to generate text in the target language. Both the encoder and decoder consist of many layers connected by a so-called self-attention mechanism. You may have many questions regarding how the inputs are preprocessed and encoded. These will be addressed in a step-by-step implementation in the subsequent chapters. A key component of transformers and LLMs is the self-attention mechanism (not shown), which allows the model to weigh the importance of different words or tokens in a sequence relative to each other. This mechanism enables the model to capture long-range dependencies and contextual relationships within the input data, enhancing its ability to generate coherent and contextually relevant output. However, due to its complexity, we will defer the explanation to chapter 3, where we will discuss and implement it step by step. Moreover, we will also discuss and implement the data preprocessing steps to create the model inputs in chapter 2, Working with Text Data. Later variants of the transformer architecture, such as the so-called BERT (short for bidirectional encoder representations from transformers) and the various GPT models (short for generative pretrained transformers), built on this concept to adapt this architecture for different tasks. (References can be found in Appendix B.) BERT, which is built upon the original transformer's encoder submodule, differs in its training approach from GPT. While GPT is designed for generative tasks, BERT and its variants specialize in masked word prediction, where the model predicts masked or hidden words in a given sentence as illustrated in Figure 1.5. This unique training strategy equips BERT with strengths in text classification tasks, including sentiment prediction and document categorization. As an application of its capabilities, as of this writing, Twitter uses BERT to detect toxic content. Figure 1.5 A visual representation of the transformer's encoder and decoder submodules. On the left, the encoder segment exemplifies BERT-like LLMs, which focus on masked word prediction and are primarily used for tasks like text classification. On the right, the decoder segment showcases GPT-like LLMs, designed for generative tasks and producing coherent text sequences. GPT, on the other hand, focuses on the decoder portion of the original transformer architecture and is designed for tasks that require generating texts. This includes machine translation, text summarization, fiction writing, writing computer code, and more. We will discuss the GPT architecture in more detail in the remaining sections of this chapter and implement it from scratch in this book. GPT models, primarily designed and trained to perform text completion tasks, also show remarkable versatility in their capabilities. These models are adept at executing both zero-shot and few-shot learning tasks. Zero-shot learning refers to the ability to generalize to completely unseen tasks without any prior specific examples. On the other hand, few-shot learning involves learning from a minimal number of examples the user provides as input, as shown in Figure 1.6. Figure 1.6 In addition to text completion, GPT-like LLMs can solve various tasks based on their inputs without needing retraining, finetuning, or task-specific model architecture changes. Sometimes, it is helpful to provide examples of the target within the input, which is known as a few-shot setting. However, GPT-like LLMs are also capable of carrying out tasks without a specific example, which is called zero-shot setting. Transformers versus LLMs Today's LLMs are based on the transformer architecture introduced in the previous section. Hence, transformers and LLMs are terms that are often used synonymously in the literature. However, note that not all transformers are LLMs since transformers can also be used for computer vision. Also, not all LLMs are transformers, as there are large language models based on recurrent and convolutional architectures. The main motivation behind these alternative approaches is to improve the computational efficiency of LLMs. However, whether these alternative LLM architectures can compete with the capabilities of transformer-based LLMs and whether they are going to be adopted in practice remains to be seen. (Interested readers can find literature references describing these architectures in the Further Reading section at the end of this chapter.) 1.5 Utilizing large datasets The large training datasets for popular GPT- and BERT-like models represent diverse and comprehensive text corpora encompassing billions of words, which include a vast array of topics and natural and computer languages. To provide a concrete example, Table 1.1 summarizes the dataset used for pretraining GPT-3, which served as the base model for the first version of ChatGPT. Table 1.1 The pretraining dataset of the popular GPT-3 LLM Dataset name Dataset description Number of Proportion in tokens training data CommonCrawl Web crawl data 410 billion 60% (filtered) WebText2 Web crawl data 19 billion 22% Books1 Internet-based book 12 billion 8% corpus Books2 Internet-based book 55 billion 8% corpus Wikipedia High-quality text 3 billion 3% Table 1.1 reports the number of tokens, where a token is a unit of text that a model reads, and the number of tokens in a dataset is roughly equivalent to the number of words and punctuation characters in the text. We will cover tokenization, the process of converting text into tokens, in more detail in the next chapter. The main takeaway is that the scale and diversity of this training dataset allows these models to perform well on diverse tasks including language syntax, semantics, and context, and even some requiring general knowledge. GPT-3 dataset details In Table 1.1, it's important to note that from each dataset, only a fraction of the data, (amounting to a total of 300 billion tokens) was used in the training process. This sampling approach means that the training didn't encompass every single piece of data available in each dataset. Instead, a selected subset of 300 billion tokens, drawn from all datasets combined, was utilized. Also, while some datasets were not entirely covered in this subset, others might have been included multiple times to reach the total count of 300 billion tokens. The column indicating proportions in the table, when summed up without considering rounding errors, accounts for 100% of this sampled data. For context, consider the size of the CommonCrawl dataset, which alone consists of 410 billion tokens and requires about 570 GB of storage. In comparison, later iterations of models like GPT-3, such as Meta's LLaMA, have expanded their training scope to include additional data sources like Arxiv research papers (92 GB) and StackExchange's code-related Q&As (78 GB). The Wikipedia corpus consists of English-language Wikipedia. While the authors of the GPT-3 paper didn't further specify the details, Books1 is likely a sample from Project Gutenberg (https://www.gutenberg.org/), and Books2 is likely from Libgen (https://en.wikipedia.org/wiki/Library_Genesis). CommonCrawl is a filtered subset of the CommonCrawl database (https://commoncrawl.org/), and WebText2 is the text of web pages from all outbound Reddit links from posts with 3+ upvotes. The authors of the GPT-3 paper did not share the training dataset but a comparable dataset that is publicly available is The Pile (https://pile.eleuther.ai/). However, the collection may contain copyrighted works, and the exact usage terms may depend on the intended use case and country. For more information, see the HackerNews discussion at https://news.ycombinator.com/item?id=25607809. The pretrained nature of these models makes them incredibly versatile for further finetuning on downstream tasks, which is why they are also known as base or foundation models. Pretraining LLMs requires access to significant resources and is very expensive. For example, the GPT-3 pretraining cost is estimated to be $4.6 million in terms of cloud computing credits. The good news is that many pretrained LLMs, available as open-source models, can be used as general purpose tools to write, extract, and edit texts that were not part of the training data. Also, LLMs can be finetuned on specific tasks with relatively smaller datasets, reducing the computational resources needed and improving performance on the specific task. In this book, we will implement the code for pretraining and use it to pretrain an LLM for educational purposes. All computations will be executable on consumer hardware. After implementing the pretraining code we will learn how to reuse openly available model weights and load them into the architecture we will implement, allowing us to skip the expensive pretraining stage when we finetune LLMs later in this book. 1.6 A closer look at the GPT architecture Previously in this chapter, we mentioned the terms GPT-like models, GPT-3, and ChatGPT. Let's now take a closer look at the general GPT architecture. First, GPT stands for Generative Pretrained Transformer and was originally introduced in the following paper: Improving Language Understanding by Generative Pre-Training (2018) by Radford et al. from OpenAI, http://cdn.openai.com/research- covers/language-unsupervised/language_understanding_paper.pdf GPT-3 is a scaled-up version of this model that has more parameters and was trained on a larger dataset. And the original model offered in ChatGPT was created by finetuning GPT-3 on a large instruction dataset using a method from OpenAI's InstructGPT paper, which we will cover in more detail in chapter 7, Finetuning with Human Feedback To Follow Instructions. As we have seen earlier in Figure 1.6, these models are competent text completion models and can carry out other tasks such as spelling correction, classification, or language translation. This is actually very remarkable given that GPT models are pretrained on a relatively simple next-word prediction task, as illustrated in Figure 1.7. Figure 1.7 In the next-word pretraining task for GPT models, the system learns to predict the upcoming word in a sentence by looking at the words that have come before it. This approach helps the model understand how words and phrases typically fit together in language, forming a foundation that can be applied to various other tasks. The next-word prediction task is a form of self-supervised learning, which is a form of self-labeling. This means that we don't need to collect labels for the training data explicitly but can leverage the structure of the data itself: we can use the next word in a sentence or document as the label that the model is supposed to predict. Since this next-word prediction task allows us to create labels "on the fly," it is possible to leverage massive unlabeled text datasets to train LLMs as previously discussed in section 1.5, Utilizing large datasets. Compared to the original transformer architecture we covered in section 1.4, Using LLMs for different tasks, the general GPT architecture is relatively simple. Essentially, it's just the decoder part without the encoder as illustrated in Figure 1.8. Since decoder-style models like GPT generate text by predicting text one word at a time, they are considered a type of autoregressive model. Autoregressive models incorporate their previous outputs as inputs for future predictions. Consequently, in GPT, each new word is chosen based on the sequence that precedes it, which improves coherence of the resulting text. Architectures such as GPT-3 are also significantly larger than the original transformer model. For instance, the original transformer repeated the encoder and decoder blocks six times. GPT-3 has 96 transformer layers and 175 billion parameters in total. Figure 1.8 The GPT architecture employs only the decoder portion of the original transformer. It is designed for unidirectional, left-to-right processing, making it well-suited for text generation and next-word prediction tasks to generate text in iterative fashion one word at a time. GPT-3 was introduced in 2020, which, by the standards of deep learning and large language model (LLM) development, is considered a long time ago. However, more recent architectures, such as Meta's Llama models, are still based on the same underlying concepts, introducing only minor modifications. Hence, understanding GPT remains as relevant as ever, and this book focuses on implementing the prominent architecture behind GPT while providing pointers to specific tweaks employed by alternative LLMs. Lastly, it's interesting to note that although the original transformer model was explicitly designed for language translation, GPT models—despite their larger yet simpler architecture aimed at next-word prediction—are also capable of performing translation tasks. This capability was initially unexpected to researchers, as it emerged from a model primarily trained on a next-word prediction task, which is a task that did not specifically target translation. The ability to perform tasks that the model wasn't explicitly trained to perform is called an "emergent behavior." This capability isn't explicitly taught during training but emerges as a natural consequence of the model's exposure to vast quantities of multilingual data in diverse contexts. The fact that GPT models can "learn" the translation patterns between languages and perform translation tasks even though they weren't specifically trained for it demonstrates the benefits and capabilities of these large-scale, generative language models. We can perform diverse tasks without using diverse models for each. 1.7 Building a large language model In this chapter, we laid the groundwork for understanding LLMs. In the remainder of this book, we will be coding one from scratch. We will take the fundamental idea behind GPT as a blueprint and tackle this in three stages, as outlined in Figure 1.9. Figure 1.9 The stages of building LLMs covered in this book include implementing the LLM architecture and data preparation process, pretraining an LLM to create a foundation model, and finetuning the foundation model to become a personal assistant or text classifier. First, we will learn about the fundamental data preprocessing steps and code the attention mechanism that is at the heart of every LLM. Next, in stage 2, we will learn how to code and pretrain a GPT-like LLM capable of generating new texts. And we will also go over the fundamentals of evaluating LLMs, which is essential for developing capable NLP systems. Note that pretraining a large LLM from scratch is a significant endeavor, demanding thousands to millions of dollars in computing costs for GPT-like models. Therefore, the focus of stage 2 is on implementing training for educational purposes using a small dataset. In addition, the book will also provide code examples for loading openly available model weights. Finally, in stage 3, we will take a pretrained LLM and finetune it to follow instructions such as answering queries or classifying texts -- the most common tasks in many real-world applications and research. I hope you are looking forward to embarking on this exciting journey! 1.8 Summary LLMs have transformed the field of natural language processing, which previously mostly relied on explicit rule-based systems and simpler statistical methods. The advent of LLMs introduced new deep learning- driven approaches that led to advancements in understanding, generating, and translating human language. Modern LLMs are trained in two main steps. First, they are pretrained on a large corpus of unlabeled text by using the prediction of the next word in a sentence as a "label." Then, they are finetuned on a smaller, labeled target dataset to follow instructions or perform classification tasks. LLMs are based on the transformer architecture. The key idea of the transformer architecture is an attention mechanism that gives the LLM selective access to the whole input sequence when generating the output one word at a time. The original transformer architecture consists of an encoder for parsing text and a decoder for generating text. LLMs for generating text and following instructions, such as GPT-3 and ChatGPT, only implement decoder modules, simplifying the architecture. Large datasets consisting of billions of words are essential for pretraining LLMs. In this book, we will implement and train LLMs on small datasets for educational purposes but also see how we can load openly available model weights. While the general pretraining task for GPT-like models is to predict the next word in a sentence, these LLMs exhibit "emergent" properties such as capabilities to classify, translate, or summarize texts. Once an LLM is pretrained, the resulting foundation model can be finetuned more efficiently for various downstream tasks. LLMs finetuned on custom datasets can outperform general LLMs on specific tasks. Readers with a background in machine learning may note that labeling information is typically required for traditional machine learning models and deep neural networks trained via the conventional supervised learning paradigm. However, this is not the case for the pretraining stage of LLMs. In this phase, LLMs leverage self-supervised learning, where the model generates its own labels from the input data. This concept is covered later in this chapter GPT-3, The $4,600,000 Language Model, https://www.reddit.com/r/MachineLearning/comments/h0jwoz/d_gpt3_the_4600000_lang 2 Working with Text Data This chapter covers Preparing text for large language model training Splitting text into word and subword tokens Byte pair encoding as a more advanced way of tokenizing text Sampling training examples with a sliding window approach Converting tokens into vectors that feed into a large language model In the previous chapter, we delved into the general structure of large language models (LLMs) and learned that they are pretrained on vast amounts of text. Specifically, our focus was on decoder-only LLMs based on the transformer architecture, which underlies the models used in ChatGPT and other popular GPT-like LLMs. During the pretraining stage, LLMs process text one word at a time. Training LLMs with millions to billions of parameters using a next-word prediction task yields models with impressive capabilities. These models can then be further finetuned to follow general instructions or perform specific target tasks. But before we can implement and train LLMs in the upcoming chapters, we need to prepare the training dataset, which is the focus of this chapter, as illustrated in Figure 2.1 Figure 2.1 A mental model of the three main stages of coding an LLM, pretraining the LLM on a general text dataset, and finetuning it on a labeled dataset. This chapter will explain and code the data preparation and sampling pipeline that provides the LLM with the text data for pretraining. In this chapter, you'll learn how to prepare input text for training LLMs. This involves splitting text into individual word and subword tokens, which can then be encoded into vector representations for the LLM. You'll also learn about advanced tokenization schemes like byte pair encoding, which is utilized in popular LLMs like GPT. Lastly, we'll implement a sampling and data loading strategy to produce the input-output pairs necessary for training LLMs in subsequent chapters. 2.1 Understanding word embeddings Deep neural network models, including LLMs, cannot process raw text directly. Since text is categorical, it isn't compatible with the mathematical operations used to implement and train neural networks. Therefore, we need a way to represent words as continuous-valued vectors. (Readers unfamiliar with vectors and tensors in a computational context can learn more in Appendix A, section A2.2 Understanding tensors.) The concept of converting data into a vector format is often referred to as embedding. Using a specific neural network layer or another pretrained neural network model, we can embed different data types, for example, video, audio, and text, as illustrated in Figure 2.2. Figure 2.2 Deep learning models cannot process data formats like video, audio, and text in their raw form. Thus, we use an embedding model to transform this raw data into a dense vector representation that deep learning architectures can easily understand and process. Specifically, this figure illustrates the process of converting raw data into a three-dimensional numerical vector. As shown in Figure 2.2, we can process various different data formats via embedding models. However, it's important to note that different data formats require distinct embedding models. For example, an embedding model designed for text would not be suitable for embedding audio or video data. At its core, an embedding is a mapping from discrete objects, such as words, images, or even entire documents, to points in a continuous vector space -- the primary purpose of embeddings is to convert non-numeric data into a format that neural networks can process. While word embeddings are the most common form of text embedding, there are also embeddings for sentences, paragraphs, or whole documents. Sentence or paragraph embeddings are popular choices for retrieval- augmented generation. Retrieval-augmented generation combines generation (like producing text) with retrieval (like searching an external knowledge base) to pull relevant information when generating text, which is a technique that is beyond the scope of this book. Since our goal is to train GPT-like LLMs, which learn to generate text one word at a time, this chapter focuses on word embeddings. There are several algorithms and frameworks that have been developed to generate word embeddings. One of the earlier and most popular examples is the Word2Vec approach. Word2Vec trained neural network architecture to generate word embeddings by predicting the context of a word given the target word or vice versa. The main idea behind Word2Vec is that words that appear in similar contexts tend to have similar meanings. Consequently, when projected into 2-dimensional word embeddings for visualization purposes, it can be seen that similar terms cluster together, as shown in Figure 2.3. Figure 2.3 If word embeddings are two-dimensional, we can plot them in a two-dimensional scatterplot for visualization purposes as shown here. When using word embedding techniques, such as Word2Vec, words corresponding to similar concepts often appear close to each other in the embedding space. For instance, different types of birds appear closer to each other in the embedding space compared to countries and cities. Word embeddings can have varying dimensions, from one to thousands. As shown in Figure 2.3, we can choose two-dimensional word embeddings for visualization purposes. A higher dimensionality might capture more nuanced relationships but at the cost of computational efficiency. While we can use pretrained models such as Word2Vec to generate embeddings for machine learning models, LLMs commonly produce their own embeddings that are part of the input layer and are updated during training. The advantage of optimizing the embeddings as part of the LLM training instead of using Word2Vec is that the embeddings are optimized to the specific task and data at hand. We will implement such embedding layers later in this chapter. Furthermore, LLMs can also create contextualized output embeddings, as we discuss in chapter 3. Unfortunately, high-dimensional embeddings present a challenge for visualization because our sensory perception and common graphical representations are inherently limited to three dimensions or fewer, which is why Figure 2.3 showed two-dimensional embeddings in a two-dimensional scatterplot. However, when working with LLMs, we typically use embeddings with a much higher dimensionality than shown in Figure 2.3. For both GPT-2 and GPT-3, the embedding size (often referred to as the dimensionality of the model's hidden states) varies based on the specific model variant and size. It is a trade-off between performance and efficiency. The smallest GPT-2 models (117M and 125M parameters) use an embedding size of 768 dimensions to provide concrete examples. The largest GPT-3 model (175B parameters) uses an embedding size of 12,288 dimensions. The upcoming sections in this chapter will walk through the required steps for preparing the embeddings used by an LLM, which include splitting text into words, converting words into tokens, and turning tokens into embedding vectors. 2.2 Tokenizing text This section covers how we split input text into individual tokens, a required preprocessing step for creating embeddings for an LLM. These tokens are either individual words or special characters, including punctuation characters, as shown in Figure 2.4. Figure 2.4 A view of the text processing steps covered in this section in the context of an LLM. Here, we split an input text into individual tokens, which are either words or special characters, such as punctuation characters. In upcoming sections, we will convert the text into token IDs and create token embeddings. The text we will tokenize for LLM training is a short story by Edith Wharton called The Verdict, which has been released into the public domain and is thus permitted to be used for LLM training tasks. The text is available on Wikisource at https://en.wikisource.org/wiki/The_Verdict, and you can copy and paste it into a text file, which I copied into a text file "the-verdict.txt" to load using Python's standard file reading utilities: Listing 2.1 Reading in a short story as text sample into Python with open("the-verdict.txt", "r", encoding="utf-8") as f: raw_text = f.read() print("Total number of character:", len(raw_text)) print(raw_text[:99]) Alternatively, you can find this "the-verdict.txt" file in this book's GitHub repository at https://github.com/rasbt/LLMs-from- scratch/tree/main/ch02/01_main-chapter-code. The print command prints the total number of characters followed by the first 100 characters of this file for illustration purposes: Total number of character: 20479 I HAD always thought Jack Gisburn rather a cheap genius--though a good fello Our goal is to tokenize this 20,479-character short story into individual words and special characters that we can then turn into embeddings for LLM training in the upcoming chapters. Text sample sizes Note that it's common to process millions of articles and hundreds of thousands of books -- many gigabytes of text -- when working with LLMs. However, for educational purposes, it's sufficient to work with smaller text samples like a single book to illustrate the main ideas behind the text processing steps and to make it possible to run it in reasonable time on consumer hardware. How can we best split this text to obtain a list of tokens? For this, we go on a small excursion and use Python's regular expression library re for illustration purposes. (Note that you don't have to learn or memorize any regular expression syntax since we will transition to a pre-built tokenizer later in this chapter.) Using some simple example text, we can use the re.split command with the following syntax to split a text on whitespace characters: import re text = "Hello, world. This, is a test." result = re.split(r'(\s)', text) print(result) The result is a list of individual words, whitespaces, and punctuation characters: ['Hello,', ' ', 'world.', ' ', 'This,', ' ', 'is', ' ', 'a', ' ', 'test.'] Note that the simple tokenization scheme above mostly works for separating the example text into individual words, however, some words are still connected to punctuation characters that we want to have as separate list entries. We also refrain from making all text lowercase because capitalization helps LLMs distinguish between proper nouns and common nouns, understand sentence structure, and learn to generate text with proper capitalization. Let's modify the regular expression splits on whitespaces (\s) and commas, and periods ([,.]): result = re.split(r'([,.]|\s)', text) print(result) We can see that the words and punctuation characters are now separate list entries just as we wanted: ['Hello', ',', '', ' ', 'world', '.', '', ' ', 'This', ',', '', ' ', 'is', ' A small remaining issue is that the list still includes whitespace characters. Optionally, we can remove these redundant characters safely as follows: result = [item for item in result if item.strip()] print(result) The resulting whitespace-free output looks like as follows: ['Hello', ',', 'world', '.', 'This', ',', 'is', 'a', 'test', '.'] Removing whitespaces or not When developing a simple tokenizer, whether we should encode whitespaces as separate characters or just remove them depends on our application and its requirements. Removing whitespaces reduces the memory and computing requirements. However, keeping whitespaces can be useful if we train models that are sensitive to the exact structure of the text (for example, Python code, which is sensitive to indentation and spacing). Here, we remove whitespaces for simplicity and brevity of the tokenized outputs. Later, we will switch to a tokenization scheme that includes whitespaces. The tokenization scheme we devised above works well on the simple sample text. Let's modify it a bit further so that it can also handle other types of punctuation, such as question marks, quotation marks, and the double-dashes we have seen earlier in the first 100 characters of Edith Wharton's short story, along with additional special characters: text = "Hello, world. Is this-- a test?" result = re.split(r'([,.:;?_!"()\']|--|\s)', text) result = [item.strip() for item in result if item.strip()] print(result) The resulting output is as follows: ['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?'] As we can see based on the results summarized in Figure 2.5, our tokenization scheme can now handle the various special characters in the text successfully. Figure 2.5 The tokenization scheme we implemented so far splits text into individual words and punctuation characters. In the specific example shown in this figure, the sample text gets split into 10 individual tokens. Now that we got a basic tokenizer working, let's apply it to Edith Wharton's entire short story: preprocessed = re.split(r'([,.?_!"()\']|--|\s)', raw_text) preprocessed = [item.strip() for item in preprocessed if item.strip()] print(len(preprocessed)) The above print statement outputs 4649, which is the number of tokens in this text (without whitespaces). Let's print the first 30 tokens for a quick visual check: print(preprocessed[:30]) The resulting output shows that our tokenizer appears to be handling the text well since all words and special characters are neatly separated: ['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 2.3 Converting tokens into token IDs In the previous section, we tokenized a short story by Edith Wharton into individual tokens. In this section, we will convert these tokens from a Python string to an integer representation to produce the so-called token IDs. This conversion is an intermediate step before converting the token IDs into embedding vectors. To map the previously generated tokens into token IDs, we have to build a so-called vocabulary first. This vocabulary defines how we map each unique word and special character to a unique integer, as shown in Figure 2.6. Figure 2.6 We build a vocabulary by tokenizing the entire text in a training dataset into individual tokens. These individual tokens are then sorted alphabetically, and duplicate tokens are removed. The unique tokens are then aggregated into a vocabulary that defines a mapping from each unique token to a unique integer value. The depicted vocabulary is purposefully small for illustration purposes and contains no punctuation or special characters for simplicity. In the previous section, we tokenized Edith Wharton's short story and assigned it to a Python variable called preprocessed. Let's now create a list of all unique tokens and sort them alphabetically to determine the vocabulary size: all_words = sorted(list(set(preprocessed))) vocab_size = len(all_words) print(vocab_size) After determining that the vocabulary size is 1,159 via the above code, we create the vocabulary and print its first 50 entries for illustration purposes: Listing 2.2 Creating a vocabulary vocab = {token:integer for integer,token in enumerate(all_words)} for i, item in enumerate(vocab.items()): print(item) if i > 50: break ('!', 0) ('"', 1) ("'", 2)... ('Has', 49) ('He', 50) As we can see, based on the output above, the dictionary contains individual tokens associated with unique integer labels. Our next goal is to apply this vocabulary to convert new text into token IDs, as illustrated in Figure 2.7. Figure 2.7 Starting with a new text sample, we tokenize the text and use the vocabulary to convert the text tokens into token IDs. The vocabulary is built from the entire training set and can be applied to the training set itself and any new text samples. The depicted vocabulary contains no punctuation or special characters for simplicity. Later in this book, when we want to convert the outputs of an LLM from numbers back into text, we also need a way to turn token IDs into text. For this, we can create an inverse version of the vocabulary that maps token IDs back to corresponding text tokens. Let's implement a complete tokenizer class in Python with an encode method that splits text into tokens and carries out the string-to-integer mapping to produce token IDs via the vocabulary. In addition, we implement a decode method that carries out the reverse integer-to-string mapping to convert the token IDs back into text. The code for this tokenizer implementation is as in listing 2.3: Listing 2.3 Implementing a simple text tokenizer class SimpleTokenizerV1: def __init__(self, vocab): self.str_to_int = vocab #A self.int_to_str = {i:s for s,i in vocab.items()} #B def encode(self, text): #C preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text) preprocessed = [item.strip() for item in preprocessed if item.strip( ids = [self.str_to_int[s] for s in preprocessed] return ids def decode(self, ids): #D text = " ".join([self.int_to_str[i] for i in ids]) text = re.sub(r'\s+([,.?!"()\'])', r'\1', text) #E return text Using the SimpleTokenizerV1 Python class above, we can now instantiate new tokenizer objects via an existing vocabulary, which we can then use to encode and decode text, as illustrated in Figure 2.8. Figure 2.8 Tokenizer implementations share two common methods: an encode method and a decode method. The encode method takes in the sample text, splits it into individual tokens, and converts the tokens into token IDs via the vocabulary. The decode method takes in token IDs, converts them back into text tokens, and concatenates the text tokens into natural text. Let's instantiate a new tokenizer object from the SimpleTokenizerV1 class and tokenize a passage from Edith Wharton's short story to try it out in practice: tokenizer = SimpleTokenizerV1(vocab) text = """"It's the last he painted, you know," Mrs. Gisburn said with pardo ids = tokenizer.encode(text) print(ids) The code above prints the following token IDs: [1, 58, 2, 872, 1013, 615, 541, 763, 5, 1155, 608, 5, 1, 69, 7, 39, 873, 113 Next, let's see if we can turn these token IDs back into text using the decode method: print(tokenizer.decode(ids)) This outputs the following text: '" It\' s the last he painted, you know," Mrs. Gisburn said with pardonable Based on the output above, we can see that the decode method successfully converted the token IDs back into the original text. So far, so good. We implemented a tokenizer capable of tokenizing and de- tokenizing text based on a snippet from the training set. Let's now apply it to a new text sample that is not contained in the training set: text = "Hello, do you like tea?" tokenizer.encode(text) Executing the code above will result in the following error:... KeyError: 'Hello' The problem is that the word "Hello" was not used in the The Verdict short story. Hence, it is not contained in the vocabulary. This highlights the need to consider large and diverse training sets to extend the vocabulary when working on LLMs. In the next section, we will test the tokenizer further on text that contains unknown words, and we will also discuss additional special tokens that can be used to provide further context for an LLM during training. 2.4 Adding special context tokens In the previous section, we implemented a simple tokenizer and applied it to a passage from the training set. In this section, we will modify this tokenizer to handle unknown words. We will also discuss the usage and addition of special context tokens that can enhance a model's understanding of context or other relevant information in the text. These special tokens can include markers for unknown words and document boundaries, for example. In particular, we will modify the vocabulary and tokenizer we implemented in the previous section, SimpleTokenizerV2, to support two new tokens, and , as illustrated in Figure 2.9. Figure 2.9 We add special tokens to a vocabulary to deal with certain contexts. For instance, we add an token to represent new and unknown words that were not part of the training data and thus not part of the existing vocabulary. Furthermore, we add an token that we can use to separate two unrelated text sources. As shown in Figure 2.9, we can modify the tokenizer to use an token if it encounters a word that is not part of the vocabulary. Furthermore, we add a token between unrelated texts. For example, when training GPT-like LLMs on multiple independent documents or books, it is common to insert a token before each document or book that follows a previous text source, as illustrated in Figure 2.10. This helps the LLM understand that, although these text sources are concatenated for training, they are, in fact, unrelated. Figure 2.10 When working with multiple independent text source, we add tokens between these texts. These tokens act as markers, signaling the start or end of a particular segment, allowing for more effective processing and understanding by the LLM. Let's now modify the vocabulary to include these two special tokens, and , by adding these to the list of all unique words that we created in the previous section: all_tokens = sorted(list(set(preprocessed))) all_tokens.extend(["", ""]) vocab = {token:integer for integer,token in enumerate(all_tokens)} print(len(vocab.items())) Based on the output of the print statement above, the new vocabulary size is 1161 (the vocabulary size in the previous section was 1159). As an additional quick check, let's print the last 5 entries of the updated vocabulary: for i, item in enumerate(list(vocab.items())[-5:]): print(item) The code above prints the following: ('younger', 1156) ('your', 1157) ('yourself', 1158) ('', 1159) ('', 1160) Based on the code output above, we can confirm that the two new special tokens were indeed successfully incorporated into the vocabulary. Next, we adjust the tokenizer from code listing 2.3 accordingly, as shown in listing 2.4: Listing 2.4 A simple text tokenizer that handles unknown words class SimpleTokenizerV2: def __init__(self, vocab): self.str_to_int = vocab self.int_to_str = { i:s for s,i in vocab.items()} def encode(self, text): preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text) preprocessed = [item.strip() for item in preprocessed if item.strip( preprocessed = [item if item in self.str_to_int #A else "" for item in preprocessed] ids = [self.str_to_int[s] for s in preprocessed] return ids def decode(self, ids): text = " ".join([self.int_to_str[i] for i in ids]) text = re.sub(r'\s+([,.?!"()\'])', r'\1', text) #B return text Compared to the SimpleTokenizerV1 we implemented in code listing 2.3 in the previous section, the new SimpleTokenizerV2 replaces unknown words by tokens. Let's now try this new tokenizer out in practice. For this, we will use a simple text sample that we concatenate from two independent and unrelated sentences: text1 = "Hello, do you like tea?" text2 = "In the sunlit terraces of the palace." text = " ".join((text1, text2)) print(text) The output is as follows: 'Hello, do you like tea? In the sunlit terraces of the palace. Next, let's tokenize the sample text using the SimpleTokenizerV2 on the vocab we previously created in listing 2.2: tokenizer = SimpleTokenizerV2(vocab) print(tokenizer.encode(text)) This prints the following token IDs: [1160, 5, 362, 1155, 642, 1000, 10, 1159, 57, 1013, 981, 1009, 738, 1013, 11 Above, we can see that the list of token IDs contains 1159 for the separator token as well as two 1160 tokens, which are used for unknown words. Let's de-tokenize the text for a quick sanity check: print(tokenizer.decode(tokenizer.encode(text))) The output is as follows: ', do you like tea? In the sunlit terraces of the established and established ----> himself and established himself ----> in and established himself in ----> a We've now created the input-target pairs that we can turn into use for the LLM training in upcoming chapters. There's only one more task before we can turn the tokens into embeddings, as we mentioned at the beginning of this chapter: implementing an efficient data loader that iterates over the input dataset and returns the inputs and targets as PyTorch tensors, which can be thought of as multidimensional arrays. In particular, we are interested in returning two tensors: an input tensor containing the text that the LLM sees and a target tensor that includes the targets for the LLM to predict, as depicted in Figure 2.13. Figure 2.13 To implement efficient data loaders, we collect the inputs in a tensor, x, where each row represents one input context. A second tensor, y, contains the corresponding prediction targets (next words), which are created by shifting the input by one position. While Figure 2.13 shows the tokens in string format for illustration purposes, the code implementation will operate on token IDs directly since the encode method of the BPE tokenizer performs both tokenization and conversion into token IDs as a single step. For the efficient data loader implementation, we will use PyTorch's built-in Dataset and DataLoader classes. For additional information and guidance on installing PyTorch, please see section A.1.3, Installing PyTorch, in Appendix A. The code for the dataset class is shown in code listing 2.5: Listing 2.5 A dataset for batched inputs and targets import torch from torch.utils.data import Dataset, DataLoader class GPTDatasetV1(Dataset): def __init__(self, txt, tokenizer, max_length, stride): self.tokenizer = tokenizer self.input_ids = [] self.target_ids = [] token_ids = tokenizer.encode(txt) #A for i in range(0, len(token_ids) - max_length, stride): #B input_chunk = token_ids[i:i + max_length] target_chunk = token_ids[i + 1: i + max_length + 1] self.input_ids.append(torch.tensor(input_chunk)) self.target_ids.append(torch.tensor(target_chunk)) def __len__(self): #C return len(self.input_ids) def __getitem__(self, idx): #D return self.input_ids[idx], self.target_ids[idx] The GPTDatasetV1 class in listing 2.5 is based on the PyTorch Dataset class and defines how individual rows are fetched from the dataset, where each row consists of a number of token IDs (based on a max_length) assigned to an input_chunk tensor. The target_chunk tensor contains the corresponding targets. I recommend reading on to see how the data returned from this dataset looks like when we combine the dataset with a PyTorch DataLoader - - this will bring additional intuition and clarity. If you are new to the structure of PyTorch Dataset classes, such as shown in listing 2.5, please read section A.6, Setting up efficient data loaders, in Appendix A, which explains the general structure and usage of PyTorch Dataset and DataLoader classes. The following code will use the GPTDatasetV1 to load the inputs in batches via a PyTorch DataLoader: Listing 2.6 A data loader to generate batches with input-with pairs def create_dataloader_v1(txt, batch_size=4, max_length=256, stride=128, shuffle=True, drop_last=True): tokenizer = tiktoken.get_encoding("gpt2") #A dataset = GPTDatasetV1(txt, tokenizer, max_length, stride) #B dataloader = DataLoader( dataset, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last return dataloader Let's test the dataloader with a batch size of 1 for an LLM with a context size of 4 to develop an intuition of how the GPTDatasetV1 class from listing 2.5 and the create_dataloader_v1 function from listing 2.6 work together: with open("the-verdict.txt", "r", encoding="utf-8") as f: raw_text = f.read() dataloader = create_dataloader_v1( raw_text, batch_size=1, max_length=4, stride=1, shuffle=False) data_iter = iter(dataloader) #A first_batch = next(data_iter) print(first_batch) Executing the preceding code prints the following: [tensor([[ 40, 367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])] The first_batch variable contains two tensors: the first tensor stores the input token IDs, and the second tensor stores the target token IDs. Since the max_length is set to 4, each of the two tensors contains 4 token IDs. Note that an input size of 4 is relatively small and only chosen for illustration purposes. It is common to train LLMs with input sizes of at least 256. To illustrate the meaning of stride=1, let's fetch another batch from this dataset: second_batch = next(data_iter) print(second_batch) The second batch has the following contents: [tensor([[ 367, 2885, 1464, 1807]]), tensor([[2885, 1464, 1807, 3619]])] If we compare the first with the second batch, we can see that the second batch's token IDs are shifted by one position compared to the first batch (for example, the second ID in the first batch's input is 367, which is the first ID of the second batch's input). The stride setting dictates the number of positions the inputs shift across batches, emulating a sliding window approach, as demonstrated in Figure 2.14. Figure 2.14 When creating multiple batches from the input dataset, we slide an input window across the text. If the stride is set to 1, we shift the input window by 1 position when creating the next batch. If we set the stride equal to the input window size, we can prevent overlaps between the batches. Exercise 2.2 Data loaders with different strides and context sizes To develop more intuition for how the data loader works, try to run it with different settings such as max_length=2 and stride=2 and max_length=8 and stride=2. Batch sizes of 1, such as we have sampled from the data loader so far, are useful for illustration purposes. If you have previous experience with deep learning, you may know that small batch sizes require less memory during training but lead to more noisy model updates. Just like in regular deep learning, the batch size is a trade-off and hyperparameter to experiment with when training LLMs. Before we move on to the two final sections of this chapter that are focused on creating the embedding vectors from the token IDs, let's have a brief look at how we can use the data loader to sample with a batch size greater than 1: dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=4, stri data_iter = iter(dataloader) inputs, targets = next(data_iter) print("Inputs:\n", inputs) print("\nTargets:\n", targets) This prints the following: Inputs: tensor([[ 40, 367, 2885, 1464], [ 1807, 3619, 402, 271], [10899, 2138, 257, 7026], [15632, 438, 2016, 257], [ 922, 5891, 1576, 438], [ 568, 340, 373, 645], [ 1049, 5975, 284, 502], [ 284, 3285, 326, 11]]) Targets: tensor([[ 367, 2885, 1464, 1807], [ 3619, 402, 271, 10899], [ 2138, 257, 7026, 15632], [ 438, 2016, 257, 922], [ 5891, 1576, 438, 568], [ 340, 373, 645, 1049], [ 5975, 284, 502, 284], [ 3285, 326, 11, 287]]) Note that we increase the stride to 4. This is to utilize the data set fully (we don't skip a single word) but also avoid any overlap between the batches, since more overlap could lead to increased overfitting. In the final two sections of this chapter, we will implement embedding layers that convert the token IDs into continuous vector representations, which serve as input data format for LLMs. 2.7 Creating token embeddings The last step for preparing the input text for LLM training is to convert the token IDs into embedding vectors, as illustrated in Figure 2.15, which will be the focus of these two last remaining sections of this chapter. Figure 2.15 Preparing the input text for an LLM involves tokenizing text, converting text tokens to token IDs, and converting token IDs into vector embedding vectors. In this section, we consider the token IDs created in previous sections to create the token embedding vectors. In addition to the processes outlined in Figure 2.15, it is important to note that we initialize these embedding weights with random values as a preliminary step. This initialization serves as the starting point for the LLM's learning process. We will optimize the embedding weights as part of the LLM training in chapter 5. A continuous vector representation, or embedding, is necessary since GPT- like LLMs are deep neural networks trained with the backpropagation algorithm. If you are unfamiliar with how neural networks are trained with backpropagation, please read section A.4, Automatic differentiation made easy, in Appendix A. Let's illustrate how the token ID to embedding vector conversion works with a hands-on example. Suppose we have the following four input tokens with IDs 2, 3, 5, and 1: input_ids = torch.tensor([2, 3, 5, 1]) For the sake of simplicity and illustration purposes, suppose we have a small vocabulary of only 6 words (instead of the 50,257 words in the BPE tokenizer vocabulary), and we want to create embeddings of size 3 (in GPT- 3, the embedding size is 12,288 dimensions): vocab_size = 6 output_dim = 3 Using the vocab_size and output_dim, we can instantiate an embedding layer in PyTorch, setting the random seed to 123 for reproducibility purposes: torch.manual_seed(123) embedding_layer = torch.nn.Embedding(vocab_size, output_dim) print(embedding_layer.weight) The print statement in the preceding code example prints the embedding layer's underlying weight matrix: Parameter containing: tensor([[ 0.3374, -0.1778, -0.1690], [ 0.9178, 1.5810, 1.3010], [ 1.2753, -0.2010, -0.1606], [-0.4015, 0.9666, -1.1481], [-1.1589, 0.3255, -0.6315], [-2.8400, -0.7849, -1.4096]], requires_grad=True) We can see that the weight matrix of the embedding layer contains small, random values. These values are optimized during LLM training as part of the LLM optimization itself, as we will see in upcoming chapters. Moreover, we can see that the weight matrix has six rows and three columns. There is one row for each of the six possible tokens in the vocabulary. And there is one column for each of the three embedding dimensions. After we instantiated the embedding layer, let's now apply it to a token ID to obtain the embedding vector: print(embedding_layer(torch.tensor())) The returned embedding vector is as follows: tensor([[-0.4015, 0.9666, -1.1481]], grad_fn=) If we compare the embedding vector for token ID 3 to the previous embedding matrix, we see that it is identical to the 4th row (Python starts with a zero index, so it's the row corresponding to index 3). In other words, the embedding layer is essentially a look-up operation that retrieves rows from the embedding layer's weight matrix via a token ID. Embedding layers versus matrix multiplication For those who are familiar with one-hot encoding, the embedding layer approach above is essentially just a more efficient way of implementing one- hot encoding followed by matrix multiplication in a fully connected layer, which is illustrated in the supplementary code on GitHub at https://github.com/rasbt/LLMs-from- scratch/tree/main/ch02/03_bonus_embedding-vs-matmul. Because the embedding layer is just a more efficient implementation equivalent to the one-hot encoding and matrix-multiplication approach, it can be seen as a neural network layer that can be optimized via backpropagation. Previously, we have seen how to convert a single token ID into a three- dimensional embedding vector. Let's now apply that to all four input IDs we defined earlier (torch.tensor([2, 3, 5, 1])): print(embedding_layer(input_ids)) The print output reveals that this results in a 4x3 matrix: tensor([[ 1.2753, -0.2010, -0.1606], [-0.4015, 0.9666, -1.1481], [-2.8400, -0.7849, -1.4096], [ 0.9178, 1.5810, 1.3010]], grad_fn=) Each row in this output matrix is obtained via a lookup operation from the embedding weight matrix, as illustrated in Figure 2.16. Figure 2.16 Embedding layers perform a look-up operation, retrieving the embedding vector corresponding to the token ID from the embedding layer's weight matrix. For instance, the embedding vector of the token ID 5 is the sixth row of the embedding layer weight matrix (it is the sixth instead of the fifth row because Python starts counting at 0). For illustration purposes, we assume that the token IDs were produced by the small vocabulary we used in section 2.3. This section covered how we create embedding vectors from token IDs. The next and final section of this chapter will add a small modification to these embedding vectors to encode positional information about a token within a text. 2.8 Encoding word positions In the previous section, we converted the token IDs into a continuous vector representation, the so-called token embeddings. In principle, this is a suitable input for an LLM. However, a minor shortcoming of LLMs is that their self- attention mechanism, which will be covered in detail in chapter 3, doesn't have a notion of position or order for the tokens within a sequence. The way the previously introduced embedding layer works is that the same token ID always gets mapped to the same vector representation, regardless of where the token ID is positioned in the input sequence, as illustrated in Figure 2.17. Figure 2.17 The embedding layer converts a token ID into the same vector representation regardless of where it is located in the input sequence. For example, the token ID 5, whether it's in the first or third position in the token ID input vector, will result in the same embedding vector. In principle, the deterministic, position-independent embedding of the token ID is good for reproducibility purposes. However, since the self-attention mechanism of LLMs itself is also position-agnostic, it is helpful to inject additional position information into the LLM. To achieve this, there are two broad categories of position-aware embeddings: relative positional embeddings and absolute positional embeddings. Absolute positional embeddings are directly associated with specific positions in a sequence. For each position in the input sequence, a unique embedding is added to the token's embedding to convey its exact location. For instance, the first token will have a specific positional embedding, the second token another distinct embedding, and so on, as illustrated in Figure 2.18. Figure 2.18 Positional embeddings are added to the token embedding vector to create the input embeddings for an LLM. The positional vectors have the same dimension as the original token embeddings. The token embeddings are shown with value 1 for simplicity. Instead of focusing on the absolute position of a token, the emphasis of relative positional embeddings is on the relative position or distance between tokens. This means the model learns the relationships in terms of "how far apart" rather than "at which exact position." The advantage here is that the model can generalize better to sequences of varying lengths, even if it hasn't seen such lengths during training. Both types of positional embeddings aim to augment the capacity of LLMs to understand the order and relationships between tokens, ensuring more accurate and context-aware predictions. The choice between them often depends on the specific application and the nature of the data being processed. OpenAI's GPT models use absolute positional embeddings that are optimized during the training process rather than being fixed or predefined like the positional encodings in the original Transformer model. This optimization process is part of the model training itself, which we will implement later in this book. For now, let's create the initial positional embeddings to create the LLM inputs for the upcoming chapters. Previously, we focused on very small embedding sizes in this chapter for illustration purposes. We now consider more realistic and useful embedding sizes and encode the input tokens into a 256-dimensional vector representation. This is smaller than what the original GPT-3 model used (in GPT-3, the embedding size is 12,288 dimensions) but still reasonable for experimentation. Furthermore, we assume that the token IDs were created by the BPE tokenizer that we implemented earlier, which has a vocabulary size of 50,257: output_dim = 256 vocab_size = 50257 token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim) Using the token_embedding_layer above, if we sample data from the data loader, we embed each token in each batch into a 256-dimensional vector. If we have a batch size of 8 with four tokens each, the result will be an 8 x 4 x 256 tensor. Let's instantiate the data loader from section 2.6, Data sampling with a sliding window, first: max_length = 4 dataloader = create_dataloader_v1( raw_text, batch_size=8, max_length=max_length, stride=max_length, shuffl data_iter = iter(dataloader) inputs, targets = next(data_iter) print("Token IDs:\n", inputs) print("\nInputs shape:\n", inputs.shape) The preceding code prints the following output: Token IDs: tensor([[ 40, 367, 2885, 1464], [ 1807, 3619, 402, 271], [10899, 2138, 257, 7026], [15632, 438, 2016, 257], [ 922, 5891, 1576, 438], [ 568, 340, 373, 645], [ 1049, 5975, 284, 502], [ 284, 3285, 326, 11]]) Inputs shape: torch.Size([8, 4]) As we can see, the token ID tensor is 8x4-dimensional, meaning that the data batch consists of 8 text samples with 4 tokens each. Let's now use the embedding layer to embed these token IDs into 256- dimensional vectors: token_embeddings = token_embedding_layer(inputs) print(token_embeddings.shape) The preceding print function call returns the following: torch.Size([8, 4, 256]) As we can tell based on the 8x4x256-dimensional tensor output, each token ID is now embedded as a 256-dimensional vector. For a GPT model's absolute embedding approach, we just need to create another embedding layer that has the same dimension as the token_embedding_layer: context_length = max_length pos_embedding_layer = torch.nn.Embedding(context_lengthe, output_dim) pos_embeddings = pos_embedding_layer(torch.arange(context_length)) print(pos_embeddings.shape) As shown in the preceding code example, the input to the pos_embeddings is usually a placeholder vector torch.arange(context_length), which contains a sequence of numbers 0, 1,..., up to the maximum input length − 1. The context_length is a variable that represents the supported input size of the LLM. Here, we choose it similar to the maximum length of the input text. In practice, input text can be longer than the supported context length, in which case we have to truncate the text. The output of the print statement is as follows: torch.Size([4, 256]) As we can see, the positional embedding tensor consists of four 256- dimensional vectors. We can now add these directly to the token embeddings, where PyTorch will add the 4x256-dimensional pos_embeddings tensor to each 4x256-dimensional token embedding tensor in each of the 8 batches: input_embeddings = token_embeddings + pos_embeddings print(input_embeddings.shape) The print output is as follows: torch.Size([8, 4, 256]) The input_embeddings we created, as summarized in Figure 2.19, are the embedded input examples that can now be processed by the main LLM modules, which we will begin implementing in chapter 3 Figure 2.19 As part of the input processing pipeline, input text is first broken up into individual tokens. These tokens are then converted into token IDs using a vocabulary. The token IDs are converted into embedding vectors to which positional embeddings of a similar size are added, resulting in input embeddings that are used as input for the main LLM layers. 2.9 Summary LLMs require textual data to be converted into numerical vectors, known as embeddings since they can't process raw text. Embeddings transform discrete data (like words or images) into continuous vector spaces, making them compatible with neural network operations. As the first step, raw text is broken into tokens, which can be words or characters. Then, the tokens are converted into integer representations, termed token IDs. Special tokens, such as and , can be added to enhance the model's understanding and handle various contexts, such as unknown words or marking the boundary between unrelated texts. The byte pair encoding (BPE) tokenizer used for LLMs like GPT-2 and GPT-3 can efficiently handle unknown words by breaking them down into subword units or individual characters. We use a sliding window approach on tokenized data to generate input- target pairs for LLM training. Embedding layers in PyTorch function as a lookup operation, retrieving vectors corresponding to token IDs. The resulting embedding vectors provide continuous representations of tokens, which is crucial for training deep learning models like LLMs. While token embeddings provide consistent vector representations for each token, they lack a sense of the token's position in a sequence. To rectify this, two main types of positional embeddings exist: absolute and relative. OpenAI's GPT models utilize absolute positional embeddings that are added to the token embedding vectors and are optimized during the model training. 3 Coding Attention Mechanisms This chapter covers Exploring the reasons for using attention mechanisms in neural networks Introducing a basic self-attention framework and progressing to an enhanced self-attention mechanism Implementing a causal attention module that allows LLMs to generate one token at a time Masking randomly selected attention weights with dropout to reduce overfitting Stacking multiple causal attention modules into a multi-head attention module In the previous chapter, you learned how to prepare the input text for training LLMs. This involved splitting text into individual word and subword tokens, which can be encoded into vector representations, the so-called embeddings, for the LLM. In this chapter, we will now look at an integral part of the LLM architecture itself, attention mechanisms, as illustrated in Figure 3.1. Figure 3.1 A mental model of the three main stages of coding an LLM, pretraining the LLM on a general text dataset, and finetuning it on a labeled dataset. This chapter focuses on attention mechanisms, which are an integral part of an LLM architecture. Attention mechanisms are a comprehensive topic, which is why we are devoting a whole chapter to it. We will largely look at these attention mechanisms in isolation and focus on them at a mechanistic level. In the next chapter, we will then code the remaining parts of the LLM surrounding the self-attention mechanism to see it in action and to create a model to generate text. Over the course of this chapter, we will implement four different variants of attention mechanisms, as illustrated in Figure 3.2. Figure 3.2 The figure depicts different attention mechanisms we will code in this chapter, starting with a simplified version of self-attention before adding the trainable weights. The causal attention mechanism adds a mask to self-attention that allows the LLM to generate one word at a time. Finally, multi-head attention organizes the attention mechanism into multiple heads, allowing the model to capture various aspects of the input data in parallel. These different attention variants shown in Figure 3.2 build on each other, and the goal is to arrive at a compact and efficient implementation of multi- head attention at the end of this chapter that we can then plug into the LLM architecture we will code in the next chapter. 3.1 The problem with modeling long sequences Before we dive into the self-attention mechanism that is at the heart of LLMs later in this chapter, what is the problem with architectures without attention mechanisms that predate LLMs? Suppose we want to develop a language translation model that translates text from one language into another. As shown in Figure 3.3, we can't simply translate a text word by word due to the grammatical structures in the source and target language. Figure 3.3 When translating text from one language to another, such as German to English, it's not possible to merely translate word by word. Instead, the translation process requires contextual understanding and grammar alignment. To address the issue that we cannot translate text word by word, it is common to use a deep neural network with two submodules, a so-called encoder and decoder. The job of the encoder is to first read in and process the entire text, and the decoder then produces the translated text. We already briefly discussed encoder-decoder networks when we introduced the transformer architecture in chapter 1 (section 1.4, Using LLMs for different tasks). Before the advent of transformers, recurrent neural networks (RNNs) were the most popular encoder-decoder architecture for language translation. An RNN is a type of neural network where outputs from previous steps are fed as inputs to the current step, making them well-suited for sequential data like text. If you are unfamiliar with RNNs, don't worry, you don't need to know the detailed workings of RNNs to follow this discussion; our focus here is more on the general concept of the encoder-decoder setup. In an encoder-decoder RNN, the input text is fed into the encoder, which processes it sequentially. The encoder updates its hidden state (the internal values at the hidden layers) at each step, trying to capture the entire meaning of the input sentence in the final hidden state, as illustrated in Figure 3.4. The decoder then takes this final hidden state to start generating the translated sentence, one word at a time. It also updates its hidden state at each step, which is supposed to carry the context necessary for the next-word prediction. Figure 3.4 Before the advent of transformer models, encoder-decoder RNNs were a popular choice for machine translation. The encoder takes a sequence of tokens from the source language as input, where a hidden state (an intermediate neural network layer) of the encoder encodes a compressed representation of the entire input sequence. Then, the decoder uses its current hidden state to begin the translation, token by token. While we don't need to know the inner workings of these encoder-decoder RNNs, the key idea here is that the encoder part processes the entire input text into a hidden state (memory cell). The decoder then takes in this hidden state to produce the output. You can think of this hidden state as an embedding vector, a concept we discussed in chapter 2. The big issue and limitation of encoder-decoder RNNs is that the RNN can't directly access earlier hidden states from the encoder during the decoding phase. Consequently, it relies solely on the current hidden state, which encapsulates all relevant information. This can lead to a loss of context, especially in complex sentences where dependencies might span long distances. For readers unfamiliar with RNNs, it is not essential to understand or study this architecture as we will not be using it in this book. The takeaway message of this section is that encoder-decoder RNNs had a shortcoming that motivated the design of attention mechanisms. 3.2 Capturing data dependencies with attention mechanisms Before transformer LLMs, it was common to use RNNs for language modeling tasks such as language translation, as mentioned previously. RNNs work fine for translating short sentences but don't work well for longer texts as they don't have direct access to previous words in the input. One major shortcoming in this approach is that the RNN must remember the entire encoded input in a single hidden state before passing it to the decoder, as illustrated in Figure 3.4 in the previous section. Hence, researchers developed the so-called Bahdanau attention mechanism for RNNs in 2014 (named after the first author of the respective paper), which modifies the encoder-decoder RNN such that the decoder can selectively access different parts of the input sequence at each decoding step as illustrated in Figure 3.5. Figure 3.5 Using an attention mechanism, the text-generating decoder part of the network can access all input tokens selectively. This means that some input tokens are more important than others for generating a given output token. The importance is determined by the so-called attention weights, which we will compute later. Note that this figure shows the general idea behind attention and does not depict the exact implementation of the Bahdanau mechanism, which is an RNN method outside this book's scope. Interestingly, only three years later, researchers found that RNN architectures are not required for building deep neural networks for natural language processing and proposed the original transformer architecture (discussed in chapter 1) with a self-attention mechanism inspired by the Bahdanau attention mechanism. Self-attention is a mechanism that allows each position in the input sequence to attend to all positions in the same sequence when computing the representation of a sequence. Self-attention is a key component of contemporary LLMs based on the transformer architecture, such as the GPT series. This chapter focuses on coding and understanding this self-attention mechanism used in GPT-like models, as illustrated in Figure 3.6. In the next chapter, we will then code the remaining parts of the LLM. Figure 3.6 Self-attention is a mechanism in transformers that is used to compute more efficient input representations by allowing each position in a sequence to interact with and weigh the importance of all other positions within the same sequence. In this chapter, we will code this self- attention mechanism from the ground up before we code the remaining parts of the GPT-like LLM in the following chapter. 3.3 Attending to different parts of the input with self-attention We'll now delve into the inner workings of the self-attention mechanism and learn how to code it from the ground up. Self-attention serves as the cornerstone of every LLM based on the transformer architecture. It's worth noting that this topic may require a lot of focus and attention (no pun intended), but once you grasp its fundamentals, you will have conquered one of the toughest aspects of this book and implementing LLMs in general. The "self" in self-attention In self-attention, the "self" refers to the mechanism's ability to compute attention weights by relating different positions within a single input sequence. It assesses and learns the relationships and dependencies between various parts of the input itself, such as words in a sentence or pixels in an image. This is in contrast to traditional attention mechanisms, where the focus is on the relationships between elements of two different sequences, such as in sequence-to-sequence models where the attention might be between an input sequence and an output sequence, such as the example depicted in Figure 3.5. Since self-attention can appear complex, especially if you are encountering it for the first time, we will begin by introducing a simplified version of self- attention in the next subsection. Afterwards, in section 3.4, we will then implement the self-attention mechanism with trainable weights, which is used in LLMs. 3.3.1 A simple self-attention mechanism without trainable weights In this section, we implement a simplified variant of self-attention, free from any trainable weights, which is summarized in Figure 3.7. The goal of this section is to illustrate a few key concepts in self-attention before adding trainable weights next in section 3.4. Figure 3.7 The goal of self-attention is to compute a context vector, for each input element, that combines information from all other input elements. In the example depicted in this figure, we compute the context vector z(2). The importance or contribution of each input element for computing z(2) is determined by the attention weights α21 to α2T. When computing z(2), the attention weights are calculated with respect to input element x(2) and all other inputs. The exact computation of these attention weights is discussed later in this section. Figure 3.7 shows an input sequence, denoted as x, consisting of T elements represented as x(1) to x(T). This sequence typically represents text, such as a sentence, that has already been transformed into token embeddings, as explained in chapter 2. For example, consider an input text like "Your journey starts with one step." In this case, each element of the sequence, such as x(1), corresponds to a d- dimensional embedding vector representing a specific token, like "Your." In Figure 3.7, these input vectors are shown as 3-dimensional embeddings. In self-attention, our goal is to calculate context vectors z(i) for each element x(i) in the input sequence. A context vector can be interpreted as an enriched embedding vector. To illustrate this concept, let's focus on the embedding vector of the second input element, x(2) (which corresponds to the token "journey"), and the corresponding context vector, z(2), shown at the bottom of Figure 3.7. This enhanced context vector, z(2), is an embedding that contains information about x(2) and all other input elements x(1) to x(T). In self-attention, context vectors play a crucial role. Their purpose is to create enriched representations of each element in an input sequence (like a sentence) by incorporating information from all other elements in the sequence, as illustrated in Figure 3.7. This is essential in LLMs, which need to understand the relationship and relevance of words in a sentence to each other. Later, we will add trainable weights that help an LLM learn to construct these context vectors so that they are relevant for the LLM to generate the next token. In this section, we implement a simplified self-attention mechanism to compute these weights and the resulting context vector one step at a time. Consider the following input sentence, which has already been embedded into 3-dimensional vectors as discussed in chapter 2. We choose a small embedding dimension for illustration purposes to ensure it fits on the page without line breaks: import torch inputs = torch.tensor( [[0.43, 0.15, 0.89], # Your (x^1) [0.55, 0.87, 0.66], # journey (x^2) [0.57, 0.85, 0.64], # starts (x^3) [0.22, 0.58, 0.33], # with (x^4) [0.77, 0.25, 0.10], # one (x^5) [0.05, 0.80, 0.55]] # step (x^6) ) The first step of implementing self-attention is to compute the intermediate values ω, referred to as attention scores, as illustrated in Figure 3.8. Figure 3.8 The overall goal of this section is to illustrate the computation of the context vector z(2) using the second input sequence, x(2) as a query. This figure shows the first intermediate step, computing the attention scores ω between the query x(2) and all other input elements as a dot product. (Note that the numbers in the figure are truncated to one digit after the decimal point to reduce visual clutter.) Figure 3.8 illustrates how we calculate the intermediate attention scores between the query token and each input token. We determine these scores by computing the dot product of the query, x(2), with every other input token: query = inputs #A attn_scores_2 = torch.empty(inputs.shape) for i, x_i in enumerate(inputs): attn_scores_2[i] = torch.dot(x_i, query) print(attn_scores_2) The computed attention scores are as follows: tensor([0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865]) Understanding dot products A dot product is essentially just a concise way of multiplying two vectors element-wise and then summing the products, which we can demonstrate as follows: res = 0. for idx, element in enumerate(inputs): res += inputs[idx] * query[idx] print(res) print(torch.dot(inputs, query)) The outputs confirms that the sum of the element-wise multiplication gives the same results as the dot product: tensor(0.9544) tensor(0.9544) Beyond viewing the dot product operation as a mathematical tool that combines two vectors to yield a scalar value, the dot product is a measure of similarity because it quantifies how much two vectors are aligned: a higher dot product indicates a greater degree of alignment or similarity between the vectors. In the context of self-attention mechanisms, the dot product determines the extent to which elements in a sequence attend to each other: the higher the dot product, the higher the similarity and attention score between two elements. In the next step, as shown in Figure 3.9, we normalize each of the attention scores that we computed previously. Figure 3.9 After computing the attention scores ω21 to ω2T with respect to the input query x(2), the next step is to obtain the attention weights α21 to α2T by normalizing the attention scores. The main goal behind the normalization shown in Figure 3.9 is to obtain attention weights that sum up to 1. This normalization is a convention that is useful for interpretation and for maintaining training stability in an LLM. Here's a straightforward method for achieving this normalization step: attn_weights_2_tmp = attn_scores_2 / attn_scores_2.sum() print("Attention weights:", attn_weights_2_tmp) print("Sum:", attn_weights_2_tmp.sum()) As the output shows, the attention weights now sum to 1: Attention weights: tensor([0.1455, 0.2278, 0.2249, 0.1285, 0.1077, 0.1656]) Sum: tensor(1.0000) In practice, it's more common and advisable to use the softmax function for normalization. This approach is better at managing extreme values and offers more favorable gradient properties during training. Below is a basic implementation of the softmax function for normalizing the attention scores: def softmax_naive(x): return torch.exp(x) / torch.exp(x).sum(dim=0) attn_weights_2_naive = softmax_naive(attn_scores_2) print("Attention weights:", attn_weights_2_naive) print("Sum:", attn_weights_2_naive.sum()) As the output shows, the softmax function also meets the objective and normalizes the attention weights such that they sum to 1: Attention weights: tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581]) Sum: tensor(1.) In addition, the softmax function ensures that the attention weights are always positive. This makes the output interpretable as probabilities or relative importance, where higher weights indicate greater importance. Note that this naive softmax implementation (softmax_naive) may encounter numerical instability problems, such as overflow and underflow, when dealing with large or small input values. Therefore, in practice, it's advisable to use the PyTorch implementation of softmax, which has been extensively optimized for performance: attn_weights_2 = torch.softmax(attn_scores_2, dim=0) print("Attention weights:", attn_weights_2) print("Sum:", attn_weights_2.sum()) In this case, we can see that it yields the same results as our previous softmax_naive function: Attention weights: tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581]) Sum: tensor(1.) Now that we computed the normalized attention weights, we are ready for the final step illustrated in Figure 3.10: calculating the context vector z(2) by multiplying the embedded input tokens, x(i), with the corresponding attention weights and then summing the resulting vectors. Figure 3.10 The final step, after calculating and normalizing the attention scores to obtain the attention weights for query x(2), is to compute the context vector z(2). This context vector is a combination of all input vectors x(1)