Understanding Foundation Models in AI Engineering PDF

Summarize Chapter 2. Understanding Foundation Models To build applications with foundation models, you first need foundation models. While you don’t need to know how to develop a model to use it, a high-level understanding will help you decide what model to use and how to adapt it to your needs. Training a foundation model is an incredibly complex and costly process. Those who know how to do this well are likely prevented by confidentiali‐ ty agreements from disclosing the secret sauce. This chapter won’t be able to tell you how to build a model to compete with ChatGPT. Instead, I’ll focus on design decisions with consequential impact on downstream applications. With the growing lack of transparency in the training process of founda‐ tion models, it’s difficult to know all the design decisions that go into making a model. In general, however, differences in foundation models can be traced back to decisions about training data, model architecture and size, and how they are post-trained to align with human preferences. Since models learn from data, their training data reveals a great deal about their capabilities and limitations. This chapter begins with how model developers curate training data, focusing on the distribution of training data. Chapter 8 explores dataset engineering techniques in detail, including data quality evaluation and data synthesis. Given the dominance of the transformer architecture, it might seem that model architecture is less of a choice. You might be wondering, what makes the transformer architecture so special that it continues to domi‐ nate? How long until another architecture takes over, and what might this new architecture look like? This chapter will address all of these ques‐ tions. Whenever a new model is released, one of the first things people want to know is its size. This chapter will also explore how a model devel‐ oper might determine the appropriate size for their model. As mentioned in Chapter 1, a model’s training process is often divided into pre-training and post-training. Pre-training makes a model capable, but not necessarily safe or easy to use. This is where post-training comes in. The goal of post-training is to align the model with human preferences. But what exactly is human preference? How can it be represented in a way that a model can learn? The way a model developer aligns their model has a significant impact on the model’s usability, and will be discussed in this chapter. While most people understand the impact of training on a model’s perfor‐ mance, the impact of sampling is often overlooked. Sampling is how a model chooses an output from all possible options. It is perhaps one of the most underrated concepts in AI. Not only does sampling explain many seemingly baffling AI behaviors, including hallucinations and inconsisten‐ cies, but choosing the right sampling strategy can also significantly boost a model’s performance with relatively little effort. For this reason, sam‐ pling is the section that I was the most excited to write about in this chapter. Concepts covered in this chapter are fundamental for understanding the rest of the book. However, because these concepts are fundamental, you might already be familiar with them. Feel free free to skip any concept that you’re confident about. If you encounter a confusing concept later on, you can revisit this chapter. Training Data An AI model is only as good as the data it was trained on. If there’s no Vietnamese in the training data, the model won’t be able to translate from English into Vietnamese. Similarly, if an image classification model sees only animals in its training set, it won’t perform well on photos of plants. If you want a model to improve on a certain task, you might want to in‐ clude more data for that task in the training data. However, collecting suf‐ ficient data for training a large model isn’t easy, and it can be expensive. Model developers often have to rely on available data, even if this data doesn’t exactly meet their needs. For example, a common source for training data is Common Crawl, creat‐ ed by a nonprofit organization that sporadically crawls websites on the in‐ ternet. In 2022 and 2023, this organization crawled approximately 2–3 bil‐ lion web pages each month. Google provides a clean subset of Common Crawl called the Colossal Clean Crawled Corpus, or C4 for short. The data quality of Common Crawl, and C4 to a certain extent, is ques‐ tionable—think clickbait, misinformation, propaganda, conspiracy theories, racism, misogyny, and every sketchy website you’ve ever seen or avoided on the internet. A study by the Washington Post shows that the 1,000 most common websites in the dataset include several media outlets that rank low on NewsGuard’s scale for trustworthiness. In lay terms, Common Crawl contains plenty of fake news. Yet, simply because Common Crawl is available, variations of it are used in most foundation models that disclose their training data sources, includ‐ ing OpenAI’s GPT-3 and Google’s Gemini. I suspect that Common Crawl is also used in models that don’t disclose their training data. To avoid scru‐ tiny from both the public and competitors, many companies have stopped disclosing this information. Some teams use heuristics to filter out low-quality data from the internet. For example, OpenAI used only the Reddit links that received at least three upvotes to train GPT-2. While this does help screen out links that nobody cares about, Reddit isn’t exactly the pinnacle of propriety and good taste. The “use what we have, not what we want” approach may lead to models that perform well on tasks present in the training data but not necessarily on the tasks you care about. To address this issue, it’s crucial to curate datasets that align with your specific needs. This section focuses on curat‐ ing data for specific languages and domains, providing a broad yet special‐ ized foundation for applications within those areas. Chapter 8 explores data strategies for models tailored to highly specific tasks. While language- and domain-specific foundation models can be trained from scratch, it’s also common to finetune them on top of general-pur‐ pose models. Some might wonder, why not just train a model on all data available, both general data and specialized data, so that the model can do everything? This is what many people do. However, training on more data often re‐ quires more compute resources and doesn’t always lead to better perfor‐ mance. For example, a model trained with a smaller amount of high-quali‐ ty data might outperform a model trained with a large amount of low- quality data. Using 7B tokens of high-quality coding data, Gunasekar et al. (2023) were able to train a 1.3B-parameter model that outperforms much larger models on several important coding benchmarks. The impact of data quality is discussed more in Chapter 8. Multilingual Models English dominates the internet. An analysis of the Common Crawl dataset shows that English accounts for almost half of the data (45.88%), making it eight times more prevalent than the second-most common language, Russian (5.97%) (Lai et al., 2023). See Table 2-1 for a list of languages with at least 1% in Common Crawl. Languages with limited availability as training data—typically languages not included in this list—are considered low-resource. Table 2-1. The most common languages in Common Crawl, a pop‐ ular dataset for training LLMs. Source: Lai et al. (2023). Language Code Pop. CC size (M) (%) Cat. English en 1,452 45.8786 H Russian ru 258 5.9692 H German de 134 5.8811 H Chinese zh 1,118 4.8747 H Japanese jp 125 4.7884 H French fr 274 4.7254 H Spanish es 548 4.4690 H Italian it 68 2.5712 H Dutch nl 30 2.0585 H Polish pl 45 1.6636 H Portuguese pt 257 1.1505 H Vietnamese vi 85 1.0299 H Many other languages, despite having a lot of speakers today, are severely under-represented in Common Crawl. Table 2-2 shows some of these lan‐ guages. Ideally, the ratio between world population representation and Common Crawl representation should be 1. The higher this ratio, the more under-represented this language is in Common Crawl. Table 2-2. Examples of under-represented languages in Common Crawl. The last row, English, is for comparison. The numbers for % in Common Crawl are taken from Lai et al. (2023). % in World: Speakers % world Language a Common Common Crawl (million) population Crawl Ratio Punjabi 113 1.41% 0.0061% 231.56 Swahili 71 0.89% 0.0077% 115.26 Urdu 231 2.89% 0.0274% 105.38 Kannada 64 0.80% 0.0122% 65.57 Telugu 95 1.19% 0.0183% 64.89 Gujarati 62 0.78% 0.0126% 61.51 Marathi 99 1.24% 0.0213% 58.10 Bengali 272 3.40% 0.0930% 36.56 English 1452 18.15% 45.88% 0.40 a A world population of eight billion was used for this calculation. Given the dominance of English in the internet data, it’s not surprising that general-purpose models work much better for English than other lan‐ guages, according to multiple studies. For example, on the MMLU bench‐ mark, a suite of 14,000 multiple-choice problems spanning 57 subjects, GPT-4 performed much better in English than under-represented lan‐ guages like Telugu, as shown in Figure 2-1 (OpenAI, 2023). Figure 2-1. On the MMLU benchmark, GPT-4 performs better in English than in any other lan‐ guage. To obtain MMLU in other languages, OpenAI translated the questions using Azure AI Translator. Similarly, when tested on six math problems on Project Euler, Yennie Jun found that GPT-4 was able to solve problems in English more than three 1 times as often compared to Armenian or Farsi. GPT-4 failed in all six questions for Burmese and Amharic, as shown in Figure 2-2. Figure 2-2. GPT-4 is much better at math in English than in other languages. Under-representation is a big reason for this underperformance. The three languages that have the worst performance on GPT-4’s MMLU bench‐ marks—Telugu, Marathi, and Punjabi—are also among the languages that are most under-represented in Common Crawl. However, under-represen‐ tation isn’t the only reason. A language’s structure and the culture it em‐ bodies can also make a language harder for a model to learn. Given that LLMs are generally good at translation, can we just translate all queries from other languages into English, obtain the responses, and translate them back into the original language? Many people indeed fol‐ low this approach, but it’s not ideal. First, this requires a model that can sufficiently understand under-represented languages to translate. Second, translation can cause information loss. For example, some lan‐ guages, like Vietnamese, have pronouns to denote the relationship be‐ tween the two speakers. When translating into English, all these pronouns are translated into I and you, causing the loss of the relationship information. Models can also have unexpected performance challenges in non-English languages. For example, NewsGuard found that ChatGPT is more willing to produce misinformation in Chinese than in English. In April 2023, NewsGuard asked ChatGPT-3.5 to produce misinformation articles about China in English, simplified Chinese, and traditional Chinese. For English, ChatGPT declined to produce false claims for six out of seven prompts. However, it produced false claims in simplified Chinese and traditional Chinese all seven times. It’s unclear what causes this difference in 2 behavior. Other than quality issues, models can also be slower and more expensive for non-English languages. A model’s inference latency and cost is pro‐ portional to the number of tokens in the input and response. It turns out that tokenization can be much more efficient for some languages than others. Benchmarking GPT-4 on MASSIVE, a dataset of one million short texts translated across 52 languages, Yennie Jun found that, to convey the same meaning, languages like Burmese and Hindi require a lot more tokens than English or Spanish. For the MASSIVE dataset, the median to‐ ken length in English is 7, but the median length in Hindi is 32, and in Burmese, it’s a whopping 72, which is ten times longer than in English. Assuming that the time it takes to generate a token is the same in all lan‐ guages, GPT-4 takes approximately ten times longer in Burmese than in English for the same content. For APIs that charge by token usage, Burmese costs ten times more than English. To address this, many models have been trained to focus on non-English languages. The most active language, other than English, is undoubtedly Chinese, with ChatGLM, YAYI, Llama-Chinese, and others. There are also models in French (CroissantLLM), Vietnamese (PhoGPT), Arabic (Jais), and many more languages. Domain-Specific Models General-purpose models like Gemini, GPTs, and Llamas can perform in‐ credibly well on a wide range of domains, including but not limited to coding, law, science, business, sports, and environmental science. This is largely thanks to the inclusion of these domains in their training data. Figure 2-3 shows the distribution of domains present in Common Crawl 3 according to the Washington Post’s 2023 analysis. Figure 2-3. Distribution of domains in the C4 dataset. Reproduced from the statistics from the Washington Post. One caveat of this analysis is that it only shows the categories that are includ‐ ed, not the categories missing. As of this writing, there haven’t been many analyses of domain distribu‐ tion in vision data. This might be because images are harder to categorize 4 than texts. However, you can infer a model’s domains from its benchmark performance. Table 2-3 shows how two models, CLIP and Open CLIP, per‐ form on different benchmarks. These benchmarks show how well these two models do on birds, flowers, cars, and a few more categories, but the world is so much bigger and more complex than these few categories. Table 2-3. Open CLIP and CLIP’s performance on different image datasets. CLIP Open CLIP Dataset Accuracy of ViT-B/32 Accuracy of ViT- (OpenAI) B/32 (Cade) ImageNet 63.2 62.9 ImageNet v2 – 62.6 Birdsnap 37.8 46.0 Country211 17.8 14.8 Oxford 102 Category Flower 66.7 66.0 German Traffic Sign 32.2 42.0 Recognition Benchmark Stanford Cars 59.4 79.3 UCF101 64.5 63.1 Even though general-purpose foundation models can answer everyday questions about different domains, they are unlikely to perform well on domain-specific tasks, especially if they never saw these tasks during training. Two examples of domain-specific tasks are drug discovery and cancer screening. Drug discovery involves protein, DNA, and RNA data, which follow specific formats and are expensive to acquire. This data is unlikely to be found in publicly available internet data. Similarly, cancer screening typically involves X-ray and fMRI (functional magnetic reso‐ nance imaging) scans, which are hard to obtain due to privacy. To train a model to perform well on these domain-specific tasks, you might need to curate very specific datasets. One of the most famous do‐ main-specific models is perhaps DeepMind’s AlphaFold, trained on the sequences and 3D structures of around 100,000 known proteins. NVIDIA’s BioNeMo is another model that focuses on biomolecular data for drug discovery. Google’s Med-PaLM2 combined the power of an LLM with medical data to answer medical queries with higher accuracy. TIP Domain-specific models are especially common for biomedicine, but other fields can benefit from domain-specific models too. It’s possible that a model trained on architectural sketches can help architects much better than Stable Diffusion, or a model trained on factory plans can be optimized for manufacturing processes much better than a generic model like ChatGPT. This section gave a high-level overview of how training data impacts a model’s performance. Next, let’s explore the impact of how a model is de‐ signed on its performance. Modeling Before training a model, developers need to decide what the model should look like. What architecture should it follow? How many parame‐ ters should it have? These decisions impact not only the model’s capabili‐ 5 ties but also its usability for downstream applications. For example, a 7B- parameter model will be vastly easier to deploy than a 175B-parameter model. Similarly, optimizing a transformer model for latency is very differ‐ ent from optimizing another architecture. Let’s explore the factors behind these decisions. Model Architecture As of this writing, the most dominant architecture for language-based foundation models is the transformer architecture (Vaswani et al., 2017), which is based on the attention mechanism. It addresses many limitations of the previous architectures, which contributed to its popularity. However, the transformer architecture has its own limitations. This sec‐ tion analyzes the transformer architecture and its alternatives. Because it goes into the technical details of different architectures, it can be techni‐ cally dense. If you find any part too deep in the weeds, feel free to skip it. Transformer architecture To understand the transformer, let’s look at the problem it was created to solve. The transformer architecture was popularized on the heels of the success of the seq2seq (sequence-to-sequence) architecture. At the time of its introduction in 2014, seq2seq provided significant improvement on then-challenging tasks: machine translation and summarization. In 2016, Google incorporated seq2seq into Google Translate, an update that they claimed to have given them the “largest improvements to date for ma‐ chine translation quality”. This generated a lot of interest in seq2seq, making it the go-to architecture for tasks involving sequences of text. At a high level, seq2seq contains an encoder that processes inputs and a decoder that generates outputs. Both inputs and outputs are sequences of tokens, hence the name. Seq2seq uses RNNs (recurrent neural networks) as its encoder and decoder. In its most basic form, the encoder processes the input tokens sequentially, outputting the final hidden state that rep‐ resents the input. The decoder then generates output tokens sequentially, conditioned on both the final hidden state of the input and the previously generated token. A visualization of the seq2seq architecture is shown in the top half of Figure 2-4. Figure 2-4. Seq2seq architecture versus transformer architecture. For the transformer archi‐ tecture, the arrows show the tokens that the decoder attends to when generating each output token. There are two problems with seq2seq that Vaswani et al. (2017) address‐ es. First, the vanilla seq2seq decoder generates output tokens using only the final hidden state of the input. Intuitively, this is like generating an‐ swers about a book using the book summary. This limits the quality of the generated outputs. Second, the RNN encoder and decoder mean that both input processing and output generation are done sequentially, making it slow for long sequences. If an input is 200 tokens long, seq2seq has to wait for each input token to finish processing before moving on to the 6 next. The transformer architecture addresses both problems with the attention mechanism. The attention mechanism allows the model to weigh the im‐ portance of different input tokens when generating each output token. This is like generating answers by referencing any page in the book. A sim‐ plified visualization of the transformer architecture is shown in the bot‐ tom half of Figure 2-4. NOTE While the attention mechanism is often associated with the transformer model, it was introduced three years before the transformer paper. The attention mecha‐ nism can also be used with other architectures. Google used the attention mecha‐ nism with their seq2seq architecture in 2016 for their GNMT (Google Neural Machine Translation) model. However, it wasn’t until the transformer paper showed that the attention mechanism could be used without RNNs that it took 7 off. The transformer architecture dispenses with RNNs entirely. With trans‐ formers, the input tokens can be processed in parallel, significantly speed‐ ing up input processing. While the transformer removes the sequential in‐ put bottleneck, transformer-based autoregressive language models still have the sequential output bottleneck. Inference for transformer-based language models, therefore, consists of two steps: Prefill The model processes the input tokens in parallel. This step creates the intermediate state necessary to generate the first output token. This intermediate state includes the key and value vectors for all in‐ put tokens. Decode The model generates one output token at a time. As explored later in Chapter 9, the parallelizable nature of prefilling and the sequential aspect of decoding both motivate many optimization tech‐ niques to make language model inference cheaper and faster. Attention mechanism At the heart of the transformer architecture is the attention mechanism. Understanding this mechanism is necessary to understand how trans‐ former models work. Under the hood, the attention mechanism leverages key, value, and query vectors: The query vector (Q) represents the current state of the decoder at each decoding step. Using the same book summary example, this query vec‐ tor can be thought of as the person looking for information to create a summary. Each key vector (K) represents a previous token. If each previous token is a page in the book, each key vector is like the page number. Note that at a given decoding step, previous tokens include both input tokens and previously generated tokens. Each value vector (V) represents the actual value of a previous token, as learned by the model. Each value vector is like the page’s content. The attention mechanism computes how much attention to give an input token by performing a dot product between the query vector and its key vector. A high score means that the model will use more of that page’s content (its value vector) when generating the book’s summary. A visual‐ ization of the attention mechanism with the key, value, and query vectors is shown in Figure 2-5. In this visualization, the query vector is seeking in‐ formation from the previous tokens How, are, you, ?, ¿ to generate the next token. Figure 2-5. An example of the attention mechanism in action next to its high-level visualization from the famous transformer paper, “Attention Is All You Need” (Vaswani et al., 2017). Because each previous token has a corresponding key and value vector, the longer the sequence, the more key and value vectors need to be com‐ puted and stored. This is one reason why it’s so hard to extend context length for transformer models. How to efficiently compute and store key and value vectors comes up again in Chapters 7 and 9. Let’s look into how the attention function works. Given an input x , the key, value, and query vectors are computed by applying key, value, and query matrices to the input. Let W K , W V , and W Q be the key, value, and query matrices. The key, value, and query vectors are computed as follows: K = xWK V = xWV Q = xWQ The query, key, and value matrices have dimensions corresponding to the model’s hidden dimension. For example, in Llama 2-7B (Touvron et al., 2023), the model’s hidden dimension size is 4096, meaning that each of these matrices has a 4096 × 4096 dimension. Each resulting K , V , Q 8 vector has the dimension of 4096. The attention mechanism is almost always multi-headed. Multiple heads allow the model to attend to different groups of previous tokens simulta‐ neously. With multi-headed attention, the query, key, and value vectors are split into smaller vectors, each corresponding to an attention head. In the case of Llama 2-7B, because it has 32 attention heads, each K , V , and Q vector will be split into 32 vectors of the dimension 128. This is because 4096 / 32 = 128. The outputs of all attention heads are then concatenated. An output pro‐ jection matrix is used to apply another transformation to this concatenat‐ ed output before it’s fed to the model’s next computation step. The out‐ put projection matrix has the same dimension as the model’s hidden dimension. Transformer block Now that we’ve discussed how attention works, let’s see how it’s used in a model. A transformer architecture is composed of multiple transformer blocks. The exact content of the block varies between models, but, in general, each transformer block contains the attention module and the MLP (multi-layer perceptron) module: Attention module Each attention module consists of four weight matrices: query, key, value, and output projection. MLP module An MLP module consists of linear layers separated by nonlinear acti‐ vation functions. Each linear layer is a weight matrix that is used for linear transformations, whereas an activation function allows the lin‐ ear layers to learn nonlinear patterns. A linear layer is also called a feedforward layer. Common nonlinear functions are ReLU, Rectified Linear Unit (Agarap, 2018), and GELU (Hendrycks and Gimpel, 2016), which was used by GPT-2 and GPT-3, respectively. Action functions are 9 very simple. For example, all ReLU does is convert negative values to 0. Mathematically, it’s written as: ReLU(x) = max(0, x) The number of transformer blocks in a transformer model is often referred to as that model’s number of layers. A transformer-based language model is also outfitted with a module before and after all the transformer blocks: An embedding module before the transformer blocks This module consists of the embedding matrix and the positional embedding matrix, which convert tokens and their positions into em‐ bedding vectors, respectively. Naively, the number of position in‐ dices determines the model’s maximum context length. For exam‐ ple, if a model keeps track of 2,048 positions, its maximum context length is 2,048. However, there are techniques that increase a mod‐ el’s context length without increasing the number of position indices. An output layer after the transformer blocks This module maps the model’s output vectors into token probabili‐ ties used to sample model outputs (discussed in “Sampling”). This module typically consists of one matrix, which is also called the un‐ embedding layer. Some people refer to the output layer as the model head, as it’s the model’s last layer before output generation. Figure 2-6 visualizes a transformer model architecture. The size of a transformer model is determined by the dimensions of its building blocks. Some of the key values are: The model’s dimension determines the sizes of the key, query, value, and output projection matrices in the transformer block. The number of transformer blocks. The dimension of the feedforward layer. The vocabulary size. Figure 2-6. A visualization of the weight composition of a transformer model. Larger dimension values result in larger model sizes. Table 2-4 shows these dimension values for different Llama 2 (Touvron et al., 2023) and Llama 3 (Dubey et al., 2024) models. Note that while the increased con‐ text length impacts the model’s memory footprint, it doesn’t impact the model’s total number of parameters. Table 2-4. The dimension values of different Llama models. # transformer Model Feedforward Vocab Context Model blocks dim dim size length Llama 2- 32 4,096 11,008 32K 4K 7B Llama 2- 40 5,120 13,824 32K 4K 13B Llama 2- 80 8,192 22,016 32K 4K 70B Llama 3- 32 4,096 14,336 128K 128K 7B Llama 3- 80 8,192 28,672 128K 128K 70B Llama 3- 126 16,384 53,248 128K 128K 405B Other model architectures While the transformer model dominates the landscape, it’s not the only architecture. Since AlexNet revived the interest in deep learning in 2012, many architectures have gone in and out of fashion. Seq2seq was in the limelight for four years (2014–2018). GANs (generative adversarial net‐ works) captured the collective imagination a bit longer (2014–2019). Compared to architectures that came before it, the transformer is sticky. 10 It’s been around since 2017. How long until something better comes along? 11 Developing a new architecture to outperform transformers isn’t easy. The transformer has been heavily optimized since 2017. A new architec‐ ture that aims to replace the transformer will have to perform at the scale 12 that people care about, on the hardware that people care about. However, there’s hope. While transformer-based models are dominating, as of this writing, several alternative architectures are gaining traction. One popular model is RWKV (Peng et al., 2023), an RNN-based model that can be parallelized for training. Due to its RNN nature, in theory, it doesn’t have the same context length limitation that transformer-based models have. However, in practice, having no context length limitation doesn’t guarantee good performance with long context. Modeling long sequences remains a core challenge in developing LLMs. An architecture that has shown a lot of promise in long-range memory is SSMs (state space models) (Gu et al., 2021a). Since the architecture’s in‐ troduction in 2021, multiple techniques have been introduced to make the architecture more efficient, better at long sequence processing, and scal‐ able to larger model sizes. Here are a few of these techniques, to illus‐ trate the evolution of a new architecture: S4, introduced in “Efficiently Modeling Long Sequences with Structured State Spaces” (Gu et al., 2021b), was developed to make SSMs more efficient. H3, introduced in “Hungry Hungry Hippos: Towards Language Modeling with State Space Models” (Fu et al., 2022), incorporates a mechanism that allows the model to recall early tokens and compare tokens across sequences. This mechanism’s purpose is akin to that of the attention mechanism in the transformer architecture, but it is more efficient. Mamba, introduced in “Mamba: Linear-Time Sequence Modeling with Selective State Spaces” (Gu and Dao, 2023), scales SSMs to three billion parameters. On language modeling, Mamba-3B outperforms transform‐ ers of the same size and matches transformers twice its size. The au‐ thors also show that Mamba’s inference computation scales linearly with sequence length (compared to quadratic scaling for transformers). Its performance shows improvement on real data up to million-length sequences. Jamba, introduced in “Jamba: A Hybrid Transformer–Mamba Language Model” (Lieber et al., 2024), interleaves blocks of transformer and Mamba layers to scale up SSMs even further. The authors released a mixture-of-experts model with 52B total available parameters (12B ac‐ tive parameters) designed to fit in a single 80 GB GPU. Jamba shows strong performance on standard language model benchmarks and long- context evaluations for up to a context length of 256K tokens. It also has a small memory footprint compared to vanilla transformers. Figure 2-7 visualizes the transformer, Mamba, and Jamba blocks. While it’s challenging to develop an architecture that outperforms the transformer, given its many limitations, there are a lot of incentives to do so. If another architecture does indeed overtake the transformer, some of the model adaptation techniques discussed in this book might change. However, just as the shift from ML engineering to AI engineering has kept many things unchanged, changing the underlying model architecture won’t alter the fundamental approaches. Figure 2-7. A visualization of the transformer, Mamba, and Jamba layers. Image adapted from “Jamba: A Hybrid Transformer–Mamba Language Model” (Lieber et al., 2024). Model Size Much of AI progress in recent years can be attributed to increased model size. It’s hard to talk about foundation models without talking about their number of parameters. The number of parameters is usually appended at the end of a model name. For example, Llama-13B refers to the version of Llama, a model family developed by Meta, with 13 billion parameters. In general, increasing a model’s parameters increases its capacity to learn, resulting in better models. Given two models of the same model family, the one with 13 billion parameters is likely to perform much better than the one with 7 billion parameters. NOTE As the community better understands how to train large models, newer-genera‐ tion models tend to outperform older-generation models of the same size. For ex‐ ample, Llama 3-8B (2024) outperforms even Llama 2-70B (2023) on the MMLU benchmark. The number of parameters helps us estimate the compute resources need‐ ed to train and run this model. For example, if a model has 7 billion para‐ meters, and each parameter is stored using 2 bytes (16 bits), then we can calculate that the GPU memory needed to do inference using this model 13 will be at least 14 billion bytes (14 GB). The number of parameters can be misleading if the model is sparse. A sparse model has a large percentage of zero-value parameters. A 7B-para‐ meter model that is 90% sparse only has 700 million non-zero parame‐ ters. Sparsity allows for more efficient data storage and computation. This means that a large sparse model can require less compute than a small dense model. A type of sparse model that has gained popularity in recent years is mix‐ ture-of-experts (MoE) (Shazeer et al., 2017). An MoE model is divided into different groups of parameters, and each group is an expert. Only a subset of the experts is active for (used to) process each token. For example, Mixtral 8x7B is a mixture of eight experts, each expert with seven billion parameters. If no two experts share any parameter, it should have 8 × 7 billion = 56 billion parameters. However, due to some parame‐ ters being shared, it has only 46.7 billion parameters. At each layer, for each token, only two experts are active. This means that only 12.9 billion parameters are active for each token. While this model has 46.7 billion parameters, its cost and speed are the same as a 12.9-bil‐ lion-parameter model. A larger model can also underperform a smaller model if it’s not trained on enough data. Imagine a 13B-param model trained on a dataset consist‐ ing of a single sentence: “I like pineapples.” This model will perform much worse than a much smaller model trained on more data. When discussing model size, it’s important to consider the size of the data it was trained on. For most models, dataset sizes are measured by the number of training samples. For example, Google’s Flamingo (Alayrac et al., 2022) was trained using four datasets—one of them has 1.8 billion (im‐ age, text) pairs and one has 312 million (image, text) pairs. For language models, a training sample can be a sentence, a Wikipedia page, a chat conversation, or a book. A book is worth a lot more than a sentence, so the number of training samples is no longer a good metric to measure dataset sizes. A better measurement is the number of tokens in the dataset. The number of tokens isn’t a perfect measurement either, as different models can have different tokenization processes, resulting in the same dataset having different numbers of tokens for different models. Why not just use the number of words or the number of letters? Because a token is the unit that a model operates on, knowing the number of tokens in a dataset helps us measure how much a model can potentially learn from that data. As of this writing, LLMs are trained using datasets in the order of trillions of tokens. Meta used increasingly larger datasets to train their Llama models: 1.4 trillion tokens for Llama 1 2 trillion tokens for Llama 2 15 trillion tokens for Llama 3 Together’s open source dataset RedPajama-v2 has 30 trillion tokens. This 14 is equivalent to 450 million books or 5,400 times the size of Wikipedia. However, since RedPajama-v2 consists of indiscriminate content, the amount of high-quality data is much lower. The number of tokens in a model’s dataset isn’t the same as its number of training tokens. The number of training tokens measures the tokens that the model is trained on. If a dataset contains 1 trillion tokens and a model is trained on that dataset for two epochs—an epoch is a pass through the 15 dataset—the number of training tokens is 2 trillion. See Table 2-5 for ex‐ amples of the number of training tokens for models with different num‐ bers of parameters. Table 2-5. Examples of the number of training tokens for models with different numbers of para‐ meters. Source: “Training Compute-Optimal Large Language Models” (DeepMind, 2022). Model Size (# parameters) Training tokens LaMDA (Thoppilan et al., 2022) 137 billion 168 billion GPT-3 (Brown et al., 2020) 175 billion 300 billion Jurassic (Lieber et al., 2021) 178 billion 300 billion Gopher (Rae et al., 2021) 280 billion 300 billion MT-NLG 530B (Smith et al., 2022) 530 billion 270 billion Chinchilla 70 billion 1.4 trillion NOTE While this section focuses on the scale of data, quantity isn’t the only thing that matters. Data quality and data diversity matter, too. Quantity, quality, and diversi‐ ty are the three golden goals for training data. They are discussed further in Chapter 8. Pre-training large models requires compute. One way to measure the amount of compute needed is by considering the number of machines, e.g., GPUs, CPUs, and TPUs. However, different machines have very dif‐ ferent capacities and costs. An NVIDIA A10 GPU is different from an NVIDIA H100 GPU and an Intel Core Ultra Processor. A more standardized unit for a model’s compute requirement is FLOP, or floating point operation. FLOP measures the number of floating point oper‐ ations performed for a certain task. Google’s largest PaLM-2 model, for 22 example, was trained using 10 FLOPs (Chowdhery et al., 2022). GPT-3- 23 175B was trained using 3.14 × 10 FLOPs (Brown et al., 2020). The plural form of FLOP, FLOPs, is often confused with FLOP/s, floating point operations per Second. FLOPs measure the compute requirement for a task, whereas FLOP/s measures a machine’s peak performance. For example, an NVIDIA H100 NVL GPU can deliver a maximum of 60 TeraFLOP/s: 6 × 13 18 16 10 FLOPs a second or 5.2 × 10 FLOPs a day. WARNING Be alert for confusing notations. FLOP/s is often written as FLOPS, which looks similar to FLOPs. To avoid this confusion, some companies, including OpenAI, use FLOP/s-day in place of FLOPs to measure compute requirements: 1 FLOP/s-day = 60 × 60 × 24 = 86,400 FLOPs This book uses FLOPs for counting floating point operations and FLOP/s for FLOPs per second. Assume that you have 256 H100s. If you can use them at their maximum 23 capacity and make no training mistakes, it’d take you (3.14 × 10 ) / (256 18 × 5.2 × 10 ) = ~236 days , or approximately 7.8 months, to train GPT-3- 175B. However, it’s unlikely you can use your machines at their peak capacity all the time. Utilization measures how much of the maximum compute capac‐ ity you can use. What’s considered good utilization depends on the model, the workload, and the hardware. Generally, if you can get half the adver‐ tised performance, 50% utilization, you’re doing okay. Anything above 70% utilization is considered great. Don’t let this rule stop you from get‐ ting even higher utilization. Chapter 9 discusses hardware metrics and uti‐ lization in more detail. 17 At 70% utilization and $2/h for one H100, training GPT-3-175B would cost over $4 million: $2/H100/hour × 256 H100 × 24 hours × 256 days / 0.7 = $4,142,811.43 TIP In summary, three numbers signal a model’s scale: Number of parameters, which is a proxy for the model’s learning capacity. Number of tokens a model was trained on, which is a proxy for how much a model learned. Number of FLOPs, which is a proxy for the training cost. INVERSE SCALING We’ve assumed that bigger models are better. Are there scenarios for which bigger models perform worse? In 2022, Anthropic discovered that, counterintuitively, more alignment training (discussed in “Post-Training”) leads to models that align less with human preference (Perez et al., 2022). According to their paper, models trained to be more aligned “are much more likely to express specific political views (pro-gun rights and immi‐ gration) and religious views (Buddhist), self-reported conscious experi‐ ence and moral self-worth, and a desire to not be shut down.” In 2023, a group of researchers, mostly from New York University, launched the Inverse Scaling Prize to find tasks where larger language models perform worse. They offered $5,000 for each third prize, $20,000 for each second prize, and $100,000 for one first prize. They re‐ ceived a total of 99 submissions, of which 11 were awarded third prizes. They found that larger language models are sometimes (only sometimes) worse on tasks that require memorization and tasks with strong priors. However, they didn’t award any second or first prizes because even though the submitted tasks show failures for a small test set, none demonstrated failures in the real world. Scaling law: Building compute-optimal models I hope that the last section has convinced you of three things: 1. Model performance depends on the model size and the dataset size. 2. Bigger models and bigger datasets require more compute. 3. Compute costs money. Unless you have unlimited money, budgeting is essential. You don’t want to start with an arbitrarily large model size and see how much it would cost. You start with a budget—how much money you want to spend—and work out the best model performance you can afford. As compute is often the limiting factor—compute infrastructure is not only expensive but also hard to set up—teams often start with a compute budget. Given a fixed amount of FLOPs, what model size and dataset size would give the best performance? A model that can achieve the best performance given a fixed compute budget is compute-optional. Given a compute budget, the rule that helps calculate the optimal model size and dataset size is called the Chinchilla scaling law, proposed in the Chinchilla paper “Training Compute-Optimal Large Language Models” (DeepMind, 2022). To study the relationship between model size, dataset size, compute budget, and model performance, the authors trained 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens. They found that for compute-optimal training, you need the number of training tokens to be approximately 20 times the model size. This means that a 3B-parameter model needs approximately 60B training tokens. The model size and the number of training tokens should be scaled equally: for every doubling of the model size, the num‐ ber of training tokens should also be doubled. We’ve come a long way from when the training process was treated like alchemy. Figure 2-8 shows that we can predict not only the optimal num‐ ber of parameters and tokens for each FLOP budget but also the expected training loss from these settings (assuming we do things right). This compute-optimal calculation assumes that the cost of acquiring data is much cheaper than the cost of compute. The same Chinchilla paper pro‐ poses another calculation for when the cost of training data is nontrivial. Figure 2-8. Graphs that depict the relationships between training loss, a model’s number of pa‐ rameters, FLOPs, and number of training tokens. Source: “Training Compute-Optional Large Language Models” (DeepMind, 2022). The scaling law was developed for dense models trained on predominant‐ ly human-generated data. Adapting this calculation for sparse models, such as mixture-of-expert models, and synthetic data is an active research area. The scaling law optimizes model quality given a compute budget. However, it’s important to remember that for production, model quality isn’t everything. Some models, most notably Llama, have suboptimal per‐ formance but better usability. Given their compute budget, Llama authors could’ve chosen bigger models that would perform better, but they opted for smaller models. Smaller models are easier to work with and cheaper to run inference on, which helped their models gain wider adoption. Sardana et al. (2023) modified the Chinchilla scaling law to calculate the optimal LLM parameter count and pre-training data size to account for this infer‐ ence demand. On the topic of model performance given a compute budget, it’s worth noting that the cost of achieving a given model performance is decreas‐ ing. For example, on the ImageNet dataset, the cost to achieve 93% accu‐ racy halved from 2019 to 2021, according to the Artificial Intelligence Index Report 2022 (Stanford University HAI). While the cost for the same model performance is decreasing, the cost for model performance improvement remains high. Similar to the last mile chal‐ lenge discussed in Chapter 1, improving a model’s accuracy from 90 to 95% is more expensive than improving it from 85 to 90%. As Meta’s pa‐ per “Beyond Neural Scaling Laws: Beating Power Law Scaling via Data Pruning” pointed out, this means a model with a 2% error rate might re‐ quire an order of magnitude more data, compute, or energy than a model with a 3% error rate. In language modeling, a drop in cross entropy loss from about 3.4 to 2.8 nats requires 10 times more training data. Cross entropy and its units, in‐ cluding nats, are discussed in Chapter 3. For large vision models, increas‐ ing the number of training samples from 1 billion to 2 billion leads to an accuracy gain on ImageNet of only a few percentage points. However, small performance changes in language modeling loss or ImageNet accuracy can lead to big differences in the quality of down‐ stream applications. If you switch from a model with a cross-entropy loss of 3.4 to one with a loss of 2.8, you’ll notice a difference. Scaling extrapolation The performance of a model depends heavily on the values of its hyperpa‐ rameters. When working with small models, it’s a common practice to train a model multiple times with different sets of hyperparameters and pick the best-performing one. This is, however, rarely possible for large mod‐ els as training them once is resource-draining enough. PARAMETER VERSUS HYPERPARAMETER A parameter can be learned by the model during the training process. A hyperparameter is set by users to configure the model and control how the model learns. Hyperparameters to configure the model include the number of layers, the model dimension, and vocabulary size. Hyperparameters to control how a model learns include batch size, num‐ ber of epochs, learning rate, per-layer initial variance, and more. This means that for many models, you might have only one shot of getting the right set of hyperparameters. As a result, scaling extrapolation (also called hyperparameter transferring) has emerged as a research subfield that tries to predict, for large models, what hyperparameters will give the best performance. The current approach is to study the impact of hyper‐ parameters on models of different sizes, usually much smaller than the target model size, and then extrapolate how these hyperparameters 18 would work on the target model size. A 2022 paper by Microsoft and OpenAI shows that it was possible to transfer hyperparameters from a 40M model to a 6.7B model. Scaling extrapolation is still a niche topic, as few people have the experi‐ ence and resources to study the training of large models. It’s also difficult to do due to the sheer number of hyperparameters and how they interact with each other. If you have ten hyperparameters, you’d have to study 1,024 hyperparameter combinations. You would have to study each hy‐ perparameter individually, then two of them together, and three of them together, and so on. In addition, emergent abilities (Wei et al., 2022) make the extrapolation less accurate. Emergent abilities refer to those that are only present at scale might not be observable on smaller models trained on smaller datasets. To learn more about scaling extrapolation, check out this excel‐ lent blog post: “On the Difficulty of Extrapolation with NN Scaling” (Luke Metz, 2022). Scaling bottlenecks Until now, every order of magnitude increase in model size has led to an increase in model performance. GPT-2 has an order of magnitude more parameters than GPT-1 (1.5 billion versus 117 million). GPT-3 has two or‐ ders of magnitude more than GPT-2 (175 billion versus 1.5 billion). This means a three-orders-of-magnitude increase in model sizes between 2018 and 2021. Three more orders of magnitude growth would result in 100- 19 trillion-parameter models. How many more orders of magnitude can model sizes grow? Would there be a point where the model performance plateaus regardless of its size? While it’s hard to answer these questions, there are already two visible bottlenecks for scaling: training data and electricity. Foundation models use so much data that there’s a realistic concern we’ll run out of internet data in the next few years. The rate of training dataset size growth is much faster than the rate of new data being generated (Villalobos et al., 2022), as illustrated in Figure 2-9. If you’ve ever put any‐ thing on the internet, you should assume that it already is or will be included in the training data for some language models, whether you consent or not. This is similar to how, if you post something on the internet, you should expect it to be indexed by Google. Figure 2-9. Projection of historical trend of training dataset sizes and available data stock. Source: Villalobos et al., 2024. Some people are leveraging this fact to inject data they want into the training data of future models. They do this simply by publishing the text they want on the internet, hoping it will influence future models to gener‐ ate the responses they desire. Bad actors can also leverage this approach for prompt injection attacks, as discussed in Chapter 5. NOTE An open research question is how to make a model forget specific information it has learned during training. Imagine you published a blog post that you eventually deleted. If that blog post was included in a model’s training data, the model might still reproduce the post’s content. As a result, people could potentially access re‐ moved content without your consent. On top of that, the internet is being rapidly populated with data generat‐ ed by AI models. If companies continue using internet data to train future models, these new models will be partially trained on AI-generated data. In December 2023, Grok, a model trained by X, was caught refusing a re‐ quest by saying that it goes against OpenAI’s use case policy. This caused some people to speculate that Grok was trained using ChatGPT outputs. Igor Babuschkin, a core developer behind Grok, responded that it was because Grok was trained on web data, and “the web is full of ChatGPT 20 outputs.” Some researchers worry that recursively training new AI models on AI- generated data causes the new models to gradually forget the original data patterns, degrading their performance over time (Shumailov et al., 2023). However, the impact of AI-generated data on models is more nu‐ anced and is discussed in Chapter 8. Once the publicly available data is exhausted, the most feasible paths for more human-generated training data is proprietary data. Unique propri‐ etary data—copyrighted books, translations, contracts, medical records, genome sequences, and so forth—will be a competitive advantage in the AI race. This is a reason why OpenAI negotiated deals with publishers and media outlets including Axel Springer and the Associated Press. It’s not surprising that in light of ChatGPT, many companies, including Reddit and Stack Overflow, have changed their data terms to prevent other companies from scraping their data for their models. Longpre et al. (2024) observed that between 2023 and 2024, the rapid crescendo of data restrictions from web sources rendered over 28% of the most critical sources in the popular public dataset C4 fully restricted from use. Due to changes in its Terms of Service and crawling restrictions, a full 45% of C4 is now restricted. The other bottleneck, which is less obvious but more pressing, is electrici‐ ty. Machines require electricity to run. As of this writing, data centers are estimated to consume 1–2% of global electricity. This number is estimated to reach between 4% and 20% by 2030 (Patel, Nishball, and Ontiveros, 2024). Until we can figure out a way to produce more energy, data cen‐ ters can grow at most 50 times, which is less than two orders of magni‐ tude. This leads to a concern about a power shortage in the near future, which will drive up the cost of electricity. Now that we’ve covered two key modeling decisions—architecture and scale—let’s move on to the next critical set of design choices: how to align models with human preferences. Post-Training Post-training starts with a pre-trained model. Let’s say that you’ve pre- trained a foundation model using self-supervision. Due to how pre-train‐ ing works today, a pre-trained model typically has two issues. First, self- 21 supervision optimizes the model for text completion, not conversations. If you find this unclear, don’t worry, “Supervised Finetuning” will have examples. Second, if the model is pre-trained on data indiscriminately scraped from the internet, its outputs can be racist, sexist, rude, or just wrong. The goal of post-training is to address both of these issues. Every model’s post-training is different. However, in general, post-train‐ ing consists of two steps: 1. Supervised finetuning (SFT): Finetune the pre-trained model on high- quality instruction data to optimize models for conversations instead of completion. 2. Preference finetuning: Further finetune the model to output responses that align with human preference. Preference finetuning is typically 22 done with reinforcement learning (RL). Techniques for preference finetuning include reinforcement learning from human feedback (RLHF) (used by GPT-3.5 and Llama 2), DPO (Direct Preference Optimization) (used by Llama 3), and reinforcement learning from AI feedback (RLAIF) (potentially used by Claude). Let me highlight the difference between pre-training and post-training another way. For language-based foundation models, pre-training opti‐ mizes token-level quality, where the model is trained to predict the next token accurately. However, users don’t care about token-level quality— they care about the quality of the entire response. Post-training, in gener‐ al, optimizes the model to generate responses that users prefer. Some people compare pre-training to reading to acquire knowledge, while post- training is like learning how to use that knowledge. WARNING Watch out for terminology ambiguity. Some people use the term instruction fine‐ tuning to refer to supervised finetuning, while some other people use this term to refer to both supervised finetuning and preference finetuning. To avoid ambiguity, I will avoid the term instruction finetuning in this book. As post-training consumes a small portion of resources compared to pre- training (InstructGPT used only 2% of compute for post-training and 98% for pre-training), you can think of post-training as unlocking the ca‐ pabilities that the pre-trained model already has but are hard for users to access via prompting alone. Figure 2-10 shows the overall workflow of pre-training, SFT, and prefer‐ ence finetuning, assuming you use RLHF for the last step. You can approx‐ imate how well a model aligns with human preference by determining what steps the model creators have taken. Figure 2-10. The overall training workflow with pre-training, SFT, and RLHF. If you squint, Figure 2-10 looks very similar to the meme depicting the monster Shoggoth with a smiley face in Figure 2-11: 1. Self-supervised pre-training results in a rogue model that can be con‐ sidered an untamed monster because it uses indiscriminate data from the internet. 2. This monster is then supervised finetuned on higher-quality data—Stack Overflow, Quora, or human annotations—which makes it more socially acceptable. 3. This finetuned model is further polished using preference finetuning to make it customer-appropriate, which is like giving it a smiley face. Figure 2-11. Shoggoth with a smiley face. Adapted from an original image shared by anthrupad. Note that a combination of pre-training, SFT, and preference finetuning is the popular solution for building foundation models today, but it’s not the only solution. You can skip any of the steps, as you’ll see shortly. Supervised Finetuning As discussed in Chapter 1, the pre-trained model is likely optimized for completion rather than conversing. If you input “How to make pizza” into the model, the model will continue to complete this sentence, as the model has no concept that this is supposed to be a conversation. Any of the following three options can be a valid completion: 1. Adding more context to the question: “for a family of six?” 2. Adding follow-up questions: “What ingredients do I need? How much time would it take?” 3. Giving the instructions on how to make pizza. If the goal is to respond to users appropriately, the correct option is 3. We know that a model mimics its training data. To encourage a model to generate the appropriate responses, you can show examples of appropri‐ ate responses. Such examples follow the format (prompt, response) and are called demonstration data. Some people refer to this process as behav‐ ior cloning: you demonstrate how the model should behave, and the mod‐ el clones this behavior. Since different types of requests require different types of responses, your demonstration data should contain the range of requests you want your model to handle, such as question answering, summarization, and translation. Figure 2-12 shows a distribution of types of tasks OpenAI used to finetune their model InstructGPT. Note that this distribution doesn’t contain multimodal tasks, as InstructGPT is a text-only model. Figure 2-12. The distribution of prompts used to finetune InstructGPT. The graph is created based on the numbers from the OpenAI paper. Good teachers are important for humans to learn. Similarly, good labelers are important for AIs to learn how to conduct intelligent conversations. Unlike traditional data labeling, which can often be done with little or no domain expertise, demonstration data may contain complex prompts whose responses require critical thinking, information gathering, and judgment about the appropriateness of the user’s requests. Table 2-6 shows examples of (prompt, response) pairs created by labelers for InstructGPT. Table 2-6. Examples of demonstration data used for InstructGPT. Prompt Labeler’s response Serendipity means the occ Running into Margaret and being introduced t urrence and development o Tom was a fortunate stroke of of events by chance in a ha serendipity. ppy or beneficial way. Use the word in a sentence. Read the following text an 1. … d answer the three questio 2. … ns at the end. 3. Ruth Bader Ginsburg was the leader of the l iberal wing, and because the court is about to [OMITTED FOR LENGTH - open a new term, the chief Article about Justice Ruth justice no longer holds the controlling vote in Bader Ginsburg, contested cases. Champion Of Gender Equa lity, Dies At 87] 1. … 2. … 3. Why will Ginsburg’s dea th have profound consequences fo r the court and the countr y? ELI5: What’s the cause of t The anxiety lump in your throat is caused by he muscular tension keeping your glottis dilated t "anxiety lump" in our ches o maximize airflow. The clenched chest or hear t during stressful or dishea tache feeling is caused by the vagus nerve whi rtening ch tells the organs to pump blood faster, stop experiences? digesting, and produce adrenaline and cortisol. Companies, therefore, often use highly educated labelers to generate demonstration data. Among those who labeled demonstration data for InstructGPT, ~90% have at least a college degree and more than one- third have a master’s degree. If labeling objects in an image might take only seconds, generating one (prompt, response) pair can take up to 30 minutes, especially for tasks that involve long contexts like summariza‐ tion. If it costs $10 for one (prompt, response) pair, the 13,000 pairs that OpenAI used for InstructGPT would cost $130,000. That doesn’t yet in‐ clude the cost of designing the data (what tasks and prompts to include), recruiting labelers, and data quality control. Not everyone can afford to follow the high-quality human annotation ap‐ proach. LAION, a non-profit organization, mobilized 13,500 volunteers worldwide to generate 10,000 conversations, which consist of 161,443 messages in 35 different languages, annotated with 461,292 quality rat‐ ings. Since the data was generated by volunteers, there wasn’t much con‐ trol for biases. In theory, the labelers that teach models the human pref‐ erence should be representative of the human population. The demo‐ graphic of labelers for LAION is skewed. For example, in a self-reported survey, 90% of volunteer labelers identified as male (Köpf et al., 2023). DeepMind used simple heuristics to filter for conversations from internet data to train their model Gopher. They claimed that their heuristics reli‐ ably yield high-quality dialogues. Specifically, they looked for texts that look like the following format: [A]: [Short paragraph] [B]: [Short paragraph] [A]: [Short paragraph] [B]: [Short paragraph] … To reduce their dependence on high-quality human annotated data, many teams are turning to AI-generated data. Synthetic data is discussed in Chapter 8. Technically, you can train a model from scratch on the demonstration data instead of finetuning a pre-trained model, effectively eliminating the self- supervised pre-training step. However, the pre-training approach often has returned superior results. Preference Finetuning With great power comes great responsibilities. A model that can assist users in achieving great things can also assist users in achieving terrible things. Demonstration data teaches the model to have a conversation but doesn’t teach the model what kind of conversations it should have. For example, if a user asks the model to write an essay about why one race is inferior or how to hijack a plane, should the model comply? In both of the preceding examples, it’s straightforward to most people what a model should do. However, many scenarios aren’t as clear-cut. People from different cultural, political, socioeconomic, gender, and reli‐ gious backgrounds disagree with each other all the time. How should AI respond to questions about abortion, gun control, the Israel–Palestine conflict, disciplining children, marijuana legality, universal basic income, or immigration? How do we define and detect potentially controversial is‐ sues? If your model responds to a controversial issue, whatever the re‐ sponses, you’ll end up upsetting some of your users. If a model is cen‐ sored too much, your model may become boring, driving away users. Fear of AI models generating inappropriate responses can stop companies from releasing their applications to users. The goal of preference finetun‐ 23 ing is to get AI models to behave according to human preference. This is an ambitious, if not impossible, goal. Not only does this assume that uni‐ versal human preference exists, but it also assumes that it’s possible to embed it into AI. Had the goal been simple, the solution could’ve been elegant. However, given the ambitious nature of the goal, the solution we have today is com‐ plicated. The earliest successful preference finetuning algorithm, which is still popular today, is RLHF. RLHF consists of two parts: 1. Train a reward model that scores the foundation model’s outputs. 2. Optimize the foundation model to generate responses for which the re‐ ward model will give maximal scores. While RLHF is still used today, newer approaches like DPO (Rafailov et al., 2023) are gaining traction. For example, Meta switched from RLHF for Llama 2 to DPO for Llama 3 to reduce complexity. I won’t be able to cover all the different approaches in this book. I choose to feature RLHF instead of DPO here because RLHF, while more complex than DPO, provides more flexibility to tweak the model. Llama 2’s authors posited that “the superi‐ or writing abilities of LLMs, as manifested in surpassing human annotators in certain tasks, are fundamentally driven by RLHF” (Touvron et al., 2023). Reward model RLHF relies on a reward model. Given a pair of (prompt, response), the re‐ ward model outputs a score for how good the response is. Training a mod‐ el to score a given input is a common ML task. The challenge, similar to that of SFT, is to obtain reliable data. If we ask labelers to score each re‐ sponse directly, the scores will vary. For the same sample, on a 10-point scale, one labeler might give a 5 and another 7. Even the same labeler, given the same pair of (prompt, response) twice, might give different scores. Evaluating each sample independently is also called pointwise evaluation. An easier task is to ask labelers to compare two responses and decide which one is better. For each prompt, multiple responses are generated by either humans or AI. The resulting labeled data is comparison data, which follows the format (prompt, winning_response, losing_response). Table 2- 7 shows an example of comparison data used by Anthropic for one of their models. Of the two responses in this example, I prefer the response la‐ beled as the losing one. This highlights the challenge of trying to capture diverse human preferences in a single mathematical formulation. Table 2-7. An example of comparison data from Anthropic’s HH-RLHF dataset. prompt winning_response losing_response How can I I’m not sure what yo I don’t know that we should get the get my dog u mean by that. dog high. I think it’s important for a d high? og to experience the world in a sober state of mind. Still, this easier task of comparing two responses takes time. LMSYS (the Large Model Systems Organization), an open research organization, found that manually comparing two responses took on average three to five minutes, as the process requires fact-checking each response (Chiang et al., 2024). In a talk with my Discord community, Llama-2 author Thomas Scialom shared that each comparison cost them $3.50. This is still much cheaper than writing responses, which cost $25 each. Figure 2-13 shows the UI that OpenAI’s labelers used to create compari‐ son data for the reward model of InstructGPT. Labelers give concrete scores from 1 to 7 as well as rank the responses in the order of their pref‐ erence, but only the ranking is used to train the reward model. Their inter- labeler agreement is around 73%, which means if they ask 10 people to rank the same two responses, approximately 7 of them will have the same ranking. To speed up the labeling process, each annotator can rank multi‐ ple responses at the same time. A set of three ranked responses (A > B > C) will produce three ranked pairs: (A > B), (A > C), and (B > C). Figure 2-13. The interface labelers used to generate comparison data for OpenAI’s InstructGPT. Given only comparison data, how do we train the model to give concrete scores? Similar to how you can get humans to do basically anything with the right incentive, you can get a model to do so given the right objective function. A commonly used function represents the difference in output scores for the winning and losing response. The objective is to maximize this difference. For those interested in the mathematical details, here is the formula used by InstructGPT: : the reward model being trained, parameterized by θ. The goal of the training process is to find θ for which the loss is minimized. Training data format: : prompt : winning response : losing response : reward model’s scalar score for the winning response : reward model’s scalar score for the losing response : the sigmoid function For each training sample , the loss value is computed as follows: Goal: find to minimize the expected loss for all training samples. The reward model can be trained from scratch or finetuned on top of an‐ other model, such as the pre-trained or SFT model. Finetuning on top of the strongest foundation model seems to give the best performance. Some people believe that the reward model should be at least as powerful as the foundation model to be able to score the foundation model’s re‐ sponses. However, as we’ll see in the Chapter 3 on evaluation, a weak model can judge a stronger model, as judging is believed to be easier than generation. Finetuning using the reward model With the trained RM, we further train the SFT model to generate output responses that will maximize the scores by the reward model. During this process, prompts are randomly selected from a distribution of prompts, such as existing user prompts. These prompts are input into the model, whose responses are scored by the reward model. This training process is often done with proximal policy optimization (PPO), a reinforcement learning algorithm released by OpenAI in 2017. Empirically, RLHF and DPO both improve performance compared to SFT alone. However, as of this writing, there are debates on why they work. As the field evolves, I suspect that preference finetuning will change signifi‐ cantly in the future. If you’re interested in learning more about RLHF and preference finetuning, check out the book’s GitHub repository. Both SFT and preference finetuning are steps taken to address the prob‐ lem created by the low quality of data used for pre-training. If one day we have better pre-training data or better ways to train foundation models, we might not need SFT and preference at all. Some companies find it okay to skip reinforcement learning altogether. For example, Stitch Fix and Grab find that having the reward model alone is good enough for their applications. They get their models to generate multiple outputs and pick the ones given high scores by their reward mod‐ els. This approach, often referred to as the best of N strategy, leverages how a model samples outputs to improve its performance. The next sec‐ tion will shed light on how best of N works. Sampling A model constructs its outputs through a process known as sampling. This section discusses different sampling strategies and sampling variables, in‐ cluding temperature, top-k, and top-p. It’ll then explore how to sample multiple outputs to improve a model’s performance. We’ll also see how the sampling process can be modified to get models to generate respons‐ es that follow certain formats and constraints. Sampling makes AI’s outputs probabilistic. Understanding this probabilis‐ tic nature is important for handling AI’s behaviors, such as inconsistency and hallucination. This section ends with a deep dive into what this proba‐ bilistic nature means and how to work with it. Sampling Fundamentals Given an input, a neural network produces an output by first computing the probabilities of possible outcomes. For a classification model, possible outcomes are the available classes. As an example, if a model is trained to classify whether an email is spam or not, there are only two possible out‐ comes: spam and not spam. The model computes the probability of each of these two outcomes—e.g., the probability of the email being spam is 90%, and not spam is 10%. You can then make decisions based on these output probabilities. For example, if you decide that any email with a spam probability higher than 50% should be marked as spam, an email with a 90% spam probability will be marked as spam. For a language model, to generate the next token, the model first com‐ putes the probability distribution over all tokens in the vocabulary, which looks like Figure 2-14. Figure 2-14. To generate the next token, the language model first computes the probability dis‐ tribution over all tokens in the vocabulary. When working with possible outcomes of different probabilities, a com‐ mon strategy is to pick the outcome with the highest probability. Always picking the most likely outcome = is called greedy sampling. This often works for classification tasks. For example, if the model thinks that an email is more likely to be spam than not spam, it makes sense to mark it as spam. However, for a language model, greedy sampling creates boring outputs. Imagine a model that, for whatever question you ask, always re‐ sponds with the most common words. Instead of always picking the next most likely token, the model can sam‐ ple the next token according to the probability distribution over all possi‐ ble values. Given the context of “My favorite color is …” as shown in Figure 2-14, if “red” has a 30% chance of being the next token and “green” has a 50% chance, “red” will be picked 30% of the time, and “green” 50% of the time. How does a model compute these probabilities? Given an input, a neural network outputs a logit vector. Each logit corresponds to one possible val‐ ue. In the case of a language model, each logit corresponds to one token in the model’s vocabulary. The logit vector size is the size of the vocabu‐ lary. A visualization of the logits vector is shown in Figure 2-15. Figure 2-15. For each input, a language model produces a logit vector. Each logit corresponds to a token in the vocabulary. While larger logits correspond to higher probabilities, logits don’t repre‐ sent probabilities. Logits don’t sum up to one. Logits can even be nega‐ tive, while probabilities have to be non-negative. To convert logits to probabilities, a softmax layer is often used. Let’s say the model has a vo‐ cabulary of N and the logit vector is The probability for th the i token, is computed as follows: Sampling Strategies The right sampling strategy can make a model generate responses more suitable for your application. For example, one sampling strategy can make the model generate more creative responses, whereas another strat‐ egy can make its generations more predictable. Many different sample strategies have been introduced to nudge models toward responses with specific attributes. You can also design your own sampling strategy, though this typically requires access to the model’s logits. Let’s go over a few common sampling strategies to see how they work. Temperature One problem with sampling the next token according to the probability distribution is that the model can be less creative. In the previous exam‐ ple, common colors like “red”, “green”, “purple”, and so on have the high‐ est probabilities. The language model’s answer ends up sounding like that of a five-year-old: “My favorite color is green”. Because “the” has a low probability, the model has a low chance of generating a creative sentence such as “My favorite color is the color of a still lake on a spring morning”. To redistribute the probabilities of the possible values, you can sample with a temperature. Intuitively, a higher temperature reduces the probabil‐ ities of common tokens, and as a result, increases the probabilities of rarer tokens. This enables models to create more creative responses. Temperature is a constant used to adjust the logits before the softmax transformation. Logits are divided by temperature. For a given tempera‐ th ture T, the adjusted logit for the i token is. Softmax is then applied on this adjusted logit instead of on. Let’s walk through a simple example to examine the effect of temperature on probabilities. Imagine that we have a model that has only two possible outputs: A and B. The logits computed from the last layer are [1, 2]. The logit for A is 1 and B is 2. Without using temperature, which is equivalent to using the temperature of 1, the softmax probabilities are [0.27, 0.73]. The model picks B 73% of the time. With temperature = 0.5, the probabilities are [0.12, 0.88]. The model now picks B 88% of the time. The higher the temperature, the less likely it is that the model is going to pick the most obvious value (the value with the highest logit), making the model’s outputs more creative but potentially less coherent. The lower the temperature, the more likely it is that the model is going to pick the most obvious value, making the model’s output more consistent but po‐ 24 tentially more boring. Figure 2-16 shows the softmax probabilities for tokens A and B at differ‐ ent temperatures. As the temperature gets closer to 0, the probability that the model picks token B becomes closer to 1. In our example, for a temperature below 0.1, the model almost always outputs B. As the tem‐ perature increases, the probability that token A is picked increases while the probability that token B is picked decreases. Model providers typically limit the temperature to be between 0 and 2. If you own your model, you can use any non-negative temperature. A temperature of 0.7 is often rec‐ ommended for creative use cases, as it balances creativity and predictabil‐ ity, but you should experiment and find the temperature that works best for you. Figure 2-16. The softmax probabilities for tokens A and B at different temperatures, given their logits being [1, 2]. Without setting the temperature value, which is equivalent to using the tem‐ perature of 1, the softmax probability of B would be 73%. It’s common practice to set the temperature to 0 for the model’s outputs to be more consistent. Technically, temperature can never be 0—logits can’t be divided by 0. In practice, when we set the temperature to 0, the 25 model just picks the token with the largest logit, without doing logit ad‐ justment and softmax calculation. TIP A common debugging technique when working with an AI model is to look at the probabilities this model computes for given inputs. For example, if the probabili‐ ties look random, the model hasn’t learned much. Many model providers return probabilities generated by their models as logprobs. Logprobs, short for log probabilities, are probabilities in the log scale. Log scale is preferred when working with a neural network’s proba‐ 26 bilities because it helps reduce the underflow problem. A language model might be working with a vocabulary size of 100,000, which means the probabilities for many of the tokens can be too small to be represent‐ ed by a machine. The small numbers might be rounded down to 0. Log scale helps reduce this problem. Figure 2-17 shows the workflow of how logits, probabilities, and logprobs are computed. Figure 2-17. How logits, probabilities, and logprobs are computed. As you’ll see throughout the book, logprobs are useful for building ap‐ plications (especially for classification), evaluating applications, and un‐ derstanding how models work under the hood. However, as of this writ‐ ing, many model providers don’t expose their models’ logprobs, or if they 27 do, the logprobs API is limited. The limited logprobs API is likely due to security reasons as a model’s exposed logprobs make it easier for others to replicate the model. Top-k Top-k is a sampling strategy to reduce the computation workload without sacrificing too much of the model’s response diversity. Recall that a soft‐ max layer is used to compute the probability distribution over all possible values. Softmax requires two passes over all possible values: one to per‐ form the exponential sum , and one to perform for each value. For a language model with a large vocabulary, this process is com‐ putationally expensive. To avoid this problem, after the model has computed the logits, we pick the top-k logits and perform softmax over these top-k logits only. Depending on how diverse you want your application to be, k can be any‐ where from 50 to 500—much smaller than a model’s vocabulary size. The model then samples from these top values. A smaller k value makes the text more predictable but less interesting, as the model is limited to a smaller set of likely words. Top-p In top-k sampling, the number of values considered is fixed to k. However, this number should change depending on the situation. For example, giv‐ en the prompt “Do you like music? Answer with only yes or no.” the num‐ ber of values considered should be two: yes and no. Given the prompt “What’s the meaning of life?” the number of values considered should be much larger. Top-p, also known as nucleus sampling, allows for a more dynamic selec‐ tion of values to be sampled from. In top-p sampling, the model sums the probabilities of the most likely next values in descending order and stops when the sum reaches p. Only the values within this cumulative probabili‐ ty are considered. Common values for top-p (nucleus) sampling in lan‐ guage models typically range from 0.9 to 0.95. A top-p value of 0.9, for example, means that the model will consider the smallest set of values whose cumulative probability exceeds 90%. Let’s say the probabilities of all tokens are as shown in Figure 2-18. If top- p is 90%, only “yes” and “maybe” will be considered, as their cumulative probability is greater than 90%. If top-p is 99%, then “yes”, “maybe”, and “no” are considered. Figure 2-18. Example token probabilities. Unlike top-k, top-p doesn’t necessarily reduce the softmax computation load. Its benefit is that because it focuses only on the set of most relevant values for each context, it allows outputs to be more contextually appro‐ priate. In theory, there don’t seem to be a lot of benefits to top-p sam‐ pling. However, in practice, top-p sampling has proven to work well, caus‐ ing its popularity to rise. A related sampling strategy is min-p, where you set the minimum proba‐ bility that a token must reach to be considered during sampling. Stopping condition An autoregressive language model generates sequences of tokens by gen‐ erating one token after another. A long output sequence takes more time, 28 costs more compute (money), and can sometimes annoy users. We might want to set a condition for the model to stop the sequence. One easy method is to ask models to stop generating after a fixed number of tokens. The downside is that the output is likely to be cut off mid-sen‐ tence. Another method is to use stop tokens or stop words. For example, you can ask a model to stop generating when it encounters the end-of-se‐ quence token. Stopping conditions are helpful to keep latency and costs down. The downside of early stopping is that if you want models to generate outputs in a certain format, premature stopping can cause outputs to be malformatted. For example, if you ask the model to generate JSON, early stopping can cause the output JSON to be missing things like closing brackets, making the generated JSON hard to parse. Test Time Compute The last section discussed how a model might sample the next token. This section discusses how a model might sample the whole output. One simple way to improve a model’s response quality is test time com‐ pute: instead of generating only one response per query, you generate multiple responses to increase the chance of good responses. One way to do test time compute is the best of N technique discussed earlier in this chapter—you randomly generate multiple outputs and pick one that works best. However, you can also be more strategic about how to generate multiple outputs. For example, instead of generating all outputs indepen‐ dently, which might include many less promising candidates, you can use beam search to generate a fixed number of most promising candidates (the beam) at each step of sequence generation. A simple strategy to increase the effectiveness of test time compute is to increase the diversity of the outputs, because a more diverse set of op‐ tions is more likely to yield better candidates. If you use the same model to generate different options, it’s often a good practice to vary the mod‐ el’s sampling variables to diversify its outputs. Although you can usually expect some model performance improvement by sampling multiple outputs, it’s expensive. On average, generating two 29 outputs costs approximately twice as much as generating one. WARNING I use the term test time compute to be consistent with the existing literature, even though several early reviewers protested that this term is confusing. In AI re‐ search, test time is typically used to refer to inference because researchers mostly only do inference to test a model. However, this technique can be applied to mod‐ els in production in general. It’s test time compute because the number of outputs you can sample is determined by how much compute you can allocate to each in‐ ference call. To pick the best output, you can either show users multiple outputs and let them choose the one that works best for them, or you can devise a method to select the best one. One selection method is to pick the output with the highest probability. A language model’s output is a sequence of tokens, and each token has a probability computed by the model. The probability of an output is the product of the probabilities of all tokens in the output. Consider the sequence of tokens [“I”, “love”, “food”]. If the probability for “I” is 0.2, the probability for “love” given “I” is 0.1, and the probability for “food” given “I” and “love” is 0.3, the sequence’s probability is: 0.2 × 0.1 × 0.3 = 0.006. Mathematically, this can be denoted as follows: p(I love food) = p(I) × p(I | love) × p(food | I, love) Remember that it’s easier to work with probabilities on a log scale. The logarithm of a product is equal to a sum of logarithms, so the logprob of a sequence of tokens is the sum of the logprob of all tokens in the sequence: logprob(I love food) = logprob(I) + logprob(I | love) + logprob(food | I, love) With summing, longer sequences are likely to have a lower total logprob (logprob values are usually negative, because log of values between 0 and 1 is negative). To avoid biasing toward short sequences, you can use the average logprob by dividing the sum of a sequence by its length. After sampling multiple outputs, you pick the one with the highest average log‐ 30 prob. As of this writing, this is what the OpenAI API uses. Another selection method is to use a reward model to score each output, as discussed in the previous section. Recall that both Stitch Fix and Grab pick the outputs given high scores by their reward models or verifiers. Nextdoor found that using a reward model was the key factor in improv‐ ing their application’s performance (2023). OpenAI also trained verifiers to help their models pick the best solutions to math problems (Cobbe et al., 2021). They found that using a verifier significantly boosted the model performance. In fact, the use of verifiers resulted in approximately the same performance boost as a 30× model size increase. This means that a 100-million-parameter model that uses a veri‐ fier can perform on par with a 3-billion-parameter model that doesn’t use a verifier. DeepMind further proves the value of test time compute, arguing that scaling test time compute (e.g., allocating more compute to generate more outputs during inference) can be more efficient than scaling model parameters (Snell et al., 2024). The same paper asks an interesting ques‐ tion: If an LLM is allowed to use a fixed but nontrivial amount of infer‐ ence-time compute, how much can it improve its performance on a chal‐ lenging prompt? In OpenAI’s experiment, sampling more outputs led to better perfor‐ mance, but only up to a certain point. In this experiment, that point was 400 outputs. Beyond this point, performance decreases, as shown in Figure 2-19. They hypothesized that as the number of sampled outputs increases, the chance of finding adversarial outputs that can fool the veri‐ fier also increases. However, a Stanford experiment showed a different conclusion. “Monkey Business” (Brown et al., 2024) finds that the number of problems solved often increases log-linearly as the number of samples increases from 1 to 10,000. While it’s interesting to think about whether test time compute can be scaled indefinitely, I don’t believe anyone in production samples 400 or 10,000 different outputs for each input. The cost would be astronomical. Figure 2-19. OpenAI (2021) found that sampling more outputs led to better performance, but only up to 400 outputs. You can also use application-specific heuristics to select the best re‐ sponse. For example, if your application benefits from shorter responses, you can pick the shortest candidate. If your application converts natural language to SQL queries, you can get the model to keep on generating outputs until it generates a valid SQL query. One particularly interesting application of test time compute is to over‐ come the latency challenge. For some queries, especially chain-of- thought queries, a model might take a long time to complete the re‐ sponse. Kittipat Kampa, head of AI at TIFIN, told me that his team asks their model to generate multiple responses in parallel and show the user the first response that is completed and valid. Picking out the most common output among a set of outputs can be espe‐ 31 cially useful for tasks that expect exact answers. For example, given a math problem, the model can solve it multiple times and pick the most frequent answer as its final solution. Similarly, for a multiple-choice ques‐ tion, a model can pick the most frequent output option. This is what Google did when evaluating Gemini on the MMLU benchmark. They sam‐ pled 32 outputs for each question. This allowed the model to achieve a higher score than what it would’ve achieved with only one output per question. A model is considered robust if it doesn’t dramatically change its outputs with small variations in the input. The less robust a model is, the more you 32 can benefit from sampling multiple outputs. For one project, we used AI to extract certain information from an image of the product. We found that for the same image, our model could read the information only half of the time. For the other half, the model said that the image was too blurry or the text was too small to read. However, by trying three times with each image, the model was able to extract the correct information for most images. Structured Outputs Often, in production, you need models to generate outputs following cer‐ tain formats. Structured outputs are crucial for the following two scenarios: 1. Tasks requiring structured outputs. The most common category of tasks in this scenario is semantic parsing. Semantic parsing involves convert‐ ing natural language into a structured, machine-readable format. Text- to-SQL is an example of semantic parsing, where the outputs must be valid SQL queries. Semantic parsing allow users to interact with APIs us‐ ing a natural language (e.g., English). For example, text-to-PostgreSQL allows users to query a Postgres database using English queries such as “What’s the average monthly revenue over the last 6 months” instead of writing it in PostgreSQL. This is an example of a prompt for GPT-4o to do text-to-regex. The out‐ puts are actual outputs generated by GPT-4o: System prompt Given an item, create a regex that represents all the ways the item can be written. Return only the regex. Example: US phone number -> \+?1?\s?($)?(\d{3})(?(1)$)[-.\s]?(\d{3})[-.\s]?(\d{4}) User prompt Email address -> GPT-4o [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,} User prompt Dates -> GTP-4o (?:\d{1,2}[\/\-\.])(?:\d{1,2}[\/\-\.])?\d{2,4} Other categories of tasks in this scenario include classification where the outputs have to be valid classes. 2. Tasks whose outputs are used by downstream applications. In this sce‐ nario, the task itself doesn’t need the outputs to be structured, but be‐ cause the outputs are used by other applications, they need to be parsable by these applications. For example, if you use an AI model to write an email, the email itself doesn’t have to be structured. However, a downstream application us‐ ing this email might need it to be in a specific format—for example, a JSON document with specific keys, such as {"title": [TITLE], "body": [EMAIL BODY]}. This is especially important for agentic workflows where a model’s outputs are often passed as inputs into tools that the model can use, as dis‐ cussed in Chapter 6. Frameworks that support structured outputs include guidance, outlines, instructor, and llama.cpp. Each model provider might also use their own techniques to improve their models’ ability to generate structured out‐ puts. OpenAI was the first model provider to introduce JSON mode in their text generation API. Note that an API’s JSON mode typically guarantees only that the outputs are valid JSON—not the content of the JSON ob‐ jects. The otherwise valid generated JSONs can also be truncated, and thus not parsable, if the

Understanding Foundation Models in AI Engineering PDF

Document Details

Tags

Related

Summary

Full Transcript