Large Language Models PDF

Speech and Language Processing. Daniel Jurafsky & James H. Martin. Copyright © 2024. All rights reserved. Draft of August 20, 2024. CHAPTER 10 Large Language Models “How much do we know at any time? Much more, or so I believe, than we know we know.” Agatha Christie, The Moving Finger Fluent speakers of a language bring an enormous amount of knowledge to bear dur- ing comprehension and production. This knowledge is embodied in many forms, perhaps most obviously in the vocabulary, the rich representations we have of words and their meanings and usage. This makes the vocabulary a useful lens to explore the acquisition of knowledge from text, by both people and machines. Estimates of the size of adult vocabularies vary widely both within and across languages. For example, estimates of the vocabulary size of young adult speakers of American English range from 30,000 to 100,000 depending on the resources used to make the estimate and the definition of what it means to know a word. What is agreed upon is that the vast majority of words that mature speakers use in their day-to-day interactions are acquired early in life through spoken interactions with caregivers and peers, usually well before the start of formal schooling. This active vocabulary (usually on the order of 2000 words for young speakers) is extremely limited compared to the size of the adult vocabulary, and is quite stable, with very few additional words learned via casual conversation beyond this early stage. Obvi- ously, this leaves a very large number of words to be acquired by other means. A simple consequence of these facts is that children have to learn about 7 to 10 words a day, every single day, to arrive at observed vocabulary levels by the time they are 20 years of age. And indeed empirical estimates of vocabulary growth in late el- ementary through high school are consistent with this rate. How do children achieve this rate of vocabulary growth? The bulk of this knowledge acquisition seems to happen as a by-product of reading, as part of the rich processing and reasoning that we perform when we read. Research into the average amount of time children spend reading, and the lexical diversity of the texts they read, indicate that it is possible to achieve the desired rate. But the mechanism behind this rate of learning must be remarkable indeed, since at some points during learning the rate of vocabulary growth exceeds the rate at which new words are appearing to the learner! Such facts have motivated the distributional hypothesis of Chapter 6, which sug- gests that aspects of meaning can be learned solely from the texts we encounter over our lives, based on the complex association of words with the words they co-occur with (and with the words that those words occur with). The distributional hypothe- sis suggests both that we can acquire remarkable amounts of knowledge from text, and that this knowledge can be brought to bear long after its initial acquisition. Of course, grounding from real-world interaction or other modalities can help build even more powerful models, but even text alone is remarkably useful. pretraining In this chapter we formalize this idea of pretraining—learning knowledge about language and the world from vast amounts of text—and call the resulting pretrained language models large language models. Large language models exhibit remark- 2 C HAPTER 10 L ARGE L ANGUAGE M ODELS able performance on all sorts of natural language tasks because of the knowledge they learn in pretraining, and they will play a role throughout the rest of this book. They have been especially transformative for tasks where we need to produce text, like summarization, machine translation, question answering, or chatbots. We’ll start by seeing how to apply the transformer of Chapter 9 to language modeling, in a setting often called causal or autoregressive language models, in which we iteratively predict words left-to-right from earlier words. We’ll first in- troduce training, seeing how language models are self-trained by iteratively being taught to guess the next word in the text from the prior words. We’ll then talk about the process of text generation. The application of LLMs to generate text has vastly broadened the scope of NLP,. Text generation, code- generation, and image-generation together constitute the important new area of gen- generative AI erative AI. We’ll introduce specific algorithms for generating text from a language model, like greedy decoding and sampling. And we’ll see that almost any NLP task can be modeled as word prediction in a large language model, if we think about it in the right way. We’ll work through an example of using large language mod- els to solve one classic NLP task of summarization (generating a short text that summarizes some larger document). 10.1 Large Language Models with Transformers The prior chapter introduced most of the components of a transformer in the domain of language modeling: the transformer block including multi-head attention, the language modeling head, and the positional encoding of the input. In the following sections we’ll introduce the remaining aspects of the transformer LLM: sampling and training. Before we do that, we use this section to talk about why and how we apply transformer-based large language models to NLP tasks. conditional generation The tasks we will describe are all cases of conditional generation. Conditional generation is the task of generating text conditioned on an input piece of text. That is, we give the LLM an input piece of text, generally called a prompt, and then have the LLM continue generating text token by token, conditioned on the prompt. The fact that transformers have such long contexts (many thousands of tokens) makes them very powerful for conditional generation, because they can look back so far into the prompting text. Consider the simple task of text completion, illustrated in Fig. 10.1. Here a language model is given a text prefix and is asked to generate a possible completion. Note that as the generation process proceeds, the model has direct access to the priming context as well as to all of its own subsequently generated outputs (at least as much as fits in the large context window). This ability to incorporate the entirety of the earlier context and generated outputs at each time step is the key to the power of large language models built from transformers. So why should we care about predicting upcoming words or tokens? The in- sight of large language modeling is that many practical NLP tasks can be cast as word prediction, and that a powerful-enough language model can solve them with a high degree of accuracy. For example, we can cast sentiment analysis as language modeling by giving a language model a context like: The sentiment of the sentence ‘‘I like Jackie Chan" is: and comparing the following conditional probability of the words “positive” and the 10.1 L ARGE L ANGUAGE M ODELS WITH T RANSFORMERS 3 Completion Text all the Language Softmax Modeling logits Head Unencoder layer U U Transformer … … Blocks + i + i + i + i + i + i + i Encoder E E E E E E E So long and thanks for all the Prefix Text Figure 10.1 Left-to-right (also called autoregressive) text completion with transformer-based large language models. As each token is generated, it gets added onto the context as a prefix for generating the next token. word “negative” to see which is higher: P(positive|The sentiment of the sentence ‘‘I like Jackie Chan" is:) P(negative|The sentiment of the sentence ‘‘I like Jackie Chan" is:) If the word “positive” is more probable, we say the sentiment of the sentence is positive, otherwise we say the sentiment is negative. We can also cast more complex tasks as word prediction. Consider question answering, in which the system is given a question (for example a question with a simple factual answer) and must give a textual answer; we introduce this task in detail in Chapter 14. We can cast the task of question answering as word prediction by giving a language model a question and a token like A: suggesting that an answer should come next: Q: Who wrote the book ‘‘The Origin of Species"? A: If we ask a language model to compute the probability distribution over possible next words given this prefix: P(w|Q: Who wrote the book ‘‘The Origin of Species"? A:) and look at which words w have high probabilities, we might expect to see that Charles is very likely, and then if we choose Charles and continue and ask P(w|Q: Who wrote the book ‘‘The Origin of Species"? A: Charles) we might now see that Darwin is the most probable token, and select it. Conditional generation can even be used to accomplish tasks that must generate text longer responses. Consider the task of text summarization, which is to take a long summarization text, such as a full-length article, and produce an effective shorter summary of it. We can cast summarization as language modeling by giving a large language model a text, and follow the text by a token like tl;dr; this token is short for something like 4 C HAPTER 10 L ARGE L ANGUAGE M ODELS ‘too long; didn’t read’ and in recent years people often use this token, especially in informal work emails, when they are going to give a short summary. Since this token is sufficiently frequent in language model training data, language models have seen many texts in which the token occurs before a summary, and hence will interpret the token as instructions to generate a summary. We can then do conditional generation: give the language model this prefix, and then have it generate the following words, one by one, and take the entire response as a summary. Fig. 10.2 shows an example of a text and a human-produced summary from a widely-used summarization corpus consisting of CNN and Daily Mirror news articles. Original Article The only thing crazier than a guy in snowbound Massachusetts boxing up the powdery white stuff and offering it for sale online? People are actually buying it. For $89, self-styled entrepreneur Kyle Waring will ship you 6 pounds of Boston-area snow in an insulated Styrofoam box – enough for 10 to 15 snowballs, he says. But not if you live in New England or surrounding states. “We will not ship snow to any states in the northeast!” says Waring’s website, ShipSnowYo.com. “We’re in the business of expunging snow!” His website and social media accounts claim to have filled more than 133 orders for snow – more than 30 on Tuesday alone, his busiest day yet. With more than 45 total inches, Boston has set a record this winter for the snowiest month in its history. Most residents see the huge piles of snow choking their yards and sidewalks as a nuisance, but Waring saw an opportunity. According to Boston.com, it all started a few weeks ago, when Waring and his wife were shov- eling deep snow from their yard in Manchester-by-the-Sea, a coastal suburb north of Boston. He joked about shipping the stuff to friends and family in warmer states, and an idea was born. [...] Summary Kyle Waring will ship you 6 pounds of Boston-area snow in an insulated Styrofoam box – enough for 10 to 15 snowballs, he says. But not if you live in New England or surrounding states. Figure 10.2 Excerpt from a sample article and its summary from the CNN/Daily Mail summarization corpus (Hermann et al., 2015), (Nallapati et al., 2016). If we take this full article and append the token tl;dr, we can use this as the con- text to prime the generation process to produce a summary as illustrated in Fig. 10.3. Again, what makes transformers able to succeed at this task (as compared, say, to the primitive n-gram language model) is that attention can incorporate information from the large context window, giving the model access to the original article as well as to the newly generated text throughout the process. Which words do we generate at each step? One simple way to generate words is to always generate the most likely word given the context. Generating the most greedy decoding likely word given the context is called greedy decoding. A greedy algorithm is one that make a choice that is locally optimal, whether or not it will turn out to have been the best choice with hindsight. Thus in greedy decoding, at each time step in generation, the output yt is chosen by computing the probability for each possible output (every word in the vocabulary) and then choosing the highest probability word (the argmax): ŵt = argmaxw∈V P(w|w

Large Language Models PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue