Introduction to Large Language Models PDF

Document Details

HardyZeugma8683

Uploaded by HardyZeugma8683

Yarmouk University

Tags

large language models natural language processing artificial intelligence machine learning

Summary

This document provides an introduction to large language models, covering their architectures (decoders, encoders, and encoder-decoders), their pretraining process, tasks, and evaluation. Key concepts like perplexity are discussed, along with different types of pretraining data and the use of finetuning.

Full Transcript

Introduction to Large Language Large Models Language Models Language models Remember the simple n-gram language model Assigns probabilities to sequences of words Generate text by sampling possible next words Is trained on counts computed from lots of text Large language...

Introduction to Large Language Large Models Language Models Language models Remember the simple n-gram language model Assigns probabilities to sequences of words Generate text by sampling possible next words Is trained on counts computed from lots of text Large language models are similar and different: Assigns probabilities to sequences of words Generate text by sampling possible next words Are trained by learning to guess the next word Large language models Even through pretrained only to predict words Learn a lot of useful language knowledge Since training on a lot of text Encoder- Encoders Good parts of decoders and encode Pretraining for three types of architec Three architectures Decodersfor large language What’s models the best way to pretrain the The neural architecture influences the type of pretra Encoder- Language models! What Getswe’ve bidirectio see Decoders Encoders Decoders How Nice to generate from; do we can’t tra condit 32 Decoders Encoders Encoder-decoders GPT, Claude, BERT family, GoodDecoders Flan-T5, Whisper Encoder- parts of d Llama HuBERT Decoders What’s the bes 32 Mixtral The neural architecture influences Encoders Encoders Many varieties! Popular: Masked Language Models (MLMs) BERT family Encoder- Trained by predicting words from surrounding Decoders words on both sides Are usually finetuned (trained on supervised data) for classification tasks. Decoders Encoder-Decoders E D Trained to map from one sequence to another Very popular for: D machine translation (map from one language to another) speech recognition (map from acoustics to 32 words) Large Language Models: What Large tasks can they do? Language Models Big idea Many tasks can be turned into tasks of predicting words! Decoders This lecture: decoder-only models Also called: Decoders Causal LLMs Autoregressive LLMs 32 Left-to-right LLMs Predict words left to right Conditional Generation: Generating text conditioned on previous text! Completion Text all the Language Softmax Modeling logits Head Unencoder layer U U Transformer … … Blocks + i + i + i + i + i + i + i Encoder E E E E E E E So long and thanks for all the Prefix Text ManyEpractical Encoder + E i NLP E tasks + i E can be + E cast i as E word + E i prediction! + i + i + i So long Sentiment analysis: and “I like Jackie thanks for Chan”the all 1. We give the Prefixlanguage Text model this string: The sentiment of the sentence "I 0.1 Left-to-right (also called autoregressive) text completion with transformer-based large language like itJackie As each token is generated, gets added onto Chan" the context as is: a prefix for generating the next token. 2. “negative” word And see what to see which word is higher:it thinks comes next: P(positive|The sentiment of the sentence ‘‘I like Jackie Chan" is:) P(negative|The sentiment of the sentence ‘‘I like Jackie Chan" is:) If the word “positive” is more probable, we say the sentiment of the sentence is positive, otherwise we say the sentiment is negative. WeWe positive, canalso alsocast otherwise can castmore we more say complex the tasksis sentiment complex tasks asnegative. as word prediction. word prediction. Consider Consider question question answering,ininwhich answering, whichthethesystem systemisisgiven givenaaquestion question (for (for example example aaquestion question with with We can Framing also cast lots more of complex tasks as tasks as conditional word prediction. generation Consider th a simple factual answer) and must give a textual answer; we introduce this task in a simple of answering factual answer) and musta give awe textual answer; we introduce this task in simple questions, task return to in Chapter 14. In detail in Chapter 15. We can cast the task of question answering as word predictionta this detail systemQA: ingiven is Chapter “Who 15. wrote some We can The question castand themust taskof Origin ofgive question Species” a answering textual as word answer. We prediction can cast th by giving alanguage model aquestion and atoken likeA: suggesting that an answer by giving of question alanguage answering model aquestion and atoken likeA: suggesting that an answer should come next: asword prediction by giving alanguage model aquestio 1. should We come give next: the language model this string: a token like Q:A:Whosuggesting wrote the that an answer book should come ‘‘The Origin next: of Species"? A: Q: Who wrote the book ‘‘The Origin of Species"? A: If Q: Whoa wrote we ask languagethe modelbook ‘‘The to compute theOrigin probabilityofdistribution Species"? A: over possible If next we ask a language words given thismodel prefix: to compute the probability distribution over possible If we 2.ask next a language And words seethis given model what to compute prefix:word it thinks comes next: P(w|Q: Who wrote the book ‘‘The Origin of Species"? A:) P(w|Q: P(w|Q: Who wrote Who wrote the book the book “ The ‘‘The Origin Species” ? A:) Originofof Species"? A:) and look at which words w have high probabilities, we might expect to see that andand3.lookAnd look Charles iterate: atatiswhich which very words words likely, wthen andw have have high high if we probabilities, probabilities, choose Charles and we might wecontinue might and expect expectaskto seeto se that Charles Charles isisvery verylikely, likely, and and then thenififwewechoose chooseCharles Charles andand continue and ask continue and ask P(w|Q: Who wrote the book ‘‘The Origin of Species"? A: Charles) P(w|Q: Who Who P(w|Q: wrotewrotethe the bookbook‘‘The “ TheOrigin Originof Species"? of Species” ? A:A: Charles) Charles) Summarization Original Summary LLMs for summarization (using tl;dr) Generated Summary Kyle Waring will … LM Head U U U … E E E E E E E E The only … idea was born. tl;dr Kyle Waring will Original Story Delimiter Pretraining Large Language Large Models: Algorithm Language Models Pretraining The big idea that underlies all the amazing performance of language models First pretrain a transformer model on enormous amounts of text Then apply it to new tasks. Self-supervised training algorithm We just train them to predict the next word! 1. Take a corpus of text 2. At each time step t i. ask the model to predict the next word ii. train the model using gradient descent to minimize the error in this prediction "Self-supervised" because it just uses the next word as the label! Intuition of language model training: loss Same loss function: cross-entropy loss We want the model to assign a high probability to true word w = want loss to be high if the model assigns too low a probability to w CE Loss: The negative log probability that the model assigns to the true next word w If the model assigns too low a probability to w We move the model weights in the direction that assigns a higher probability to w Cross-entropy loss for language modeling CE loss: difference between the correct probability distribution and the predicted distribution The correct distribution yt knows the next word, so is 1 for the actual next word and 0 for the others. So in this sum, all terms get multiplied by zero except one: the logp the model assigns to the correct next word, so: Teacher forcing At each token position t, model sees correct tokens w1:t, Computes loss (–log probability) for the next token wt+1 At next token position t+1 we ignore what model predicted for wt+1 Instead we take the correct word wt+1, add it to context, move on Training a transformer language model Next token long and thanks for all … Loss − log yand − log yt hank s … = Language Modeling logits logits logits logits logits … Head U U U U U Stacked Transformer … … … … … … Blocks x1 x2 x3 x4 x5 … + 1 + 2 + 3 + 4 + 5 Input Encoding E E E E E … Input tokens So long and thanks for Pretraining data for LLMs Large Language Models LLMs are mainly trained on the web Common crawl, snapshots of the entire web produced by the non- profit Common Crawl with billions of pages Colossal Clean Crawled Corpus (C4; Raffel et al. 2020), 156 billion tokens of English, filtered What's in it? Mostly patent text documents, Wikipedia, and news sites The Pile: a pretraining corpus academics web books dialog Filtering for quality and safety Quality is subjective Many LLMs attempt to match Wikipedia, books, particular websites Need to remove boilerplate, adult content Deduplication at many levels (URLs, documents, even lines) Safety also subjective Toxicity detection is important, although that has mixed results Can mistakenly flag data written in dialects like African American English What does a model learn from pretraining? There are canines everywhere! One dog in the front room, and two dogs It wasn't just big it was enormous The author of "A Room of One's Own" is Virginia Woolf The doctor told me that he The square root of 4 is 2 Big idea Text contains enormous amounts of knowledge Pretraining on lots of text with all that knowledge is what gives language models their ability to do so much But there are problems with scraping from the web Copyright: much of the text in these datasets is copyrighted Not clear if fair use doctrine in US allows for this use This remains an open legal question Data consent Website owners can indicate they don't want their site crawled Privacy: Websites can contain private IP addresses and phone numbers Finetuning Large Language Models Finetuning for daptation to new domains What happens if we need our LLM to work well on a domain it didn't see in pretraining? Perhaps some specific medical or legal domain? Or maybe a multilingual LM needs to see more data on some language that was rare in pretraining? Finetuning Fine- Pretraining Data tuning Pretrained LM Data Fine-tuned LM … … … … … … Pretraining Fine-tuning "Finetuning" means 4 different things We'll discuss 1 here, and 3 in later lectures In all four cases, finetuning means: taking a pretrained model and further adapting some or all of its parameters to some new data 1. Finetuning as "continued pretraining" on new data Further train all the parameters of model on new data using the same method (word prediction) and loss function (cross-entropy loss) as for pretraining. as if the new data were at the tail end of the pretraining data Hence sometimes called continued pretraining Evaluating Large Language Large Models Language Models Perplexity Just as for n-gram grammars, we use perplexity to measure how well the LM predicts unseen text The perplexity of a model θ on an unseen test set is the inverse probability that θ assigns to the test set, normalized by the test set length. For a test set of n tokens w1:n the perplexity is : How is this probability calculated? (Chain rule of probability ) The chain rule helps the model calculate the probability of a sequence of words step-by-step: P(w1:n)  P(word|context) P(w1​)⋅P(w2​∣w1​)⋅P(w3​∣w1​,w2​)⋅…⋅P(wn∣w1​:wn−1​) (Each word’s probability depends on the context of the preceding words.) The Problem with Probability and Text Length When using probability to evaluate a language model, the total probability of a sequence decreases as the sequence gets longer. This happens because of the chain rule of probability, where the probability of the entire sequence is the product of the probabilities of individual words (or tokens). For example, if a test set has N words, we compute:𝑃(𝑤1:𝑤n)=𝑃(𝑤1)⋅𝑃(𝑤2∣𝑤1)⋅𝑃(𝑤3∣𝑤1,𝑤2)⋅…⋅𝑃(𝑤n∣𝑤 1:𝑤n−1) Multiplying many small probabilities together results in a very small final value for 𝑃(𝑤1:𝑤n), especially for longer texts. (That’s why we need to normalize the probability) Why perplexity instead of raw probability of the test set? Probability depends on size of test set Probability gets smaller the longer the text Better: a metric that is per-word, normalized by length (By normalizing using the nth root, perplexity evaluates how well the model predicts individual words on average, rather than focusing on the entire sequence.) Perplexity is the inverse probability of the test set, normalized by the number of words (The inverse comes from the original definition of perplexity from cross- entropy rate in information theory) Probability range is [0,1], perplexity range is [1,∞] Perplexity The higher the probability of the word sequence, the lower the perplexity. Thus the lower the perplexity of a model on the data, the better the model. Minimizing perplexity is the same as maximizing probability Also: perplexity is sensitive to length/tokenization so best used when comparing LMs that use the same tokenizer. Many other factors that we evaluate, like: Size Big models take lots of GPUs and time to train, memory to store Energy usage Can measure kWh or kilograms of CO2 emitted Fairness Benchmarks measure gendered and racial stereotypes, or decreased performance for language from or about some groups. Harms of Large Language Large Models Language Models Hallucination Copyright Privacy Toxicity and Abuse Misinformation

Use Quizgecko on...
Browser
Browser