LLM_RNN_Transformer.pdf

21CSE306J – Applied Gen AI Large Language Models 1 21CSE306J – Applied Gen AI NLP to LLM LLM is a type of Neural Network Uses...

21CSE306J – Applied Gen AI Large Language Models 1 21CSE306J – Applied Gen AI NLP to LLM LLM is a type of Neural Network Uses a Transformer Architecture – CNN (image set- grid structure) Builds itself - ML Predicts which word should follow the previous Produce text that sounds right but cannot guarantee that it is right (example ChatGPT) 2 21CSE306J – Applied Gen AI NLP to LLM 3 21CSE306J – Applied Gen AI NLP Vs LLM 4 21CSE306J – Applied Gen AI What is a LLM? A new class of natural language processing (NLP) A deep learning algorithm that’s equipped to summarize, translate, predict and generate human-sounding text to convey ideas and concepts. Trained from massive data sets of Text and Code using advanced machine learning algorithms to learn the patterns and structures of human language. This allows the model to learn the statistical relationships between words and phrases Common LLMs : ❖ GPT 3.5 & 4 ❖ Bard ❖ Llama versions ❖ BERT Large by Hugging Face 5 21CSE306J – Applied Gen AI What is a LLM? Has ability in a variety of tasks such as Answering open-ended questions Chat –generate text Content summarization Execution of near-arbitrary instructions Translation Content and code generation Answer your questions in an informative way 6 21CSE306J – Applied Gen AI Gen AI Vs LLMs Not all generative AI tools are built on LLMs, but all LLMs are a form of generative AI LLMs create text-only outputs LLMs are only growing Chat GPT, Google’s Bard etc 7 21CSE306J – Applied Gen AI General architecture of GenAI 8 21CSE306J – Applied Gen AI GPT 9 21CSE306J – Applied Gen AI ChatGPT Neural Network Layers 10 21CSE306J – Applied Gen AI STF-Supervised fine-tuning 11 21CSE306J – Applied Gen AI 12 21CSE306J – Applied Gen AI GPT architecture Source: https://towardsdatascience.com/large-language-models-gpt-1-generative-pre-trained-transformer-7b895f296d3b 13 21CSE306J – Applied Gen AI GPT (a) BERT-GPT (a) BERT-GPT (c) 14 21CSE306J – Applied Gen AI GPT Classification Textual entailment Semantic similarity Question answering & Multiple choice answering 15 21CSE306J – Applied Gen AI BERT GPT BERT Vs GPT BERT is designed for bidirectional representation learning. It uses a GPT, on the other hand, is designed for generative language modeling. It masked language model objective, predicts the next word in a sentence Architecture where it predicts missing words in a given the preceding context, utilizing a sentence based on both left and right unidirectional autoregressive context. approach. BERT is pre-trained using a masked GPT is pre-trained to predict the next language model objective and next word in a sentence, which encourages Pre-training sentence prediction. It focuses on the model to learn a coherent capturing bidirectional context and representation of language and Objectives understanding relationships between generate contextually relevant words in a sentence. sequences. GPT is strong in generating coherent BERT is effective for tasks that require and contextually relevant text. It is Context a deep understanding of context and relationships within a sentence, such often used in creative tasks, dialogue systems, and tasks requiring the Understanding as text classification, named entity generation of natural language recognition, and question-answering. sequences. Task types and Use Commonly used in tasks like text Applied to tasks such as text classification, named entity Cases recognition, sentiment analysis, and generation, dialogue systems, summarization, and creative writing. question-answering. BERT is often fine-tuned on specific GPT is designed to perform few-shot Fine-tuning vs Few- downstream tasks with labeled data to learning, where it can generalize to adapt its pre-trained representations new tasks with minimal task-specific Shot Learning to the task at hand. training data. 16 21CSE306J – Applied Gen AI BERT (Bidirectional Encoder Representations from Transformers) Sentiment Analysis: Can determine how positive or negative a movie’s reviews are. Question answering: Helps chatbots answer your questions. Text prediction: Predicts your text when writing an email (Gmail). Text generation: Can write an article about any topic with just a few sentence inputs. Summarization: Can quickly summarize long legal contracts. Can differentiate words that have multiple meanings (like ‘bank’) based on the surrounding text. (Polysemy resolution) Google Translate, voice assistants (Alexa, Siri, etc.), chatbots, Google searches, voice-operated GPS Google search: RankBrain (only ML) & BERT (NLP) 17 21CSE306J – Applied Gen AI BERT 18 21CSE306J – Applied Gen AI Pre-training in NLP Word embeddings are the basis of deep learning for NLP king queen [-0.5, -0.9, 1.4, …] [-0.6, -0.8, -0.2, …] Word embeddings (word2vec, GloVe) are often pre-trained on text corpus from co-occurrence statistics Inner Product Inner Product the king wore a crown the queen wore a crown 19 21CSE306J – Applied Gen AI Contextual Representations Problem: Word embeddings are applied in a context free manner open a bank account on the river bank [0.3, 0.2, -0.8, …] Solution: Train contextual representations on text corpus [0.9, -0.2, 1.6, …] [-1.9, -0.4, 0.1, …] open a bank account on the river bank 20 21CSE306J – Applied Gen AI History of Contextual Representations Semi-Supervised Sequence Learning, Google, 2015 Train LSTM Fine-tune on Language Model Classification Task open a bank POSITIVE LSTM LSTM LSTM... LSTM LSTM LSTM open a very funny movie 21 21CSE306J – Applied Gen AI History of Contextual Representations ELMo: Deep Contextual Word Embeddings, 2017 Train Separate Left-to-Right and Apply as “Pre-trained Right-to-Left LMs Embeddings” open a bank open a Existing Model Architecture LSTM LSTM LSTM LSTM LSTM LSTM open a open a bank open a bank 22 21CSE306J – Applied Gen AI History of Contextual Representations Improving Language Understanding by Generative Pre-Training, OpenAI, 2018 Train Deep (12-layer) Fine-tune on Transformer LM Classification Task POSITIVE open a bank Transformer Transformer Transformer Transformer Transformer Transformer open a open a 23 21CSE306J – Applied Gen AI Problem with Previous Methods Problem: Language models only use left context or right context, but language understanding is bidirectional. Why are LMs unidirectional? Reason 1: Directionality is needed to generate a well-formed probability distribution. ○ We don’t care about this. Reason 2: Words can “see themselves” in a bidirectional encoder. 24 21CSE306J – Applied Gen AI Unidirectional vs. Bidirectional Models Unidirectional context Bidirectional context Build representation incrementally Words can “see themselves” open a bank open a bank Layer 2 Layer 2 Layer 2 Layer 2 Layer 2 Layer 2 Layer 2 Layer 2 Layer 2 Layer 2 Layer 2 Layer 2 open a open a 25 21CSE306J – Applied Gen AI Masked LM Solution: Mask out k% of the input words, and then predict the masked words ○ We always use k = 15% store gallon the man went to the [MASK] to buy a [MASK] of milk Too little masking: Too expensive to train Too much masking: Not enough context 26 21CSE306J – Applied Gen AI Masked LM Problem: Mask token never seen at fine-tuning Solution: 15% of the words to predict, but don’t replace with [MASK] 100% of the time. Instead: 80% of the time, replace with [MASK] went to the store → went to the [MASK] 10% of the time, replace random word went to the store → went to the running 10% of the time, keep same went to the store → went to the store 27 21CSE306J – Applied Gen AI Next Sentence Prediction To learn relationships between sentences, predict whether Sentence B is actual sentence that proceeds Sentence A, or a random sentence BERT is trained on both MLM (50%) and NSP (50%) at the same time. 28 21CSE306J – Applied Gen AI Input Representation Use 30,000 WordPiece vocabulary on input. Each token is sum of three embeddings Single sequence is much more efficient. 29 21CSE306J – Applied Gen AI Model Architecture Transformer encoder Multi-headed self attention ○ Models context Feed-forward layers ○ Computes non-linear hierarchical features Layer norm and residuals ○ Makes training deep networks healthy Positional embeddings ○ Allows model to learn relative positioning 30 21CSE306J – Applied Gen AI Model Architecture Empirical advantages of Transformer vs. LSTM: 1. Self-attention == no locality bias Long-distance context has “equal opportunity” 2. Single multiplication per layer == efficiency on TPU Effective batch size is number of words, not sequences Transformer LSTM X_0_0 X_0_1 X_0_2 X_0_3 X_0_0 X_0_1 X_0_2 X_0_3 ✕ ✕ W W X_1_0 X_1_1 X_1_2 X_1_3 X_1_0 X_1_1 X_1_2 X_1_3 31 21CSE306J – Applied Gen AI Model Details Data: Wikipedia (2.5B words) + BookCorpus (800M words) Batch Size: 131,072 words (1024 sequences * 128 length or 256 sequences * 512 length) Training Time: 1M steps (~40 epochs) Optimizer: AdamW, 1e-4 learning rate, linear decay BERT-Base: 12-layer, 768-hidden, 12-head BERT-Large: 24-layer, 1024-hidden, 16-head Trained on 4x4 or 8x8 TPU slice for 4 days 32 21CSE306J – Applied Gen AI Fine-Tuning Procedure 33 21CSE306J – Applied Gen AI GLUE Results GLUE (General Language Understanding Evaluation) benchmark is a group of resources for training, measuring, and analyzing language models comparatively to one another. These resources consist of nine “difficult” tasks designed to test an NLP model’s understanding. MultiNLI CoLa Premise: Hills and mountains are especially Sentence: The wagon rumbled down the road. sanctified in Jainism. Label: Acceptable Hypothesis: Jainism hates nature. Label: Contradiction Sentence: The car honked down the road. Label: Unacceptable 34 21CSE306J – Applied Gen AI SQuAD SQuAD (Stanford Question Answering Dataset) is a reading comprehension dataset of around 108k questions that can be answered via a corresponding paragraph of Wikipedia text. BERT’s performance on this evaluation method was a big achievement beating previous state-of-the-art models and human-level performance 35 21CSE306J – Applied Gen AI SQuAD 1.1 Only new parameters: Start vector and end vector. Softmax over all positions. 36 SQuAD 2.0 21CSE306J – Applied Gen AI Use token 0 ([CLS]) to emit logit for “no answer”. “No answer” directly competes with answer span. Threshold is optimized on dev set. 37 21CSE306J – Applied Gen AI SWAG (Situations With Adversarial Generations) SWAG (Situations With Adversarial Generations) is an interesting evaluation in that it detects a model’s ability to infer commonsense! It does this through a large-scale dataset of 113k multiple choice questions about common sense situations. 38 21CSE306J – Applied Gen AI SWAG Run each Premise + Ending through BERT. Produce logit for each pair on token 0 ([CLS]) 39 21CSE306J – Applied Gen AI Effect of Pre-training Task Masked LM (compared to left-to-right LM) is very important on some tasks, Next Sentence Prediction is important on other tasks. Left-to-right model does very poorly on word-level task (SQuAD), although this is mitigated by BiLSTM 40 21CSE306J – Applied Gen AI Effect of Directionality and Training Time Masked LM takes slightly longer to converge because we only predict 15% instead of 100% But absolute results are much better almost immediately 41 21CSE306J – Applied Gen AI Effect of Model Size Big models help a lot Going from 110M -> 340M params helps even on datasets with 3,600 labeled examples Improvements have not asymptoted 42 21CSE306J – Applied Gen AI Effect of Masking Strategy Masking 100% of the time hurts on feature-based approach Using random word 100% of time hurts slightly 43 21CSE306J – Applied Gen AI Multilingual BERT Trained single model on 104 languages from Wikipedia. Shared 110k WordPiece vocabulary. System English Chinese Spanish XNLI Baseline - Translate Train 73.7 67.0 68.8 XNLI Baseline - Translate Test 73.7 68.4 70.7 BERT - Translate Train 81.9 76.6 77.8 BERT - Translate Test 81.9 70.1 74.9 BERT - Zero Shot 81.9 63.8 74.3 XNLI is MultiNLI translated into multiple languages. Always evaluate on human-translated Test. Translate Train: MT English Train into Foreign, then fine-tune. Translate Test: MT Foreign Test into English, use English model. Zero Shot: Use Foreign test on English model. 44 21CSE306J – Applied Gen AI Synthetic Training Data 1. Use seq2seq model to generate positive questions from context+answer. 2. Heuristically transform positive questions into negatives (i.e., “no answer”/impossible). Result: +3.0 F1/EM score, new state-of-the-art. 45 21CSE306J – Applied Gen AI Synthetic Training Data 1. Pre-train seq2seq model on Wikipedia. ○ Encoder trained with BERT, Decoder trained to decode next sentence. 2. Fine-tune model on SQuAD Context+Answer → Question ○ Ceratosaurus was a theropod dinosaur in the Late Jurassic, around 150 million years ago. -> When did the Ceratosaurus live ? 3. Train model to predict answer spans without questions. ○ Ceratosaurus was a theropod dinosaur in the Late Jurassic, around 150 million years ago. -> {150 million years ago, 150 million, theropod dinsoaur, Late Jurassic, in the Late Jurassic} 46 21CSE306J – Applied Gen AI Synthetic Training Data 4. Generate answer spans from a lot of Wikipedia paragraphs using model from (3) 5. Use output of (4) as input to seq2seq model from (2) to generate synthetic questions: ○ Roxy Ann Peak is a 3,576-foot-tall mountain in the Western Cascade Range in the U.S. state of Oregon. → What state is Roxy Ann Peak in? 6. Filter with baseline SQuAD 2.0 system to throw out bad questions. ○ Roxy Ann Peak is a 3,576-foot-tall mountain in the Western Cascade Range in the U.S. state of Oregon. → What state is Roxy Ann Peak in? ( Good) ○ Roxy Ann Peak is a 3,576-foot-tall mountain in the Western Cascade Range in the U.S. state of Oregon. → Where is Oregon? ( Bad) 47 21CSE306J – Applied Gen AI 7. Heuristically generate “strong negatives”: a. Positive questions from other paragraphs of same document. What state is Roxy Ann Peak in? → When was Roxy Ann Peak first summited? b. Replace span of text with other span of same type (based on POS tags). Replacement is usually from paragraph. What state is Roxy Ann Peak in? → What state is Oregon in? What state is Roxy Ann Peak in? → What mountain is Roxy Ann Peak in? 8. Optionally: Two-pass training, where no-answer is modeled as regression second pass (~+0.5 F1) 48 21CSE306J – Applied Gen AI Advantage: Slightly faster training time Disadvantages: ○ Will need to add non-pre-trained bidirectional model on top ○ Right-to-left SQuAD model doesn’t see question ○ Need to train two models ○ Off-by-one: LTR predicts next word, RTL predicts previous word ○ Not trivial to add arbitrary pre-training tasks. 49 21CSE306J – Applied Gen AI Good results on pre-training is >1,000x to 100,000 more expensive than supervised training. ○ E.g., 10x-100x bigger model trained for 100x-1,000x as many steps. ○ Imagine it’s 2013: Well-tuned 2-layer, 512-dim LSTM sentiment analysis gets 80% accuracy, training for 8 hours. ○ Pre-train LM on same architecture for a week, get 80.5%. ○ Conference reviewers: “Who would do something so expensive for such a small gain?” 50 21CSE306J – Applied Gen AI The model must be learning more than “contextual embeddings” Alternate interpretation: Predicting missing words (or next words) requires learning many types of language understanding features. ○ syntax, semantics, pragmatics, coreference, etc. Implication: Pre-trained model is much bigger than it needs to be to solve specific task Task-specific model distillation words very well 51 21CSE306J – Applied Gen AI Examples of NLP models that are not “solved”: ○ Models that minimize total training cost vs. accuracy on modern hardware ○ Models that are very parameter efficient (e.g., for mobile deployment) ○ Models that represent knowledge/context in latent space ○ Models that represent structured data (e.g., knowledge graph) ○ Models that jointly represent vision and language 52 21CSE306J – Applied Gen AI Claude 53 21CSE306J – Applied Gen AI https://research.aimultiple.com/large-language-models/ 54 21CSE306J – Applied Gen AI Deep Learning Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI Recurrent Neural Network CNN – mainly for images RNN – for Natural Language Processing & time series data RNN is a Sequence model Use cases: Google mail – autocomplete the sentence Google translate Named Entity recognition – identifying Person, Place, Company etc Sentiment Analysis Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI Applications of RNN RNN learns past patterns in the data, it is able to use its knowledge to forecast the future. RNN can analyze time series data, such as The hourly temperature in your NLP applications such as such The number of daily active The trajectories of nearby cars, city, your home’s daily power as automatic translation or users on your website, and more, consumption, speech-to-text. Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI Recurrent Neural Network - RNN A recurrent neural network, or RNN, is a deep neural network trained on sequential or time series data to create a machine learning model that can make sequential predictions or conclusions based on sequential inputs. Department of Networking and Communications, SRM The most important component of RNN is the Hidden Institute of Sciencestate, which Kattankulathur, and Technology, remembers specific information about a sequence Chennia. 21CSE306J – Applied Gen AI Recurrent Neural Network - RNN Simple RNN, composed of one neuron receiving inputs, producing an output, and sending that output back to itself. At each time step t (also called a frame), this recurrent neuron receives the inputs x(t) as well as its own output from the previous time step, ŷ(t–1). Since there is no previous output at the first-time step, it is generally set to 0. Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, A recurrent neuron (left) Chennia. unrolled through time (right) 21CSE306J – Applied Gen AI RNN - Equation Output of a recurrent layer for a single instance Outputs of a layer of recurrent neurons for all instances in a pass Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI RNN Unique Properties Internal Memory - A part of a neural network that preserves some state across time steps is called a memory cell (or simply a cell). Sequential Data Processing Contextual Understanding Dynamic Processing Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI RNN Architecture Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI RNN Architecture – Single unit rolled out Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI RNN Architecture – Single Unit Input Layer Hidden Layer Activation Function Output Layer Recurrent Connection within hidden layer - This connection allows the network to pass the hidden state information (the network’s memory) to the next time step. Department of Networking and Communications, ASRMsimple diagram of a recurrent layer Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI Input and Output Sequence in RNN Sequence-to-Sequence: An RNN can simultaneously take a sequence of inputs and produce a sequence of outputs. Useful to forecast time series data: For example, home’s daily power consumption - feed it the data over the last N days, and train it to output the power consumption shifted by one day into the future Sequence-to-Vector: Feed the network a sequence of inputs and ignore all outputs except for the last one. Feed the network a sequence of words corresponding to a movie review, and the network would output a sentiment score Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI Input and Output Sequence in RNN Vector-to-Sequence: Feed the network the same input vector over and over again at each time step and let it output a sequence. For example, the input could be an image (or the output of a CNN), and the output could be a caption for that image. An Encoder–Decoder network: A sequence-to-vector network, called an encoder, followed by a vector-to-sequence network, called a decoder. For example, this is used for translating a sentence from one language to another. Feed the network a sentence in one language, the encoder would convert this sentence into a single vector representation, and then the decoder would decode this vector into a sentence in another language. Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI Types of RNN One To One: There is only one pair here. A one-to-one architecture is used in traditional neural networks. One To Many: A single input in a one-to-many network might result in numerous outputs. One too many networks are used in the production of music, for example. Many To One: In this scenario, a single output is produced by combining many inputs from distinct time steps. Sentiment analysis and emotion identification use such networks, in which the class label is determined by a sequence of words. Many To Many: For many to many, there are numerous options. Two inputs yield three outputs. Machine translation systems, such as English to French or vice versa translation systems, use many to many networks. Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI Types of RNN Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI Training RNN To train an RNN, the trick is to unroll it through time and then use regular backpropagation. This strategy is called backpropagation through time (BPTT). Department of Networking and Communications, SRM Backpropagation Institute of Science and Technology, Kattankulathur, through time Chennia. 21CSE306J – Applied Gen AI Deep RNN It is quite common to stack multiple layers of cells. This gives you a deep RNN. A deep RNN (left) unrolled through time (right) Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI Working with Text Working with Image Text data is composed of discrete chunks Pixels in an image are points in a continuous (either characters or words). color spectrum. With discrete text data, we can’t obviously Can easily apply backpropagation to image apply backpropagation. data. We cannot change cat to a dog. Can change a green pixel blue. Difficult to calculate the gradient of our loss Can calculate the gradient of our loss function. function. Has a time dimension. Has two spatial dimension, but no time dimension. The order of words is highly important in text data and words wouldn’t make sense in Images can usually be flipped without reverse. affecting the content. Text data is highly sensitive to small changes in Image data is generally less sensitive to the individual units. changes in individual pixel units. Text data has a rules-basedDepartment grammatical Image data of Networking and Communications, SRM doesn’t follow set rules about how structure. the pixel values should be assigned. Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI Steps before training RNN: Tokenization First step is to clean up and tokenize the text. Tokenization is the process of splitting the text up into individual units, such as words or characters. Word tokens: All text can be converted to lowercase. The text vocabulary (the set of distinct words in the training set) may be very large, with some words appearing very sparsely or perhaps only once wise to replace sparse words with a token for unknown word Words can be stemmed - reduced to their simplest form. For example, browse, browsing, browses, and browsed would all be stemmed to brows. tokenize the punctuation, or remove it altogether Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI Steps before training RNN: Tokenization Character tokens: Generate sequences of characters that form new words outside of the training vocabulary Capital letters can either be converted to their lowercase. vocabulary is usually much smaller when using character tokenization Benefits: Speeds up the model training. Fewer weights to learn in final output layer. Training Set: The training set is created using the tokens generated for the vocabulary (the input) Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI Difficulties RNN Face Unstable gradients, which can be alleviated using A (very) limited short-term various techniques, memory, which can be including recurrent extended using LSTM and dropout and recurrent GRU cells. layer normalization. Variant of RNN Long short-term memory (LSTM) Gated recurrent units (GRUs) Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI In short, an LSTM cell can learn to recognize an important input (that’s the role of the input gate), store it in the long- term state, preserve it for as long as it is needed (that’s the role of the forget gate), and extract it whenever it is needed. These are the reasons why these LSTM cells have been amazingly successful at capturing long-term patterns in time LSTM series, long texts, audio recordings, and more. Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI An LSTM Cell Architecture The LSTM cell looks exactly like a regular cell, except that its state is split into two vectors: h(t) and c(t). Where h(t) is the short-term state and c(t) is the long-term state. Idea behind the network is, it can learn what to store in the long-term state, what to throw away, and what to read from it An LSTM cell Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI An LSTM Cell Architecture As the long-term state c(t–1) traverses the network from left to right, you can see that it first goes through a forget gate, dropping some memories, and then it adds some new memories via the addition operation (which adds the memories that were selected by an input gate). The result c(t) is sent straight out, without any further transformation. So, at each time step, some memories are dropped and some memories are added. Moreover, after the addition operation, the long-term state is copied and passed through the tanh function, and then the result is filtered by the output gate. This produces the short-term state h(t) (which is equal to the cell’s output for this time step, y(t)) Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI An LSTM Cell Architecture First, the current input vector x(t) and the previous short-term state h(t–1) are fed to four different fully connected layers. They all serve a different purpose: The main layer is the one that outputs g(t). It has the usual role of analyzing the current inputs x(t) and the previous (short-term) state h(t–1). In a basic cell, there is nothing other than this layer, and its output goes straight out to y(t) and h(t). But in an LSTM cell, this layer’s output does not go straight out; instead its most important parts are stored in the long-term state (and the rest is dropped). The three other layers are gate controllers. Since they use the logistic activation function, the outputs range from 0 to 1. As you can see, the gate controllers’ outputs are fed to element-wise multiplication operations: if they output 0s they close the gate, and if they output 1s they open it. Specifically: The forget gate (controlled by f(t)) controls which parts of the long-term state should be erased. The input gate (controlled by i(t)) controls which parts of g(t) should be added to the long-term state. Finally, the output gate (controlled by o(t)) controls which parts of the long- Department of Networking and Communications, SRM term state should be read and Institute output of Science at this and Technology, time step, both to h(t) and to y(t). Kattankulathur, Chennia. 21CSE306J – Applied Gen AI Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI LSTM An LSTM cell maintains a cell state, Ct, which can be thought of as the cell’s internal beliefs about the current status of the sequence. This is distinct from the hidden state, ht, which is ultimately output by the cell after the final timestep. The cell state is the same length as the hidden state (the number of units in the cell). Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI LSTM Computations The hidden state is updated in six steps: 1. The hidden state of the previous timestep, ht-1, and the current word embedding, xt, are concatenated and passed through the forget gate. This gate is simply a dense layer with weights matrix Wf, bias bf, and a sigmoid activation function. The resulting vector, ft, has length equal to the number of units in the cell and contains values between 0 and 1 that determine how much of the previous cell state, Ct-1, should be retained. 2. The concatenated vector is also passed through an input gate that, like the forget gate, is a dense layer with weights matrix Wi, bias bi, and a sigmoid activation function. The output from this gate, it, has length equal to the number of units in the cell and contains values between 0 and 1 that determine how much new information will be added to the previous cell state, Ct-1. Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI LSTM Computations 3. The concatenated vector is passed through a dense layer with weights matrix WC, bias bC, and a tanh activation function to generate a vector C˜ t that contains the new information that the cell wants to consider keeping. It also has length equal to the number of units in the cell and contains values between –1 and 1. 4. ft and Ct-1 are multiplied element-wise and added to the element-wise multiplication of it and C˜ t. This represents forgetting parts of the previous cell state and then adding new relevant information to produce the updated cell state, Ct. 5. The concatenated vector is passed through an output gate: a dense layer with weights matrix Wo, bias bo, and a sigmoid activation. The resulting vector, ot, has length equal to the number of units in the cell and stores values between 0 and 1 that determine how much of the updated cell state, Ct, to output from the cell. 6. ot is multiplied element-wise with the updated cell state, Ct, after a tanh activation has been applied to produce the new hidden state, ht Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI LSTM computations In the above equation: Wxi, Wxf, Wxo, and Wxg are the weight matrices of each of the four layers for their connection to the input vector x(t). Whi, Whf, Who, and Whg are the weight matrices of each of the four layers for their connection to the previous short-term state h(t–1). bi, bf, bo, and bg are the bias terms for each of the four layers. Note that TensorFlow initializes bf to a vector full of 1s instead of 0s. This prevents forgetting everything at the beginning of training. Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI Another type of commonly used RNN layer is the gated recurrent unit (GRU).2 The key differences from the LSTM unit are as follows: 1.The forget and input gates are GRU replaced by reset and update gates. 2.There is no cell state or output gate, only a hidden state that is output from the cell. Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI GRU Computations The hidden state is updated in four steps, the process is as follows: 1.The hidden state of the previous timestep, ht-1, and the current word embedding, xt, are concatenated and used to create the reset gate. This gate is a dense layer, with weights matrix Wr and a sigmoid activation function. The resulting vector, rt, has length equal to the number of units in the cell and stores values between 0 and 1 that determine how much of the previous hidden state, ht- 1, should be carried forward into the calculation for the new beliefs of the cell. 2.The reset gate is applied to the hidden state, ht-1, and concatenated with the current word embedding, xt. This vector is then fed to a dense layer with weights matrix W and a tanh activation function to generate a vector, h˜ t, that stores the new beliefs of the cell. It has length equal to the number of units in the cell and stores values between –1 and 1. Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI GRU Computations 3. The concatenation of the hidden state of the previous timestep, ht-1, and the current word embedding, xt, are also used to create the update gate. This gate is a dense layer with weights matrix Wz and a sigmoid activation. The resulting vector, zt, has length equal to the number of units in the cell and stores values between 0 and 1, which are used to determine how much of the new beliefs, h˜ t, to blend into the current hidden state, ht-1. 4. The new beliefs of the cell, h˜t, and the current hidden state, ht-1, are blended in a proportion determined by the update gate, zt, to produce the updated hidden state, ht, that is output from the cell. Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI Gated Recurrent Unit (GRU) GRU cell is one of the variant/simplified version of LSTM. The main simplifications are: Both state vectors are merged into a single vector h(t). A single gate controller z(t) controls both the forget gate and the input gate. If the gate controller outputs a 1, the forget gate is open (= 1) and the input gate is closed (1 – 1 = 0). If it outputs a 0, the opposite happens. In other words, whenever a memory must be stored, the location where it will be stored is erased first. This is actually a frequent variant to the LSTM cell in and of itself. There is no output gate; the full state vector is output at every time step. However, there is a new gate controller r(t) that controls which part of the previous state will be shown to the main layer (g(t)). Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI GRU computations Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI Autoencoders Generative Adversarial Network (GAN) Generative Diffusion Models Models Above models are all unsupervised They learn from latent representation Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI Generative Adversarial Network (GAN) GANs are composed of two neural networks: A generator that tries to generate data that looks similar to the training data A discriminator that tries to tell real data from fake data. This architecture is very original in deep learning in that the generator and the discriminator compete against each other during training. The generator is often compared to a criminal trying to make realistic counterfeit money, while the discriminator is like the police investigator trying to tell real money from fake. Adversarial training (training competing neural networks) is widely considered one of the most important innovations of the 2010s Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI Generative Adversarial Network (GAN) Generator: Takes a random distribution as input (typically Gaussian) and outputs some data—typically, an image. You can think of the random inputs as the latent representations (i.e., codings) of the image to be generated. So, as you can see, the generator offers the same functionality as a decoder in a variational autoencoder, and it can be used in the same way to generate new images: just feed it some Gaussian noise, and it outputs a brand-new image. Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI Generative Adversarial Network (GAN) Discriminator: Takes either a fake image from the generator or a real image from the training set as input, and must guess whether the input image is fake or real. Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI GAN Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI GAN Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI GAN Training the DCGAN— gray boxes indicate that the weights are frozen during training Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI Generative Adversarial Network (GAN) During training, the generator and the discriminator have opposite goals: the discriminator tries to tell fake images from real images, while the generator tries to produce images that look real enough to trick the discriminator. Because the GAN is composed of two networks with different objectives, it cannot be trained like a regular neural network. Each training iteration is divided into two phases: In the first phase, we train the discriminator. A batch of real images is sampled from the training set and is completed with an equal number of fake images produced by the generator. The labels are set to 0 for fake images and 1 for real images, and the discriminator is trained on this labeled batch for one step, using the binary cross-entropy loss. Importantly, backpropagation only optimizes the weights of the discriminator during this phase. In the second phase, we train the generator. We first use it to produce another batch of fake images, and once again the discriminator is used to tell whether the images are fake or real. This time we do not add real images in the batch, and all the labels are set to 1 (real): in other words, we want the generator to produce images that the discriminator will (wrongly) believe to be real! Crucially, the weights of the discriminator are frozen during this step, so backpropagation only affects the weights of the generator. Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI Rely on attention mechanism. Downside of RNN: it is challenging to parallelize, as it must process Transformers sequences one token at a time. Transformers are highly parallelizable, allowing them to be trained on massive datasets Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI Generative Pre-trained Transformer A type of autoregressive model GPT Current state of the art for text generation. It powers OpenAI’s GPT-4 model Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI GPT Initially, in GPT, a transformer architecture can be trained on a huge amount of text data to predict the next word in a sequence and then subsequently fine-tuned to specific downstream tasks. The pre-training process of GPT involves training the model on a large corpus of text called BookCorpus (4.5 GB of text from 7,000 unpublished books of different genres). During pre-training, the model is trained to predict the next word in a sequence given the previous words. This process is known as language modeling and is used to teach the model to understand the structure and patterns of natural language. After pre-training, the GPT model can be fine-tuned for a specific task by providing it with a smaller, task-specific dataset. Fine-tuning involves adjusting the parameters of the model to better fit the task at hand. The GPT architecture has since been improved and extended by OpenAI with the release of subsequent models such as GPT-2, GPT-3, GPT-3.5, and GPT-4. Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI Attention Mechanism The first step to understanding how GPT works is to understand how the attention mechanism works. This makes the Transformer architecture unique and distinct from recurrent approaches to language modeling. What is attention mechanism? Paying attention to certain words in the sentence and largely ignoring others. An attention mechanism (also know as an attention head) in a Transformer is designed to do exactly this. Example: The pink elephant tried get into the car but it was too ….. If it is sloth rather than elephant then the next word would be slow rather than big If it were a swimming pool rather than a car then the next word would be scared rather than big Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI Recurrent Layers vs Attention head Recurrent Layers Attention Head A weakness of this approach is Attention heads do not suffer that many of the words that from this problem, because they have already been incorporated can pick and choose how to into the hidden vector will not combine information from be directly relevant to the nearby words, depending on the immediate task at hand. context. Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI Model Training Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI Queries, Keys, and Values Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI Queries, Keys, and Values A query (“What word follows too?”) is made into a key/value store (other words in the sentence) and the resulting output is a sum of the values, weighted by the resonance between the query and each key. Attention Equation: Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI Attention Mechanism The query (Q) can be thought of as a representation of the current task at hand (e.g., “What word follows too?”). In this example, it is derived from the embedding of the word too, by passing it through a weights matrix WQ to change the dimensionality of the vector from de to dk. The key vectors (K) are representations of each word in the sentence—you can think of these as descriptions of the kinds of prediction tasks that each word can help with. They are derived in a similar fashion to the query, by passing each embedding through a weights matrix WK to change the dimensionality of each vector from de to dk. Notice that the keys and the query are the same length (dk). Inside the attention head, each key is compared to the query using a dot product between each pair of vectors (QKT). This is why the keys and the query have to be the same length. The higher this number is for a particular key/query pair, the more the key resonates with the query, so it is allowed to make more of a contribution to the output of the attention head. The resulting vector is scaled by dk to keep the variance of the vector sum stable (approximately equal to 1), and a softmax is applied to ensure the contributions sum to 1. This is a vector of attention weights. The value vectors (V) are also representations of the words in the sentence—you can think of these as the unweighted contributions of each word. They are derived by passing each embedding through a weights matrix WV to change the dimensionality of each vector from de to dv. Notice that the value Department vectors doandnot of Networking necessarily Communications, SRM have to have the same length as the keys and query (but often do, forofsimplicity). Institute Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI Multihead Attention We can build Multihead attention layer that concatenates the output from multiple attention heads, allowing each to learn a distinct attention mechanism so that the layer as a whole can learn more complex relationships. The concatenated outputs are passed through one final weights matrix WO to project the vector into the desired output dimension, which in our case is the same as the input dimension of the query (de), so that the layers can be stacked sequentially on top of each other. Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI Casual Masking For efficiency during training, we would like the attention layer to be able to operate on every word in the input at once, predicting for each what the subsequent word will be. In other words, we want our GPT model to be able to handle a group of query vectors in parallel (i.e., a matrix). In doing so, we need one extra step—we need to apply a mask to the query/key dot product, to avoid information from future words leaking through. This is known as causal masking Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI Matrix calculation of the attention scores for a batch of input queries, using a causal attention mask to hide keys that are not available to the query (because they come later in the sentence) Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI Casual Masking Causal masking is only required in decoder Transformers such as GPT, where the task is to sequentially generate tokens given previous tokens. Masking out future tokens during training is therefore essential. Other flavors of Transformer (e.g., encoder Transformers) do not need causal masking, because they are not trained to predict the next token. For example Google’s BERT predicts masked words within a given sentence, so it can use context from both before and after the word in question. Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI Learnable parameters Three densely connected weights matrices for each attention head (WQ, WK, WV) and one further weights matrix to reshape the output (WO). There are no convolutions or recurrent mechanisms at all in a multihead attention layer! Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI Transformer Block Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI Transformer Block A Transformer block is a single component within a Transformer that applies Skip connections, Feed-forward (dense) layers, and Normalization around the multihead attention layer. Firstly, the query is passed around the multihead attention layer to be added to the output—this is a skip connection and is common in modern deep learning architectures. This skip connection provides a gradient-free highway that allows the network to transfer information forward uninterrupted. Secondly, layer normalization is used in the Transformer block to provide stability to the training process. Usually in batch normalization layer, the output from each channel is normalized to have a mean of 0 and standard deviation of 1. The normalization statistics are calculated across the batch and spatial dimensions. In contrast, layer normalization in a Transformer block normalizes each position of each sequence in the batch by calculating the normalizing statistics across the channels. It is the complete opposite of batch normalization, in terms of how the normalization statistics are calculated. Lastly, a set of feed-forwardDepartment (i.e., densely of Networkingconnected) layers and Communications, SRM is included in the Transformer block, to allow the componentInstitute to extract higher-level of Science features as we go deeper into the network. and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI Layer normalization versus batch normalization—the normalization statistics are calculated across the red cells Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI Positional Encoding In multihead attention head, we find the dot product between each key and the query is calculated in parallel, not sequentially, like in a recurrent neural network. This is a strength (because of the parallelization efficiency gains) but also a problem, because we need the attention layer to be able to predict different outputs for the following two sentences: The dog looked at the boy and … (barked?) The boy looked at the dog and … (smiled?) Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI Positional Encoding To solve this problem, we use a technique called positional encoding when creating the inputs to the initial Transformer block. Instead of only encoding each token using a token embedding, we also encode the position of the token, using a position embedding. The token embedding is created using a standard embedding layer to convert each token into a learned vector. We can create the positional embedding in the same way, using a standard embedding layer to convert each integer position into a learned vector. To construct the joint token–position encoding, the token embedding is added to the positional embedding. This way, the meaning and position of each word in the sequence are captured in a single vector. Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI Training GPT We need to pass our input text through the token and position embedding layer, then through our Transformer block. The final output of the network is a simple dense layer with softmax activation over the number of words in the vocabulary. Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI GPT Architecture Department of Networking and Communications, SRM TheInstitute simplified of ScienceGPT modelKattankulathur, and Technology, architecture Chennia. 21CSE306J – Applied Gen AI Other Transformers Type Examples Use cases Encoder BERT (Google) Sentence classification, named entity recognition, extractive question answering Encoder-decoder T5 (Google) Summarization, translation, question answering Decoder GPT-3 (OpenAI) Text generation Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI T5 Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI The evolution of OpenAI’s GPT collection of models Word Attention Context # Training Model Date Layers embedding heads window parameters data size GPT Jun 2018 12 12 768 512 120,000,000 BookCorpus: 4.5 GB of text from unpublished books GPT-2 Feb 2019 48 48 1,600 1,024 1,500,000,00 WebText: 40 0 GB of text from outbound Reddit links GPT-3 May 2020 96 96 12,888 2,048 175,000,000, CommonCra 000 wl, WebText, English Wikipedia, book corpora and others: 570 Department of Networking and Communications, SRM GB Institute of Science and Technology, Kattankulathur, GPT-4 Mar 2023 - - - Chennia. - - - 21CSE306J – Applied Gen AI ChatGPT Before the beta release of GPT-4, OpenAI announced ChatGPT—a tool that allows users to interact with their suite of large language models through a conversational interface. The original release in November 2022 was powered by GPT-3.5, a version of the model that was more powerful that GPT-3 and was fine-tuned to conversational responses. ChatGPT uses a technique called reinforcement learning from human feedback (RLHF) to fine-tune the GPT-3.5 model. Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI The training process for ChatGPT 1.Supervised fine-tuning: Collect a demonstration dataset of conversational inputs (prompts) and desired outputs that have been written by humans. This is used to fine-tune the underlying language model (GPT-3.5) using supervised learning. 2.Reward modeling: Present a human labeler with examples of prompts and several sampled model outputs and ask them to rank the outputs from best to worst. Train a reward model that predicts the score given to each output, given the conversation history. 3.Reinforcement learning: Treat the conversation as a reinforcement learning environment where the policy is the underlying language model, initialized to the fine-tuned model from step 1. Given the current state (the conversation history) the policy outputs an action (a sequence of tokens), which is scored by the reward model trained in step 2. A reinforcement learning algorithm— proximal policy optimization (PPO)—can then be trained to maximize the reward, by adjusting the weights of the language model. Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. The reinforcement learning from human feedback fine-tuning process used in ChatGPT Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia. 21CSE306J – Applied Gen AI THE END Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennia.

LLM_RNN_Transformer.pdf

Document Details

Tags

Related

Full Transcript