Large Language Models (LLMs) PDF
Document Details
Uploaded by BetterKnownEpigram
UBC
Tags
Summary
This document provides an overview of large language models (LLMs), including their architecture, training methods, and types of machine learning. It discusses core concepts such as traditional programming, machine learning, and embedding. The document also touches on the challenges and solutions of training LLMs.
Full Transcript
Large Language Models CPEN - 320 1 Traditional Programming In put 2 Ou tpu t Software 1.0 Traditional Programming vs Machine Learning In put Training In put 3 Ou tpu...
Large Language Models CPEN - 320 1 Traditional Programming In put 2 Ou tpu t Software 1.0 Traditional Programming vs Machine Learning In put Training In put 3 Ou tpu t Output Software 1.0 Software 2.0 Types of Machine Learning Unsupervised Learning Supervised Learning Reinforcement Learning Learn structure of data to Learn how data maps to Learn how to act in an predict (e.g., clustering) labels to recognize or predict environment to obtain reward 4 "This product does what it is supposed " cat "Hey Siri" Embedding: Inputs and outputs are always numbers Input Output What we see "Lincoln" 5 Embedding: Inputs and outputs are always numbers Input Output What we see "Lincoln" 6 What the machine "sees" [76, 105, 110, 99, 111, 108, 110] Why is this hard? "I loved this movie" Infinite variety of inputs can all mean the "As good as The Godfather" same thing Meaningful differences can be tiny 7 Structure of the world is complex How is i t done? Many methods fo r Machine Learning - Logistic Regression 8 - Support Vector Machines - Decision Trees But one is dominant - Neural Networks (also called Deep Learning) Inspiration Inspired by what we Output: say "cat" know to be intelligent: the brain Input: see a cat The brain is 9 composed of billions of https://www.the-scientist.com/the-nutshell/what-made-human-brains-so-big-36663 neurons Inputs Each neuron Output receives electrical inputs and https://medicalxpress.com/news/2018-07-neuron-axons-spindly-theyre-optimizing.html sends an electrical Formalization Output: say "cat" Input: see a cat 10 https://www.the-scientist.com/the-nutshell/what-made-human-brains-so-big-36663 Inputs Inputs Output Output https://medicalxpress.com/news/2018-07-neuron-axons-spindly-theyre-optimizing.html https://www.jessicayung.com/explaining-tensorflow-code-for-a-multilayer-perceptron/ A "perceptron" is a vector of numbers 11 https://www.jessicayung.com/explaining-tensorflow-code-for-a-multilayer-perceptron/ Step Function 12 https://www.jessicayung.com/explaining-tensorflow-code-for-a-multilayer-perceptron/ Step function: also called threshold function Decides based on the weighted sum if the output is 0 or 1 In modern deep learning, the step function is less often used as it is not differentiable at the threshold point Instead, smoother functions that approximate the step function but retain differentiability are used For example: sigmoid outputs a continuous value in the range (0, 1) A "layer" is a matrix of numbers 13 https://www.jessicayung.com/explaining-tensorflow-code-for-a-multilayer-perceptron/ The neural network is a set of matrices Called "parameters" or "weights" 14 NN operations are just matrix multiplications. GPUs are really fast at matrix multiplications. Training Dataset contains n items Batch: a subset of n to use for training If n is large, it won’t fit in memory Epoch: going over the entire n items in the dataset https://www.guru99.com/backpropogation-neural-network.html 15 LLMBC 2023 Training Data X (e.g. images), labels y (e.g. labels) Take a little batch of data x: - Use the current model t o make a prediction x - > y' - Compute loss(y, y') - Back-propagate the loss through all the layers of the model Repeat unti l loss stops decreasing https://www.guru99.com/backpropogation-neural-network.html https://developers.google.com/machine-learning/testing-debugging/metrics/interpretic 15 Dataset Splitting Split (X, y) into training (~80%), validation (~10%), and test (~10%) sets Validation set is for - ensuring that training is not "overfitting" - setting hyper-parameters of the model (e.g. number of parameters) Test set is for measuring validity of THIS APPLIES TO YOUR predictions on new data EXPERIMENTATION WITH PROMPTS! 16 A LOT of data Much less data Large model Fine-tuned large model Pre-training: Fine-tuning: slow training on a lot of data fast training on a little data Model Hubs People share pre- trained models! - 180K models - 30K datasets 18 Before ~2020: each task had its own NN architecture http://lucasb.eyer.be/transformer 19 Now: all is Transformers 20 Transformer cartoon (DALL-E) The Transformer Architecture Attention is all you need (2017) https://arxiv.org/abs/1706.03762 How does the Transformer work? 23 LLMBC 2023 Attention is all you need (2017) https://arxiv.org/abs/1706.03762 Ground-breaking architecture that set SOTA on fi rst translation and later all other NLP tasks 24 LLMBC 2023 Transformer A breakthrough in AI that changes how we process data. The core idea: Instead of reading word by word, transformers focus on the whole sentence and important parts of a sentence all at once. 25 Foundation Models - A New Paradigm Task Task Collect data Train model (prompt) Foundation Models Produce human-like text Speech to text translation Collect data Train model (including code) (Codex) Creates images from text Create description for an image "Write my program Write code from text description for me!" GPT Timeline June, 2018 Feb, 2019 June, 2020 Nov, 2022 March, 2023 GPT-1 GPT-2 GPT-3 GPT-3.5 GPT-4 (ChatGPT) (ChatGPT) 175 billion 117M parameters 1.5 billion parameters parameters Reinforcement Multimodal models Zero Shot Learning In-context learning Learning from Human Feedback (RLHF) Interacting with LLMs Prompting Model fine-tuning used to be necessary Larger (or instruction tuned) models give intelligible responses even without it – just prompt them! Taken from the GPT-3 paper: https://arxiv.org/pdf/2005.14165.pdf Few-shot Prompting? Few-shot Prompting Is the sentiment positive or negative? Zero-Shot [Instruction] [Input] “This movie sucks!” A: Q: This movie rocks! A: Positive. One-Shot No-Instruction [Ex In 1] [Ex Out 1] [Input] Q: “This movie sucks!” A: Is the sentiment positive or negative? Q: This One-Shot [Instruction] [Ex In 1] [Ex Out 1] movie rocks! A: Positive. Q: “This movie sucks!” [Input] A: Is the sentiment positive or negative? Q: This [Instruction] [Ex In 1] [Ex Out Two-Shot movie rocks! A: Positive. Q: My eyes are 1] [Ex In 2] [Ex Out 2] [Input] bleeding! A: Negative. Q: “This movie sucks!” A: Is the sentiment positive or negative? Please [Instruction] Chain-of-Thought explain your answer step-by-step. Q: “This [Request step-by-step movie sucks!” A: explanation] [Input] Prompting examples Prompt Example 1 – Grammar Correction Q: Correct this to standard English: Anna and Mike is going skiing. Model: Anna and Mike are going skiing Prompt Example 2 – Summarization Zhang, Tianyi, et al. "Benchmarking Large Language Models for News Summarization." arXiv preprint arXiv:2301.13848 (2023). Prompt Example 3 – Style Transfer Reif, Emily, et al. "A recipe for arbitrary text style transfer with large language models." arXiv preprint arXiv:2109.03910 (2021). Prompt Example 4 – “Let’s Think Step By Step.” Kojima, Takeshi, et al. "Large language models are zero-shot reasoners." arXiv preprint arXiv:2205.11916 (2022) Prompt Example 5 – Joke Explanation Taken from the PaLM paper: https://arxiv.org/pdf/2204.02311.pdf HERE NOV 2024 39 LLMBC 2023 Transformer architecture A breakthrough in AI that changes how we process data. The core idea: Instead of reading word by word, transformers focus on the whole sentence and important parts of a sentence all at once. 40 LLMBC 2023 Transformer Architecture tacked many times e: 41 Why does this work so well? 42 We mostly don't understand it, though 43 Should you be able to code a Transformer? Definitely not necessary! BUT: it's not difficul t, i t is fun, and is probably worth doing Andrej Karpathy's GPT-2 implementation is