Web and Text Analytics 2024-25 Week 12 PDF
Document Details
Uploaded by FamedTopaz4712
University of Macedonia
2024
Evangelos Kalampokis
Tags
Summary
This presentation discusses large language models (LLMs) and their applications. It covers topics like Generative AI and machine learning for natural language processing, focusing on various models and their architectures (e.g., Transformer).
Full Transcript
Web and Text Analytics 2024-25 Week 12 Evangelos Kalampokis https://kalampokis.github.io http://islab.uom.gr © Information Systems Lab Sam Altman’s Reflections for 2024 https://blog.samaltman.com/reflections ©...
Web and Text Analytics 2024-25 Week 12 Evangelos Kalampokis https://kalampokis.github.io http://islab.uom.gr © Information Systems Lab Sam Altman’s Reflections for 2024 https://blog.samaltman.com/reflections © Information Systems Lab Large Language Models ▪ Generative AI is a subset of traditional machine learning ▪ Machine learning models that underpin generative AI have learned these abilities by finding statistical patterns in massive datasets of content that was originally generated by humans. ▪ Large language models (foundation models) have been trained on trillions of words over many weeks and months, and with large amounts of compute power. Foundation model Parameters Description GPT-3 175 billion Developed by OpenAI BERT 340 million Created by Google Google's Text-to-Text Transfer Transformer (T5) is T5 11 billion versatile for various NLP tasks. PaLM 540 billion Pathways Language Model by Google LLaMA 65 billion Meta's Large Language Model Meta AI Claude 3.5 Size not publicly disclosed Developed by Anthropic © Information Systems Lab LLM - terminology ▪ The way you interact with language models is quite different than other machine learning and programming paradigms. ▪ In those cases, you write computer code with formalized syntax to interact with libraries and APIs. ▪ In contrast, large language models are able to take natural language or human written instructions and perform tasks much as a human would. ▪ The text that you pass to an LLM is known as a prompt. ▪ The space or memory that is available to the prompt is called the context window, and this is typically large enough for a few thousand words, but differs from model to model. ▪ The output of the model is called a completion, and the act of using the model to generate text is known as inference. © Information Systems Lab Transformer architecture ▪ Building large language models using the transformer architecture dramatically improved the performance of natural language tasks over the earlier generation of RNNs, and led to an explosion in generative capability. ▪ The power of the transformer architecture lies in its ability to learn the relevance and context of all of the words in a sentence. ▪ Attention weights are learned during LLM training ▪ An attention map can be useful to illustrate the attention weights between each word and every other word © Information Systems Lab Encoder - Decoder ▪ The transformer architecture is split into two distinct parts, the encoder and the decoder. ▪ These components work in conjunction with each other and they share a number of similarities. ▪ The embedding layer is a trainable vector embedding space, a high- dimensional space where each token is represented as a vector and occupies a unique location within that space. © Information Systems Lab Tokenizer – Embedding – Positional Encoding © Information Systems Lab Multi-headed self-attention ▪ Once we've summed the input tokens and the positional encodings, we pass the resulting vectors to the self-attention layer. ▪ The self-attention weights that are learned during training and stored in these layers reflect the importance of each word in that input sequence to all other words in the sequence. ▪ The transformer architecture has multi-headed self-attention meaning that multiple sets of self-attention weights or heads are learned in parallel independently of each other. ▪ Each self-attention head will learn a different aspect of language. © Information Systems Lab ▪ Now that all of the attention weights have been applied to the input data, the output is processed through a fully- connected feed-forward network. ▪ The output of this layer is a vector of logits proportional to the probability score for each and every token in the tokenizer dictionary. ▪ We pass these logits to a final softmax layer, where they are normalized into a probability score for each word. ▪ This output includes a probability for every single word in the vocabulary, so there's likely to be thousands of scores here © Information Systems Lab Prompt engineering ▪ The work to develop and improve the prompt is known as prompt engineering. ▪ Types of prompts include: – Instruction-based prompts: Clearly state what you want the model to do. – Few-shot learning prompts – Zero-shot prompts © Information Systems Lab Transformer Layers and the Prompt ▪ Encoder-Decoder Models (e.g., T5) ▪ In models like T5: – The encoder processes the prompt (input sequence) to create a contextual representation. – The decoder uses this representation, along with its own self-attention, to generate output. ▪ Decoder-Only Models (e.g., GPT) ▪ In models like GPT: – The prompt is directly processed by the decoder's layers. – The same transformer stack handles both the input (prompt) and the output (generated tokens), building context as more tokens are added. © Information Systems Lab In-context learning ▪ One powerful strategy to get the model to produce better outcomes is to include examples of the task that you want the model to carry out inside the prompt. ▪ Providing examples inside the context window is called in-context learning. ▪ The inclusion of a single example is known as one-shot inference, in contrast to the zero-shot prompt. ▪ We can extend the idea of giving a single example to include multiple examples. This is known as few-shot inference. © Information Systems Lab Zero-shot inference ▪ Within the prompt shown here, we ask the model to classify the sentiment of a review. ▪ The prompt consists of the instruction, "Classify this review," followed by some context, which in this case is the review text itself, and an instruction to produce the sentiment at the end. ▪ This method, including the input data within the prompt, is called zero- shot inference. © Information Systems Lab Zero-shot evaluation (Example) ▪ Performance of Open-source and Proprietary Large Language Models on Cardiology Board Exam-style Questions © Information Systems Lab One-shot inference ▪ The prompt text is longer and now starts with a completed example that demonstrates the tasks to be carried out to the model. ▪ The inclusion of a single example is known as one-shot inference © Information Systems Lab Few-shot inference ▪ We can extend the idea of giving a single example to include multiple examples. ▪ This is known as few-shot inference. © Information Systems Lab In-context learning ▪ While the largest models are good at zero-shot inference with no examples, smaller models can benefit from one-shot or few-shot inference that include examples of the desired behavior. ▪ But remember the context window because you have a limit on the amount of in-context learning that you can pass into the model. ▪ We may have to try out a few models to find the right one for your use case. © Information Systems Lab Generative configuration ▪ Each model exposes a set of configuration parameters that can influence the model's output during inference. ▪ These are different than the training parameters which are learned during training time. © Information Systems Lab Max tokens ▪ The max token setting defines the maximum number of tokens a model can process in one go. It includes both: – Input tokens (the text you provide to the model). – Output tokens (the model's response or completion). ▪ The max tokens parameter is constrained by the context window size. Specifically: © Information Systems Lab Max new tokens ▪ Max new tokens (or response length) can be used to limit the number of tokens that the model will generate. ▪ This is max new tokens, not a hard number of new tokens generated. © Information Systems Lab Random sampling ▪ The output from the transformer's softmax layer is a probability distribution across the entire dictionary of words that the model uses ▪ Most large language models by default will operate with so-called greedy decoding. – This is the simplest form of next-word prediction, where the model will always choose the word with the highest probability. ▪ Random sampling is the easiest way to introduce some variability. ▪ Instead of selecting the most probable word every time, with random sampling the model chooses an output word at random using the probability distribution to weight the selection © Information Systems Lab Top-P ▪ Top-P is a parameter that determines the subset of possible next tokens the model considers when generating text. ▪ Instead of picking from all possible tokens, the model selects from the smallest group of tokens whose cumulative probability adds up to p (e.g., 0.9 or 90%). ▪ It helps balance creativity and coherence: – High top-p values (close to 1) allow for more diverse and creative outputs. – Low top-p values focus on more predictable, high-probability tokens, making the text more deterministic. © Information Systems Lab Top-P Example ▪ The model is predicting the next word in this sentence: ▪ Top-p = 0.9: – The model will consider tokens "mat," "floor," and "roof" (cumulative probability: 0.4 + 0.3 + 0.15 = 0.85). – Tokens like "sofa" and "tree" are excluded as they add up beyond 0.9. ▪ Top-p = 0.6: – The model will only consider "mat" and "floor" (cumulative probability: 0.4 + 0.3 = 0.7, but "roof" alone exceeds 0.6). © Information Systems Lab Key Takeaways on Top-P ▪ Top-p = 1.0: Equivalent to no filtering (all tokens considered). ▪ Lower Top-p: Narrows down choices, leading to more predictable text. ▪ Higher Top-p: Allows for more diverse and unexpected outputs. © Information Systems Lab Temperature ▪ The temperature parameter in language models controls the randomness or creativity of the generated output. Adjusting the temperature influences how the model selects the next token from the probability distribution produced by the softmax layer. ▪ The temperature parameter is a scaling factor applied to the logits (raw scores) before they are converted to probabilities by the softmax function. © Information Systems Lab Key Effects of Temperature ▪ Low Temperature (T1): – The model smooths the probability distribution, making less probable tokens more likely. – Encourages creativity and diversity but may lead to nonsensical outputs. – Example: "The cat sat on the roof." © Information Systems Lab Temperature effects © Information Systems Lab Temperature effects © Information Systems Lab Retrieval-Augmented Generation (RAG) ▪ Retrieval-Augmented Generation (RAG) is the process of optimizing the output of a large language model, so it references an authoritative knowledge base outside of its training data sources before generating a response. ▪ RAG extends the already powerful capabilities of LLMs to specific domains or an organization's internal knowledge base, all without the need to retrain the model. © Information Systems Lab RAG © Information Systems Lab RAG (Example) ▪ Performance of Publicly Available Large Language Models on Internal Medicine Board-style Questions – Used the Harrison’s Principles of Internal Medicine book as an external datasource – https://doi.org/10.1371/journal.pdig.0000604 © Information Systems Lab Fine-tuning LLMs ▪ Fine-tuning is the process of adapting a pre-trained language model to a specific task, domain, or dataset by continuing its training. This allows the model to perform better on specialized tasks while leveraging its general language understanding. ▪ Types of Fine-Tuning – Task-Specific Fine-Tuning: Fine-tuning GPT-3 for financial report summarization – Domain Adaptation: Adapts the model to work better in specific domains like healthcare, legal, or technical text – Instruction Tuning: Focuses on improving the model’s ability to follow user instructions effectively © Information Systems Lab Continuous Pretraining vs Fine-Tuning ▪ Continuous Pretraining and Fine-Tuning are two distinct approaches to adapting a pre-trained language model to a specific task, domain, or dataset. While both involve further training, their goals, methods, and scopes differ significantly ▪ Continuous pretraining extends the general training phase of the model by exposing it to additional, typically domain-specific, unlabeled or lightly curated text data – Goal: To improve the model's general understanding of a specific domain or type of text – Scope: Broad and domain-specific, not task-specific – Benefits: Improves the model’s foundational knowledge in a specific domain. Provides better generalization for tasks within the domain. – Drawbacks: Requires large amounts of domain-specific data. Computationally intensive and time-consuming. © Information Systems Lab Fine-tuning ▪ In contrast to pre-training, where you train the LLM using vast amounts of unstructured textual data via selfsupervised learning, fine-tuning is a supervised learning process where you use a data set of labeled examples to update the weights of the LLM. ▪ The labeled examples are prompt completion pairs, the fine-tuning process extends the training of the model to improve its ability to generate good completions for a specific task. © Information Systems Lab Instruction fine-tuning ▪ These prompt completion examples allow the model to learn to generate responses that follow the given instructions. ▪ Instruction fine-tuning, where all of the model's weights are updated is known as full fine-tuning. ▪ The process results in a new version of the model with updated weights. ▪ It is important to note that just like pre-training, full fine tuning requires enough memory and compute budget to store and process all the gradients, optimizers and other components that are being updated during training © Information Systems Lab Fine-tune on a single task ▪ We can fine-tune a pre-trained model to improve performance on only a task of interest, e.g., summarization. ▪ In this case 500-1,000 examples can result in good performance in contrast to the billions of pieces of texts that the model saw during pre- training. ▪ The process may lead to a phenomenon called catastrophic forgetting. ▪ Catastrophic forgetting happens because the full fine-tuning process modifies the weights of the original LLM. ▪ While this leads to great performance on the single fine-tuning task, it can degrade performance on other tasks. ▪ In order the model to maintain its multitask generalized capabilities, we can perform fine-tuning on multiple tasks at one time. ▪ Good multitask fine-tuning may require 50-100,000 examples across many tasks, and so will require more data and compute to train. © Information Systems Lab Parameter efficient fine-tuning (PEFT) ▪ In contrast to full fine-tuning where every model weight is updated during supervised learning, parameter efficient fine-tuning methods only update a small subset of parameters ▪ Some PEFT techniques freeze most of the model weights and focus on fine tuning a subset of existing model parameters, for example, particular layers or components. ▪ Other techniques don't touch the original model weights at all, and instead add a small number of new parameters or layers and fine-tune only the new components. ▪ Because the original LLM is only slightly modified or left unchanged, PEFT is less prone to the catastrophic forgetting problems of full fine- tuning. © Information Systems Lab