Large Language Models (LLMs)

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson
Download our mobile app to listen on the go
Get App

Questions and Answers

Which of the following best describes the role of parameter count in qualifying an LLM as 'large'?

  • It’s inherently vague, and what's considered 'large' evolves over time. (correct)
  • A definitive threshold exists, universally agreed upon by experts.
  • The parameter count is irrelevant; size is determined by the dataset used.
  • A model must have at least 1 trillion parameters to be considered large.

What is the primary function of Retrieval-Augmented Generation (RAG) in the context of LLMs?

  • To compress the size of LLMs for easier deployment.
  • To enhance LLMs with information retrieved from external document sources. (correct)
  • To reduce the computational cost of training LLMs.
  • To enable LLMs to interact with dynamic environments.

Which approach involves initially correcting a few naive AI responses to bootstrap a large dataset of correct responses?

  • Self-Instruct (correct)
  • Mixture of Experts
  • Reinforcement learning from human feedback
  • Proximal Policy Optimization

In the context of LLMs, what is the purpose of 'attention heads'?

<p>They calculate 'soft' weights to determine the relevance of other tokens within the context window. (C)</p>
Signup and view all the answers

What is the main goal of post-training quantization in the context of LLMs?

<p>To decrease the space requirement of a trained model while preserving performance. (D)</p>
Signup and view all the answers

Which of the following is a key consideration when evaluating the effectiveness of LLMs using perplexity?

<p>The model's likelihood of including portions of the test set in its training data. (A)</p>
Signup and view all the answers

What characterizes 'emergent abilities' in large language models?

<p>They arise from the complex interaction of the model's components and are not explicitly programmed. (A)</p>
Signup and view all the answers

According to the information provided, what is the main role of Reinforcement Learning from Human Feedback (RLHF) in LLMs?

<p>To fine-tune a model based on a dataset of human preferences. (D)</p>
Signup and view all the answers

What is the primary challenge associated with using LLM-generated content for training new LLMs?

<p>The difficulty in distinguishing LLM-generated content from human text, potentially degrading performance. (D)</p>
Signup and view all the answers

How do models trained for step-by-step solutions improve complex reasoning tasks?

<p>By spending more time on generating step-by-step solutions before final answers. (B)</p>
Signup and view all the answers

What is a token in the context of dataset preprocessing for LLMs?

<p>A numerical index assigned to a vocabulary entry. (A)</p>
Signup and view all the answers

What is the purpose of using control characters like [MASK] or [UNK] during tokenization?

<p>To represent masked tokens or characters not in the vocabulary. (A)</p>
Signup and view all the answers

Why might a token vocabulary based on English frequencies be suboptimal for other languages?

<p>It may split words into a suboptimal amount of tokens compared to the ideal amount. (D)</p>
Signup and view all the answers

What is a potential consequence of 'sleeper agents' in large language models?

<p>They can cause the model to deviate from expected behavior and make insecure actions. (C)</p>
Signup and view all the answers

Complete the analogy: Algorithmic bias is to skewed representations and unfair treatment as Stereotyping is to...

<p>Reinforcement of wide range of stereotypes based on demographic information. (B)</p>
Signup and view all the answers

What is one reason transformer-based LLM training is expensive?

<p>High FLOPS per parameter to train on one token. (D)</p>
Signup and view all the answers

According to the content, In terms of model size what constitutes a 'large' language model in 2024?

<p>There is no absolute number, models previously considered large may evolve. (C)</p>
Signup and view all the answers

Which practice does not contribute to dataset cleaning in the context of training LLMs?

<p>Adding more data regardless of quality (A)</p>
Signup and view all the answers

What is a potential problem that greedy tokenization can cause?

<p>It can cause problems with text completion (C)</p>
Signup and view all the answers

Based on the text, what factors mostly affect the performance of an LLM after pretraining?

<p>cost of pretraining, size of artificial neural network it self, size of its pretraining dataset (C)</p>
Signup and view all the answers

Some commenters expressed concern over which of the following pertaining to Large Language Models?

<p>Accidental/deliberate creation of misinformation (A)</p>
Signup and view all the answers

In the context of language models, what does 'hallucination' refer to?

<p>The model's tendency to generate text that is factually incorrect or nonsensical. (A)</p>
Signup and view all the answers

What does the acronym BPE stand for with relationship to LLMs?

<p>Byte-Pair Encoding (C)</p>
Signup and view all the answers

Why are some AI researchers concerned about RLHF creating a “smiling facade” in LLMs?

<p>It obscures less desirable thought processes or insanity. (A)</p>
Signup and view all the answers

GPT-4o achieved 13% accuracy while oi reached 83% on what when both are taking a International Mathematics Olympiad qualifying exam?

<p>Reasoning Tasks or Final Answers (D)</p>
Signup and view all the answers

What is the main idea behind the ReAct pattern?

<p>Making an agent out of a LLM using the LLM as a planner. (C)</p>
Signup and view all the answers

Which of the following best defines 'multimodality' in the context of AI models like LLMs?

<p>The capacity to understand and generate different types of input or output such as video, image, or text. (D)</p>
Signup and view all the answers

Which is not a major reason for Algorithmic Bias

<p>Poor Data Quality (A)</p>
Signup and view all the answers

How can in-context learning be described?

<p>learning involved in tasks. (D)</p>
Signup and view all the answers

What is a solution that can be used on its own to be solved by an LLM?

<p>354*139 (B)</p>
Signup and view all the answers

According to the reading why text be converted to numbers?

<p>Because machine learning algorithms process numbers rather than text (B)</p>
Signup and view all the answers

In regards to Emergent abilities, What is something recent research suggests?

<p>can employ heuristic reasoning akin to human cognition. (A)</p>
Signup and view all the answers

The relationship amongst energy demands of LLMs

<p>The energy demands of LLMs have grown along with their size and capabilities (B)</p>
Signup and view all the answers

What can be used to further fine-tune a model that is based on a dataset of human preferences?

<p>Reinforcement learning from human feedback (C)</p>
Signup and view all the answers

What is the function of tokenization?

<p>converts text to numbers (A)</p>
Signup and view all the answers

Some are skeptics of LLM what do they believe?

<p>simply remixing and recombining existing writing (C)</p>
Signup and view all the answers

The canonical measure of the performance of an LLM is its __________ on a given text corpus.

<p>Perplexity (D)</p>
Signup and view all the answers

What are the two other examples for the question 'What is the time now? It is'?

<p>Where a separate program interpreter would need to execute code and 354*139. (B)</p>
Signup and view all the answers

Concerning Transformer-based LLM training cost, Select which choice is accurate.

<p>much higher than inference cost. (D)</p>
Signup and view all the answers

Flashcards

Large Language Model (LLM)

A type of machine learning model designed for natural language processing tasks like language generation. They have many parameters and are trained with self-supervised learning on a vast amount of text.

Self-Supervised Learning

Training a model to predict the next word in a sequence, allowing to acquire predictive power regarding syntax, semantics, and ontologies.

Generative Pretrained Transformers (GPTs)

The largest and most capable LLMs that use the transformer architecture to generate human-like text.

Fine-tuning

Adapting a pre-trained model for specific tasks using a task-specific dataset or guiding the model.

Signup and view all the flashcards

Tokenization

The process of converting words or pieces of text into numerical representations (tokens) that machine learning algorithms can process.

Signup and view all the flashcards

Byte-Pair Encoding (BPE)

An algorithm that merges the most frequent pair of adjacent characters into a bi-gram, and repeatedly merges frequent n-grams until a vocabulary of a prescribed size is obtained.

Signup and view all the flashcards

Dataset Cleaning

The process of cleaning datasets by removing low-quality, duplicated, or toxic data to increase training efficiency and improve performance.

Signup and view all the flashcards

Synthetic Data

Training a language model on artificially created data. Used when naturally occurring data is insufficient or of poor quality.

Signup and view all the flashcards

Reinforcement Learning from Human Feedback (RLHF)

A technique used to further fine-tune a model based on a dataset of human preferences, via algorithms such as proximal policy optimization.

Signup and view all the flashcards

Instruction Tuning

A technique to enable LLMs to generate correct responses by starting from human-generated corrections of a few cases.

Signup and view all the flashcards

Mixture of Experts (MoE)

A technique used for models that are too expensive to train and use directly, in which a line of research pursued by Google researchers since 2017 is used to train models reaching up to 1 trillion parameters.

Signup and view all the flashcards

Prompt Engineering

Technique to achieve results previously only achievable by costly fine-tuning, although results are limited to the scope of a single conversation.

Signup and view all the flashcards

Attention Mechanism

In order to find out which tokens are relevant to each other within the scope of the context window, the algorithm calculates "soft" weights for each token, more precisely for its embedding, by using multiple attention heads, each with its own "relevance" for calculating its own soft weights.

Signup and view all the flashcards

Context Window

The number of tokens a model can consider when generating a response. Limits how much of a conversation the model remembers.

Signup and view all the flashcards

Tool Use

Using external tools to enhance the capabilities of language models. Language models can not solve all problems themselves so tools can be required to get the best answers.

Signup and view all the flashcards

Retrieval-Augmented Generation (RAG)

An approach that enhances LLMs by integrating them with document retrieval systems. A document retriever is called to retrieve the most relevant documents, then passed to the LLM.

Signup and view all the flashcards

Agency

The ability of an LLM to interact with dynamic environments, recall past behaviors, and plan future actions, typically achieved by integrating modules like profiling, memory, planning, and action.

Signup and view all the flashcards

Post-Training Quantization

Decreasing the space requirement of a trained model by lowering the precision of the parameters, while preserving most of its performance.

Signup and view all the flashcards

Multimodality

Having the ability to process or generate other types of data, such as images or audio.

Signup and view all the flashcards

Reasoning models

Models that use logical techniques to solve problems in multi step reasoning.

Signup and view all the flashcards

Scaling laws

Laws for LLM performance based on factors like cost of training, size of the neural network, and size of the pretraining dataset .

Signup and view all the flashcards

Emergent abilities

Describes AI systems, including large language models, can employ heuristic reasoning akin to human cognition. They balance between exhaustive logical processing and the use of cognitive shortcuts.

Signup and view all the flashcards

Next Sentence Prediction (NSP)

Models may be trained on auxiliary tasks which test their understanding of the data distribution, such as Next Sentence Prediction (NSP), in which pairs of sentences are presented and the model must predict whether they appear consecutively in the training corpus.

Signup and view all the flashcards

Misinformation, by LLMs

The capacity to create misinformation, or other forms of misuse, which large language models are now helping by reducing required skill.

Signup and view all the flashcards

"Sleeper agents" in LLMs

The potential presence of hidden functionalities built into the model that remain dormant until triggered by a specific event or condition.

Signup and view all the flashcards

Algorithmic bias

The tendency for LLMs to inherit and amplify biases present in their training data, leading to skewed representations or unfair treatment of different demographics.

Signup and view all the flashcards

Perplexity

The performance of an LLM on a given text in context. Measures how well a model predicts the contexts of a dataset. the higher the likelihood the model assigns to the dataset, the lower the perplexity.

Signup and view all the flashcards

Study Notes

Large Language Models (LLMs) Defined

  • LLMs are a type of machine learning model for natural language processing tasks like language generation
  • LLMs are language models trained with self-supervised learning on large quantities of text
  • They have many parameters
  • Generative Pre-trained Transformers (GPTs) are among the largest and most capable LLMs
  • Modern models can be fine-tuned for specific tasks or guided by prompt engineering
  • LLMs gain predictive abilities with syntax, semantics, and ontologies, but can also inherit inaccuracies and biases

Historical Context

  • In the 1990s, The IBM alignment models was first to develop statistical language modeling
  • In 2001, a smoothed n-gram model trained on 0.3 billion words achieved state-of-the-art perplexity
  • Researchers constructed internet-scale language datasets in the 2000s as internet use became prevalent
  • By 2009, statistical language models dominated over symbolic language models as it could ingest large datasets
  • Google made its translation service into Neural Machine Translation in 2016, done by seq2seq deep LSTM networks
  • Google introduced the transformer architecture that aimed to improve upon 2014 seq2seq technology
  • Transformer architecture is based on attention mechanism
  • BERT, an encoder-only model became "ubiquitous" in 2018
  • After a decline, research into BERT in 2023 was followed by rapid advancements in decoder-only models
  • Decoder-only GPT-1 was introduced in 2018
  • GPT-2 caught attention in 2019, but OpenAl deemed it too powerful to release publicly fearing malicious use
  • GPT-3 in 2020 went a step further and is available only viaAPI
  • ChatGPT a consumer-facing browser-based model captured the imaginations in 2022
  • OpenAI released the reasoning model OpenAI o1, in 2024, that generated long chains of thought before returning a final answer
  • GPT-4 released in 2023, was praised in its increased accuracy and holy grail for its multimodal capabilities
  • BLOOM, and LLaMA, have restrictions on the field of use but are source-available models
  • Mistral 7B and Mixtral 8x7b models by Mistral AI, have the more permissive Apache License
  • DeepSeek released DeepSeek R1 in January 2025

Dataset Preprocessing And Tokenization

  • As machine learning algorithms process numbers rather than text, the text must be converted to numbers
  • A vocabulary is selected to begin, then integer indices are assigned, an embedding is associated to the integer index
  • byte-pair encoding (BPE), and WordPiece are types of algorithms to perform dataset conversion
  • "[MASK]" for masked-out token used in BERT
  • "[UNK]" ("unknown") for characters not appearing in the vocabulary are special tokens
  • Special symbols such as "Ġ" and "##"are used to denote special text formatting
  • GPT-3 (Legacy) uses a BPE tokenizer would split tokenizer
  • The BPE token split will texts -> series of numerical "tokens" as token izer : texts > series of numerical "tok ens "
  • Tokenization also compresses the datasets
  • LLMs generally require input to be an array that is not jagged, the shorter texts must be "padded"
  • Byte pair encoding merges the pairs of adjacent characters

Data and Model Refinement

  • After trained, any text can be tokenized, as long as it does not contain characters that appear in the uni-grams initial-set
  • a token vocabulary uses as few tokens as possible for an average English word depending on the frequencies extracted
  • Languages such as Portuguese and German have "a premium of 50%" compared to English.
  • In a English-optimized tokenizer word in another language is "split" by an amount tokens
  • In the context of training LLMs, datasets are typically cleaned by removing low, duplicated, or toxic data
  • Datasets are usually clean to increase training efficiency
  • Microsoft's Phi series of LLMs is trained on textbook-like data generated by another LLM
  • Reinforcement learning from human feedback (RLHF) through algorithms, such as proximal policy optimization, is used to further fine-tune a model based on a dataset of human preferences

Approaches to AI Model Training

  • "Self-instruct" approaches to the AI model have been able to bootstrap correct responses
  • The largest LLM may be too expensive to train and use directly
  • Mixture of experts (MoE) can be applied, pursued by Google researchers since 2017 to train models, to address this cost
  • "Soft" weights is achieved via prompt engineering
  • Prompts and multiple heads can have varying relevancy

Components of an LLM

  • Attention mechanism calculates “soft” weights towards tokens by embedding
  • The GPT-2 model has had twelve attention heads and a context window
  • Google's Gemini 1.5 has a context window sized up to 1 million
  • Anthropic's Claude 2.1, has a context window of up to 200k tokens

Context Limitations and Pre-Training

  • Models may be pre-trained either to predict how the segment continues, or what is missing in the segment from its training dataset
  • Regularization loss is used to stabilize training during auxiliary tasks
  • Infrastructure has substantial bearing on the development of AI models

LLMs in terms of training cost and parameter counts

  • GPT-1 of 2018 has only 0.117 billion parameters
  • The qualifier "large" in "large language model" is inherently vague
  • Training of the GPT-2 (i.e. a 1.5-billion-parameters model) in 2019 cost $50,000
  • The PaLM (i.e. a 540-billion-parameters model) in 2022 cost $8 million
  • Megatron-Turing NLG 530B (in 2021) cost around $11 million
  • For Transformer-based LLM, training costs 6 FLOPs, whereas inference costs 1 to 2 FLOPs

Tool Integration and Agents

  • Certain tasks require the use of external tools/software as LLMs cannot solve the use of these tools
  • Retrieval-augmented generation (RAG) enhances LLMs by document retrieval systems:
    • A document retriever retrieves relevant documents given a query
    • The model encodes vectors to find the documents with closest vectors of the query
    • LLM then generates an output based on a query of context
  • Integration modules transforms an LLM to an autonomous agent that uses dynamic environments, recalls past behaviors, and plans future actions
  • The ReAct pattern constructs a LLM with the "Reason + Act" portmanteau
  • "Describe, Explain, Plan and Select" (DEPS) is a method where LLM produces plans based around image descriptions

Advanced Tools

  • The Reflexion method constructs an agent that learns over multiple episodes:
    • At the end of each episode the LLM thinks up of lessons learned
    • Then are agent in the episodes
  • Open - ended exploration, an LLM can be used to score for their "interestingness"
  • LLM planners then construct "skills", or "functions" for sequences to store and invoke later

Memory and compression

  • LLM-powered agents keep long-term of previous contexts, retrieve, interact socially
  • LLMs are trained using or half-precision floating point numbers (float32 and float16).
  • Largest models typically have 100 billion parameters, requiring 200 gigabytes to load outside of consumer electronics range(2GB)
  • Post-training quantization cuts "space requirement"
  • Compression consists of single precision floating point numbers

Multimodality and LLMs

  • Multimodality allows for several modalities means several types of input/output, such as video, image, audio, text, proprioception, etc -AI models are trained to ingest "one modality and output another modality"
  • AlexNet for image to label, and speech recognition for speech to text are trained to utilize multimodal capabilities
  • Common method to create multimodal models out of a LLM is to "tokenize"
  • Can construct an LLM that can understand images:
    • Take a trained LLM
  • Take
  • A trained image encoder can create a small multilayered perceptron that has same dimensions as encoded token

Reasoning Skills

  • A direction emerged in LLM development with "reasoning models" train to spend time generating step-by-step resolutions before providing OpenIA with o1 in Sept 2024, and 03 Dec 2024,
  • Reasoning models show significant improvements in coding, mathematics, science GPT-4o achieved 13% accuracy while o1 reached 83% on Olympiad exams
  • The DeepSeek-R1 Is more cost-effective to operate
  • Reasoning models require more resources, have superior capabilities +domains

Hallucinations

  • In order to assist with AI and hallucinations, automated reasoning and (retrieval-augmented generation) is used.

Model Properties Depend On

  • Cost of pretraining is all cost for the total of calculate.
  • The Scaling Laws show LLM performance

Scaling laws

  • Scaling laws is the cost of model training.
  • Number of parameters is statistical hyper-parameter

Emergent Abilities

  • The data indicates what they do as the model becomes more complex
  • Balance, logical processing/cognitive shortcuts to adapt to strategies of accuracy/effort

Measuring understanding

  • Transcoders, which are more interpretable than transformers, been to develop "replacement models.”
  • Technique as partial dependency plots, Shapley Additive exPlanations allows to visualize features
  • The small transformers were trained using "the modural arithemetic all addition"
  • (whether they are believed to be simply predict, be to plan ahead)
  • A Microsoft team argued in 2023, that GPT-4 "can solve novel and difficult tasks spanning (math coding, vision, medicine, psychology +more)"

Evaluation Metrics

  • It may produce outputs and what roles characteristics are generated from a model based on traditional gender norms
  • Select is referring more to how large the model is to its a token
  • It systematically favors certain political viewpoints and ideologies during elections and coverage

Problems During LLM Training

  • LLM models may exhibit and learn the training data
  • Test Evaluation is is on problem with the larger models, "to give more accurate parts on text"
  • Bits words is the Entropy from "quantify common".
  • More measure BPT during words and tokens.
  • Evaluate is on model as of capacity for the compresser.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Use Quizgecko on...
Browser
Browser