Podcast
Questions and Answers
Which of the following best describes the role of parameter count in qualifying an LLM as 'large'?
Which of the following best describes the role of parameter count in qualifying an LLM as 'large'?
- It’s inherently vague, and what's considered 'large' evolves over time. (correct)
- A definitive threshold exists, universally agreed upon by experts.
- The parameter count is irrelevant; size is determined by the dataset used.
- A model must have at least 1 trillion parameters to be considered large.
What is the primary function of Retrieval-Augmented Generation (RAG) in the context of LLMs?
What is the primary function of Retrieval-Augmented Generation (RAG) in the context of LLMs?
- To compress the size of LLMs for easier deployment.
- To enhance LLMs with information retrieved from external document sources. (correct)
- To reduce the computational cost of training LLMs.
- To enable LLMs to interact with dynamic environments.
Which approach involves initially correcting a few naive AI responses to bootstrap a large dataset of correct responses?
Which approach involves initially correcting a few naive AI responses to bootstrap a large dataset of correct responses?
- Self-Instruct (correct)
- Mixture of Experts
- Reinforcement learning from human feedback
- Proximal Policy Optimization
In the context of LLMs, what is the purpose of 'attention heads'?
In the context of LLMs, what is the purpose of 'attention heads'?
What is the main goal of post-training quantization in the context of LLMs?
What is the main goal of post-training quantization in the context of LLMs?
Which of the following is a key consideration when evaluating the effectiveness of LLMs using perplexity?
Which of the following is a key consideration when evaluating the effectiveness of LLMs using perplexity?
What characterizes 'emergent abilities' in large language models?
What characterizes 'emergent abilities' in large language models?
According to the information provided, what is the main role of Reinforcement Learning from Human Feedback (RLHF) in LLMs?
According to the information provided, what is the main role of Reinforcement Learning from Human Feedback (RLHF) in LLMs?
What is the primary challenge associated with using LLM-generated content for training new LLMs?
What is the primary challenge associated with using LLM-generated content for training new LLMs?
How do models trained for step-by-step solutions improve complex reasoning tasks?
How do models trained for step-by-step solutions improve complex reasoning tasks?
What is a token in the context of dataset preprocessing for LLMs?
What is a token in the context of dataset preprocessing for LLMs?
What is the purpose of using control characters like [MASK]
or [UNK]
during tokenization?
What is the purpose of using control characters like [MASK]
or [UNK]
during tokenization?
Why might a token vocabulary based on English frequencies be suboptimal for other languages?
Why might a token vocabulary based on English frequencies be suboptimal for other languages?
What is a potential consequence of 'sleeper agents' in large language models?
What is a potential consequence of 'sleeper agents' in large language models?
Complete the analogy: Algorithmic bias is to skewed representations and unfair treatment as Stereotyping is to...
Complete the analogy: Algorithmic bias is to skewed representations and unfair treatment as Stereotyping is to...
What is one reason transformer-based LLM training is expensive?
What is one reason transformer-based LLM training is expensive?
According to the content, In terms of model size what constitutes a 'large' language model in 2024?
According to the content, In terms of model size what constitutes a 'large' language model in 2024?
Which practice does not contribute to dataset cleaning in the context of training LLMs?
Which practice does not contribute to dataset cleaning in the context of training LLMs?
What is a potential problem that greedy tokenization can cause?
What is a potential problem that greedy tokenization can cause?
Based on the text, what factors mostly affect the performance of an LLM after pretraining?
Based on the text, what factors mostly affect the performance of an LLM after pretraining?
Some commenters expressed concern over which of the following pertaining to Large Language Models?
Some commenters expressed concern over which of the following pertaining to Large Language Models?
In the context of language models, what does 'hallucination' refer to?
In the context of language models, what does 'hallucination' refer to?
What does the acronym BPE stand for with relationship to LLMs?
What does the acronym BPE stand for with relationship to LLMs?
Why are some AI researchers concerned about RLHF creating a “smiling facade” in LLMs?
Why are some AI researchers concerned about RLHF creating a “smiling facade” in LLMs?
GPT-4o achieved 13% accuracy while oi reached 83% on what when both are taking a International Mathematics Olympiad qualifying exam?
GPT-4o achieved 13% accuracy while oi reached 83% on what when both are taking a International Mathematics Olympiad qualifying exam?
What is the main idea behind the ReAct pattern?
What is the main idea behind the ReAct pattern?
Which of the following best defines 'multimodality' in the context of AI models like LLMs?
Which of the following best defines 'multimodality' in the context of AI models like LLMs?
Which is not a major reason for Algorithmic Bias
Which is not a major reason for Algorithmic Bias
How can in-context learning be described?
How can in-context learning be described?
What is a solution that can be used on its own to be solved by an LLM?
What is a solution that can be used on its own to be solved by an LLM?
According to the reading why text be converted to numbers?
According to the reading why text be converted to numbers?
In regards to Emergent abilities, What is something recent research suggests?
In regards to Emergent abilities, What is something recent research suggests?
The relationship amongst energy demands of LLMs
The relationship amongst energy demands of LLMs
What can be used to further fine-tune a model that is based on a dataset of human preferences?
What can be used to further fine-tune a model that is based on a dataset of human preferences?
What is the function of tokenization?
What is the function of tokenization?
Some are skeptics of LLM what do they believe?
Some are skeptics of LLM what do they believe?
The canonical measure of the performance of an LLM is its __________ on a given text corpus.
The canonical measure of the performance of an LLM is its __________ on a given text corpus.
What are the two other examples for the question 'What is the time now? It is'?
What are the two other examples for the question 'What is the time now? It is'?
Concerning Transformer-based LLM training cost, Select which choice is accurate.
Concerning Transformer-based LLM training cost, Select which choice is accurate.
Flashcards
Large Language Model (LLM)
Large Language Model (LLM)
A type of machine learning model designed for natural language processing tasks like language generation. They have many parameters and are trained with self-supervised learning on a vast amount of text.
Self-Supervised Learning
Self-Supervised Learning
Training a model to predict the next word in a sequence, allowing to acquire predictive power regarding syntax, semantics, and ontologies.
Generative Pretrained Transformers (GPTs)
Generative Pretrained Transformers (GPTs)
The largest and most capable LLMs that use the transformer architecture to generate human-like text.
Fine-tuning
Fine-tuning
Signup and view all the flashcards
Tokenization
Tokenization
Signup and view all the flashcards
Byte-Pair Encoding (BPE)
Byte-Pair Encoding (BPE)
Signup and view all the flashcards
Dataset Cleaning
Dataset Cleaning
Signup and view all the flashcards
Synthetic Data
Synthetic Data
Signup and view all the flashcards
Reinforcement Learning from Human Feedback (RLHF)
Reinforcement Learning from Human Feedback (RLHF)
Signup and view all the flashcards
Instruction Tuning
Instruction Tuning
Signup and view all the flashcards
Mixture of Experts (MoE)
Mixture of Experts (MoE)
Signup and view all the flashcards
Prompt Engineering
Prompt Engineering
Signup and view all the flashcards
Attention Mechanism
Attention Mechanism
Signup and view all the flashcards
Context Window
Context Window
Signup and view all the flashcards
Tool Use
Tool Use
Signup and view all the flashcards
Retrieval-Augmented Generation (RAG)
Retrieval-Augmented Generation (RAG)
Signup and view all the flashcards
Agency
Agency
Signup and view all the flashcards
Post-Training Quantization
Post-Training Quantization
Signup and view all the flashcards
Multimodality
Multimodality
Signup and view all the flashcards
Reasoning models
Reasoning models
Signup and view all the flashcards
Scaling laws
Scaling laws
Signup and view all the flashcards
Emergent abilities
Emergent abilities
Signup and view all the flashcards
Next Sentence Prediction (NSP)
Next Sentence Prediction (NSP)
Signup and view all the flashcards
Misinformation, by LLMs
Misinformation, by LLMs
Signup and view all the flashcards
"Sleeper agents" in LLMs
"Sleeper agents" in LLMs
Signup and view all the flashcards
Algorithmic bias
Algorithmic bias
Signup and view all the flashcards
Perplexity
Perplexity
Signup and view all the flashcards
Study Notes
Large Language Models (LLMs) Defined
- LLMs are a type of machine learning model for natural language processing tasks like language generation
- LLMs are language models trained with self-supervised learning on large quantities of text
- They have many parameters
- Generative Pre-trained Transformers (GPTs) are among the largest and most capable LLMs
- Modern models can be fine-tuned for specific tasks or guided by prompt engineering
- LLMs gain predictive abilities with syntax, semantics, and ontologies, but can also inherit inaccuracies and biases
Historical Context
- In the 1990s, The IBM alignment models was first to develop statistical language modeling
- In 2001, a smoothed n-gram model trained on 0.3 billion words achieved state-of-the-art perplexity
- Researchers constructed internet-scale language datasets in the 2000s as internet use became prevalent
- By 2009, statistical language models dominated over symbolic language models as it could ingest large datasets
- Google made its translation service into Neural Machine Translation in 2016, done by seq2seq deep LSTM networks
- Google introduced the transformer architecture that aimed to improve upon 2014 seq2seq technology
- Transformer architecture is based on attention mechanism
- BERT, an encoder-only model became "ubiquitous" in 2018
- After a decline, research into BERT in 2023 was followed by rapid advancements in decoder-only models
- Decoder-only GPT-1 was introduced in 2018
- GPT-2 caught attention in 2019, but OpenAl deemed it too powerful to release publicly fearing malicious use
- GPT-3 in 2020 went a step further and is available only viaAPI
- ChatGPT a consumer-facing browser-based model captured the imaginations in 2022
- OpenAI released the reasoning model OpenAI o1, in 2024, that generated long chains of thought before returning a final answer
- GPT-4 released in 2023, was praised in its increased accuracy and holy grail for its multimodal capabilities
Popular LLMs
- BLOOM, and LLaMA, have restrictions on the field of use but are source-available models
- Mistral 7B and Mixtral 8x7b models by Mistral AI, have the more permissive Apache License
- DeepSeek released DeepSeek R1 in January 2025
Dataset Preprocessing And Tokenization
- As machine learning algorithms process numbers rather than text, the text must be converted to numbers
- A vocabulary is selected to begin, then integer indices are assigned, an embedding is associated to the integer index
- byte-pair encoding (BPE), and WordPiece are types of algorithms to perform dataset conversion
- "[MASK]" for masked-out token used in BERT
- "[UNK]" ("unknown") for characters not appearing in the vocabulary are special tokens
- Special symbols such as "Ġ" and "##"are used to denote special text formatting
- GPT-3 (Legacy) uses a BPE tokenizer would split tokenizer
- The BPE token split will texts -> series of numerical "tokens" as token izer : texts > series of numerical "tok ens "
- Tokenization also compresses the datasets
- LLMs generally require input to be an array that is not jagged, the shorter texts must be "padded"
- Byte pair encoding merges the pairs of adjacent characters
Data and Model Refinement
- After trained, any text can be tokenized, as long as it does not contain characters that appear in the uni-grams initial-set
- a token vocabulary uses as few tokens as possible for an average English word depending on the frequencies extracted
- Languages such as Portuguese and German have "a premium of 50%" compared to English.
- In a English-optimized tokenizer word in another language is "split" by an amount tokens
- In the context of training LLMs, datasets are typically cleaned by removing low, duplicated, or toxic data
- Datasets are usually clean to increase training efficiency
- Microsoft's Phi series of LLMs is trained on textbook-like data generated by another LLM
- Reinforcement learning from human feedback (RLHF) through algorithms, such as proximal policy optimization, is used to further fine-tune a model based on a dataset of human preferences
Approaches to AI Model Training
- "Self-instruct" approaches to the AI model have been able to bootstrap correct responses
- The largest LLM may be too expensive to train and use directly
- Mixture of experts (MoE) can be applied, pursued by Google researchers since 2017 to train models, to address this cost
- "Soft" weights is achieved via prompt engineering
- Prompts and multiple heads can have varying relevancy
Components of an LLM
- Attention mechanism calculates “soft” weights towards tokens by embedding
- The GPT-2 model has had twelve attention heads and a context window
- Google's Gemini 1.5 has a context window sized up to 1 million
- Anthropic's Claude 2.1, has a context window of up to 200k tokens
Context Limitations and Pre-Training
- Models may be pre-trained either to predict how the segment continues, or what is missing in the segment from its training dataset
- Regularization loss is used to stabilize training during auxiliary tasks
- Infrastructure has substantial bearing on the development of AI models
LLMs in terms of training cost and parameter counts
- GPT-1 of 2018 has only 0.117 billion parameters
- The qualifier "large" in "large language model" is inherently vague
- Training of the GPT-2 (i.e. a 1.5-billion-parameters model) in 2019 cost $50,000
- The PaLM (i.e. a 540-billion-parameters model) in 2022 cost $8 million
- Megatron-Turing NLG 530B (in 2021) cost around $11 million
- For Transformer-based LLM, training costs 6 FLOPs, whereas inference costs 1 to 2 FLOPs
Tool Integration and Agents
- Certain tasks require the use of external tools/software as LLMs cannot solve the use of these tools
- Retrieval-augmented generation (RAG) enhances LLMs by document retrieval systems:
- A document retriever retrieves relevant documents given a query
- The model encodes vectors to find the documents with closest vectors of the query
- LLM then generates an output based on a query of context
- Integration modules transforms an LLM to an autonomous agent that uses dynamic environments, recalls past behaviors, and plans future actions
- The ReAct pattern constructs a LLM with the "Reason + Act" portmanteau
- "Describe, Explain, Plan and Select" (DEPS) is a method where LLM produces plans based around image descriptions
Advanced Tools
- The Reflexion method constructs an agent that learns over multiple episodes:
- At the end of each episode the LLM thinks up of lessons learned
- Then are agent in the episodes
- Open - ended exploration, an LLM can be used to score for their "interestingness"
- LLM planners then construct "skills", or "functions" for sequences to store and invoke later
Memory and compression
- LLM-powered agents keep long-term of previous contexts, retrieve, interact socially
- LLMs are trained using or half-precision floating point numbers (float32 and float16).
- Largest models typically have 100 billion parameters, requiring 200 gigabytes to load outside of consumer electronics range(2GB)
- Post-training quantization cuts "space requirement"
- Compression consists of single precision floating point numbers
Multimodality and LLMs
- Multimodality allows for several modalities means several types of input/output, such as video, image, audio, text, proprioception, etc -AI models are trained to ingest "one modality and output another modality"
- AlexNet for image to label, and speech recognition for speech to text are trained to utilize multimodal capabilities
- Common method to create multimodal models out of a LLM is to "tokenize"
- Can construct an LLM that can understand images:
- Take a trained LLM
- Take
- A trained image encoder can create a small multilayered perceptron that has same dimensions as encoded token
Reasoning Skills
- A direction emerged in LLM development with "reasoning models" train to spend time generating step-by-step resolutions before providing OpenIA with o1 in Sept 2024, and 03 Dec 2024,
- Reasoning models show significant improvements in coding, mathematics, science GPT-4o achieved 13% accuracy while o1 reached 83% on Olympiad exams
- The DeepSeek-R1 Is more cost-effective to operate
- Reasoning models require more resources, have superior capabilities +domains
Hallucinations
- In order to assist with AI and hallucinations, automated reasoning and (retrieval-augmented generation) is used.
Model Properties Depend On
- Cost of pretraining is all cost for the total of calculate.
- The Scaling Laws show LLM performance
Scaling laws
- Scaling laws is the cost of model training.
- Number of parameters is statistical hyper-parameter
Emergent Abilities
- The data indicates what they do as the model becomes more complex
- Balance, logical processing/cognitive shortcuts to adapt to strategies of accuracy/effort
Measuring understanding
- Transcoders, which are more interpretable than transformers, been to develop "replacement models.”
- Technique as partial dependency plots, Shapley Additive exPlanations allows to visualize features
- The small transformers were trained using "the modural arithemetic all addition"
- (whether they are believed to be simply predict, be to plan ahead)
- A Microsoft team argued in 2023, that GPT-4 "can solve novel and difficult tasks spanning (math coding, vision, medicine, psychology +more)"
Evaluation Metrics
- It may produce outputs and what roles characteristics are generated from a model based on traditional gender norms
- Select is referring more to how large the model is to its a token
- It systematically favors certain political viewpoints and ideologies during elections and coverage
Problems During LLM Training
- LLM models may exhibit and learn the training data
- Test Evaluation is is on problem with the larger models, "to give more accurate parts on text"
- Bits words is the Entropy from "quantify common".
- More measure BPT during words and tokens.
- Evaluate is on model as of capacity for the compresser.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.