Podcast
Questions and Answers
In the context of language models, what does computing $P(W)$ signify, where $W$ represents a sequence of words?
In the context of language models, what does computing $P(W)$ signify, where $W$ represents a sequence of words?
- Calculating the probability of observing the given sequence of words in a language. (correct)
- Estimating the likelihood of a specific word appearing in a vocabulary.
- Measuring the semantic similarity between words in the sequence.
- Determining the syntactic correctness of the word sequence.
Which task is LEAST directly aided by language models?
Which task is LEAST directly aided by language models?
- Spell Correction.
- Machine Translation.
- Speech Recognition.
- Sentiment Analysis. (correct)
What is the primary role of the Chain Rule of Probability in language modeling?
What is the primary role of the Chain Rule of Probability in language modeling?
- Smoothing probability distributions to account for unseen events.
- Decomposing the joint probability of a word sequence into a product of conditional probabilities. (correct)
- Simplifying the calculation of conditional probabilities by assuming independence between events.
- Estimating the probability of a word based on its frequency in a corpus.
Why is it insufficient to simply count and divide when estimating the probability of a long sequence of words?
Why is it insufficient to simply count and divide when estimating the probability of a long sequence of words?
How does applying the Markov Assumption simplify the computation of the probability of a word sequence?
How does applying the Markov Assumption simplify the computation of the probability of a word sequence?
How does a bigram model differ from a unigram model in estimating the probability of a word sequence?
How does a bigram model differ from a unigram model in estimating the probability of a word sequence?
What is the Maximum Likelihood Estimate (MLE) for estimating the probability of a word in a unigram language model?
What is the Maximum Likelihood Estimate (MLE) for estimating the probability of a word in a unigram language model?
In a bigram language model, what data is required to apply Maximum Likelihood Estimation?
In a bigram language model, what data is required to apply Maximum Likelihood Estimation?
Why is it beneficial to perform calculations in log space when working with language models?
Why is it beneficial to perform calculations in log space when working with language models?
What is the primary purpose of evaluating a language model?
What is the primary purpose of evaluating a language model?
What is the difference between intrinsic and extrinsic evaluation of language models?
What is the difference between intrinsic and extrinsic evaluation of language models?
Why might intrinsic evaluation be considered a 'bad approximation' of a language model's performance?
Why might intrinsic evaluation be considered a 'bad approximation' of a language model's performance?
What does perplexity measure in the context of language models?
What does perplexity measure in the context of language models?
What does a lower perplexity score generally indicate about a language model's performance?
What does a lower perplexity score generally indicate about a language model's performance?
In language modeling, what is the 'zero probability' problem?
In language modeling, what is the 'zero probability' problem?
What is the primary motivation behind using smoothing techniques in language models?
What is the primary motivation behind using smoothing techniques in language models?
How does Add-one (Laplace) smoothing adjust probabilities in a language model?
How does Add-one (Laplace) smoothing adjust probabilities in a language model?
Which of the following is a potential drawback of Add-one (Laplace) smoothing?
Which of the following is a potential drawback of Add-one (Laplace) smoothing?
What is the main idea behind interpolation in language modeling?
What is the main idea behind interpolation in language modeling?
How are the weights (lambdas) typically determined in linear interpolation?
How are the weights (lambdas) typically determined in linear interpolation?
In the formulas for linear interpolation that mix trigrams, bigrams, and unigrams, what constraint applies to all lambda values?
In the formulas for linear interpolation that mix trigrams, bigrams, and unigrams, what constraint applies to all lambda values?
What problem prompted the evolution from unigram models to bigram and subsequently trigram models?
What problem prompted the evolution from unigram models to bigram and subsequently trigram models?
Which technique would most directly handle the issue of dividing by zero to estimate a likely sequence?
Which technique would most directly handle the issue of dividing by zero to estimate a likely sequence?
Extrinsic evaluation of N-gram models focuses on what aspect of the model in question?
Extrinsic evaluation of N-gram models focuses on what aspect of the model in question?
In the expression $P(w_i | w_{i-1})$, where the context is 'the dog', what word is $w_i$ if we suppose the expression denotes 'the dog barks'?
In the expression $P(w_i | w_{i-1})$, where the context is 'the dog', what word is $w_i$ if we suppose the expression denotes 'the dog barks'?
Assuming frequent calls to estimate probabilities such as $P(w_i | w_{i-1})$ using the chain rule, what optimization would MOST improve performance?
Assuming frequent calls to estimate probabilities such as $P(w_i | w_{i-1})$ using the chain rule, what optimization would MOST improve performance?
When all the probabilities obtained during the estimation of a sequence are uniformly very low, which is the MOST beneficial strategy?
When all the probabilities obtained during the estimation of a sequence are uniformly very low, which is the MOST beneficial strategy?
When are pilot experiments MOST useful during N-gram development?
When are pilot experiments MOST useful during N-gram development?
Under what conditions is Intrinsic Evaluation MOST useful?
Under what conditions is Intrinsic Evaluation MOST useful?
Which strategy MOST directly helps with the issue that test dataset may contain legitimate and commonplace n-grams that did not occur in the training set?
Which strategy MOST directly helps with the issue that test dataset may contain legitimate and commonplace n-grams that did not occur in the training set?
If 'The quick brown fox' occurred 10 times, and 'quick brown fox jumps' never occurred, what would Add-1 Laplace estimation adjust here?
If 'The quick brown fox' occurred 10 times, and 'quick brown fox jumps' never occurred, what would Add-1 Laplace estimation adjust here?
Which is the most direct downside to extrapolation within N-gram modeling?
Which is the most direct downside to extrapolation within N-gram modeling?
Which is the most appropriate situation to prefer 'held out' data over other types of data?
Which is the most appropriate situation to prefer 'held out' data over other types of data?
Interpolation improves upon specific shortcomings of Laplace Smoothing. Which is the most direct improvement interpolation introduces?
Interpolation improves upon specific shortcomings of Laplace Smoothing. Which is the most direct improvement interpolation introduces?
Compared to a unigram model, what estimation overhead considerations exist regarding a good bigram model?
Compared to a unigram model, what estimation overhead considerations exist regarding a good bigram model?
Which expression is most synonymous with 'held out' data?
Which expression is most synonymous with 'held out' data?
Generally, why is training time not as strong a consideration vs perplexity?
Generally, why is training time not as strong a consideration vs perplexity?
When extrapolating, what potential issue may more intensely impact the model outcome?
When extrapolating, what potential issue may more intensely impact the model outcome?
Flashcards
Language Modeling Goal
Language Modeling Goal
Predicts the probability of a sentence or sequence of words.
Language Model
Language Model
A model that computes the probability of a sequence of words or the probability of the next word given the previous words.
Markov Assumption
Markov Assumption
Simplifies probability computation by assuming a word's probability depends only on the preceding k words.
Bigram Model Condition
Bigram Model Condition
Signup and view all the flashcards
Unigram MLE Estimation
Unigram MLE Estimation
Signup and view all the flashcards
Bigram MLE Estimation
Bigram MLE Estimation
Signup and view all the flashcards
Perplexity in Language Models
Perplexity in Language Models
Signup and view all the flashcards
Laplace Smoothing
Laplace Smoothing
Signup and view all the flashcards
Linear Interpolation
Linear Interpolation
Signup and view all the flashcards
Study Notes
- N-gram language models is the topic.
- Other points to be covered include estimating N-gram probabilities, evaluating language models, generalization and zeros and smoothing techniques.
The Language Modeling Problem
- The objective is to determine the probability of a word sequence or sentence.
P(W) = P(w1, w2, ..., wn)
- For example,
P(Hôm nay trời đẹp quá) = P(Hôm, nay, trời, đẹp, quá)
. - Determining the likelihood of an upcoming word is a related task.
P(w4 | w1, w2, w3)
- For example,
P(đẹp|Hôm, nay, trời)
. - A Language Model is a model that can compute either P(W) or P(wn | w1, w2, ..., wn-1).
Why Language Models?
- Language models are useful for solving many problems.
- In Machine Translation, it can determine which translation has higher probability.
- For example, P(high winds tonite) > P(large winds tonite).
- In Spell Correction, it can be used to select the most probable word.
- For example, P(about fifteen minutes from) > P(about fifteen minuets from)
- Speech Recognition also uses language models.
- For example, P(I saw a van) >> P(eyes awe of an).
How to Compute P(W)
- To compute the joint probability, apply the Chain Rule of Probability.
- P(its, water, is, so, transparent, that).
The Chain Rule
- Conditional probabilities are defined as P(B|A) = P(A, B) / P(A).
- Rewriting this formula gives P(A, B) = P(A) * P(B|A).
The Chain Rule (Continued)
- For more variables, applying chain rule yields:
P(A, B, C, D) = P(A, B, C) * P(D|A, B, C) = P(A, B) * P(C|A, B) * P(D|A, B, C) = P(A) * P(B|A) * P(C|A, B) * P(D|A, B, C)
- Generally, chain rule states
P(x1, x2, x3, ..., xn) = P(x1) * P(x2|x1) * P(x3|x1, x2) ... P(xn|x1, ..., xn-1)
Chain Rule Application
- Chain rule can be used for joint probability calculations
P(w1 w2 ... wn) = ∏ P(wi|w1 w2 ... wi-1)
- Given "its water is so transparent", the calculations will be:
- P(its) * P(water | its) * P(is | its water) * P(so | its water is) * P(transparent | its water is so)
Estimating Probabilities
- Estimate count and divide as a method, but it is unlikely to succeed because of the vast number of possible sentences.
- Enough data will probably never be available for estimation.
Markov Assumption
- The Markov assumption simplifies probability calculation by assuming a word's probability only depends on a limited history.
P(the | its water is so transparent that) ≈ P(the | that)
- It can be adjusted to longer histories.
P(the | its water is so transparent that) ≈ P(the | transparent that)
Markov Approximation
- Each component in the product is approximated by a limited history
- The joint probability of the sequence would be:
P(w1 w2 ... wn) ≈ ∏ P(wi | wi-k ... wi-1)
Unigram Model
- In the unigram model, independence between words is assumed
P(w1 w2 ... wn) ≈ ∏ P(wi)
Bigram Model
- The probability of a word depends only on the previous word.
P(wi|w1 w2 ... wi-1) ≈ P(wi | wi-1)
Unigram Model Details
- The unigram model simplifies by not using history.
P(w1 w2 ... wn) = ∏ P(wi)
- Maximum Likelihood Estimate (MLE) is used to find P(wi):
P(wi) = count(wi) / Σcount(w')
Unigram Example
- Given the example corpus: "i live in osaka . , i am a graduate student . , my school is in nara . "
- P(nara) = 1/20 = 0.05
- P(i) = 2/20 = 0.1
- P() = 3/20 = 0.15
- P(W = i live in nara . ) = 0.1 * 0.05 * 0.1 * 0.05 * 0.15 * 0.15 = 5.625 * 10^-7.
Bigram Model Details
- Bigram model conditions on the previous word.
- Maximum Likelihood Estimate can be applied to estimate the probabilities:
P(wi | wi-1) = count(wi-1, wi) / count(wi-1)
Bigram Example
- Given the sentence "
I am Sam,Sam I am,I do not like green eggs and ham" - P(I |
) = 2/3 = 0.67, P(Sam |) = 1/3 = 0.33, P(am | I) = 2/3 = 0.67 - P( | Sam) = 1/2 = 0.5, P(Sam | am) = 1/2 = 0.5, P(do | I) = 1/3 = 0.33
Practical Issues
- To avoid underflow, log space should be used .
- Addition is faster than multiplication.
- P(p1 × p2 × p3 × p4) = log p1 + log p2 + log p3 + log p4.
Language Modeling Toolkits
- Some language modeling toolkits are SRILM and KenLM
- SRILM: http://www.speech.sri.com/projects/srilm/
- KenLM: https://kheafield.com/code/kenlm/
Evaluating Language Models
- A key question is whether the language model prefers good sentences over bad ones
- "Real" sentences should be assigned higher probabilities than "ungrammatical" ones.
Model Evaluation
- The model's parameters are trained on a training set.
- The model performance is assessed on unseen data.
- An evaluation metric indicates how well the model performs on the test set.
- This test set is unused and different from the training set completely.
Evaluation Approaches
- Extrinsic evaluation compares two language models in downstream tasks like MT, speech recognition, and spelling correction.
- Intrinsic evaluation uses evaluation measures like perplexity on the test set.
Extrinsic Evaluation In Detail
- Putting each model (A and B) in a task like a MT system, speech recognizer, or spelling corrector, is the best method
- The task is run and accuracy is collected for both model A and model B
- It is measured by correctly translated words or how many misspelled words are corrected.
- Comparing the accuracy of models A and B is the final stage
Difficulty of Extrinsic Evaluation
- Extrinsic evaluation is time-consuming and can take weeks.
- Intrinsic evaluation sometimes employed and approximates perplexity and is generally only useful in pilot experiments.
- But it is still helpful to think about.
- This is on the condition that the test data looks like training data.
Intuition of Perplexity
- The Shannon Game shows how well we predict the next word given the text "I always order pizza with cheese and ____", "The 33rd President of the US was ____" and "I saw a ____".
- A better model of a text gives a higher probability to the word that actually occurs.
- Unigrams are terrible at this game
Perplexity Defined
- The best language model accurately predicts an unseen text set but also gives the highest probability P(sentence).
- It's the inverse probability of the test set normalized by number of words.
PP(W) = P(w1 w2 ... wN)^(-1/N)
- Minimizing PPL is equivalent to maximizing probability.
- Using chain rule,
PP(W) = [∏ P(wi | w1 ... wi-1)]^(-1/N)
- For the bigram model
PP(W) = [∏ P(wi | wi-1)]^(-1/N)
Perplexity Calculations
- log2 PP(W) = 1/N Σ log2 1/P(wi|wi-1) = - 1/N Σ log2 P(wi | wi-1)
- PP(W) = 2^H, H is the entropy.
Lower Perplexity Implies Better Model
- A model was trained on 38 million words and tested on 1.5 million words from WSJ
- The unigram's perplexity was 962, bigram was 170 and trigram was 109
Zero Probabilities
- The training set has "denied the allegations", "denied the reports", "denied the claims", and "denied the request".
- The test set has "denied the offer" and "denied the loan"
- Then, "P(“offer” | denied the) = 0."
Potential Problems with Zero Probabilities
- Underestimating probability of words that can occur is a problem.
- The entire probability of the test set is 0 so perplexity cannot be calculated.
Laplace Smoothing
- Add one to all counts, pretending each word more time than observed.
- The MLE unigram probabilities:
PML(wi) = c(wi) / N
- The Add-1 estimate:
PLaplace = (c(wi) + 1) / (Σ(c(w) + 1)) = (c(wi) + 1) / (N + V)
Laplace Example
- Considering the example "i live in osaka . , i am a graduate student . , my school is in nara . "
- The vocabulary = {i, live, in, osaka, am, graduate, student, my, school, is, nara, } and the count V = 12
- Given it, P(kyoto) = 0/20 = 0, with Laplace smoothing it is P(kyoto) = (0+1)/(20+12) = 0.03125.
- P(nara) = (1+1)/(20+12) = 0.0625, P(i) = (2+1)/(20+12) = 0.09375 and P() = (3+1)/(20+12) = 0.125
Add-1 Estimate
- Applying MLE on bigrams,
PML(wi | wi-1) = c(wi-1, wi) / c(wi-1)
- Add-1 estimate:
PLaplace = (c(wi-1, wi) + 1) / (Σ(c(wi-1 w) + 1)) = (c(wi-1, wi) + 1) / (c(wi-1) + V)
Linear Interpolation
- It combines trigrams, unigrams, and bigrams
P(wi | wi-2, wi-1) = λ1 * PML(wi | wi-2, wi-1) + λ2 * PML(wi | wi-1) + λ3 * PML(wi)
- Where
λ1 + λ2 + λ3 = 1, and λi >= 0 for all i
. - In the case with trigrams, bigrams, and unigrams don't exist, the formulation can be:
P(wi) = λ × PML(wi) + (1 - λ) × 1/N
Setting Lambdas
- Employ a held-out corpus during training by choosing lambdas to maximize held-out data probability.
- First fix the N-gram probabilities with respect to the data training.
- As next step, search for lambdas that give largest probability to held-out set.
log P(w1...wn | M(λ1.λk)) = Σ log PM(λiλk) (wi | wi-1)
Interpolation Example
- The corpus is "i live in osaka , i am a graduate student , my school is in nara ".
- Estimating the maximum likelihood values
- P(osaka | in) = c(in osaka)/c(in) = 1/2 = 0.5
- P(nara | in) = c(in nara)/c(in) = 1/2 = 0.5
- P(school | in) = c(in school)/c(in) = 0/2 = 0
- Use interpolation
- P(school | in) = λ2 * PML(school | in) + (1 - λ2) * P(school)
- P(school) = λ1 * PML(school) + (1 - λ1) * 1/N = λ1 * 1/20 + (1-λ1) * 1/N
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.