N-gram Language Models

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson
Download our mobile app to listen on the go
Get App

Questions and Answers

In the context of language models, what does computing $P(W)$ signify, where $W$ represents a sequence of words?

  • Calculating the probability of observing the given sequence of words in a language. (correct)
  • Estimating the likelihood of a specific word appearing in a vocabulary.
  • Measuring the semantic similarity between words in the sequence.
  • Determining the syntactic correctness of the word sequence.

Which task is LEAST directly aided by language models?

  • Spell Correction.
  • Machine Translation.
  • Speech Recognition.
  • Sentiment Analysis. (correct)

What is the primary role of the Chain Rule of Probability in language modeling?

  • Smoothing probability distributions to account for unseen events.
  • Decomposing the joint probability of a word sequence into a product of conditional probabilities. (correct)
  • Simplifying the calculation of conditional probabilities by assuming independence between events.
  • Estimating the probability of a word based on its frequency in a corpus.

Why is it insufficient to simply count and divide when estimating the probability of a long sequence of words?

<p>The number of possible sentences is too vast, leading to data sparsity and unreliable estimates. (A)</p> Signup and view all the answers

How does applying the Markov Assumption simplify the computation of the probability of a word sequence?

<p>By limiting the context to a fixed number of preceding words when estimating the probability of the next word. (A)</p> Signup and view all the answers

How does a bigram model differ from a unigram model in estimating the probability of a word sequence?

<p>A bigram model conditions the probability of a word on the previous word, while a unigram model treats each word independently. (B)</p> Signup and view all the answers

What is the Maximum Likelihood Estimate (MLE) for estimating the probability of a word in a unigram language model?

<p>The word's frequency in the corpus divided by the total number of words in the corpus. (C)</p> Signup and view all the answers

In a bigram language model, what data is required to apply Maximum Likelihood Estimation?

<p>The frequency of each word pair and the frequency of the first word in each pair. (B)</p> Signup and view all the answers

Why is it beneficial to perform calculations in log space when working with language models?

<p>Both A and B. (D)</p> Signup and view all the answers

What is the primary purpose of evaluating a language model?

<p>To assess how well the model distinguishes between plausible and implausible sentences. (A)</p> Signup and view all the answers

What is the difference between intrinsic and extrinsic evaluation of language models?

<p>Intrinsic evaluation assesses the model's performance directly using a test set, while extrinsic evaluation measures the model's impact when integrated into a downstream application. (A)</p> Signup and view all the answers

Why might intrinsic evaluation be considered a 'bad approximation' of a language model's performance?

<p>Intrinsic evaluation scores, such as perplexity, might decrease even when the language model performs worse in downstream tasks. (A)</p> Signup and view all the answers

What does perplexity measure in the context of language models?

<p>The inverse probability of the test set, normalized by the number of words. (A)</p> Signup and view all the answers

What does a lower perplexity score generally indicate about a language model's performance?

<p>The model is better at predicting the test set. (D)</p> Signup and view all the answers

In language modeling, what is the 'zero probability' problem?

<p>The assignment of a probability of zero to unseen n-grams, causing the entire sequence probability to be zero. (B)</p> Signup and view all the answers

What is the primary motivation behind using smoothing techniques in language models?

<p>To address the issue of zero probabilities by assigning non-zero probabilities to unseen n-grams. (B)</p> Signup and view all the answers

How does Add-one (Laplace) smoothing adjust probabilities in a language model?

<p>By adding one to the count of each n-gram and adjusting the normalization factor. (B)</p> Signup and view all the answers

Which of the following is a potential drawback of Add-one (Laplace) smoothing?

<p>It may significantly alter the probability distribution, assigning too much probability mass to unseen events. (D)</p> Signup and view all the answers

What is the main idea behind interpolation in language modeling?

<p>To combine the probabilities from multiple n-gram models of different orders. (B)</p> Signup and view all the answers

How are the weights (lambdas) typically determined in linear interpolation?

<p>They are optimized on a held-out corpus to maximize the probability of the data. (B)</p> Signup and view all the answers

In the formulas for linear interpolation that mix trigrams, bigrams, and unigrams, what constraint applies to all lambda values?

<p>The sum of the lambdas must be equal to 1 and each lambda must be greater than or equal to zero. (B)</p> Signup and view all the answers

What problem prompted the evolution from unigram models to bigram and subsequently trigram models?

<p>The inability of unigram models to accurately predict the next word in a sequence due to lack of context. (D)</p> Signup and view all the answers

Which technique would most directly handle the issue of dividing by zero to estimate a likely sequence?

<p>Laplace Smoothing (C)</p> Signup and view all the answers

Extrinsic evaluation of N-gram models focuses on what aspect of the model in question?

<p>Measuring performance of the model indirectly by using its performance in a downstream task. (A)</p> Signup and view all the answers

In the expression $P(w_i | w_{i-1})$, where the context is 'the dog', what word is $w_i$ if we suppose the expression denotes 'the dog barks'?

<p>barks (B)</p> Signup and view all the answers

Assuming frequent calls to estimate probabilities such as $P(w_i | w_{i-1})$ using the chain rule, what optimization would MOST improve performance?

<p>Using log space transformations. (A)</p> Signup and view all the answers

When all the probabilities obtained during the estimation of a sequence are uniformly very low, which is the MOST beneficial strategy?

<p>Using Log Space Transformations (D)</p> Signup and view all the answers

When are pilot experiments MOST useful during N-gram development?

<p>Prior to Extrinsic Evaluation (D)</p> Signup and view all the answers

Under what conditions is Intrinsic Evaluation MOST useful?

<p>When the testing dataset mirrors the training data closely. (D)</p> Signup and view all the answers

Which strategy MOST directly helps with the issue that test dataset may contain legitimate and commonplace n-grams that did not occur in the training set?

<p>Using a larger dataset (A)</p> Signup and view all the answers

If 'The quick brown fox' occurred 10 times, and 'quick brown fox jumps' never occurred, what would Add-1 Laplace estimation adjust here?

<p>Add 1 to the count of 'the quick brown fox jumps' and renormalize by adding 1 the number of possible n-grams accounting for vocabulary size (C)</p> Signup and view all the answers

Which is the most direct downside to extrapolation within N-gram modeling?

<p>The new estimated values may be extremely high compared to other reasonable values violating the prior shape of the data (A)</p> Signup and view all the answers

Which is the most appropriate situation to prefer 'held out' data over other types of data?

<p>When you aim to determine optimal lambdas within interpolation (A)</p> Signup and view all the answers

Interpolation improves upon specific shortcomings of Laplace Smoothing. Which is the most direct improvement interpolation introduces?

<p>Estimates can reflect the distinct effects of estimates from distinct N-gram orders, rather than introducing probability uniformly. (C)</p> Signup and view all the answers

Compared to a unigram model, what estimation overhead considerations exist regarding a good bigram model?

<p>Bigram estimation requires significantly greater overhead (B)</p> Signup and view all the answers

Which expression is most synonymous with 'held out' data?

<p>Validation Data (B)</p> Signup and view all the answers

Generally, why is training time not as strong a consideration vs perplexity?

<p>Training time can often be dealt with at design time, with resulting system deployed with a long-term lifespan requiring high quality results that stem from a low perplexity model. (C)</p> Signup and view all the answers

When extrapolating, what potential issue may more intensely impact the model outcome?

<p>The model may estimate large but completely unrealistic probability values. (C)</p> Signup and view all the answers

Flashcards

Language Modeling Goal

Predicts the probability of a sentence or sequence of words.

Language Model

A model that computes the probability of a sequence of words or the probability of the next word given the previous words.

Markov Assumption

Simplifies probability computation by assuming a word's probability depends only on the preceding k words.

Bigram Model Condition

Each word in a sequence relies only on the previous single word.

Signup and view all the flashcards

Unigram MLE Estimation

Estimates word probability by dividing the count of the word by the total number of words.

Signup and view all the flashcards

Bigram MLE Estimation

Estimates word probability based on the previous word by counting occurrences.

Signup and view all the flashcards

Perplexity in Language Models

A method for evaluating language models by calculating the inverse probability of a test set, normalized by the number of words.

Signup and view all the flashcards

Laplace Smoothing

Adding one to all the counts to avoid zero probabilities.

Signup and view all the flashcards

Linear Interpolation

Linear interpolation mixes probabilities from N-gram models of different orders.

Signup and view all the flashcards

Study Notes

  • N-gram language models is the topic.
  • Other points to be covered include estimating N-gram probabilities, evaluating language models, generalization and zeros and smoothing techniques.

The Language Modeling Problem

  • The objective is to determine the probability of a word sequence or sentence.
  • P(W) = P(w1, w2, ..., wn)
  • For example, P(Hôm nay trời đẹp quá) = P(Hôm, nay, trời, đẹp, quá).
  • Determining the likelihood of an upcoming word is a related task.
  • P(w4 | w1, w2, w3)
  • For example, P(đẹp|Hôm, nay, trời).
  • A Language Model is a model that can compute either P(W) or P(wn | w1, w2, ..., wn-1).

Why Language Models?

  • Language models are useful for solving many problems.
  • In Machine Translation, it can determine which translation has higher probability.
  • For example, P(high winds tonite) > P(large winds tonite).
  • In Spell Correction, it can be used to select the most probable word.
  • For example, P(about fifteen minutes from) > P(about fifteen minuets from)
  • Speech Recognition also uses language models.
  • For example, P(I saw a van) >> P(eyes awe of an).

How to Compute P(W)

  • To compute the joint probability, apply the Chain Rule of Probability.
  • P(its, water, is, so, transparent, that).

The Chain Rule

  • Conditional probabilities are defined as P(B|A) = P(A, B) / P(A).
  • Rewriting this formula gives P(A, B) = P(A) * P(B|A).

The Chain Rule (Continued)

  • For more variables, applying chain rule yields:
  • P(A, B, C, D) = P(A, B, C) * P(D|A, B, C) = P(A, B) * P(C|A, B) * P(D|A, B, C) = P(A) * P(B|A) * P(C|A, B) * P(D|A, B, C)
  • Generally, chain rule states
  • P(x1, x2, x3, ..., xn) = P(x1) * P(x2|x1) * P(x3|x1, x2) ... P(xn|x1, ..., xn-1)

Chain Rule Application

  • Chain rule can be used for joint probability calculations
  • P(w1 w2 ... wn) = ∏ P(wi|w1 w2 ... wi-1)
  • Given "its water is so transparent", the calculations will be:
  • P(its) * P(water | its) * P(is | its water) * P(so | its water is) * P(transparent | its water is so)

Estimating Probabilities

  • Estimate count and divide as a method, but it is unlikely to succeed because of the vast number of possible sentences.
  • Enough data will probably never be available for estimation.

Markov Assumption

  • The Markov assumption simplifies probability calculation by assuming a word's probability only depends on a limited history.
  • P(the | its water is so transparent that) ≈ P(the | that)
  • It can be adjusted to longer histories.
  • P(the | its water is so transparent that) ≈ P(the | transparent that)

Markov Approximation

  • Each component in the product is approximated by a limited history
  • The joint probability of the sequence would be:
  • P(w1 w2 ... wn) ≈ ∏ P(wi | wi-k ... wi-1)

Unigram Model

  • In the unigram model, independence between words is assumed
  • P(w1 w2 ... wn) ≈ ∏ P(wi)

Bigram Model

  • The probability of a word depends only on the previous word.
  • P(wi|w1 w2 ... wi-1) ≈ P(wi | wi-1)

Unigram Model Details

  • The unigram model simplifies by not using history.
  • P(w1 w2 ... wn) = ∏ P(wi)
  • Maximum Likelihood Estimate (MLE) is used to find P(wi):
  • P(wi) = count(wi) / Σcount(w')

Unigram Example

  • Given the example corpus: "i live in osaka . , i am a graduate student . , my school is in nara . "
  • P(nara) = 1/20 = 0.05
  • P(i) = 2/20 = 0.1
  • P() = 3/20 = 0.15
  • P(W = i live in nara . ) = 0.1 * 0.05 * 0.1 * 0.05 * 0.15 * 0.15 = 5.625 * 10^-7.

Bigram Model Details

  • Bigram model conditions on the previous word.
  • Maximum Likelihood Estimate can be applied to estimate the probabilities:
  • P(wi | wi-1) = count(wi-1, wi) / count(wi-1)

Bigram Example

  • Given the sentence " I am Sam , Sam I am , I do not like green eggs and ham "
  • P(I | ) = 2/3 = 0.67, P(Sam | ) = 1/3 = 0.33, P(am | I) = 2/3 = 0.67
  • P( | Sam) = 1/2 = 0.5, P(Sam | am) = 1/2 = 0.5, P(do | I) = 1/3 = 0.33

Practical Issues

  • To avoid underflow, log space should be used .
  • Addition is faster than multiplication.
  • P(p1 × p2 × p3 × p4) = log p1 + log p2 + log p3 + log p4.

Language Modeling Toolkits

Evaluating Language Models

  • A key question is whether the language model prefers good sentences over bad ones
  • "Real" sentences should be assigned higher probabilities than "ungrammatical" ones.

Model Evaluation

  • The model's parameters are trained on a training set.
  • The model performance is assessed on unseen data.
  • An evaluation metric indicates how well the model performs on the test set.
  • This test set is unused and different from the training set completely.

Evaluation Approaches

  • Extrinsic evaluation compares two language models in downstream tasks like MT, speech recognition, and spelling correction.
  • Intrinsic evaluation uses evaluation measures like perplexity on the test set.

Extrinsic Evaluation In Detail

  • Putting each model (A and B) in a task like a MT system, speech recognizer, or spelling corrector, is the best method
  • The task is run and accuracy is collected for both model A and model B
  • It is measured by correctly translated words or how many misspelled words are corrected.
  • Comparing the accuracy of models A and B is the final stage

Difficulty of Extrinsic Evaluation

  • Extrinsic evaluation is time-consuming and can take weeks.
  • Intrinsic evaluation sometimes employed and approximates perplexity and is generally only useful in pilot experiments.
  • But it is still helpful to think about.
  • This is on the condition that the test data looks like training data.

Intuition of Perplexity

  • The Shannon Game shows how well we predict the next word given the text "I always order pizza with cheese and ____", "The 33rd President of the US was ____" and "I saw a ____".
  • A better model of a text gives a higher probability to the word that actually occurs.
  • Unigrams are terrible at this game

Perplexity Defined

  • The best language model accurately predicts an unseen text set but also gives the highest probability P(sentence).
  • It's the inverse probability of the test set normalized by number of words.
  • PP(W) = P(w1 w2 ... wN)^(-1/N)
  • Minimizing PPL is equivalent to maximizing probability.
  • Using chain rule, PP(W) = [∏ P(wi | w1 ... wi-1)]^(-1/N)
  • For the bigram model PP(W) = [∏ P(wi | wi-1)]^(-1/N)

Perplexity Calculations

  • log2 PP(W) = 1/N Σ log2 1/P(wi|wi-1) = - 1/N Σ log2 P(wi | wi-1)
  • PP(W) = 2^H, H is the entropy.

Lower Perplexity Implies Better Model

  • A model was trained on 38 million words and tested on 1.5 million words from WSJ
  • The unigram's perplexity was 962, bigram was 170 and trigram was 109

Zero Probabilities

  • The training set has "denied the allegations", "denied the reports", "denied the claims", and "denied the request".
  • The test set has "denied the offer" and "denied the loan"
  • Then, "P(“offer” | denied the) = 0."

Potential Problems with Zero Probabilities

  • Underestimating probability of words that can occur is a problem.
  • The entire probability of the test set is 0 so perplexity cannot be calculated.

Laplace Smoothing

  • Add one to all counts, pretending each word more time than observed.
  • The MLE unigram probabilities: PML(wi) = c(wi) / N
  • The Add-1 estimate: PLaplace = (c(wi) + 1) / (Σ(c(w) + 1)) = (c(wi) + 1) / (N + V)

Laplace Example

  • Considering the example "i live in osaka . , i am a graduate student . , my school is in nara . "
  • The vocabulary = {i, live, in, osaka, am, graduate, student, my, school, is, nara, } and the count V = 12
  • Given it, P(kyoto) = 0/20 = 0, with Laplace smoothing it is P(kyoto) = (0+1)/(20+12) = 0.03125.
  • P(nara) = (1+1)/(20+12) = 0.0625, P(i) = (2+1)/(20+12) = 0.09375 and P() = (3+1)/(20+12) = 0.125

Add-1 Estimate

  • Applying MLE on bigrams, PML(wi | wi-1) = c(wi-1, wi) / c(wi-1)
  • Add-1 estimate: PLaplace = (c(wi-1, wi) + 1) / (Σ(c(wi-1 w) + 1)) = (c(wi-1, wi) + 1) / (c(wi-1) + V)

Linear Interpolation

  • It combines trigrams, unigrams, and bigrams
  • P(wi | wi-2, wi-1) = λ1 * PML(wi | wi-2, wi-1) + λ2 * PML(wi | wi-1) + λ3 * PML(wi)
  • Where λ1 + λ2 + λ3 = 1, and λi >= 0 for all i.
  • In the case with trigrams, bigrams, and unigrams don't exist, the formulation can be:
  • P(wi) = λ × PML(wi) + (1 - λ) × 1/N

Setting Lambdas

  • Employ a held-out corpus during training by choosing lambdas to maximize held-out data probability.
  • First fix the N-gram probabilities with respect to the data training.
  • As next step, search for lambdas that give largest probability to held-out set.
  • log P(w1...wn | M(λ1.λk)) = Σ log PM(λiλk) (wi | wi-1)

Interpolation Example

  • The corpus is "i live in osaka , i am a graduate student , my school is in nara ".
  • Estimating the maximum likelihood values
  • P(osaka | in) = c(in osaka)/c(in) = 1/2 = 0.5
  • P(nara | in) = c(in nara)/c(in) = 1/2 = 0.5
  • P(school | in) = c(in school)/c(in) = 0/2 = 0
  • Use interpolation
  • P(school | in) = λ2 * PML(school | in) + (1 - λ2) * P(school)
  • P(school) = λ1 * PML(school) + (1 - λ1) * 1/N = λ1 * 1/20 + (1-λ1) * 1/N

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

N-Gram Language Models
9 questions

N-Gram Language Models

MeritoriousWaterfall avatar
MeritoriousWaterfall
Root Words: Graph and Gram
11 questions
Use Quizgecko on...
Browser
Browser