N-gram Language Models

Podcast

Play an AI-generated podcast conversation about this lesson

Download our mobile app to listen on the go

Get App

Questions and Answers

In the context of language models, what does computing $P(W)$ signify, where $W$ represents a sequence of words?

Calculating the probability of observing the given sequence of words in a language. (correct)
Estimating the likelihood of a specific word appearing in a vocabulary.
Measuring the semantic similarity between words in the sequence.
Determining the syntactic correctness of the word sequence.

Which task is LEAST directly aided by language models?

Spell Correction.
Machine Translation.
Speech Recognition.
Sentiment Analysis. (correct)

What is the primary role of the Chain Rule of Probability in language modeling?

Smoothing probability distributions to account for unseen events.
Decomposing the joint probability of a word sequence into a product of conditional probabilities. (correct)
Simplifying the calculation of conditional probabilities by assuming independence between events.
Estimating the probability of a word based on its frequency in a corpus.

Why is it insufficient to simply count and divide when estimating the probability of a long sequence of words?

The number of possible sentences is too vast, leading to data sparsity and unreliable estimates. (A) Signup and view all the answers

How does applying the Markov Assumption simplify the computation of the probability of a word sequence?

By limiting the context to a fixed number of preceding words when estimating the probability of the next word. (A) Signup and view all the answers

How does a bigram model differ from a unigram model in estimating the probability of a word sequence?

A bigram model conditions the probability of a word on the previous word, while a unigram model treats each word independently. (B) Signup and view all the answers

What is the Maximum Likelihood Estimate (MLE) for estimating the probability of a word in a unigram language model?

The word's frequency in the corpus divided by the total number of words in the corpus. (C) Signup and view all the answers

In a bigram language model, what data is required to apply Maximum Likelihood Estimation?

The frequency of each word pair and the frequency of the first word in each pair. (B) Signup and view all the answers

Why is it beneficial to perform calculations in log space when working with language models?

Both A and B. (D) Signup and view all the answers

What is the primary purpose of evaluating a language model?

To assess how well the model distinguishes between plausible and implausible sentences. (A) Signup and view all the answers

What is the difference between intrinsic and extrinsic evaluation of language models?

Intrinsic evaluation assesses the model's performance directly using a test set, while extrinsic evaluation measures the model's impact when integrated into a downstream application. (A) Signup and view all the answers

Why might intrinsic evaluation be considered a 'bad approximation' of a language model's performance?

Intrinsic evaluation scores, such as perplexity, might decrease even when the language model performs worse in downstream tasks. (A) Signup and view all the answers

What does perplexity measure in the context of language models?

The inverse probability of the test set, normalized by the number of words. (A) Signup and view all the answers

What does a lower perplexity score generally indicate about a language model's performance?

The model is better at predicting the test set. (D) Signup and view all the answers

In language modeling, what is the 'zero probability' problem?

The assignment of a probability of zero to unseen n-grams, causing the entire sequence probability to be zero. (B) Signup and view all the answers

What is the primary motivation behind using smoothing techniques in language models?

To address the issue of zero probabilities by assigning non-zero probabilities to unseen n-grams. (B) Signup and view all the answers

How does Add-one (Laplace) smoothing adjust probabilities in a language model?

By adding one to the count of each n-gram and adjusting the normalization factor. (B) Signup and view all the answers

Which of the following is a potential drawback of Add-one (Laplace) smoothing?

It may significantly alter the probability distribution, assigning too much probability mass to unseen events. (D) Signup and view all the answers

What is the main idea behind interpolation in language modeling?

To combine the probabilities from multiple n-gram models of different orders. (B) Signup and view all the answers

How are the weights (lambdas) typically determined in linear interpolation?

They are optimized on a held-out corpus to maximize the probability of the data. (B) Signup and view all the answers

In the formulas for linear interpolation that mix trigrams, bigrams, and unigrams, what constraint applies to all lambda values?

The sum of the lambdas must be equal to 1 and each lambda must be greater than or equal to zero. (B) Signup and view all the answers

What problem prompted the evolution from unigram models to bigram and subsequently trigram models?

The inability of unigram models to accurately predict the next word in a sequence due to lack of context. (D) Signup and view all the answers

Which technique would most directly handle the issue of dividing by zero to estimate a likely sequence?

Laplace Smoothing (C) Signup and view all the answers

Extrinsic evaluation of N-gram models focuses on what aspect of the model in question?

Measuring performance of the model indirectly by using its performance in a downstream task. (A) Signup and view all the answers

In the expression $P(w_i | w_{i-1})$, where the context is 'the dog', what word is $w_i$ if we suppose the expression denotes 'the dog barks'?

barks (B) Signup and view all the answers

Assuming frequent calls to estimate probabilities such as $P(w_i | w_{i-1})$ using the chain rule, what optimization would MOST improve performance?

Using log space transformations. (A) Signup and view all the answers

When all the probabilities obtained during the estimation of a sequence are uniformly very low, which is the MOST beneficial strategy?

Using Log Space Transformations (D) Signup and view all the answers

When are pilot experiments MOST useful during N-gram development?

Prior to Extrinsic Evaluation (D) Signup and view all the answers

Under what conditions is Intrinsic Evaluation MOST useful?

When the testing dataset mirrors the training data closely. (D) Signup and view all the answers

Which strategy MOST directly helps with the issue that test dataset may contain legitimate and commonplace n-grams that did not occur in the training set?

Using a larger dataset (A) Signup and view all the answers

If 'The quick brown fox' occurred 10 times, and 'quick brown fox jumps' never occurred, what would Add-1 Laplace estimation adjust here?

Add 1 to the count of 'the quick brown fox jumps' and renormalize by adding 1 the number of possible n-grams accounting for vocabulary size (C) Signup and view all the answers

Which is the most direct downside to extrapolation within N-gram modeling?

The new estimated values may be extremely high compared to other reasonable values violating the prior shape of the data (A) Signup and view all the answers

Which is the most appropriate situation to prefer 'held out' data over other types of data?

When you aim to determine optimal lambdas within interpolation (A) Signup and view all the answers

Interpolation improves upon specific shortcomings of Laplace Smoothing. Which is the most direct improvement interpolation introduces?

Estimates can reflect the distinct effects of estimates from distinct N-gram orders, rather than introducing probability uniformly. (C) Signup and view all the answers

Compared to a unigram model, what estimation overhead considerations exist regarding a good bigram model?

Bigram estimation requires significantly greater overhead (B) Signup and view all the answers

Which expression is most synonymous with 'held out' data?

Validation Data (B) Signup and view all the answers

Generally, why is training time not as strong a consideration vs perplexity?

Training time can often be dealt with at design time, with resulting system deployed with a long-term lifespan requiring high quality results that stem from a low perplexity model. (C) Signup and view all the answers

When extrapolating, what potential issue may more intensely impact the model outcome?

The model may estimate large but completely unrealistic probability values. (C) Signup and view all the answers

Flashcards

Language Modeling Goal

Predicts the probability of a sentence or sequence of words.

Language Model

A model that computes the probability of a sequence of words or the probability of the next word given the previous words.

Markov Assumption

Simplifies probability computation by assuming a word's probability depends only on the preceding k words.