Web Engineering & Development Lecture 5: Search Engines

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

What is the purpose of rotating a wildcard query?

To ensure the * symbol is at the end of the string (correct)
To improve search speed
To make the query more complex
To eliminate irrelevant terms

The term 'fishmonger' would not survive the filtering process when checking for the substring 'mo'.

False (B)

What is the lookup key for the wildcard query 'uni*rse'?

rse$uni*

The process of creating an inverted index from bigrams involves enumerating all ______ occurring in any term.

k-grams Signup and view all the answers

Match the query to its corresponding lookup key:

unirse = rse$uni fimoer = er$fi* uniese = se$uni* mon* = $m AND mo AND on Signup and view all the answers

What does the * symbol indicate in wildcard queries?

Any (possibly empty) string of characters (B) Signup and view all the answers

Permuterm index is a technique used to handle general wildcard queries.

True (A) Signup and view all the answers

Give an example of a wildcard query that seeks terms containing all five vowels in sequence.

aeiou Signup and view all the answers

A wildcard query such as `automat*` seeks documents containing variants of the query term such as , , and __.

automatic, automation, automated Signup and view all the answers

Match the techniques used to handle wildcard queries with their explanations:

Permuterm index = Rotate queries with * at the end K-gram index = Break terms into segments of k characters TF-IDF weights = Rank documents based on term frequency Precision & Recall = Evaluate the effectiveness of search results Signup and view all the answers

What is the purpose of the K-gram index?

To break terms into segments of k characters (C) Signup and view all the answers

Edit distance is a method used for ranking documents.

False (B) Signup and view all the answers

What common strategy do the Permuterm index and K-gram index share?

Express the given wildcard query as a Boolean query on a specially constructed index. Signup and view all the answers

The edit distance algorithm is also known as Hamming distance.

False (B) Signup and view all the answers

What is the edit distance between the words 'cat' and 'dog'?

3 Signup and view all the answers

Given a lexicon and a character sequence Q, the spelling correction method can return the words closest to ___

Q Signup and view all the answers

Match the following terms with their correct definitions:

Edit Distance = Minimum operations to convert one string to another Levenshtein Distance = Another name for edit distance Dynamic Programming = A method used to calculate the edit distance K-gram = A sequence of k characters used in indexing Signup and view all the answers

Which of the following operations is NOT one of the basic operations in edit distance?

Copy (C) Signup and view all the answers

The primary function of spelling correction systems is to retrieve documents indexed by the misspelled word.

False (B) Signup and view all the answers

What are two alternatives mentioned for isolated word correction?

Edit distance and K-gram overlap Signup and view all the answers

What does the Jaccard coefficient measure?

The ratio of the intersection to the union of two sets (C) Signup and view all the answers

The Jaccard coefficient can only be used with sets of the same size.

False (B) Signup and view all the answers

What threshold value would indicate a match using the Jaccard coefficient in this context?

greater than 0.8 Signup and view all the answers

The process of identifying overlapping elements in k-grams is called __________.

k-gram overlap Signup and view all the answers

Match the bigram sets to their respective words from the given entries:

lorm = lo, or, rm alone = al, lo, on, ne lord = lo, or, rd sloth = sl, lo, ot, th Signup and view all the answers

Which of the following statements is true about ranked retrieval?

It ranks documents based on their relevance to the query. (D) Signup and view all the answers

In a k-gram index, overlapping k-grams must be the same length.

False (B) Signup and view all the answers

What is the Jaccard coefficient for the terms 'lorm' and 'lord'?

0.5 Signup and view all the answers

What does the term frequency (tf) refer to?

The number of times a term occurs in a document (D) Signup and view all the answers

Document frequency (df) is used to assign high weights to common terms.

False (B) Signup and view all the answers

What does tf-idf stand for?

term frequency-inverse document frequency Signup and view all the answers

The idf for a term is calculated as log(n / df_t), where n is the total number of documents and df_t is the ________.

document frequency of the term Signup and view all the answers

Match the term with its definition:

Term Frequency (tf) = Number of times a term occurs in a document Document Frequency (df) = Number of documents that contain a term Inverse Document Frequency (idf) = Measure of informativeness based on term rarity tf-idf = Combines term frequency and inverse document frequency Signup and view all the answers

When is the tf-idf score highest?

When a term occurs frequently in a single document (B) Signup and view all the answers

The higher the document frequency, the lower the discriminating power of a term.

True (A) Signup and view all the answers

What is the primary goal of using term weighting in document queries?

To assess the relevance of documents to a query more accurately. Signup and view all the answers

What formula is used to compute the score for a document based on a query?

Score(q,d) = ∑ tf.idf, d (B) Signup and view all the answers

The tf-idf weight for 'Caesar' in document 3 is greater than zero.

False (B) Signup and view all the answers

Calculate the tf-idf for 'mercy' in document 1.

79.2 Signup and view all the answers

The measure of __________ is the ratio of relevant records retrieved to all retrieved records.

precision Signup and view all the answers

Match the term with its corresponding tf-idf value in document 1:

Brutus = 5.1 Caesar = 21 mercy = 79.2 Signup and view all the answers

What is the idf value of 'mercy'?

2.4 (A) Signup and view all the answers

Recall measures the number of relevant records retrieved compared to the total number of irrelevant records.

False (B) Signup and view all the answers

What is the formula for recall?

Recall = (relevant records retrieved) / (total relevant records) Signup and view all the answers

Flashcards

o$

A special symbol used in permuterm indexing to mark the beginning and end of a word, allowing rotations of the term for wildcard searches.

Permuterm Indexing

A technique used to search for terms that match a wildcard query, where the wildcard character () is rotated to the end of the string to create a unique lookup key in the permuterm index. For example, the wildcard query "unirse" would be rotated to "rse$uni*" for indexing.

k-gram Index

A type of index that stores all possible k-grams (sequences of k characters) occurring in a set of terms. This allows for efficient searching of terms containing specific k-grams.

n-gram Wild-card Processing

A special technique for processing wildcard queries using k-gram indexes. It involves breaking down the query into individual k-grams and then retrieving terms that match all the specified k-grams. Example: The query "mon*" could be broken into "$m", "mo", and "on".