Web Engineering Midterm Exam Notes PDF

Summary

These lecture notes cover Web Engineering, focusing on tolerance retrieval in search engines. The lecture notes include topics such as wild-card queries, permuterm and k-gram indexes, spelling correction (using edit distance and k-gram overlap), ranking documents (using TF-IDF), and evaluating search engines (precision and recall). The document also includes an exercise section providing examples and questions.

Full Transcript

Web Engineering & Development Lecture 5 Search Engines (Tolerance Retrieval) Presented by 1 Dr. Naglaa Fathy 2 Agenda ▪ Wild-card Queries o Permuterm index o K-gram index ▪ Spelling correction...

Web Engineering & Development Lecture 5 Search Engines (Tolerance Retrieval) Presented by 1 Dr. Naglaa Fathy 2 Agenda ▪ Wild-card Queries o Permuterm index o K-gram index ▪ Spelling correction o Editdistance o K-gram overlap ▪ Ranking Documents o TF-IDF weights ▪ Evaluating Search Engines o Precision & Recall 3 Wild-card queries Wild-card queries: * 4 ▪ The idea of a wildcard query: a query such as *a*e*i*o*u*, which seeks documents containing any term that includes all the five vowels in sequence. ▪ The * symbol indicates any (possibly empty) string of characters. ▪ Users pose such queries to a search engine when they are uncertain about how to spell a query term, or seek documents containing variants of a query term; o for instance, the query automat* would seek documents containing any of the terms automatic, automation and automated. Handling Wildcard queries 5 ▪ Two techniques for handling general wildcard queries; Permuterm index and k-gram index. ▪ Both techniques share a common strategy: o Express the given wildcard query qw as a Boolean query Q on a specially constructed index, such that the answer to Q is a superset of the set of vocabulary terms matching qw. o Then, we check each term in the answer to Q against qw, discarding those vocabulary terms that do not match qw. o At this point we have the vocabulary terms matching qw and can resort to the standard inverted index. Permuterm index 6 ▪ Basic idea: Rotate every wildcard query, so that the * occurs at the end. ▪ For term hello index under: o hello$,ello$h, llo$he, lo$hel, o$hell where $ is a special symbol. to mark the end of a term. ▪ Queries: oX look up on X$ X* lookup on X*$ o *X lookup on X$* *X* lookup on X* o X*Y lookup on Y$X* X*Y*Z ??? Query = hel*o X=hel, Y=o Lookup o$hel* Permuterm index 7 ▪ EX1: query m*n. o The key is to rotate such a wildcard query so that the * symbol appears at the end of the string o Thus, the rotated wildcard query becomes n$m* o Next, we look up this string in the permuterm index, where seeking n$m* leads to rotations of the terms man and moron. ▪ EX2: fi*mo*er o The rotated wildcard query becomes er$fi* o we filter these out by exhaustive enumeration, checking each candidate to see if it contains mo. o The term fishmonger would survive this filtering but filibuster would not ▪ Now that the permuterm index enables us to identify the original vocabulary terms matching a wildcard query, we look up these terms in the standard inverted index to retrieve matching documents. Exercise 8 ▪ Determine the lookups in the permuterm index dictionary for the following wild card queries. o uni*rse, uni*e*se ▪ Answer: o lookup key for (uni*rse) is (rse$uni*) o lookup key for (uni*e*se) is (se$uni*) k-gram indexes 9 ▪ Enumerate all k-grams (sequence of k chars) occurring in any term ▪ e.g., from text “April is the cruelest month” we get the 2-grams (bigrams) $a,ap,pr,ri,il,l$,$i,is,s$,$t,th,he,e$,$c,cr,ru, ue,el,le,es,st,t$, $m,mo,on,nt,h$ o$ is a special word boundary symbol ▪ Maintainan “inverted” index from bigrams to dictionary terms that match each bigram. Bigram index example 10 $m mace madden mo among amortize on among rondo Processing n-gram wild-cards 11 ▪ Query mon* can now be run as o $m AND mo AND on ▪ Getsterms that match AND version of our wildcard query. ▪ But we’d enumerate moon. ▪ Must post-filter these terms against query. ▪ Surviving enumerated terms are then looked up in the term-document inverted index. Bigram index 12 ▪ Note that we now have two different types of inverted indexes: o The term-document inverted index for finding documents based on a query consisting of terms o The k-gram index for finding terms based on a query consisting of k-grams 13 Spelling correction Query mis-spellings 14 ▪ We can either o Retrieve documents indexed by the correct spelling, OR o Return several suggested alternative queries with the correct spelling Did you mean … ? Isolated word correction 15 ▪ Given a lexicon and a character sequence Q, return the words in the lexicon closest to Q ▪ We’ll study several alternatives o Edit distance o K-gram overlap Edit distance 16 ▪ Given two strings S1 and S2, the minimum number of basic operations to convert one to the other ▪ Basic operations are typically character-level o Insert o Delete o Replace ▪ E.g., the edit distance from cat to dog is 3. ▪ Generally found by dynamic programming. Edit Distance Algorithm 17 Also called “Levenshtein distance” Edit Distance - Example 18 ▪ Calculate the edit distance between the words “CABBAGES” and “RABBIT”. The first row and column Each box corresponds correspond to empty to a pair of substrings strings How do Spell Checkers work? Levenshtein Edit Distance - YouTube Edit Distance - Example 19 This box corresponds to "CAB" We will compute the Lev and "RA" Between "CAB" and "RA" and store it in this box How do Spell Checkers work? Levenshtein Edit Distance - YouTube Edit Distance - Example 20 How many edits are required to transform substrings of the word "CABBAGES" into the empty string "" It takes 1 edit to change "C" into "". we just delete the "C“. It takes 3 edits to change "CAB" into "". we delete the 3 letters"C", "A", and "B" To transform any string into the empty string, we delete each of the letters How do Spell Checkers work? Levenshtein Edit Distance - YouTube Edit Distance - Example 21 The first column indicates how many edits are required to transform empty string into some other substring to transform the empty string into any string, we have to insert all the characters of that string If we had to transform the empty string into the substring "R" we would insert the "R“ = 1 edit If we had to transform the empty string into the substring "RAB" we would insert the "R", the "A" and the "B" = 3 edits How do Spell Checkers work? Levenshtein Edit Distance - YouTube Edit Distance - Example 22 How many In the above edits are box: required to It takes 4 transform edits to "CABBA" transform into "RAB"? "CABBA" into "RA" In the In the left diagonal box: It takes box: It takes 2 edits to 3 edits to transform transform "CABB" into "CABB" into "RAB" "RA" How do Spell Checkers work? Levenshtein Edit Distance - YouTube Edit Distance - Example 23 These surrounding 3 boxes correspond to the 3 operations available in Levenshetein: Any of these operations would produce the substring "RAB" from the substring "CABBA“ 2. It takes 2 edits to 3. It takes 3 edits to 1. It takes 4 edits to transform "CABB" into transform "CABBA" transform "CABB" into "RAB" then we could "RA", then we could into "RA" and then substitute (replace) the "A" we could insert "B" delete the "A" on the end of “CABBA” on the end of "CABBA" on the end of “RA” with a "B" to get “RAB” How do Spell Checkers work? Levenshtein Edit Distance - YouTube Edit Distance - Example 24 If the characters are different, like the "A" and "B", take the minimum of the 3 boxes and add 1 How do Spell Checkers work? Levenshtein Edit Distance - YouTube Edit Distance - Example 25 ▪ If the two characters are exactly the same, then we do not have to perform any additional operation here, we can just copy the value from the box diagonally left and above to it. The ? corresponds to "B" in both strings How do Spell Checkers work? Levenshtein Edit Distance - YouTube Edit Distance - Example 26 ▪ The value in the lower right corner is the final edit distance between the two strings. o in this example, 5 is the final edit distance How do Spell Checkers work? Levenshtein Edit Distance - YouTube Exercise 27 ▪ Draw the matrix and compute the edit distance between “oslo” and “snow”. o Plot the possible edit operations surrounding a given box. “” s n o w “” 0 1 2 3 4 o 1 1 2 2 3 Possible edit operations surrounding a given box are: s 2 1 2 3 3 l 3 2 2 3 4 o 4 3 3 2 3 The edit distance equals to 3 k-gram overlap 28 ▪ To further limit the set of vocabulary terms for which we compute edit distances to the query term, k-gram index can be used to assist with retrieving vocabulary terms with low edit distance to the query q. o Once we retrieve such terms, we can then find the ones of least edit distance from q. ▪ k-gram index is used to retrieve vocabulary terms that have many k-grams in common with the query. k-gram overlap 29 ▪ Enumerate all the k-grams in the query string as well as in the lexicon. ▪ Use the k-gram index to retrieve all lexicon terms matching any of the query k-grams. ▪ Measures the overlap in k-grams between a vocabulary term and q and take those terms that exceed a certain threshold. Jaccard coefficient for k-gram overlap 30 ▪ A commonly-used measure of overlap. ▪ Let X and Y be two sets; the set of k-grams in the query q, and the set of k-grams in a vocabulary term. ▪ Then the J.C. is X Y / X Y ▪ Equals 1 when X and Y have the same elements and zero when they are disjoint ▪ X and Y don’t have to be of the same size ▪ Always assigns a number between 0 and 1 o Now threshold to decide if you have a match o E.g., if J.C. > 0.8, declare a match Matching bigrams 31 ▪ Consider the query lord – we wish to identify words matching 2 of its 3 bigrams (lo, or, rd) lo alone lord sloth or border lord morbid rd ardent border card Standard postings “merge” will enumerate … Adapt this to using Jaccard (or another) measure. Exercise 32 ▪ Compute the Jaccard coefficients between the query “lorm” and each of the terms that contain the bigram “lo” in the following entry of a bigram index: lo alone lord sloth Answer 33 ▪ bigrams for o (lorm) → lo, or, rm o (alone) → al, lo, on, ne o (lord) → lo, or, rd o (sloth) → sl, lo, ot, th X Y / X Y ▪ Jaccard Coefficient = o J.C. (lorm, alone) = 1/(7-1) = 1/6 o J.C. (lorm, lord) = 2/ (6-2) = 2/4=1/2 o J.C. (lorm, sloth) = 1/(7-1) = 1/6 34 Scoring and Ranking Ranked retrieval 35 ▪ Rather than a set of documents satisfying a query expression, in ranked retrieval models, the system returns an ordering over the (top) documents in the collection with respect to a query. Scoring as the basis of ranked retrieval 36 ▪ Wewish to return in order the documents most likely to be useful to the searcher ▪ How can we rank order the docs in the corpus with respect to a query? ▪ Assign a score – say in [0,1] o for each doc on each query ▪ This score measures how well document and query “match”. 37 Term weighting Term frequency tf 38 ▪ The term frequency tft,d of term t in document d is defined as the number of times that t occurs in d. ▪ Raw term frequency as above suffers from a critical problem: all terms are considered equally important when assessing relevancy on a query. ▪ In fact certain terms have little or no discriminating power in determining relevance. o For instance, a collection of documents on the auto industry is likely to have the term auto in almost every document ▪ Therefore, we would like to reduce the weight of a common term using Document frequency. Document frequency 39 ▪ We want high weights for rare terms and low (positive) weights for frequent words. ▪ We will use document frequency to factor this into computing the matching score. ▪ The document frequency is the number of documents in the collection that the term occurs in. tf x idf term weights 40 ▪ tf x idf measure combines: o term frequency (tf ) the number of occurrences of a term in a document o inverse document frequency (idf ) measure of informativeness of a term: its rarity across the whole corpus  n  idft = log   df t  logarithms are to the base 10 tf x idf (or tf-idf) 41 ▪ Assign a tf-idf weight to each term i in each document d wt ,d = tft ,d  log(n / dft ) tft ,d = frequency of term t in document d n = total number of documents dft = the number of documents that contain term t 1. Highest when t occurs many times within a small number of documents (thus lending high discriminating power to those documents). 2. Lower when the term occurs fewer times in a document or occurs in many documents (thus offering a less pronounced relevance signal). 3. Lowest when the term occurs in virtually all documents. Final ranking of documents for a query 42 Score(q,d) =  tf.idft,d t qd 42 Exercise 43  Consider the following table of frequencies of terms (Brutus, Caesar, mercy) in 3 documents. Compute the tf-idf weights for each term in each document. (Total no. of documents= 791,860). dfi Doc 1 Doc2 Doc 3 Brutus 16,175 3 30 25 Caesar 25,235 14 27 0 mercy 3276 33 0 17 Answer 44  idf (Brutus) = log10(791,860/ 16,175)=1.7  tf-idf (Brutus, doc1)= 1.7 * 3 = 5.1  tf-idf (Brutus, doc2)= 1.7 * 30 = 51  tf-idf (Brutus, doc3)= 1.7 * 25 = 42.5  idf (Caesar) = log10 (791,860/ 25,235)= 1.5  tf-idf (Caesar, doc1)= 1.5 * 14 = 21  tf-idf (Caesar, doc2)= 1.5 * 27 = 40.5  tf-idf (Caesar, doc3)= 1.5 * 0 = 0  idf (mercy) = log10 (791,860/ 3276)= 2.4  tf-idf (mercy, doc1)= 2.4 * 33 = 79.2  tf-idf (mercy, doc2)= 2.4 * 0 = 0  tf-idf (mercy, doc3)= 2.4 * 17 = 40.8 45 Evaluating search engines Measuring search effectiveness: 46 Precision and Recall Precision and recall are the basic measures used in evaluating search strategies. As shown in the first two figures on the left, These measures assume: There is a set of records in the database which is relevant to the search topic Records are assumed to be either relevant or irrelevant (these measures do not allow for degrees of relevancy). The actual retrieval set may not perfectly match the set of relevant records. Recall 47 RECALL is the ratio of the number of relevant records retrieved to the total number of relevant records in the database. It is usually expressed as a percentage. Precision 48 PRECISION is the ratio of the number of relevant records retrieved to the total number of irrelevant and relevant records retrieved. It is usually expressed as a percentage. Exercise 49 ▪ Assume the following: ▪ A database contains 80 records on a particular topic ▪ A search was conducted on that topic and 60 records were retrieved. ▪ Of the 60 records retrieved, 45 were relevant. ▪ Calculate the precision and recall scores for the search. Solution 50 ▪ The number of relevant records retrieved= 45 ▪ The Total number of relevant records = 80 ▪ The number of retrieved records= 60 ▪ Recall = 45/80 * 100% = 56% ▪ Precision = 45/60 * 100% = 75% Web Engineering 51 Midterm Exam Notes Lectures included: 1+ 2+ 3 + (Lec 4 until slide 33) Midterm Total Marks: ( ASU: 15 marks → UEL :100 marks) Questions include: Question 1: MCQ Question 2: Replace each statement with its key term Question 3: Fill in the missing spaces with the suitable step. Exam pages: 2 (One double- sided paper) Answer in the SAME exam sheet. 52

Use Quizgecko on...
Browser
Browser