4-Retrieval.pdf
Document Details
Uploaded by ThrillingTuba
Tags
Full Transcript
Exact Text Search – Binary Retrieval Model If we want to search for bird + early, this corresponds to using AND on these columns: 1 1 1 1 1 1 1 1 1 1 1 worm 1 wise flock 1 1 wealthy feather 1 1 1 together early Doc 4 1 1 to cheese Doc 3 1 the but 1 that bird Doc 2 second bed 1 1 of and 1 Doc 1 1 9 1...
Exact Text Search – Binary Retrieval Model If we want to search for bird + early, this corresponds to using AND on these columns: 1 1 1 1 1 1 1 1 1 1 1 worm 1 wise flock 1 1 wealthy feather 1 1 1 together early Doc 4 1 1 to cheese Doc 3 1 the but 1 that bird Doc 2 second bed 1 1 of and 1 Doc 1 1 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 rise 8 mouse 7 man 6 make 5 it 4 is 3 healthy 2 get 1 a Dim 1 1 1 1 1 1 intersection of bird AND early = 0011 AND 1010 = 0010 matching document: bit 2 = document 2. 19 Exact Text Search – Inverted Index We cannot store the binary matrix for large document collections. Solution: We store only the 1’s, for each term, sorted. bird → Doc 1, Doc 2 early → Doc 2, Doc 4 Optimizations: ▶ sort lists, and use a merge operation to intersect lists in 🔗 “Intersect” algorithm in the IR Book ▶ skip pointers to skip multiple entries at once ▶ B-trees with prefix compression to organize lists ▶ compress lists by storing deltas, variable-length integer encoding etc. [LeBo15; LeKaKa16] (Reduce IO cost, use SIMD instructions for fast decoding and intersection.) ▶ sharding (horizontal partitioning), also for distributed search 20 Exact Text Search – Inverted Index We can store auxiliary information in inverted indexes: Auxiliary information Symbol Cost Total number of documents Number of occurrences Total number of occurrences Maximum number of occurrences Term positions within document (for phrase matches and the “NEAR” operator – usually 2–4× larger index) For TF-IDF, we can store the quantity/positions along with the document: bird → Doc 1×1, Doc 2×1 early → Doc 2×1, Doc 4×2 early → Doc 2 @ 4, Doc 4 @ 1,5 21 Ranked Retrieval – Approximative Matching Exact matches (boolean model) tend to return no results, or too many results. ➜ We often want to see the “best” matches only. We need a scoring function to sort results. Intuition: TF: the more words match in the document, the better. Intuition: IDF: common words should be given less weight than rare words. Cosine similarity: where , are the TF-IDF vectors. Note: with cosine c normalization,. 22 Ranked Retrieval II – Approximative Matching with TF-IDF For text search (unweighted query ), we can simplify this to: This has the benefits: ▶ efficient computation, one term at a time, and is usually small. ▶ documents not in the inverted list get (i.e., no change). Improvements: ▶ order terms by descending , and try to stop early if the remaining documents cannot become a top- result anymore (if we know ). ▶ stop early and skip lowterms, even if we cannot guarantee correctness ( it is only an approximation of real relevance anyway – it is never “correct”, and frequent lowterms such as “in” and “the” will not change much anyway.) 23 TF-IDF Similarity for Clustering For clustering and similar analysis techniques, we do not want such a query-document asymmetry. So we will usually use cosine similarity. If we normalize documents, this can be computed as the dot product of the TF-IDF vectors ( Since Note: ): , we only need: is usually a small set if the documents are dissimilar. Do not materialize skipping all where — compute the dot product by merging the (sorted) sparse vectors, (optimized sparse vector product). 24 Evaluation of Retrieval Models 25 Unordered Evaluation We can use precision, recall, and accuracy: Precision (exactness): what fraction of the retrieved objects were relevant relevant retrieved True Positives ( nonrelevant ) not retrieved False Negatives ( False Positives ( ) ) True Negatives ( ) Recall (completeness): what fraction of the relevant objects were retrieved F1-Measure: harmonic mean of precision and recall: Accuracy: fraction of test set tuples that are correctly classified as (non-)relevant Problem with “Accuracy”: retrieval is usually a very unbalanced task. F1 is better suited here. 26 Ranked Evaluation In many cases, we evaluate a ranked list of results, not a binary classification/subset. When the relevant objects are at ranks in the ranking, let be the top results each. Average Precision (AveP): average precision at the relevant object locations Mean Average Precision (MAP): the mean average precision over multiple queries Normalized Discounted Cumulative Gain (NDCG) for binary relevancy [JäKe02]: where the last term is a normalization term such that the perfect ranking yields a NDCG of 1. Several other measures exist, and enjoy popularity in particular contexts: Area under the ROC curve (AUROC, ROCAUC), area under the precision recall curve (AUPRC), … 27 Entropy and Perplexity The (information theoretic) entropy of a probability distribution is: Perplexity is defined using the entropy as: Perplexity of a fair coin is 2, of a fair D6 die is 6. The perplexity of a text model often is per word! Lower perplexity ⇝ the model can better predict the text. Note: while perplexity is more common here, entropy is likely the better measure: the number of bits per word in an optimal compression vs. the misleading analogy to successfully predicting a PP-sided fair die. 28