Information Retrieval I Master in Artificial Intelligence 2024/25 PDF
Document Details
Uploaded by ThrivingWombat
University of Vigo
2024
Tags
Summary
These are notes and lecture slides for a Master in Artificial Intelligence course on Information Retrieval. The lecture slides cover topics like the architecture and components of Information Retrieval systems, as well as techniques for indexing and retrieving documents. The document also includes a table of contents and information about the course.
Full Transcript
Information Retrieval I Master in Artificial Intelligence 2024/25 ESEI – University of Vigo Table of Contents What is Information Retrieval? Typical Architecture and Components Document Analysis and Indexing (sparse vs dense) IR with Sparse Representations Sparse doc. proce...
Information Retrieval I Master in Artificial Intelligence 2024/25 ESEI – University of Vigo Table of Contents What is Information Retrieval? Typical Architecture and Components Document Analysis and Indexing (sparse vs dense) IR with Sparse Representations Sparse doc. processing Inverted indices Retrieval Models 1 What is Information Retrieval? What is Information Retrieval? Information retrieval (IR) is finding material (usually docu- ments) of an unstructured nature (usually text) that satis- fies an information need from within large collections (usu- ally stored on computers). Source: C. Manning et al., Introduction to Information Retrieval, Cambridge University Press. 2008. Goals Efficient and effective access to relevant information from a large unstructured dataset scalable to handle large and growing datasets Retrieve docs. that match a user’s information need (≈ query) understanding and processing natural language queries and docs. rank retrieved results based on their relevance to the user’s query 2 IR Evolution Early Developments (1950s-1960s) The birth of IR: indexing and retrieving information from text Boolean model for exact matching Cranfield reference collection (evaluation metrics) and first versions of SMART (System for the Mechanical Analysis and Retrieval of Text), by G. Salton Probabilistic Models (1970s-1980s) Probabilistic models for IR (term probabilities → doc. relevance and ranking) Vector Space Model (VSM): representing docs. and queries as vectors TF-IDF weighting scheme, Okapi BM25 ranking function (by Karen Spark-Jones, S. Robertson and others) 3 IR Evolution (II) Digital Libraries and Web Search (1990s-2000s) Digital Libraries (PubMed Central) and first Web Search Engines (Yahoo, AltaVista) Dominance of Google’s PageRank algorithm (exploits link-based info.) Web-scale indexing, search personalization, recommendation systems Modern Information Retrieval (2010s-Present) Deep learning neural networks for semantic understanding AI-powered search engines and chatbots 4 Typical Architecture and Components Typical IR Systems Architecture Typical tasks/components Source: D. Jurafsky, J.H. Martin, Speech and Language Processing (3rd ed.) 5 Typical IR Systems Architecture (II) Document Indexing Processing and indexing documents in the collection → creates doc. representation Indexing creates a data structure that maps terms (words, phrases) in the docs. to their corresponding locations in the collection Enables faster retrieval Query Processing Analyzing user queries and converting them into a format useful for doc. retrieval → creates query representation from information need Transforms user queries to be compared with the indexed docs. to find relevant matches efficiently 6 Typical IR Systems Architecture (III) Search and Ranking System ranks the set of candidate docs. that match the user’s query → matches query vs. docs. representations Several scoring and ranking algorithm are available Resulting docs. order based on their relevance to the query Most relevant docs. appear at the top → improves user’s experience 7 Typical IR Systems Architecture (IV) Optional tasks/components Relevance feedback Allows users to provide feedback on the retrieved results System can iteratively adjusts its ranking and retrieval strategies (user personalization) Helps improve retrieval accuracy by taking user preferences and feedback into account Query expansion Expands the user’s query by adding synonyms, hyperonyms, related terms, or conceptually similar words. Improve the retrieval of additional relevant docs. Wider range of relevant docs by accounting for variations in terminology and user intent 8 Document Analysis and Indexing (sparse vs dense) Goals/Subtasks (1) Document representation Extract informative content from docs. Sparse representations → docs. represented as sets of discrete index terms Typical doc. processing: tokenization + normalization (stemming) + filtering (stop-word removal) Dense representations → docs. represented as dense semantic vectors Typical doc. processing: Deep Learning language models to extract semantically rich contextual vectors 9 Document Analysis and Indexing (sparse vs dense) (II) (2) Indexing Create a machine processable representation from analyzed docs. contents Sparse representations → inverted indices (maps index terms to docs. and additional info.) Dense representations → vector stores (arranges dense vectors for efficient similarity comparisons) 10 IR with Sparse Representations Doc. processing in Sparse IR Manually assigned index terms (by human annotators) Free form keyword indexing Controlled vocabularies (Ex: MeSH Thesaurus in PubMed/Medline) Automatic indexing (with NLP techniques) Typical steps Tokenization: docs. are divided into smaller meaningful items (tokens): words, phrases, characters,... Stop-word removal: common and uninformative tokens (stop-words) are removed reduces index’s size and improves retrieval efficiency Normalization: unify lexical variants of tokens reducing words to their base forms (≈ lexical roots) by stemming or lemmatization Also, sophisticated NLP: NER, multi-word detection, syntactic chunking, etc 11 Inverted indices in Sparse IR Data structure that allows fast retrieval of docs. associates index terms with the docs. in which they appear records various statistics about their occurrences Primary components Term Dictionary list of unique terms found in the doc. collection Posting Lists each term linked to a posting list, with info. about the docs. containing the term additional info: term frequencies (TF), positions within docs. 12 Inverted indices in Sparse IR (II) 13 Source: C. Manning at al., Introduction to Information Retrieval, Cambridge University Press. 2008. Inverted indices in Sparse IR (III) Retrieval time System processes the user’s query → identifies query terms (analogous steps to doc. processing) Terms look up in the inverted index → get corresponding posting list Intersection of posting lists for query terms is computed → finds docs. that contain the required query terms (optional)(Ranking and scoring ) mechanisms → rank candidate docs. the presence based on of query terms in the posting lists the frequency 14 Retrieval Models Mathematical models/frameworks → How docs are scored and ranked based on their relevance to a query? Boolean Models Based on Boolean algebra AND, OR, NOT operators to match query terms with doc. terms in inverted index No doc. ranking → docs. either relevant (1) or non-relevant (0) Vector Space Models (VSM) Represents doc. and queries as vectors in a multi-dimensional space Similarity measures (cosine) between query vector and doc. vectors → docs. ranking Probabilistic Models Evolution of VSM Probabilistic principles to rank docs. using its likelihood of being relevant to a query 15 Vector Space Models Docs. and queries as high dimensional sparse vectors (d⃗i and q ⃗ ). Vector dimension → total number of index term in collection Sparse vectors → very few non-zero elements Vector similarity between d⃗i and q ⃗ ⇒ relevance of doc. di to query ( (1) how index terms are weighted in d⃗i and q ⃗ Key points: (2) how vector similarity is computed For (1) : TF-IDF weighting For (2) : Euclidean Distance (many drawbacks), Cosine similarity 16 TF-IDF Weighting The TF-IDF formula is typically defined as: wti ,dj = TF-IDF(ti , dj ) = TF(ti , dj ) × IDF(ti ) (1) Where: number of times term ti appears in document dj TF(ti , dj ) = total number of terms in document dj total number of documents in corpus D IDF(t) = log number of documents containing term ti + 1 With ti ∈ T and dj ∈ D, being D the document collection and T the vocabulary with all terms in D Note: In practice, actual IR systems use different variations of this basic formula (see Wikipedia entry for TF-IDF) 17 TF-IDF Weighting (II) Term Frequency(TF) Measures how often a term appears in a doc. Higher weight to terms that are ”important” in a given doc. (informative terms) Inverse Document Frequency (IDF) Measures the rarity of a term across the entire doc. collection. Higher weight to terms that are that are relatively rare (discriminative terms) TF-IDF mixes 2 heuristics (informativeness + discriminativeness) to highlight terms that are both frequent within a document and rare across the collection. 18 Vector Similarity First approach: (not recommended) Euclidean distance between ⃗ and d⃗j , with dj ∈ D q sX 2 q , d⃗j ) = distance(⃗ (TF-IDF(ti , q) − TF-IDF(ti , dj )) ti ∈T Euclidean distance drawbacks Sensitive to scaling differences between dimensions (high TF values and long documents), may require normalization Less effective with sparse data (absence of a term means a zero in the vector space), leading to inaccurate distance measurements Can suffer from the ”curse of dimensionality”, when dimensions/terms increases, distances between data points not accurately reflect similarity 19 Vector Similarity (II) Source: C. Manning at al., Introduction to Information Retrieval, Cambridge University Press. 2008. 20 Vector Similarity (III) Actual similarity measure: Cosine similarity ⃗ · d⃗j P q t∈T (TF-IDF(ti , q)· TF-IDF(ti , dj )) q , d⃗j ) = similarity (⃗ = qP q | × |d⃗j | qP |⃗ ti ∈T TF-IDF(ti , q)2 × ti ∈V TF-IDF(ti , dj ) 2 ⃗ and d⃗j (≈ how vectors are Measures the angle between q overlapped) Scale Invariance: it doesn’t depend on the magnitude of vectors, focuses on the direction rather than on the dimension of the vector Robustness in high-dimensional spaces and able to deal with sparse vectors Efficiency (easy to optimize dot product computation), effective ranking and direct semantic interpretation 21 Okapi BM25 BM25 (Best Matching 25) is a family of probabilistic ranking functions in IR Ranks a set of docs. based on the query terms appearing in them Adapts TF-IDF to overcome some limitations (BM25 is more robust) Aspects included in the BM25 ranking score: A modified TF weighting scheme, less sensitive to extreme values A modified IDF, attenuates the impact of frequently occurring terms Doc. Length Normalization, which penalizes docs. with longer length Term Saturation. A saturation function to prevent overly high TF scores (increasing TF may not necessarily indicate greater relevance) Tunable parameters (b, and k1), which can be adjusted to optimize retrieval performance on specific datasets 22 Okapi BM25 (II) Given a query Q, containing index terms q1 ,..., qn , the BM25 score of a document D is: n X TF(qi , D) · (k1 + 1) scoreBM25 (D, Q) = IDF(qi ) · |D| i=1 TF(qi , D) + k1 · 1 − b + b · avgdl Where: n = number of query terms k1 = term frequency saturation parameter b = document length normalization parameter TF(qi , D) = number of times that qi occurs in the doc. D N − n(qi ) + 0.5 IDF(qi ) = ln +1 n(qi ) + 0.5 N = total number of docs in the collection n(qi ) = number of docs. containing query term qi |D| = length of doc. D avgdl = average doc. length in the collection 23 IR libraries and frameworks (sparse IR) Apache Lucene Open-source IR library and search engine written in Java (interfaces to other languages, PyLucene) Related projects: Apache SOLR Indexing/search engine server on top of Lucene (REST API, load balance, high availability) ElasticSearch Similar to SOLR. Main focus on log storage (Logstash) and analysis. Provides data visualization tools (Kibana) Anserini and PySerini Toolkit for reproducible IR research (both sparse, backed by Lucene, and dense IR) Woosh Easy to use Python clone of Lucene (mimics Lucene main elements). No active development. Terrier IR Platform Open-source search engine, focused on research and experimentation in IR 24 Sparse IR Improvements Query Term Expansion query expansion (QE) and pseudo-relevance feedback (RF) expand user’s query with additional terms (improve recall and address vocabulary mismatch problem) QE adds synonyms, hypernyms (is-a relationships) or related concepts taken from external resources (thesaurus, ontologies) RF iteratively adjust query with new terms/weights from relevant results selected by the user (user feedback using Rocchio’s algorithm) Learning to Rank (LTR) ML techniques applied to learn complex ranking models to improve ranking accuracy trains model that combine features: relevance signals, query-document similarity, user interactions,... Query Language Improvements Approximate (fuzzy) queries, phrase and n-gram terms, term proximity queries,.... 25 Lab 1 26 Lab 1 Architecture 27