CSE419 Information Retrieval Unit 1 PDF
Document Details
Uploaded by FinestClearQuartz
null
null
null
Tags
Summary
This document introduces information retrieval (IR) focusing on the concepts, evolution, and core problem of IR systems. It discusses the representation, organization, and retrieval of information. The document also provides an overview of user tasks, and defines fundamental data and information concepts.
Full Transcript
CSE419 Information Retrieval Unit1 Contributors Students of B.Tech. CSE-G, CSE-I, CSE-J, CSE-K Dr Isunuri Bala Venkateswarlu Reference 1 “Modern Information Retrieval: The Concepts and Technology Behind Search”, Ricar...
CSE419 Information Retrieval Unit1 Contributors Students of B.Tech. CSE-G, CSE-I, CSE-J, CSE-K Dr Isunuri Bala Venkateswarlu Reference 1 “Modern Information Retrieval: The Concepts and Technology Behind Search”, Ricardo Baeza-Yates and Berthier Ribeiro-Neto, Second Edition, (Pearson Education India, 2010) Reference 2 “Introduction to Information Retrieval”, C. Manning, P. Raghavan, and H. Schütze, (Cambridge University Press, 2008) Reference 3 Internet sources Note Material is for reference and students are recommended to read text books (Reference 1 and 2) along with the material. 1 Introduction to Information Retrieval(IR) Definitions: IR deals with representation, organization, storage of and access to information items. IR is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). Information Retrieval (IR) focuses on finding relevant information from large, unstructured datasets, such as documents or multimedia files, based on user queries. Data that does not have clear, semantically overt, easy for computer structure is know as unstructured data. IR is meant for Unstructured or Semi-structured records. Point of views of IR study is as follows. Computer-centered: building of efficient indexes, processing user queries with high-performance and developing ranking-algorithms. Human-centered: study the behaviour of user, understanding their need and of determining how such understanding affects organization and operations of system. 1.1 Evolution of IR system Evolving from early library systems, modern IR powers search engines, digital libraries, and enterprise applications. The primary goal is to return relevant results by ranking documents through algorithms like TF-IDF, BM25, and machine learning-based models. Natural language processing (NLP) advancements, such as BERT, allow IR systems to better understand the intent behind queries, leading to more accurate and context-aware retrieval. While personalization and contextual search improve user experiences, challenges like privacy and algorithmic bias remain. Future advancements in deep learning and neural networks are expected to further refine IR systems, particularly in multimedia retrieval. Contemporary IR research includes: Modeling, Systems architecture, Web search, Text classification, User interfaces, Data visualization, and Filtering. 2 IR problem The primary goal of IR system is to retrieve all documents that are relevant to user query while retrieving few non-relevant documents as possible. Description of the user queries are given as input to IR system as 1 shown in Figure 1. Then, the IR system searches and produced the relevant information or documents as the output (based on the user query). Figure 1: General IR system Consider, some of the sample query descriptions are as follows: Final all documents that address the role of Federal Government of financing operation of National Railroad Transportation Corporation. Find all pages containing information on college tennis teams which: (1) are maintained by an university in the USA and (2) participate in NCAA tennis tournament. Find all students in each class: (1) whose attendance percentage greater than 90% and having 9 CGPA (2) able to develop simple applications. The core Information Retrieval (IR) problem is centred on finding relevant documents or information from a vast collection of resources based on a user’s query. The problem becomes complex due to several factors, such as natural language ambiguity, user intent interpretation, and the enormous amount of data available. Thus, the users of modern IR systems have information need of varying complexity. User must first translate this information into “Query” which can be processed by search engine (or IR System). Translation yields set of key words or index terms which summarizes the description of information need. Emphasis is on the retrieval of “information” instead of “data”. Difference between Data and Information. Data: Facts and statistics collected together for reference or analysis; Raw and Unprocessed; Unorga- nized and has no context. Information: Facts provided or learned about something or someone; Processed; Organized and has some context. Interpretation of document contents of information items according to user query. Relevance is of central importance in information retrieval. It depends on task being solved and its context. It can change with time or location or device 2.1 User tasks User is searching or querying for information of their interests as shown in Figure 2. Interest of the user may be poorly defined or inherently broad. For example, ”Documents about car racing?” is poorly defined query about car racing. Tasks of user might be of two distinct types: Searching or querying: user can give descriptive query to retrieve interested information; has more flexibility. Browsing or navigating: user need to select through navigation of provided options; has less flexi- bility. 2 Figure 2: User tasks 2.2 Data retrieval vs Information retrieval Data retrieval: Aims at retrieving all objects which satisfy clearly defined conditions such as regular expression or relational algebra expression; Deals with data that has a well defined structure and semantics; Single erroneous object among thousand retrieved objects means total failure. Information retrieval: Aims at retrieving information about a subject than with retrieving data which satisfies a given query; Deals with natural language text which is not always well structured and semantically ambiguous; The retrieved objects might be inaccurate and small errors are likely to go unnoticed. Figure 3: IR system 3 IR system An IR system is a set of algorithms and tools designed to solve the IR problem. IR system accepts user query as input and produces relevant documents as output. IR system consists of three process as shown in Figure 3. The details of each process are as follows. 1. Crawling process: Assemble document collection which can be private or crawled from the web. Crawler is responsible for collecting the documents. Collected documents are stored in disk storage 3 usually referred as Central repository. The document collection refers to the vast amount of data stored in the system, which can be in various formats such as text, multimedia, or structured records. These documents are indexed to enable fast retrieval. The collection may be static (rarely updated) or dynamic (frequently updated), depending on the use case. For example, search engines like Google continuously update their document collections with new web pages. 2. Indexing process: Central repository needs to be indexed for fast retrieval and ranking. Indexing is a key process that organizes documents in a way that facilitates quick retrieval. The idea is like a book’s index, where specific terms are mapped to the pages they appear on. In IR systems, an index is typically an inverted index, which maps each keyword to the set of documents that contain the keyword. 3. Retrieval and Ranking process: It consists of retrieving documents that satisfy either user query (searching) or click in hyperlink (browsing). Query is expanded (system query) and then processed against index to retrieve subset of all documents. Query processing transforms the user’s input into a form that can be matched against the indexed documents. A user might input a natural language query, but the IR system breaks it down into components, removing stop words and applying stemming techniques. The processed query is then compared to the indexed terms. This step is essential for improving the quality and relevance of retrieved documents. Retrieved documents are ranked and top documents are returned to user. Ranking is to identify documents that are most likely to be considered relevant by user. Evaluating quality of answer set is key step for improving IR system. Common evaluation procedure consists of comparing set of results produced with results suggested by human specialists. Collect feedback from users and use this information to change results. 4 The Web and Search interface The Web is an immense repository of information, constantly growing and evolving. Web is a collection of website and web pages. Hypertext markup language is used to design web pages. Hypertext allows reader to jump from one electronic document to another through hyper links. Thus, the hyper text follows non-linear reading method which is navigation from one page to other page. Now a days people can publish their ideas on web and reach millions of people over night. It has led to a freedom to publish ideas and hence this new era is known as e-Publishing era. Web search is today the most prominent application of IR and its techniques. Major impacts of web on search related tasks are as follows. Crawling: The first major impact of web on search is related to characteristics of document collection itself. The web collection is composed of documents or pages distributed over millions of sites and connected through hyperlinks. Thus, collecting all documents and storing copies of them in central repository is the primary task in IR indexing. This process is known as Crawling. Performance and scalability: The second major impact is size of collection and volume of user queries submitted on daily basis. The combination of very large text collection with a very high query traffic has pushed performance and scalability of search engines to limits that largely exceed those of any previous IR system. Noisy documents: The third major impact is prediction of relevant document over vast size of document collection. Query retrieves large number of documents that match its terms, which means that there are many noisy documents. Providing effective answers to queries requires identification of structured data associated with object of interest. Web advertising and other economic incentives led to web spams. Issues on the web: Electronic commerce is major trend on web. In electronic transaction, some critical information like credit card information need to be encrypted which need authentication process. Privacy is another issue which affects deployment of web and which has not been properly addressed. Copyright and patent rights are some other issues for web. 4 4.1 Interfaces for search The role of search user interface is to aid in searchers understanding and expression of their information. It also need to help users formulate their queries, select among available information sources, understand search results and keep track of progress of their search. Search interface should support a range of tasks from simple to complex. User interaction with search interfaces differs depending on the type of task. 4.1.1 Information lookup vs exploratory search Information lookup tasks are akin to fact retrieval or question answering, and are satisfied by short, discrete pieces of information: numbers, dates, names or names of files or web site. Standard web search interactions can work well for these. Exploratory search are learning and investigating type information seeking tasks. Learning searches require more than single query response pairs, and involve the searcher spending time scanning and reading multiple information terms and synthesizing content to form new understanding. Investigation refers to longer-term process that involves multiple interactions that take place over perhaps very long periods of time and many return results that are critically assessed before being integrated into personal and professional knowledge base. Classic vs Dynamic model of information seeking: The classic notation of information seeking process model is formulated as cycle considering of four main activities. – Problem identification – Articulation of information needs – Query formulation – Results evaluation More recent models emphasize dynamic nature of search process, noting that users learn as they document surrogates. This dynamic process is sometimes referred as Berry picking model of search. Navigation vs search: Navigation and browsing are used somewhat interchangeably to refer to strat- egy in which searcher looks at information structure and moves among views of available information, in sequence of scan and select operations. Browsing is often preferred because it is mentally less taxing to recognize piece of information that it is to recall or remember it. Observation of search process: One common observation is that users often reformulate their queries with slight modifications, since this can be easier than trying to specify query precisely in the first attempt. Another is that searchers often search for information that they have previously accessed and that users search strategies different when searching over previously seen materials. 4.2 Search interfaces There are several standard user interface components in search interface for high usability. Query specification: Once search staring point has been selected, the primary methods for search to express their information need are either entering words into search entry form or selecting links from directory or other information organization display. Typically in web queries, text is very short and multi-word queries like phrases are used often. Query specification interfaces: The standard interface for textual query is search box entry forms, in which user type query, activated by hitting return key on keyboard or selection button associated with the form. Some entry forms are divided into multiple components, allowing for more general free text query followed by form that filters query in some way. An innovation that has greatly improved query specification is the inclusion of dynamically generated list of query suggestions known as auto- complete/auto-suggest or dynamic query suggestions. Retrieval results display: When displaying search results, either documents must be shown in full or else searcher must be presented with some kind of representation of content of those documents. The document surrogate refers the information that summarizes the document, and is key part of success of search interface. Showing query terms in the context in which they appear in document improves users ability to gauge relevance of result known as KWIC (key word in context) or query-based summaries or query-oriented summaries or user-directed summaries. 5 Query reformulation: After query is specified and results have been produced, a number of tools exist to help user reformulate their query or take their information seeking process in new direction. One of the most important query reformulation techniques consists of showing terms related to query or to documents retrieved in response to query. A special case of this is spelling corrections or suggestions. Search interfaces are increasingly employing related term suggestions known as term expansion. Organizing search results: Searchers often express desire for user interface that organizes search results into meaningful groups to help understand results and decide what to do next. Two methods for grouping search results are popular: category systems and clustering. A category system is a set of meaningful labels organized in such a way as to reflect concepts relevant to domain. Clustering refers to grouping of items according to some measure of similarity. In document clustering, similarity is typically computed using associations and commonalities among features where features are typically words and phrases. 5 Visualizing search interface A visual representation can communicate some kinds of information much more rapidly and effectively than any other method. Visualization of inherently abstract information is more difficult and challenging. Aside from using icons and color highlighting, spatial layout in the form of lines, bars, circles and canvas views of information are common. The main interactive information visualization techniques include panning and zooming, distortion based views and use of animations. Visualization of search can be applied in following ways. 5.1 Visualizing boolean syntax Boolean query syntax is difficult for most users and is rarely used in web search. A common approach is to show Venn diagrams visually for Boolean queries. VQuery system is the more flexible version of this idea. VQuery system represents dis junction by sets of circles with in active area of canvas for AND operation. One problem with boolean queries is that they can easily end up with empty results or too many results. Figure 4: Visualization of Boolean queries with VQuery 6 5.2 Visualizing query terms within retrieval results In standard search results listings, summary sentences are often selected that contain query terms. In addition to that occurrence of these terms are highlighted or boldfaced where they appear in the title, summary and URL. TileBars interface is one of the best known for the query term highlighting. The documents are shown as horizontal glyphs with locations of query term hits marked along the glyph. The user is encouraged to divide query into its different facets, with once concept per line. Then, the horizontal rows within each documents representation show frequency of occurrence of query terms within each topic. A number of variations of TileBars display have ben proposed, including simplified version which shows only one sequence per query term, and color gradiaiton is used to show query term frequency. Other approaches to show query term hits within document collections include placing query terms in bar charts, scatter plots and tables. Another variation on the idea of showing query term hits within documents is to show thumbnails. Figure 5: Visualization using TileBars 5.3 Visualizing relationships among words and documents It uses an idea of placing words and documents on two dimensional canvas where proximity of glyphs represents semantic relationships among the the terms or documents. VIBE interface is one early version of it, where queries are laid out on plane and documents that contain combinations of queries are placed midway between icons representing those terms. Figure 6: Visualization using VIBE 7 Another variation of this idea is to map documents or words from very high dimensional term space down into two-dimensional plane. The documents or words fall with in that plane, using 2D or 3D maps. Figure 7: Visualization using 3D maps A more promising application of this kind of idea is in the layout of treasures terms, in a small network graph such as used in visual wordnet. Figure 8: Visualization using visual wordnet 5.4 Visualization of text mining The previous methods show that usability results for visualization in search are not particularly strong. It seems in fact that visualization is better used for purposed of analysis and exploration of textual data. Word 8 Tree show a piece of text concordance, allowing user to view which words and phrases commonly precede or follow a given word. Figure 9: Visualization using Word Tree 6 Inverted index and boolean queries Grepping is the simplest form of document retrieval is for a computer to do this sort of linear scan through documents. However, we need more efficient way due to following reasons. To process large document collections quickly To allow more flexible matching operations To allow ranked retrieval Indexing is the way to avoid linearly scanning the texts for each query is to index the documents in advance. Hence, indexing plays vital role in information retrieval. 6.1 Incidence matrix based indexing An incidence matrix is the most basic way to represent the presence or absence of terms in documents. Incidence matrix representation is one of the basic techniques to represent text data. A binary matrix is used for indexing and is creating using following steps. Generate unique words across all the documents. For each document, the corresponding cell of matrix is assigned by 1 if the term exists in the document otherwise 0 is assigned. 6.1.1 Types of incidence matrix There are two types of incidence matrix representation as follows. Forward index: is a data structure that stores mapping from documents to words i.e. directs you from document to word. In this method, rows represents documents or document ids and columns represent term or term ids. It is also known as document-term incidence matrix. Indexing is quite fast as it only append keywords as it move forwards. Searching is quite difficult as it has to look at every contents of index just to retrieve all pages related to word. 9 Inverted index: is a data structure that stores mapping from words to documents or set of documents i.e. directs you from word to document. In this method, columns represents documents or document ids and rows represent term or term ids. It is also known as term-document incidence matrix. Indexing is slow as it first checks that word is present or not and searching is very fast. Hence, IR system uses inverted index representation. Consider a collection of four documents with the following contents: Document 1 ”Brutus killed Caesar.” Document 2 ”Caesar was betrayed by Brutus.” Document 3 ”Brutus and Caesar fought in battle.” Document 4 ”Caesar won the battle.” Forward index incidence matrix for the above documents is: Docs/Terms brutus killed caesar betrayed fought battle won Document 1 1 1 1 0 0 0 0 Document 2 1 0 1 1 0 0 0 Document 3 1 0 1 0 1 1 0 Document 4 0 0 1 0 0 1 1 Inverted index incidence matrix for the above documents is: Terms/Docs Document 1 Document 2 Document 3 Document 4 brutus 1 1 1 0 killed 1 0 0 0 caesar 1 1 1 1 betrayed 0 1 0 0 fought 0 0 1 0 battle 0 0 1 1 won 0 0 0 1 6.2 Boolean queries Boolean retrieval refers to a query system that uses Boolean logic (AND, OR, NOT) to combine search terms. The incidence matrix can be used to answer Boolean queries. This method treats terms as either present or absent in documents, making retrieval precise but inflexible compared to ranked retrieval models. Basic boolean operations are as follows. AND: The AND operator retrieves documents that contain all specified terms. OR: The OR operator retrieves documents that contain any of the specified terms. NOT: The NOT operator excludes documents that contain the specified term. For example, the query ”brutus AND caesar” would correspond to finding documents where both ”Bru- tus” and ”Caesar” have a 1 in the corresponding cells. Query processing using forward index incidence matrix: brutus caesar brutus AND caesar Document 1 1 1 1 Document 2 1 1 1 Document 3 1 1 1 Document 4 0 1 0 Query processing using inverted index incidence matrix: Document 1 Document 2 Document 3 Document 4 brutus 1 1 1 0 caesar 1 1 1 1 brutus AND caesar 1 1 1 0 Disadvantages of term-document matrix: Adding new document is difficult which needs to add new columns. Incidence matrices are inefficient for large datasets. With millions of documents and terms, the matrix becomes too sparse (most entries are 0) and consumes too much memory. 10 6.3 Sort-Based Indexing (SBI) A more efficient alternative to the incidence matrix is the sort-based inverted index. Instead of storing the entire matrix, the system maintains a postings list for each term. Each postings list contains the document IDs where the term occurs, along with additional metadata like term frequency or positions within the document. Steps for sort-based index construction is as follows. Building an inverted index involves three key steps: Tokenization: Break down each document into individual terms (words). Punctuation and stopwords (common words like ”the” or ”is”) may be removed during this phase. Term-Document Pairs: Create a list of all terms and the documents in which they appear. It is also known as term-document dictionary. Sorting and Grouping: Sort the term-document pairs by term and group them into postings lists. Consider the same document collection as above, the inverted index with postings lists looks like this: term doc. freq. posting list battle 2 [3, 4] betrayed 1 brutus 3 [1, 2, 3] caesar 4 [1, 2, 3, 4 ] fought 1 killed 1 won 1 Query processing using sort-based inverted index: AND: It corresponds to the intersection of postings lists. Algorithm for intersection of two posting lists is as follows. For example: ”Brutus AND Caesar” retrieve the following document. brutus [1, 2, 3] caesar [1, 2, 3, 4] brutus AND caeser [1, 2, 3] OR: It corresponds to the union of postings lists. For example: ”brutus OR battle” retrieve the following documents. brutus [1, 2, 3] battle [3, 4] brutus OR caeser [1, 2, 3, 4] NOT: It corresponds to set subtraction. For example: ”caesar AND NOT battle” retrieve the following documents. caesar [1, 2, 3, 4] battle [3, 4] caesar AND NOT battle [1, 2] 11 6.4 Faster postings list intersection via skip pointers If the list lengths are m and n, the intersection takes O(m+n). We can do better than this with Skip list. It augmenting postings lists with skip pointers (at indexing time). Skip pointers are effectively shortcuts that allow us to avoid processing parts of the postings list that will not figure in the search results. Skip pointers only helps for AND queries, not for OR queries. Building√effective skip pointers is easy if an index is relatively static. Simple heuristic for placing skips: use P evenly-spaced skip pointers for postings list of length P. Figure 10: Sample skip list Figure 11: Intersection of skip lists Advantages: – Achieves sublinear list intersection time which is better than O(m+n). – Works better for static index. – Useful for AND queries. Disadvantages: – Needs space to store skip pointers. – The effectiveness depends on where we place skip pointers – More skip: shorted skip interval and more comparisons – Less skip: Less extra space more linear traversal and lesser comparison 12 7 Index construction Index construction or Indexing is the process or machine that performs it the indexer. Indexers compress and decompress intermediate files and the final index to save disk space. The indexer needs raw text, but documents are encoded in many ways which need addition decoding. Figure 12 visualizes process of index construction and the major steps in Inverted Index construction are as follows. – Collect the documents to be indexed. – Tokenization of the text. – Do linguistic pre-processing of tokens. – Index the documents that each term occurs in. Figure 12: Index construction steps 7.1 Tokenization The first step of processing is to convert this byte sequence into a linear sequence of characters is known as token stream generation or tokenization. Tokenization is the process of chopping character streams into tokens. Token is an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing. Tokenization is critical because it helps to convert unstructured text into a structured form for easier data manipulation and analysis. It is one of the first and most critical steps in Natural Language Processing (NLP). Challenges or issues of tokinzation are language-specific and are as follows. – Document or text scanning order like left to right (like English language) or right to left (like Urdu language) as shown in Figure 13. Figure 13: Tokenization of Urdu text 13 – Handling punctuation and special characters. Consider the input text ”Mr. O’Neill thinks that the boys’ stories about Chile’s capital aren’t amusing.”. Figure 14 depicts some problems with punctuation and special characters. Figure 14: Tokenization of English text – Dealing with languages that don’t use spaces (e.g., Chinese or Japanese). – Determine the document unit for indexing. Token type is the class of all tokens containing the same character sequence. Term is a type that is included in the IR system’s dictionary. Types of Tokenization are as follows. – Word Tokenization: The text is split into individual words. – Sentence Tokenization: The text is divided into sentences. For example, consider a text: ”Artificial Intelligence is transforming the world.”. The results of tokenization are as follows. – Word tokenization: [’Artificial’, ’Intelligence’, ’is’, ’transforming’, ’the’, ’world’] – Sentence tokenization: [’Artificial Intelligence is transforming the world.’] 7.2 Stemming and lemmatization In general, documents are going to use different forms of a word and families of derivationally re- lated words with similar meanings. Thus, the goal of this process is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. 7.2.1 Stemming Stemming involves reducing words to their root form by chopping off suffixes. The goal is to treat related words the same during analysis. Stemming is useful for applications where grammatical cor- rectness is less important, but text simplification is needed. Popular stemming algorithms are as follows. – Porter Stemmer: Uses a set of rules to iteratively trim suffixes from words. – Snowball Stemmer: An improved version of the Porter stemmer, designed for efficiency. – Lancaster Stemmer: More aggressive and can sometimes over-stem words, producing less in- tuitive roots. Benefits of stemming are: – Reduce dimensionality in text analysis by converting inflected words into a common base. – It Simplifies text and reduces computational load. Drawbacks of stemming are: – Can produce non-meaningful stems (e.g., “studies” becomes “studi”). 14 – Might lead to loss of context, as different words may have the same stem but different meanings (e.g., ”better” and ”bad” may be stemmed similarly). Example: – Words: [’eating’, ’eaten’, ’eats’] – Stemmed Output: [’eat’, ’eat’, ’eat’] 7.2.2 Lemmatization Lemmatization involves converting words to their base form (lemma) using the context and grammar of the sentence. It is more sophisticated than stemming and preserves the meaning of words. Lemma- tization uses a language’s vocabulary and morphological analysis. It requires understanding the part of speech (POS) of each word to accurately convert it into its lemma. Lemmatization is especially useful in search engines, chatbots, and applications requiring semantic understanding of text.Popular lemmatization algorithms are as follows. – WordNet Lemmatizer: A popular tool that maps words to their dictionary forms. – spaCy’s Lemmatizer: Another widely used library that provides more advanced lemmatization. Example: – Input words: [’am’, ’are’, ’is’] – Lemmatized Output: [’be’, ’be’, ’be’] 7.2.3 Stemming vs Lemmatization – Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time. Often includes the removal of derivational affixes. stemming most commonly collapses derivationally related words. Stemming chops off prefixes and suffixes without considering context. – Lemmatization usually refers to doing things properly with the use of a vocabulary and mor- phological analysis of words. Normally aiming to remove inflectional endings only. Lemmatization commonly only collapses the different inflectional forms of a lemma. Lemmatization is context- aware, ensuring that words are reduced to their meaningful base form. 7.2.4 Stop words removal Stop words are common words (e.g., ”the”, ”is”, ”in”) that usually do not contribute significantly to the meaning of a sentence and can be removed to reduce noise. some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely. The general strategy for determining a stop word list is as follows. – Sort the terms by collection frequency. – Take the most frequent terms (often hand-filtered for their semantic content) A stop word list of 25 semantically non-selective words which are common in Reuters-RCV1 are shown in Figure 15. Figure 15: List of stop words 15 Example: – Original Text: ”This is a great day for AI.” – After Stop Words Removal: [’great’, ’day’, ’AI’] Stop word removal is used to reduce the number of words to process and focuses on key terms that convey the most meaning. NLTK and spaCy provide pre-built stop word lists for different languages. These lists can be customized to include domain- specific words (e.g., ”data” might be considered a stop word in some contexts). Some stop words may be necessary depending on the use case (e.g., in sentiment analysis, words like “not” are important). Stop words removal is essential in making NLP models more efficient by eliminating redundant or non-essential information. 7.2.5 Normalization Normalization refers to the process of transforming text into a standard format. It ensures uniformity and reduces variability in text that could cause processing errors. Key Techniques for normalization of text are as follows. – Lowercasing: Converting all characters to lowercase to avoid case-sensitive mismatches. – Removing Punctuation: Eliminates symbols like commas, periods, etc.s – Expanding Contractions: Converting ”don’t” to ”do not” to ensure uniformity. – Removing Special Characters: Stripping out non- alphabetical characters like #, , $, etc. – Spelling Correction: Standardizing the spelling of words. Uniform text helps in tasks like text classification and keyword extraction. Ensures better matching of terms for tasks like information retrieval. Normalization makes text more predictable and easier for models to process. Example: – Original Text: ”The U.S. has a GDP of $21.43 trillion.” – Normalized Output: [’the’, ’us’, ’has’, ’a’, ’gdp’, ’of’, ’trillion’] 7.3 Index construction methods Index construction is a crucial step in information retrieval systems, especially for efficiently searching large document collections. The goal of an index is to map content, such as terms, to document IDs that contain those terms, facilitating fast retrieval during queries. The design of indexing algorithms is governed by hardware constraints. – Caching: Access to data in memory is much faster than access to data on disk. – Disk seek time: To maximize data transfer rates, chunks of data that will be read together should therefore be stored contiguously on disk. – Buffer: Reading a single byte from disk can take as much time as reading the entire block. – Disk I/O is block-based: Reading and writing of entire blocks (as opposed to smaller chunks). – Disk space is several orders of magnitude larger. – Fault tolerance is very expensive: it is much cheaper to use many regular machines rather than one fault tolerance systems 16 7.3.1 Sort-based Index Construction A common approach to index construction is sort-based. Key Steps in Index Construction are as follows. – First pass: compile the vocabulary ∗ Parsing documents one by one, accumulating postings (termID, docID pairs). – Second pass: construct the inverted index ∗ Sorting the accumulated postings by termID. ∗ Organize postings list: the docIDs for each term into a postings list and compute statistics like term and document frequency. Consider the documents collection: – Doc1: ”apple banana” – Doc2: ”apple orange” – Doc3: ”banana orange apple” Step 1: Generate Postings for each document, we extract the terms (words) and create (term, docID) pairs. – Doc1: (”apple”, 1), (”banana”, 1) – Doc2: (”apple”, 2), (”orange”, 2) – Doc3: (”banana”, 3), (”orange”, 3), (”apple”, 3) Step 2: Sort Postings by Term We collect all the postings and sort them by the term: – (”apple”, 1) – (”apple”, 2) – (”apple”, 3) – (”banana”, 1) – (”banana”, 3) – (”orange”, 2) – (”orange”, 3) Step 3: Write Sorted Blocks to Disk If the dataset is large, the sorted postings are written to disk in blocks, but since this is a small dataset, we can treat it as a single block. Step 4: Merge Sorted Blocks (if necessary) For large datasets, multiple blocks would be merged in this step to form the final inverted index. In this case, there’s only one block, so no merge is needed. Final Inverted Index: – apple: [Doc1, Doc2, Doc3] – banana: [Doc1, Doc3] – orange: [Doc2, Doc3] This final index maps each term to the list of document IDs where it appears, allowing for efficient retrieval of documents during search queries. 7.3.2 Blocked Sort-based Indexing (BSBI) The BSBI algorithm solves memory limitation issues by splitting the dataset into blocks that fit in memory. Each block is processed individually, and the block indices are eventually merged to form the final inverted index. Algorithm for BSBI is as follows: – Block Parsing: The collection is parsed into chunks, each containing approximately 10 million termID-docID pairs. Each block is designed to fit into memory. 17 Figure 16: BSBI algorithm – Inversion: For each chunk, termID-docID pairs are sorted in memory with time complexity O(N log N). After sorting, the termID is concatenated with the corresponding docID to form a postings list. – Block Writing: The inverted index for each block is written to disk as a separate file. At this stage, we have several intermediate sorted index files. – Merge: Once all blocks are processed, the intermediate block indexes are merged to create a global index. This is done using the two-way merge or multi-way merge algorithms. In two-way merge, every time only two block are merged to produce the inverted index and the process is repeated for all blocks. Figure 17 visualizes the two-way merge process. It is more efficient to do a multi-way merge, where you are reading from all blocks simultaneously. Open all block files simultaneously and maintain a read buffer for each one and a write buffer for the output file. In each iteration, pick the lowest termID that hasn’t been processed using a priority queue. Merge all postings lists for that termID and write it out. It read decent-sized chunks of each block into memory and then write out a decent-sized output chunk. Hence, it multi-way merge reduces disk seeks. Figure 17: Two-way merger process Consider documents collection: – Doc1: ”apple banana” – Doc2: ”banana orange” Step 1: Parse and collect postings We process each document as a separate block due to memory constraints, collecting the (term, docID) pairs (postings). Block 1: (Doc1) 18 – Doc1: (”apple”, 1), (”banana”, 1) Block 2: (Doc2) – Doc2: (”banana”, 2), (”orange”, 2) Step 2: Sort Postings in Each Block Now, we sort the postings within each block. Sorted Postings (Block 1): – (”apple”, 1), (”banana”, 1) Sorted Postings (Block 2): – (”banana”, 2), (”orange”, 2) Step 3: Write Each Block to Disk Each sorted block is written to disk after sorting. – Disk Block 1: (”apple”, 1), (”banana”, 1) – Disk Block 2: (”banana”, 2), (”orange”, 2) Step 4: Merge Sorted Blocks Once both blocks have been written to disk, we merge them to create the final inverted index by combining postings for the same terms across the blocks. Merged Postings: – apple: [Doc1] – banana: [Doc1, Doc2] – orange: [Doc2] 7.3.3 Single-pass In-memory Indexing (SPIMI) SPIMI uses single-pass for generation of index. The key ideas for index construction are as follows. – Key idea 1: Generate separate dictionaries for each block – no need to maintain term-termID mapping across blocks. – Key idea 2: Don’t sort. Accumulate postings in postings lists as they occur. With these two ideas we can generate a complete inverted index for each block. These separate indexes can then be merged into one big index. Algorithm for SPIMI is as follows. Figure 18: SPIMI algorithm Consider documents collection: – Doc1: ”apple banana” – Doc2: ”banana orange” 19 Step 1: Process Each Document and Build the Inverted Index In-memory Document 1: ”apple banana” – Parse the first term ”apple”: Create an entry in the dictionary: apple → [Doc1] – Parse the second term ”banana”: Create an entry in the dictionary: banana → [Doc1] At this point, the in-memory index looks like this: – apple: [Doc1] – banana: [Doc1] Document 2: ”banana orange” – Parse the term ”banana”: Since ”banana” already exists in the dictionary, append Doc2 to its posting list: banana → [Doc1, Doc2] – Parse the term ”orange”: Create an entry in the dictionary: orange → [Doc2] The in-memory index now looks like this: – apple: [Doc1] – banana: [Doc1, Doc2] – orange: [Doc2] Step 2: Write the In-memory Index to Disk Once the memory block is full or we have finished processing the documents, we write the current index to disk. For this small example, the in-memory index is written to disk in one pass. – apple → [Doc1] – banana → [Doc1, Doc2] – orange → [Doc2] 7.3.4 Distributed Indexing Web search engines use distributed indexing algorithms for index construction. For web-scale collec- tions, such as those managed by search engines like Google or Bing, distributed indexing is necessary. In a distributed system, the indexing task is divided across multiple machines to handle large document collections. A master machine directs the job, assigning tasks to various machines in parallel. Tasks are divided into parsers (which process documents) and inverters (which generate posting lists from parsed data). Distributed indexing ensures fault tolerance, scalability, and faster processing by utilizing multiple machines concurrently. Distributed indexing for a term-partitioned index is an application of MapReduce. MapReduce is a general architecture for distributed computing. – Maintain a master machine directing the indexing job – considered “safe”. – Break up indexing into sets of (parallel) tasks: Parsers, Inverters – Master machine assigns each task to an idle machine from a pool. Operations of Parser: – Master assigns a split to an idle parser machine. – Parser reads a document at a time and emits (term, doc) pairs. – Parser writes pairs into j partitions; Example: Each partition is for a range of terms’ first letters (e.g., a-f, g-p, q-z) – here j = 3. 20 Operations of Inverter: – An inverter collects all (term,doc) pairs (= postings) for one term-partition. – Sorts and writes to postings lists. Figure 19: Distributed indexing MapReduce breaks a large computing problem into smaller parts by recasting it in terms of manipu- lation of key-value pairs. Steps in MapReduce are as follows. – The map phase of MapReduce consists of mapping splits of the input data to key-value pairs. – For the reduce phase , we want all values for a given key to be stored close together, so that they can be read and processed quickly. – Collecting all values (here: docIDs) for a given key (here: termID) into one list is the task of the inverters in the reduce phase. – The master assigns each term partition to a different inverter - and, as in the case of parsers, reassigns term partitions in case of failing or slow inverters. – Finally, the list of values is sorted for each key and written to the final sorted postings list (“postings”’ in the figure). Figure 20: Distributed indexing Consider the documents collection: – Doc1: ”apple banana” – Doc2: ”banana orange” Distributed Indexing Process: Step 1: Distribution of Documents: 21 – Machine 1: Doc1 – Machine 2: Doc2 Step 2: Local Index Construction: Machine 1: – apple: [Doc1] – banana: [Doc1] Machine 2: – banana: [Doc2] – orange: [Doc2] Step 3: Combine Local Indices: – apple: [Doc1] – banana: [Doc1, Doc2] – orange: [Doc2] 7.3.5 Dynamic Indexing Till now the assumption is the document collection is static. What about if document collection is dynamic? The simplest way to achieve this is to periodically reconstruct the index from scratch. If there is a requirement that new documents be included quickly. Then, maintain two indexes: a large main index and a small auxiliary index that stores new documents. – The auxiliary index is kept in memory. – Searches are run across both indexes and results merged. – Deletions are stored in an invalidation bit vector. – We can then filter out deleted documents before returning the search result. – Documents are updated by deleting and reinserting them. – Each time the auxiliary index becomes too large, we merge it into the main index. The reason for keeping the auxiliary index is to reduce the number of disk seeks required over time. Dynamic indexing addresses the issue of updating an index over time as documents are added, deleted, or modified. This is particularly important for frequently changing collections, such as news sites or blogs. The simplest approach is to maintain a small auxiliary index for new documents, merging it with the main index periodically. Initial documents collection: – Doc1: ”apple banana” – Doc2: ”banana orange” Initial Index Construction: – Doc1: ”apple banana” – Doc2: ”banana orange” Initial Inverted Index: – apple: [Doc1] – banana: [Doc1, Doc2] 22 – orange: [Doc2] Adding a new documents: – Doc3: ”apple grape” Updating the Index: Step 1: Update the Index with New Document: – apple: Add Doc3 to the postings list. – grape: Create a new postings list for Doc3. Step 2: Updated Inverted Index: – apple: [Doc1, Doc3] – banana: [Doc1, Doc2] – orange: [Doc2] – grape: [Doc3] Deleting a Document: – Doc2 is removed from the index Updating the Index after Deletion: Step 1: Remove Doc2 from the Index: – banana: Remove Doc2 from the postings list. – orange: Remove Doc2 from the postings list. Step 2: Updated Inverted Index: – apple: [Doc1, Doc3] – banana: [Doc1] – orange: [] – grape: [Doc3] Computation complexity of auxiliary and main index: – T /n merges where T is of postings and n is size of auxiliary. – Index construction time is O(T 2 /n) as in the worst case a posting is touched T /n times. Issues with main and auxiliary indexes – Problem of frequent merges : poor performance during merge. – Merging of the auxiliary index into the main index is efficient if we keep a separate file for each postings list. – We would need a lot of files – inefficient for OS. More efficient approaches, like the logarithmic merge, are also used to manage frequent updates while ensuring query processing remains fast. Maintain a series of indexes, each twice as large as the previous one. At any time, some of these powers of 2 are instantiated as follows. – Keep smallest (Z0) in memory 23 Figure 21: Logarithmic merge algorithm – Larger ones (I0, I1,... ) on disk – If Z0 gets too big (≥ n), write to disk as I0 – or merge with I0 (if I0 already exists) as Z1 – Either write merge Z1 to disk as I1 (if no I1) – Or merge with I1 to form Z2 Computation complexity of logarithmic merge: – Each posting is merged at most O(log(T /n)) times, so complexity is O(T log(T /n)). – So logarithmic merge is much more efficient for index construction. 7.4 Challenges of Large-Scale Index design Several issues arise when creating an inverted index for multiple datasets: – Memory Limitation: For large files, it is impossible to store all items and their associated file IDs in memory. – Disk I/O: Sorting and merging large files stored on disk requires efficient disk access to minimize expensive disk seeks. – Dynamic Dictionary: The dictionary mapping terms to termIDs grows dynamically as the data is processed. 8 Index compression Index compression is a key aspect of information retrieval (IR) systems, such as search engines. The goal is to reduce the amount of storage required for large indexes, which map terms to the documents in which they appear. Compressing these indexes allows systems to use less memory and disk space, improving the efficiency of query processing. Index compression is critical because, in large-scale IR systems, indexes can grow to an enormous size, especially for large document collections such as the web. Compressed indexes reduce I/O operations when searching and retrieving information, leading to faster search times, especially when indexes are loaded into memory. However, compression techniques 24 must ensure that the indexes are still easily searchable without needing to decompress everything. Advantaged of index compression. – Storage Efficiency: Indexes for large-scale text collections can be huge, so compressing them reduces the storage footprint. – I/O Reduction: A smaller index requires fewer I/O operations, leading to faster access times. – Memory Utilization: Compressed indexes can be stored in memory more efficiently, speeding up search queries. – Data Transfer: In distributed systems, compression reduces the amount of data transferred between nodes. If the main goal of compression is to conserve disk space, then the speed of compression algorithms is of no concern. But for improved cache utilization and faster disk-to-memory transfer, decompression speeds must be high. There are primarily two types of compression relevant to index compression in IR systems: – Lossless Compression: This is the primary form of compression used in IR systems. Lossless compression ensures that no information is lost during compression, meaning the original index can be fully reconstructed after decompression. Examples include Variable-Length Encoding and Delta Encoding. – Lossy Compression: This form of compression sacrifices some information for even higher compression rates. However, it is rarely used in indexing because full accuracy is needed for retrieval in most IR applications. Case folding, stemming, and stop word elimination are forms of lossy compression. 8.1 Statistical properties of terms in IR In practice, the vocabulary will keep growing with the collection size. The statistics properties of terms in IR can be estimated using two laws: Heaps’ law which estimating the number of terms and Zipf’s law which models the distribution of terms. 8.1.1 Heaps’ law Heaps’ law estimates vocabulary size as a function of collection size. Vocabulary size M as a function of collection size T (number of tokens). M ∝ T = k.T b Where, M is the size of the vocabulary, T is the number of tokens in the collection. Typical values: 30 ≤ k ≤ 100 and b ≃ 0.5. In a log-log plot of vocabulary size M vs. T, Heaps’ law predicts a line with slope about ½. It is the simplest possible (linear) relationship between the two in log-log space which an empirical finding (“empirical law”). 25 log M = log k + b log T Motivation for Heaps’ law: simplest possible relationship between collection size and vocabulary size is linear in log-log space. For RCV1, log10 M = 0.49 log10 T + 1.64 is the best least squares fit. Thus, k ≃ 44 and b ≃ 0.49. Good empirical fit for Reuters RCV1. For first 1,000,020 tokens, law predicts 38,323 terms which is actually 38,365 terms. 8.1.2 Zipf ’s law Heaps’ law gives the vocabulary size in collections. We also study the relative frequencies of terms. In natural language, there are a few very frequent terms and very many very rare terms. Zipf’s law: The ith most frequent term has frequency proportional to 1/i. cfi ∝ 1/i = K/i Where K is a normalizing constant, cfi is collection frequency: the number of occurrences of the term ti in the collection. If the most frequent term (the) occurs cf1 times, then the second most frequent term (of) occurs cf1/2 times. The third most frequent term (and) occurs cf1/3 times. In log-log plot, the zipf’s law can be represents Linear relationship between logcfi and logi which is power law relationship. logcfi = logK − logi 8.2 Index compression techniques Components of an Inverted Index: To better understand index compression, let’s break down the two main components of an inverted index. – The Dictionary: Contains all unique terms in the corpus. Each term is associated with a postings list. – Postings List: A list of document IDs where the term appears, often accompanied by other data such as the term’s position in the document and term frequency. These components need to be efficiently compressed to reduce the overall size of the index while maintaining retrieval speed. 8.2.1 Dictionary Compression In information retrieval, a dictionary contains all the unique terms (words) from the documents in a collection. Dictionary compression focuses on reducing the size of this dictionary. – Dictionary as fixed-width string: Simplest data structure for the dictionary is to sort the vocabulary lexicographically and store it in an array of fixed-width entries. For Reuters-RCV1: M x (20 + 4 + 4) = 400000 x 28 = 11.2 MB for storing the dictionary in this scheme. Using 26 fixed-width entries for terms is clearly wasteful. The average length of a term in English is about 8 characters. Wasting twelve characters (or 24 bytes) in the fixed-width scheme. We have no way of storing terms with more than twenty characters. Figure 22: Fixed-width string – Dictionary as term pointers into string: Pointers mark the end of the preceding term and the beginning of the next. A pointer to the next term is also used to demarcate the end of the current term. This scheme saves us 60% compared to fixed-width storage - 24 bytes on average of the 40 bytes 12 bytes on average of the 20 bytes we allocated for terms before. We now also need to store term pointers. For example, the first three terms in this example are systile, syzygetic, and syzygial. In this new scheme, we need 400000 x (4 + 4 + 3 + 8) = 7.6 MB for the Reuters- RCV1 dictionary. We locate terms in the data structure by way of binary search in the (now Figure 23: Term pointers into string smaller) table. Thus, computational complexity assuming each dictionary term equally likely in query, average number of comparisons = (0+1+2+3+2+1+2+2)/8 =1.6 – Dictionary as blocked storage: We can further compress the dictionary by grouping terms in the string into of size k and keeping a term pointer only for the first term of each block. Each term is preceded by a byte encoding its length that indicates how many bytes to skip to reach subsequent terms. We store the length of the term in the string as an additional byte at the beginning of the term. We thus eliminate k-1 term pointers, but need an additional k bytes for storing the length of each term. For k=4, we save (k-1) x 3 = 9 bytes for term pointers, but need an additional k=4 bytes for term lengths. So the total space requirements for the dictionary of Reuters-RCV1 are reduced by 5 bytes per four-term block, or a total of 400000 x 1/4 x 5 = 0.5 MB. Binary search down to 4-term block, then linear search through terms in block. Thus, average comparisons = (0+1+2+3+4+1+2+3)/8 =2 – Dictionary as blocked storage and front coding: Front coding: One source of redundancy in the dictionary we have not exploited yet is the fact that consecutive entries in an alphabetically sorted list share common prefixes. A common prefix is identified for a subsequence of the term list and then referred to with a special character. 27 Figure 24: Dictionary search without blocked storage Figure 25: Blocked storage Figure 26: Dictionary search with blocked storage 8.2.2 Postings List Compression The postings list is the core part of the inverted index, containing lists of document IDs where each term appears. Compression of these lists is critical for index efficiency. The postings list is where most of the space in the index is consumed, as it stores the document IDs for each term. The goal of postings list compression is to reduce the size of the document IDs and other term-related metadata. We can observe that the postings for frequent terms are close together. Key idea is gaps between postings are short, requiring a lot less space than 20 bits to store. Gaps for the most frequent terms are mostly equal to 1. – Variable byte (VB) encoding: It uses an integral number of bytes to encode a gap. The last 7 28 Figure 27: Front coding (a) (b) Figure 28: Variable byte encoding bits of a byte are “payload” and encode part of the gap. The first bit of the byte is a continuation bit. It is set to 1 for the last byte of the encoded gap and to 0 otherwise. To decode a variable byte code, we read a sequence of bytes with continuation bit 0 terminated by a byte with continuation bit 1. We then extract and concatenate the 7-bit parts. Instead of storing absolute document IDs, the differences (gaps) between consecutive document IDs are stored. For example, instead of storing document IDs [10, 14, 20], we store [10, 4, 6] as the gaps between document IDs. Since gaps are smaller than absolute values, they can be compressed more effectively. Consider a sample VB encoding. docIDs 824 829 215406 gaps - 5 214577 VB code 00000110 10111000 10000101 00001101 00001100 10110001 With VB compression, the size of the compressed index for Reuters-RCV1 is 116 MB as we verified in an experiment. The idea of VB encoding can also be applied to larger or smaller units than bytes: 32-bit words, 16-bit words, and 4-bit words or nibbles. Word sizes smaller than bytes get even better compression ratios at the cost of more bit manipulation. For most IR systems variable byte codes offer an excellent tradeoff between time and space. – Gamma coding: We can achieve better compression ratios by using bit-level encodings, in particular two closely related encodings: Gamma codes, and Delta codes. VB codes use an adaptive number of bytes depending on the size of the gap. Bit-level codes adapt the length of the code on the finer grained bit level. The simplest bit-level code is unary code. The unary code of n is a string of n 1s followed by a 0. How efficient can a code be in principle? ∗ Assuming the 2n gaps G with 1 ≤ G ≤ 2n are all equally likely, the optimal encoding uses n bits for each G. ∗ So some gaps (G = 2n in this case) cannot be encoded with fewer than log2 G bits. Our 29 ∗ goal is to get as close to this lower bound as possible. A method that is within a factor of optimal is Gamma encoding. Gamma codes implement variable-length encoding by splitting the representation of a gap G into a pair of length and offset. Offset is G in binary, but with the leading 1 removed. For example, offset of 13 (binary 1101) is 101. Length encodes the length of offset in unary code. For 13, the length of offset is 3 bits, which is 1110 in unary. The Gamma code of 13 is therefore 1110,101 (Unary code of length, Offset code), the concatenation of length 1110 and offset 101. A Gamma code is decoded by first reading the unary code up to the 0 that terminates it. For example, the four bits 1110 when decoding 1110,101. Now, we know how long the offset is: 3 bits. The offset 101 can then be read correctly and the 1 that was chopped off in encoding is prepended: 101 → 1101 = 13. The length of offset is log2 G bits and the length of length is log2 G + 1, so the length of the entire code is 2log2 G + 1bits. Gamma codes are always of odd length and they are within a factor of 2 of what we claimed to be the optimal encoding length log2 G. We derived this optimum from the assumption that the 2n gaps between 1 and 2n are equiprobable. In general, we do not know the probability distribution over gaps a priori. The characteristic of a discrete probability distribution P that determines its coding properties (including whether a code is optimal) is its entropy H(P). Entropy is a measure of uncertainty for a probability distribution P over two possible outcomes, namely, X = {x1 , x2}. Entropy is maximized (H(P)=1) for P(x1) = P(x2) = 0.5 when uncertainty about which xi will appear next is largest. Entropy is minimized (H(P)=0) for P(x1) =1, P(x2)=0 and for P(x1) =0, P(x2)=1 when there is absolute certainty. It can be shown that the lower bound for the expected length E(L) of a code L is H(P) if certain conditions hold (see the references). It can further be shown that for 1 < H(P ) < ∞, Gamma encoding is within a factor of 3 of this optimal encoding, approaching 2 for large H(P): A code like Gamma code with the property of being within a factor of optimal for an arbitrary distribution P is called universal. In addition to universality, Gamma codes have two other properties that are useful for index compression. ∗ Prefix free: no gamma code is the prefix of another. This means that there is always a unique decoding of a sequence of gamma codes - and we do not need delimiters between them, which would decrease the efficiency of the code. ∗ Parameter free: For many other efficient codes, we have to fit the parameters of a model (e.g., the binomial distribution) to the distribution of gaps in the index. 30 9 Phrases and Phrasal queries A phrase is defined as a sequence of words that together convey a specific concept or meaning. In IR, phrases are important because users often search for content that reflects a specific multi- word idea, rather than individual words. A phrasal query is a search query that looks for an exact sequence of words in a document, in the same order as specified. In most search engines, this is done by enclosing the words within double quotation syntax (“stanford university”) for phrase queries. Postings lists to be simply lists of documents that contain individual terms are not sufficient for phrasal queries. Approaches for handling phrase queries are Bi-word indexes and Positional indexes. Key features are as follows. – Exact match requirement: The search engine will return results where the specified word sequence appears exactly as it does in the query. – Enhanced specificity: Phrasal queries allow users to narrow down their search to find highly specific information. – Proximity handling: Some search engines allow users to define proximity-based queries, where words in the phrase may be separated by a certain number of words while still matching the query. 9.1 Bi-ward indexes Consider every pair of consecutive terms in a document as a phrase. For example: Friends, Romans, Countrymen Biwords: friends romans — romans countrymen Steps for bi-wards generation: – First, we tokenize the text and perform part-of-speech-tagging. – We can then group terms into nouns, including proper nouns, (N) and function words, including articles and prepositions, (X), among other classes. – Now deem any string of terms of the form NX*N to be an extended biword. – Each such extended biword is made a term in the vocabulary. Example: ”cost overruns on a power plant” Bi-word query: “cost overruns” AND “overruns power” AND “power plant” It can be extended to longer sequences of words, and if the index includes variable length word se- quences. Searches for a single term are not naturally handled in a biword index. Disadvantages: – We also need to have an index of single-word terms. – Always a chance of false positive matches. – Exhaustive biword dictionary greatly expands the size of the vocabulary. 9.2 Positional indexes For each term in the vocabulary, we store postings of the form docID: < position1, position2,... >. Each posting will also usually record the term frequency. Where each position is a token index in the document. Accordingly, algorithm for intersection of two posting list need be modified such that it should consider the position of term also. Many Boolean search systems also support proximity operators, which allow for more flexible queries by specifying that terms must occur close to each other in a document. For example, the proximity operator /k specifies that two terms must appear within k words of each other. Consider the query ”Brutus /3 Caesar”. This means that ”Brutus” and ”Caesar” must appear within 3 words of each other in the document. Positional indexes can be used for such proximity queries, biword indexes cannot. 31 Figure 29: Sample positional posting list representation Figure 30: Positional posting list intersection algorithm 9.3 Combination scheme Biword indexes and positional indexes can be fruitfully combined. A combination strategy uses: a phrase index/biword index, for certain queries and a positional index for other phrase queries. 10 Wildcard queries Wildcard query is a query that uses regular expressions and pattern matching-based searching in text using wild card symbols like *, +, and ?. For example: ”*a*e*i*o*u*” which seeks documents containing any term that includes all the five vowels in sequence. The * symbol indicates any (possibly empty) string of characters. Users pose such queries to a search engine when they are uncertain about how to spell a query term, or seek documents containing variants of a query term. For example: the query ”automat*” would seek documents containing any of the terms ”automatic”, ”automation” and ”automated”. Wildcard queries are used in any of the following situations: 32 – the user is uncertain of the spelling of a query term. – the user is aware of multiple variants of spelling a term and (consciously) seeks documents con- taining any of the variants. – the user seeks documents containing variants of a term that would be caught by stemming, but is unsure whether the search engine performs stemming. – the user is uncertain of the correct rendition of a foreign word or phrase. Types of wildcard queries: – Trailing wildcard: A query such as ”mon*” is known as a trailing wildcard query, because the * symbol occurs only once, at the end of the search string. – Leading wildcard: A query of the form ”*mon” is known as leading wildcard query. – General wildcard: A query of the form ”se*mon” is know as general wild card query. Express the given wildcard query qw as a Boolean query Q on a specially constructed index, such that the answer to Q is a superset of the set of vocabulary terms matching qw. 10.1 Permuterm indexes First special index for general wildcard queries is permuterm index into our character set, to mark the end of a term. The term “hello” is shown here as the augmented term hello$. Next, we construct a permuterm index, in which the various rotations of each term (augmented with $) all link to the original vocabulary term. We refer to the set of rotated terms in the permuterm index as the permuterm vocabulary. Figure 31: Permuterm index of ”hello” How does this index help us with wildcard queries? – Consider the wildcard query ”m*n”. – The key is to rotate such a wildcard query so that the * symbol appears at the end of the string - thus the rotated wildcard query becomes n$m*. 33 – Next, we look up this string in the permuterm index, where seeking n$m* (via a search tree) leads to rotations of (among others) the terms man and moron. The permuterm index enables us to identify the original vocabulary terms matching a wildcard query, we look up these terms in the standard inverted index to retrieve matching documents. We can thus handle any wildcard query with a single * symbol. 10.2 K-gram indexes A K-gram index is a special type of index used in information retrieval systems to support efficient and flexible string matching. It works by breaking down words into small substrings known as k-grams, which are sequences of k consecutive characters. We can use a special character $ to denote beginning and ending of term. Full set of 3-grams for “castle” is : [$ca, cas, ast, stl, tle, le$]. In k-gram index, dictionary contains all k-grams that occur in any term in vocabulary. Each postings list points from a k-gram to all vocabulary terms containing that k-gram. For example, the 3-gram “etr” would point to vocabulary terms such as “metric” and “retrieval”. Consider using 3-gram index described for query red*: – First issue boolean query $re AND red to 3-gram index may result “retired” – Post-filtering step: in which terms enumerated by boolean query on 3-gram index are checked individually against original query ”red*” through string matching. Search engines allow combination of wild card queries using boolean operators for example “red*d AND fe*ri” Benefits of K-gram Indexes: – Wildcard search efficiency: K-gram indexes are particularly helpful for handling wildcard queries, such as “net”*, which might match words like ”network”, ”networking”, etc. – Spelling correction: By capturing parts of a word, k-gram indexes can help detect minor spelling mistakes in queries and suggest correct alternatives. – Substring search: They support efficient substring matching within words, allowing users to find documents that match only part of a word or phrase. *** 34