Chapter 3 Indexing-2024.docx
Document Details
Tags
Full Transcript
Outline ![](media/image9.png) **Subsystem of IR system** **The two subsystems of an IR system:** - is an offline process of organizing documents using keywords extracted from the collection - Indexing is used to speed up access to desired information from document collection as per...
Outline ![](media/image9.png) **Subsystem of IR system** **The two subsystems of an IR system:** - is an offline process of organizing documents using keywords extracted from the collection - Indexing is used to speed up access to desired information from document collection as per users query - Is an online process that scans document corpus to find relevant documents that matches users query **Subsystem of IR system** ![](media/image8.png) **Indexing subsystem** **Subsystem of IR system** ![](media/image8.png) - Indexing and searching: are inexorably connected - you cannot search that was not first indexed in some manner or other - indexing of documents or objects is done in order to be searchable - There are many ways to do indexing - To index one needs an indexing language - There are many indexing languages - Even taking every word in a document is an indexing language - Knowing searching is knowing indexing **Subsystem of IR system** **Searching subsystem** Indexing: Basic Concepts ======================== ![](media/image8.png) - Indexing is an arrangement of index terms to permit fast searching and reducing memory space requirement - It used to speed up access to desired information from document collection as per users query such that - It enhances efficiency in terms of time for retrieval. - Relevant documents are searched and retrieved quick - Index file usually has index terms in a sorted order. - Index files are much smaller than the original file. - Remember **Heaps Law**: in 1 GB of text collection the vocabulary has a size of only 5 MB. This size may be further reduced by **Linguistic pre-processing** (or text operations). - The usual unit for indexing is the word How current search engines indexing? ==================================== - Indexes are built using a **web crawler**, which retrieves each page on the Web for indexing. - After indexing, the local copy of each page is discarded, unless stored in a cache. **Some search engines: automatically index** - All the index activities are done by the search engines - **such search engines include:** Google, AltaVista, Excite, HotBot, InfoSeek, Lycos **Some others: semi automatically index** - Partially human indexed, hierarchically organized - **Such search engines include:** Yahoo, Magellan, Galaxy, WWW Virtual Library **Common features** - allow Boolean searches How current search engines indexing? ==================================== ![](media/image8.png) Web search system #### Major Steps in Index Construction ![](media/image8.png) - **Source file**: Collection of text document - A document can be described by a set of representative keywords called index terms. - **Index Terms Selection**: apply text operations or preprocessing - - - - - #### Major Steps in Index Construction - **Term relevance weight**: Different index terms have varying relevance when used to describe document contents. - This effect is captured through the **assignment of numerical weights to each index term** of a document. - There are different index terms weighting methods: **TF, IDF, TF\*IDF**... - **Output**: a set of index terms (vocabulary) or indexing structure - **Indexing structure**: a set of index terms (vocabulary) are organized in **Index File** to easily identify documents in which each term occurs in. ![](media/image13.png) - **Index file** is a file consisting of a **list of index terms** and a link to one or more documents that has the index term - A good **index file** maps each keyword *K~i~* to a set of documents D*~i~* that contain the keyword - **Index file** usually has index terms in a sorted order. - The sort order of the terms in the index file provides an order on a physical file - An **index file** is list of search terms that are organized for associative look-up, i.e., to answer user's query: - In which documents does a specified search term appear? - Where within each document does each term appear? (There may be several occurrences.) - For organizing index file for a collection of documents, there are various options available: - Decide what data structure and/or file structure to use. - Is it sequential file, inverted file, suffix tree, etc.? ### ![](media/image8.png)Index file Evaluation Metrics - **Running time of the main operations** - **Indexing time** - **Access/search time** - How much is the running time to find the required search key from the list? - ##### Update time (Insertion time, Deletion time) - How much time does it take to update existing records in an attempt to add new terms or delete existing unnecessary terms? - Does the indexing structure allows incremental update or re- indexing? - ##### Space overhead - Computer storage space consumed for keeping the list. - **Access types supported efficiently.** - - ### Building Index file (1. Sequential File) - Sequential file is the most simple file structures. - It has no vocabulary as well as linking pointers. - The records are generally arranged serially, one after another, but in lexicographic order on the value of some key field. i.e. - A particular **attribute is chosen as primary key whose** value will determine the order of the records. - When the first key fails to discriminate among records, a second key is chosen to give an order. ![](media/image8.png) - Given a collection of documents, they are parsed to extract words and these are saved with the Document ID. #### Sorting the - After all documents have been tokenized, stop words are removed, and normalization and stemming are applied, to generate index terms - These index terms in sequential file are sorted in alphabetical order #### ![](media/image8.png)Sequential File... - To access records search serially; starting at the first record read and investigate all the succeeding records until the required record is found or end of the file is reached. **Its main advantages are:** - - - **Its disadvantages:** - - - - - Inverted file ============= - A word oriented indexing mechanism based on sorted list of keywords, with each keyword having links to the documents containing it - Building and maintaining an inverted index is a relatively low cost risk - ![](media/image20.png)This list is inverted from a list of terms in location order to a list of terms in alphabetical order. - Data to be held in the inverted file includes : - The vocabulary (List of terms) - The occurrence (Location and frequency of terms in a document collection) - **The occurrence**: contains one record per term, listing - **Frequency of each term in a document,** i.e. count number of occurrences of keywords in a document - *TF~ij~,* number of occurrences of term *t~j~* in document **d***~i~* - *DF~j~,* number of documents containing term *t~j~* - *max~i~,* maximum frequency of any term in **d***~i~ (highest frequent terms in doc i)* - *N,* total number of documents in a collection - CF*~j~,*, collection frequency of *t~j~* in *n~j~* - **Locations/Positions of words in the text** #### Inverted Files..... **Why vocabulary**? - **Why location**? - - - **Why frequencies**? - - - Inverted File ------------- ### Construction of Inverted file - ##### A vocabulary file (Word list): - stores all of the **distinct** terms (keywords) that appear in any of the documents (in lexicographical order, i.e like that of a dictionary) and - For each word a pointer to a ***posting file*** - Records kept for each term *j* in the vocabulary (word list) contains the following: - term *j* - number of documents in which term *j* occurs (*DFj*) - Collection frequency of term j (Cf) - pointer to inverted (postings) list for term *j* ### Postings File (Inverted List) - For each distinct term in the vocabulary, the posting file stores a list of pointers to the documents that contain that term. - Each element in an inverted list is called a **posting**, i.e., the occurrence of a term in a document - Each list consists of one or many individual postings ![](media/image8.png)General structure of Inverted File ------------------------------------------------------- - The following figure shows the general structure of inverted index file. Organization of Index File -------------------------- ![](media/image8.png) -- -- -- -- -- -- -- -- ### Example: - Given a collection of documents, they are parsed to extract words and these are saved with the Document ID. #### ![](media/image15.png)Sorting the - After all documents have been tokenized the inverted file is sorted by terms - Steps - Extract the terms in each doc - Sort the terms - Compile the terms #### Remove stop words and compute frequency - Multiple term entries in a single document are merged and frequency information added #### Stemming & compute frequency - ![](media/image15.png)Multiple term entries in a single document are merged and frequency information added ![](media/image36.png) #### ![](media/image15.png)Vocabulary and postings file **posting** **CF** -- -- -------- **2** **1** **2** **3** **1** **3** **1** **1** **2** ### ![](media/image8.png)Searching on Inverted File - Since the whole index file is divided into two, searching can be done faster by loading vocabulary list which takes less memory even for large document collection - Using binary Search the searching takes logarithmic time - The search is done in the vocabulary lists - Updating inverted file is complex. - We need to update both vocabulary and posting files Example: Create Inverted file ============================= - **Create an inverted file (both the vocabulary list and the posting file) for the following document collection** -- -- -- -- Example: Create Inverted file ============================= ![](media/image8.png) - After text operation red color terms remain as index term -- -- -- -- - D1= department comput science establish - D2= department launch bsc comput study - D3= follow msc comput science start - D4= department produce phd graduat - D5= staff contribut intellect profession advance field #### ![](media/image8.png)Inverted Index - It is a data structure that stores mapping from words to documents or set of documents i.e. directs you from word to document. - Steps to build Inverted index are: - Fetch the document and gather all the words. - Check for each word, if it is present then add reference of document to index else create new entry in index for that word. - Repeat above steps for all documents and sort the words. - *Indexing is slow as it first checks that word is present or not.* - *Searching is very fast.* #### Example of Inverted index - It does not store duplicate keywords in index. ![](media/image1.png)