Information Retrieval Systems (IRS) Lecture
Document Details
Uploaded by PeerlessMetonymy8634
UNESWA
Tags
Summary
This lecture introduces Information Retrieval Systems (IRS), detailing their purpose, processes, and different models. It explains the importance of indexing documents and formulating effective queries to retrieve relevant information. The lecture also compares different retrieval models like exact match and best match.
Full Transcript
Information Retrieval Systems (IRS) 1 Recap…... 2 An IRS is a system (in most cases a software programme) that stores and manages documents (information), often textual documents but possibly multimedia (Hiemstra, 2009)....
Information Retrieval Systems (IRS) 1 Recap…... 2 An IRS is a system (in most cases a software programme) that stores and manages documents (information), often textual documents but possibly multimedia (Hiemstra, 2009). 3 PURPOSE Store information with good organization; Give indexes to the existing information; Get the user queries; Search the information; IRS is designed to retrieve the documents required by the user; In other words, provide users with documents that will satisfy their information need; (some) evaluate the importance of all query results; Supports users in browsing document collections 4 5 The 3 processes supported by an IRS 6 There are three basic processes an IRS has to support: 1)The representation of the content of the documents 2) The representation of the user's information need 3) The comparison of the two representations. 7 8 The representation of the content of the documents Representing the documents is usually called the indexing process. (In other words) the indexing process results in a representation of the document. The aim of indexing is to give a brief description of the document and label it with a few keywords 9 It is the manual (by an individual → indexer) or automated process (takes place on-line) of making statements about a document: oThe end user of the IRS is not directly involved in the indexing process The indexer captures what the document is about; and indicate other features of interest to users. Indexing involves the selection and assignment of terms to (or extraction of terms from) a document in order to indicate topics (subjects/topics), features, or possible uses of the document. 10 11 The representation of the user's information need Users do not search for information just for fun → they search for information that would satisfy their information need They search for information when: - they need it, and, - for information that will solve their needs The full description of the user’s information need is not necessarily a good query to be submitted to the IRS. Not effective to type the whole information need as is 12 The user → translates the information need into a search query (set of key words) → (the process of representing their information need) o This process is often referred to as the query formulation process. - The resulting representation of the information need is the query. 13 In a broad sense, query formulation (or the query) denote the complete interactive dialogue between IRS and user oA query can specify text words or phrases that the IRS should look for - User modifies the query such that it can yield the largest amount of relevant results 14 15 Comparison of the two representations This process is called the matching process It is when the system (IRS) compares or matches the information need representation (query) with the document representation (indexed documents) and displays the documents found and the user then selects documents that are relevant to his/her information need. 16 17 During the matching process → the features in the query are used to predict document relevance In exact match → the system finds the document that matches all the conditions of a query, then In best match → the system finds documents that matches most or all conditions in the query. 18 Exact match systems VS best match systems: Exact Match In an exact match system → the query is matched to the indexed documents in a strict manner and only documents that match the query exactly are retrieved. Every document either matches or fails to match query Exact match systems will display a list of retrieved documents → not ranked according to how relevant they are to user’s query e.g. retrieved documents could be sorted alphabetically or by date 19 Best Match Best match systems allow some leeway / flexibility and will retrieve documents that not only match exactly the user’s query but also those that match it fairly well. Best match systems will rank the retrieved documents according to how relevant they are to user’s query o Result is ranked list of documents - e.g. from the most relevant to the least relevant 20 Users (more especially in exact match) will go through this document list in search of the information they need. Ranked retrieval (best match) will hopefully put the relevant documents towards the top of the ranked list, minimising the time the user has to invest in reading the documents. 21 22 23 INFORMATION RETRIEVAL MODELS 24 What is a model? General definition: Graphical or physical representation of a concept / relationship / system that show what it looks like or how it works There are two reasons for having IR models: 1) Models provide the means for academic discussion; 2) Models can serve as a blueprint to implement an actual retrieval system. 25 Information retrieval models Examples: Boolean Vector space Probabilistic model Latent semantic indexing Statistical model Inference network Many models of IR that have been developed, but in this module we will look at the Boolean model. 26 The Boolean model The Boolean model is the first model of Information Retrieval (IR) It is the most common IR model used in many search engines, library OPACs etc. This model provides exact matching Exact match: document matches condition or not - Simply put user’s query is matched to the documents (indexed documents) in a strict manner and only documents that match the user’s query exactly are retrieved. Retrieved documents are not ranked according to how relevant they are to user’s query 27 Query terms can be combined using Boolean operators. Formal Boolean operators are named after George Boole, a mathematician who lived in the 19th century. Boole defined three basic operators (Boolean operators): AND, OR and NOT 28 The Boolean operator AND The function of the AND operator is to combine search terms in such a way that all of the terms combined by AND in the query must appear in the document for it to be retrieved. e.g. the puppy AND kitten will retrieve the set of documents that are indexed both with the term puppy and the term kitten 29 30 31 The Boolean operator OR The OR operator combine terms in such a way that any of the terms combined by OR can be present in the document for it to be retrieved. e.g. the query puppy OR kitten will retrieve the set of documents that are indexed with either the term puppy or the term kitten, or both 32 33 34 The Boolean operator NOT The Boolean operator NOT is also called the exclusion operator. This operator is used to exclude certain search terms from your query. 35 36 Advantages of the Boolean model It is easy to implement and it is computationally efficient. Hence, it is the standard and common model for the current large-scale of IRS. The Boolean approach possesses a great expressive power and clarity. o It enables users to express conceptual constraints to describe important linguistic features. e.g. users when formulating queries are able to specify synonyms using the OR operator, exclude unwanted features/search terms using the NOT operator. 37 Users are under control of the search results. o It offers techniques to broaden or narrow a query. o It gives users a sense of control over the system. It is immediately clear why a document has been retrieved given a query. If the resulting document set is either too small or too big, it is directly clear which operators will produce respectively a bigger or smaller set. 38 Disadvantages of the Boolean model Main disadvantage is that it does not provide a ranking of retrieved documents. The model either retrieves a document or not. o In other words it only give inclusion or exclusion of documents, not rankings 39 Users would need to spend more effort → manually examining the returned sets of documents. o Sometimes that process is very labour intensive. (Sometimes) users find it difficult to construct effective Boolean queries for several reasons. Users are using the natural language terms AND, OR or NOT that have a different meaning when used in a query. Thus, users will make errors when they form a Boolean query, because they resort to their knowledge of English 40 CLASS EXERCISE 41 The IR Problem The full description of the user’s information need is not necessarily a good query to be submitted to the IRS. - Not effective to type the whole information need as is It is helpful for the user to translate the information need into a search query: oThis translation processes yields a set of keywords, or index terms, which summarizes the user information need; oIt is also helpful to list synonyms or related terms of these keywords so that if you don’t find relevant documents with the search query using keywords you can try again using the synonyms and related terms. - experiment and try several searches with different search terms 42 43 Practical Examples Example 1 Do college athletes get a pay? Main ideas (keywords) are: Do college athletes get a pay? 44 Related terms or Synonyms for the key search terms: College Athletes pay University players salary Tertiary sports wage football Compensation basketball remuneration 45 Example of a search query College AND Athletes AND pay University AND players AND salary 46 Example 2 How do transportation choices affect the environment? Main ideas (keywords) are: How do transportation choices affect the environment? Related terms or synonyms for the key search terms: Transportation Environment Travel Air quality commuting pollution Motor vehicles ecology Ozone layer depletion 47 Example 3 Do violent video game cause aggressive behaviour in children? Main ideas (keywords) are: Do violent video game cause aggressive behaviour in children? Related terms or synonyms for the key search terms: Violent video games Aggressive behaviour Children 48