LIBS 894 - Information Retrieval PDF

DSITANCE LEARNING CENTRE AHMADU BELLO UNIVERSITY ZARIA, NIGERIA COURSE MATERIAL FOR COURSE CODE &TITLE: LIBS 894 (INFORMATION RETRIEVAL) PROGRAMME TITLE: MASTERS IN INFORMATION MANAGEMENT 1 ...

DSITANCE LEARNING CENTRE AHMADU BELLO UNIVERSITY ZARIA, NIGERIA COURSE MATERIAL FOR COURSE CODE &TITLE: LIBS 894 (INFORMATION RETRIEVAL) PROGRAMME TITLE: MASTERS IN INFORMATION MANAGEMENT 1 ACKNOWLEDGEMENT We acknowledge the use of the Courseware of the NOUN as the primary resource. Internal reviewers in the Ahmadu Bello University who extensively reviewed and enhanced the material have been duly listed as members of the Courseware development team. 2 COPYRIGHT PAGE © 2018Ahmadu Bello University (ABU) Zaria, Nigeria All rights reserved. No part of this publication may be reproduced in any form or by any means, electronic, mechanical, photocopying, recording or otherwise without the prior permission of the Ahmadu Bello University, Zaria, Nigeria. First published 2018 in Nigeria. ISBN: Ahmadu Bello University Press Ltd, Ahmadu Bello University Zaria, Nigeria. Tel: +234 E-mail: 3 COURSE WRITERS/DEVELOPMENT TEAM Editor Prof. M.I Sule Course Materials Development Overseer Dr. Usman Abubakar Zaria Subject Matter Expert Dr (Mrs)Monsurat F. Mohammed Subject Matter Reviewer AbdullahiHussaini Language Reviewer EnegoloinuAdakole Instructional Designers/Graphics Ibrahim Otukoya, Abubakar Haruna Proposed Course Coordinator AbdullahiHussaini ODL Expert Dr (Mrs)Monsurat F. Mohammed 4 QUOTE The retrieval process begins when a lack of information shows itself in a human mind and the decision is taken to ﬁnd out if this information has been discovered and published Douglas John Foskett (1963) 5 CONTENT Title Page…………………………………………………………….……? Acknowledgement Page…………………………………………… ……? Copyright Page………………………………………………………..……? Course Writers/Development Team………………………………………? TableofContent………………………………..……………………………? COURSE STUDY GUIDE ? i. Course Information - - - - - - - -8 ii. Course Introduction and Description - - - - - -8 iii. Course Prerequisites - - - - - - - -9 iv. Course Learning Resources - - - - - - -9 v. Course Objectives and Outcomes - - - - - -10 vi. Activities to Meet Course Objectives - - - - - -11 vii. Time (To complete Syllabus/Course) - - - - - -12 viii. Grading Criteria and Scale - - - - - - -12 ix. OER Resources - - - - - - - - -13 x. ABU DLC Academic Calendar - - - - - - -16 xi. Course Structure and Outline - - - - - - -17 xii. STUDY MODULES - - - - - - - -25 1.0 MODULE 1: Overview of Information Retrieval - -25 Study Session 1: Data and Information - - - - -25 Study Session 2: Information Retrieval Definitions and Concepts -36 Study Session 3: Information Retrieval Systems - - -45 2.0 MODULE 2: Information Retrieval Models - - -54 Study Session 1: Information Retrieval Models – Introduction -54 Study Session 2: Boolean Model and Extended Boolean Model -63 Study Session 3: Vector Space Model and Fuzzy Model - -71 Study Session 4: Probabilistic Model and Natural Language Model -79 3.0 MODULE 3: Query and Query Negotiation - - -88 Study Session 1: Query–an overview- --88 Study Session 2: Query Structure - - - - - -95 Study Session 3: Query Negotiation - - - - -106 4.0 MODULE 4: Information Retrieval Process and Evaluation -113 Study Session 1: Information Retrieval Process – an overview -113 Study Session 2: Search Strategy - - - - - -121 6 Study Session 3: Evaluation of Retrieval Products - - -129 Study Session 4: Standard Metadata and their description - -138 7 COURSE STUDY GUIDE i. COURSE INFORMATION Course Code: LIBS 894 Course Title: Information Retrieval Credit Units: 2 Semester: Second ii. COURSE INTRODUCTION AND DESCRIPTION Introduction: You are welcome to LIBS894, this is a 2credit unit year one second semester course. I will be your guide in the span of this course, you should feel free to ask questions when in a form of difficulties relating this course, and my door is always open. Description: The course should engage you for about 15 weeks in a semester. It is designed to provide reading material for two to three hours of study. This course does not require prior skills in any area of knowledge other than the general admission requirements. The approach at this level is to help the students to appreciate the basic characteristics of information, the need for good organisation of information, and the fundamental concepts of storage and retrieval. The rest of this guide will tell you what you are expected to learn in this course, how to work through the course, the content of the course, and useful information on exercises and assignments, and how to get the most out of the course. iii. COURSE PREREQUISITES 8 You should note that although this course has no subject pre-requisite, you are expected to have: 1. Satisfactory level of English proficiency 2. Basic Computer Operations proficiency 3. Online interaction proficiency 4. Web 2.0 and Social media interactive skills iv. COURSE LEARNING RESOURCES i. Course Textbooks 1. Baeza-Yates, R. and Ribeiro-Neto, B. (1999). Modern Information Retrieval.Addison-Wesley 2. Chowdhury, G.G (2003). Introduction to Modern Information Retrieval.Neal-Schuman 3. Jones, K. S. & Willett, P. (1997). Readings in Information Retrieval. Morgan Kaufmann. 4. Kowalski, G. &Maybury, M.T. (2005).Information Storage andRetrieval Systems.Springer. 5. van Risjbergen, C.J. (2004). The Geometry of Information Retrieval.Cambridge UP. v. COURSE OUTCOMES After studying this course, you should be able to: 1. Explain the concepts of data, data processing, and information. 2. Distinguish between document and information and describe the processes of documentation. 3. Correctly assign key words to be used for retrieval purposes. 4. Describe the basic architecture of a computer and the role of computers in 9 storage and retrieval. 5. Describe the storage media in use and the basic structure of records, files, and databases. 6. Explain the concepts of information retrieval. 7. Identify the information retrieval components. 8. Identify theInformation Retrieval Models. 9. Discuss user characteristics and user needs, which are fundamental to information storage and retrieval. 10.Negotiate Queries using Reference Interview. 11.Describe different Methods of Querying. 12.Correctly analyse requests for information and formulate search strategies. 13.Retrieve information from an Information System. 14.Identify the factors the affect online search. 15.Optimize retrieval process. 16.Evaluate retrieval product. 17.Search the Internet. vi. ACTIVITIES TO MEET COURSE OBJECTIVES To go through this course you are required to read the study units, answer the self- assessment exercises and do the assignment in each unit. The self-assignment exercises are meant to help you to reinforce what you have learnt. It will be very helpful to you to try and answer the questions first before looking at the answers. At the end of the unit, you will find an assignment, which will be marked by your tutor. Work diligently on it and submit your work to your tutor for grading. Individual assignments/test, Group assignments, Discussions/Quizzes/Out of class 10 engagements etc will constitute 40% of the total marks of the examination in this course. You will need to have access to a computer and be familiar with the basic elements of a computer system. In due course, you will have practical exercises in the library and on the Internet, and so, you need to have access to these facilities. It is expected that you will spend on the average two to three hours to study one unit and about 15weeks to complete the whole course. However, you should realise that you are actually to work at your own pace. Specifically, this course shall comprise of the following activities: 1. Studying courseware 2. Listening to course audios 3. Watching relevant course videos 4. Field activities, industrial attachment or internship, laboratory or studio work (whichever is applicable) 5. Course assignments (individual and group) 6. Forum discussion participation 7. Tutorials (optional) 8. Semester examinations (CBT and essay based). vii. TIME (TO COMPLETE SYLABUS/COURSE) To cope with this course, you would be expected to commit a minimum of 3 hours Weekly for the Course. 11 viii. GRADING CRITERIA AND SCALE Grading Criteria A. Formative assessment Grades will be based on the following: Individual assignments/test (CA 1,2etc.) 20 Group assignments (GCA 1, 2 etc.) 10 Discussions/Quizzes/Out of class engagements etc. 10 B. Summative assessment (Semester examination) CBT based 30 Essay based 30 TOTAL 100% C. Grading Scale: A = 70-100 B = 60 – 69 C = 50 - 59 D = 45-49 F = 0-44 D. Feedback Courseware based: 1. In-text questions and answers (answers preceding references) 2. Self-assessment questions and answers (answers preceding references) Tutor based: 12 1. Discussion Forum tutor input 2. Graded Continuous assessments Student based: 1. Online programme assessment (administration, learning resource, deployment, and assessment). ix. LINKS TO OPEN EDUCATION RESOURCES OSS Watch provides tips for selecting open source, or for procuring free or open software. SchoolForge and SourceForge are good places to find, create, and publish open software. SourceForge, for one, has millions of downloads each day. Open Source Education Foundation and Open Source Initiative, and other organisation like these, help disseminate knowledge. Creative Commons has a number of open projects from Khan Academy to Curriki where teachers and parents can find educational materials for children or learn about Creative Commons licenses. Also, they recently launched the School of Open that offers courses on the meaning, application, and impact of "openness." Numerous open or open educational resource databases and search engines exist. Some examples include: OEDb: over 10,000 free courses from universities as well as reviews of colleges and rankings of college degree programmes Open Tapestry: over 100,000 open licensed online learning resources for an academic and general audience OER Commons: over 40,000 open educational resources from elementary school through to higher education; many of the elementary, middle, and high school resources are aligned to the Common Core State Standards Open Content: a blog, definition, and game of open source as well as a friendly search engine for open educational resources from MIT, Stanford, and other universities with subject and description listings Academic Earth: over 1,500 video lectures from MIT, Stanford, Berkeley, Harvard, Princeton, and Yale 13 JISC: Joint Information Systems Committee works on behalf of UK higher education and is involved in many open resources and open projects including digitising British newspapers from 1620-1900! Other sources for open education resources Universities The University of Cambridge's guide on Open Educational Resources for Teacher Education (ORBIT) OpenLearn from Open University in the UK Global Unesco's searchable open database is a portal to worldwide courses and research initiatives African Virtual University (http://oer.avu.org/) has numerous modules on subjects in English, French, and Portuguese https://code.google.com/p/course-builder/ is Google's open source software that is designed to let anyone create online education courses Global Voices (http://globalvoicesonline.org/) is an international community of bloggers who report on blogs and citizen media from around the world, including on open source and open educational resources Individuals (which include OERs) Librarian Chick: everything from books to quizzes and videos here, includes directories on open source and open educational resources K-12 Tech Tools: OERs, from art to special education Web 2.0: Cool Tools for Schools: audio and video tools Web 2.0 Guru: animation and various collections of free open source software Livebinders: search, create, or organise digital information binders by age, grade, or subject (why re-invent the wheel?) 14 x. ABU DLC ACADEMIC CALENDAR/PLANNER PERIOD Semester Semester 1 Semester 2 Semester 3 Activity JAN FEB MAR APR MAY JUN JUL AUG SEPT OCT NOV DEC Registration Resumption Late Registn. Facilitation Revision/ Consolidation Semester Examination N.B: - All Sessions commence in January - 1 Week break between Semesters and 6 Weeks vocation at end of session. - Semester 3 is OPTIONAL (Fast-tracking, making up carry-overs & deferments) 15 xi. COURSE STRUCTURE AND OUTLINE Course Structure WEEK MODULE STUDY SESSION ACTIVITY 1. Read Courseware for the corresponding Study Session. Study Session 1: 2. View the Video(s) on this Study Session Data and Information 3. Listen to the Audio on this Study Session Pp. 25 4. View any other Video/U-tube (address/site https://bit.ly/2C14W1y) Week 1 5. View referred Animation (Address/Site https://bit.ly/2HNFRZT) Study Session 1. Read Courseware for the corresponding Study Session. 2:Information Retrieval 2. View the Video(s) on this Study Session STUDY Definitions and 3. Listen to the Audio on this Study Session Week 2 MODULE 1: Concepts 4. View any other Video/U-tube address Overview Of Pp. 36 https://bit.ly/2NGfUOk) Information Retrieval 5. View referred Animation (Address/Site https://bit.ly/30DPN0C) Study Session 3: 1. Read Courseware for the corresponding Study Session. Information Retrieval 2. View the Video(s) on this Study Session Systems 3. Listen to the Audio on this Study Session Week 3 Pp. 45 4. View any other Video/U-tube address https://bit.ly/1bhYPOq) 5. View referred Animation (Address/Site https://bit.ly/30DPN0C) Study Session 1. Read Courseware for the corresponding Study Session. 1:Information Retrieval 2. View the Video(s) on this Study Session Models: Introduction 3. Listen to the Audio on this Study Session Week 4 Pp. 54 4. View any other Video/U-tubehttps://bit.ly/2Uk2WZo) 5. View referred Animation (Address/Site https://bit.ly/2Wk8WFx) 16 STUDY Study Session 1. Read Courseware for the corresponding Study Session. MODULE 2: 2:Boolean Modeland 2. View the Video(s) on this Study Session Information Retrieval Extended Boolean 3. Listen to the Audio on this Study Session Week 5 Models Model 4. View any other Video/U-tube https://bit.ly/1hxT0ot) Pp. 63 5. View referred Animation (Address/Site https://bit.ly/2YJyIk8) 1. Read Courseware for the corresponding Study Session. Study Session 2. View the Video(s) on this Study Session 3:Vector Space Model 3. Listen to the Audio on this Study Session and Fuzzy Model 4. View any other Video/U-tube https://bit.ly/2TqRNsu) Pp. 71 5. View referred Animation (Address/Site Week 6 https://bit.ly/2HwVeai) 1. Read Courseware for the corresponding Study Session. Study Session 4: 2. View the Video(s) on this Study Session Probabilistic Model and 3. Listen to the Audio on this Study Session Week 7 Natural Language 4. View any other Video/U-tube https://bit.ly/2SGOJ79) Model 5. View referred Animation (Address/Site Pp. 79 https://bit.ly/2jHWJFQ) 1. Read Courseware for the corresponding Study Session. Study Session 1: 2. View the Video(s) on this Study Session MODULE 3 Query- an overview 3. Listen to the Audio on this Study Session Week 8 Query and Query Pp. 88 4. View any other Video/U-tube https://bit.ly/2tPPAs1) Negotiation 5. View referred Animation (Address/Site https://bit.ly/30Ap7hv) 1. Read Courseware for the corresponding Study Session. Study Session 2: 2. View the Video(s) on this Study Session Query Structure 3. Listen to the Audio on this Study Session Week 9 Pp. 95 4. View any other Video/U-tube https://bit.ly/2EJntRR) 5. View referred Animation (Address/Site https://bit.ly/30EtqbG) 17 Week 10 Study Session 3: 1. Read Courseware for the corresponding Study Session. Query Negotiation 2. View the Video(s) on this Study Session 3. Listen to the Audio on this Study Session Pp. 106 4. View any other Video/U-tube https://bit.ly/2Vtn6QO) 5. View referred Animation (Address/Site https://bit.ly/2HKpLAk) 1. Read Courseware for the corresponding Study Session. Study Session 1: 2. View the Video(s) on this Study Session Information Retrieval 3. Listen to the Audio on this Study Session MODULE 4: Process-an overview 4. View any other Video/U-tube https://bit.ly/2Vtn6QO) Information Retrieval Pp. 113 5. View referred Animation (Address/Site Process And https://bit.ly/30DPN0C) Evaluation Week 11 1. Read Courseware for the corresponding Study Session. Study Session 2: 2. View the Video(s) on this Study Session Search Strategy 3. Listen to the Audio on this Study Session Pp. 121 4. View any other Video/U-tube https://bit.ly/1FtyP54) 5. View referred Animation (Address/Site https://bit.ly/2JA0M5U) 1. Read Courseware for the corresponding Study Session. 2. View the Video(s) on this Study Session Study Session 3: 3. Listen to the Audio on this Study Session Evaluation of Retrieval 4. View any other Video/U-tube https://bit.ly/2tLZsTI) Products 5. View referred Animation (Address/Site Pp. 129 https://bit.ly/2HQ9Osj) 18 1. Read Courseware for the corresponding Study Session. Study Session 4: 2. View the Video(s) on this Study Session Week 12 Standard Metadata and 3. Listen to the Audio on this Study Session their description 4. View any other Video/U-tube https://bit.ly/2TxoAfk Pp. 138 https://bit.ly/2TvwqpO) 5. View referred Animation (Address/Site https://bit.ly/2HAn1Xy) Week 13 REVISION/TUTORIALS (On Campus or Online)& CONSOLIDATION WEEK Week 14 & 15 SEMESTER EXAMINATION 19 Course Outline MODULE 1: Overview of Information Retrieval Study Session 1: Data and Information Study Session 2: Information Retrieval Definitions and Concepts Study Session 3: Information Retrieval Systems MODULE 2: Information Retrieval Models Study Session 1: Information Retrieval Models- Introduction Study Session 2: Boolean Model and Extended Boolean Model Study Session 3: Vector Space Model and Fuzzy Model Study Session 4: Probabilistic Model and Natural Language Model MODULE 3: Query and Query Negotiation Study Session 1: Query – an overview Study Session 2: Query Structure Study Session 3: Query Negotiation MODULE 4: Information Retrieval Process Study Session 1: Information Retrieval Process- an overview Study Session 2: Search Strategy Study Session 3: Evaluation Retrieval Products Study Session 4: Standard Metadata and their description 20 xii. STUDY MODULES 1.0 MODULE 1:Overviewof Information Storage and Retrieval Contents: Study Session 1: Data and Information Study Session 2: Concept of Information Retrieval Study Session 3: Information Retrieval Systems STUDY SESSION 1 Data and Information Section and Subsection Headings: Contents 1.0 Learning Outcomes 2.0 Main Contents 2.1- Data 2.1.1- Data Processing 2.2- Information 2.2.1- Value of Information 3.0Study Session Summary and Conclusion 4.0 Self-Assessment Questions and Answers 5.0 Additional Activities (Videos, Animations &Out of Class activities) 6.0 References/Further Readings Introduction You are welcome to study session 1, we will explore the concept of data and information, and you will learn how to use the term properly, because we have found out that many people use the tern wrongly and how to appreciate the value of date and information. 1.0 Study Session Learning Outcomes 21 After you finished studying this session, you should be able to: 1. Explain the meaning of Data 2. Explain the concept of Data Processing 3. Explain the meaning of Information 4. Distinguish between Data and Information 2.0 Main Content 2.1 Data What do you understand by the term data? The word "data" is the plural form of the word "datum". Data may be regarded as symbols or figures that have potential value or to which meaning can be given. Now, let us consider how people record events. Think of the old man in the village who made nine strokes of chalk on the lintel of his front door to remind him that it was exactly nine hundred naira that he borrowed. Fig 1.1.1: Data Each time he paid back one hundred naira, he wiped away one stroke, when he was able to pay back a multiple of one hundred naira, he cleaned off the corresponding number of strokes. Would you not consider such a man as well organised? He kept accurate data. He could always tell how much of his debt was outstanding by counting the number of strokes left. Those strokes might not mean anything to someone else but they mean a lot for the old man. Look at the table below and try and make some meaning out of it. 22 Table 1.1.1: Three Ways of Recording Data Column 1 Column 2 Column 3 Sunday *** !!!! 12 Monday ***** !!!!! !!!! 3 Tuesday ***** !!!! !!! 3 Wednesday ******* !!!!! ! 3 Thursday ***** !!!!! 5 Friday *** !!!!! 6 Saturday ***** !!! 7 You may not know the events that were recorded but you can see the frequency of occurrence for each day of the week. Column 1 shows how a little boy recorded the number of cars that came to his father's house during a particular week. Column 2 shows the number of cows a dealer sold in a space of one week. The record was kept by his son. Column 3 illustrates the number of phone calls a young lady received during a particular week. Let look at another table below Table 1.1.2: Data from a Local Government Election Ward Number of Number of Votes Registered Voters 1 1,567 1,345 2 833 823 3 1,234 1,090 4 1,400 1,390 5 1,556 1,460 6 1,456 1,240 7 2,034 1,900 8 1,902 1,890 9 1,890 1,790 10 987 970 11 456 450 12 1,340 1,322 23 The statistical data of the last Local Government election in 12 wards of a Local government is captured in Table 2 above Consider the votes on four motions in a state house of assembly. The following data were generated Table 1.1.3: Data from the Votes on Four Motions in a State House of Assembly YES NO ABSENT Motion 1 18 7 2 Motion 2 11 12 3 Motion 3 13 8 6 Motion 4 9 17 1 2.1.1 Sources of Data 1. Internal Source: When data are collected from reports and records of the organization itself, it is known as the internal source.For example, a company publishes its ‘Annual Report’ on Profit and Loss, Total Sales, Loans, Wages etc. 2. External Source: When data are collected from outside the organization, it is known as the external source.For example, if a Tour and Travels Company obtains information on ‘Karnataka Tourism’ from Karnataka Transport Corporation, it would be known as external sources of data. In-text Question What is Data? Answer Data may be regarded as symbols or figures that have potential value or to which meaning can be given. 24 2.2.0 Data Processing To put it simply, data processing is what we do to data in order to make some sense out of them. There are many ways of processing data. We may just inspect data and be able to see a pattern in them. From such a pattern we can make a statementabout what has happened and even take a decision. This is visual inspection. Let us apply it to the set of data in Table 1. The data in column 1 show that more cars came to the house on Wednesday. Column 2 not only tells us that more cows were sold Monday, but that sales declined during the rest of the week. From column 3, we can say that young lady received the highest number of phone calls on Sunday. We can add that she is more likely to receive more calls during the weekend. We can build more stories on the data in table 2. In each ward, we can compare the number of registered voters with the number of those that voted. Naturally, not everyone who registered could have, voted. We can compute the voting rate for each ward as the percentage of the registered voters who voted. We can find the ward with the highest voting rate and the one that comes next. We can also rank the wards according to the number of registered voters and on the basis of the ranking predict in which wards to expert the highest number of voters. Now, here is a sticky point: what do you say when the number of voters is higher than the number who registered? A big mistake somewhere or what? Now let us turn to Table 3. Can you say something about the popularity of the four motions? Obviously, motion 1 was the most popular. Motion 2 must have been highly controversial. Motion 4 was highly unpopular. Besides visual inspection, rearrangement or sorting, data processing could take the form of arithmetic processes such as computing the sum or mean. Statistical techniques may also be used to show how much the data deviate from some reference value or the relationships within the data. Then we are able to make statements about the data as well as generalise our observations to other similar 25 situations. Suppose we found that out of 420 boys that sat for mathematics in SSSE, 303 passed with credit and above; whereas out 392 girls who sat for it the number who passed with credit and above was 121. Then we can at least say that boys do much better than girls in mathematics in that school. If we collected our data from a number of schools across the country and got the same pattern, then we can say that in Nigeria, boys are better in mathematics than girls provided that the data were generated properly. You may find it difficult to estimate the volume of information a person handles every day. Unless a person is sleeping, his or her brain is always busy processing data and handling information. The brain receives both data and information through the eyes, for instance from what the person reads or sees; through the ears, for instance from what the person hears or listens to; through the nose, for instance from what the person smells, and through the senses of touch, exposure, physical contact and taste. In essence data processing is simply the conversion of raw data into meaningful information through a process. This process follows a cycle which is referred to as Data ProcessingCycle. Data Processing Cycle is a series of steps carried out to extract useful information from raw data. These series of steps are in six stages namely: Collection, Preparation, Input, Processing, Output and Storage. Read the six stages of data processing at http://medium.com 2.2 Information In the paragraphs above, we considered the processing of data. We found that having done such processing, we were able to make some statements on the situations or events on which the 26 datawere obtained. Such statements could be used to guide our future response or action. Fig 1.1.2: Information This is the function of information: to guide a person on what to do, how and when to do it. Consequently, information may be defined as a fact or set of facts that can influence a person's response in a given situation. Ideally, information reduces or even eliminates a person's sense of uncertainty. We could also define information as the outcome of data processing. Information is a vital resource. It would be impossible to live a normal life without having adequate information. Just imagine waking up one morning to find that you are alone in the house. Not even one other person is around in your house, in your compound, and in the neighbourhood. A thousand questions are racing through your mind, but there is not a soul around to answer you. You decide to move down some distance but still there is not a single person anywhere. What would you do? For how long would you be able to endure that experience? In order to embark on tertiary education, you needed information to decide what programme to choose. After completing your application for admission, you waited with some anxiety for, information on whether you have been given admission or not. You did not know what else to do until you got that information. A businessman would need up- to-date information on the market situation, and changes in government policies that may affect his business. He would normally consult on a regular basis with his colleagues and share information with them. Today, people are realising more and more the value of information and so a science has developed around it. Information science is the science that deals with the generation, acquisition, organisation, storage, retrieval, dissemination and use of information, its characteristics as well as its impact at 27 the individual, corporate and societal level. You will agree that information plays a key role in every sphere of life. Information is at the core of success of both individuals and corporate bodies in commercial and business enterprises. Information confers competitive advantage on those who have it against their counterparts who do not have it. Information gives power. The countries that have the capacity to acquire or generate and manage information use it to improve their socio- economic status and advance ahead of other nations that do not have such capacity. We that live in the Third World countries,do not seem to fully appreciate the value of information. In many of these countries, most people in the civil service and in government seem to regard information as what government wants the citizenry to hear, the kind of releases made by the minister of ministry of information, or the news one hear from the radio and television or reads in the newspaper. That is information quite alright; but it is “soft” information. The kind of information that confers power in our age, include, scientific, technological, economic and developmental information. The crucial importance of information has also dictated that both organisations andnational governments take appropriate measures to put in place the infrastructure necessary for managing and possibly for controlling it. More and more investment is being made in the establishment of information systems. National, regional and global computer and telecommunication networks have been developed for the management of communication of information. The countries in the third world have been talking of a new information order. That is the reaction against the domination by the western and industrialised countries of the global information industry 28 2.2.1 Value of Information The value of information is enhanced by its accuracy, relevance, timeliness, source, up-to-datedness, as well as the packaging format. If you have any reason to doubt the accuracy of a piece of information, you would not like to act on it. As a matter of fact, inaccurate information could be more disastrous than no information. If it becomes necessary to verify information coming from a particular channel, the cost of such information will be higher, and that may discourage the use of it. It is quite important that information be relevant to the purpose for which it is needed. If you would like to read something about the industrial revolution and library staffs give you a book on the colonisation of Africa, would you be pleased? You may accept the book if you think you mighthave time to read it, but your need for information on the industrial revolution has not yet been satisfied. You may in fact reject the book. The book is still important and will be useful to someone else but not to you at that point in time. The other consideration is timeliness. For information to be useful, it must be received in good time to make a difference in what the person who receives it is actually doing. If you wanted the information on industrial revolution in the course of preparing for an examination and you could not get it until after that examination, you would no longer attach much importance to it. The source of information could be very important. The source could have a high level of credibility so that the person using the information could do so with much confidence. Otherwise it could have some qualified level of credibility. Certainly you would be more assured of the authenticity of information when you hear that it has come from an impeccable source. 29 It must be noted that information has to be presented in a way that makes it easy to use. A brief summary may be enough and much better for a business executive than a voluminous report. Graphic representation may carry more impression than pages of a statistical report. A video presentation could be better appreciated than a written version on a particular subject. The reverse may be the case in another situation. In-text Question What is Information? Answer Information may be defined as a fact or set of facts that can influence a person's response in a given situation. 3.0 Study Session Summary and Conclusion This study session has clearly explained the meaning of data with relevant examples, how data can be processed into useful information and what information entails and its importance. Tutor marked assignment was given to test the students’ ability to distinguish between Data and Information. Self- Assessment question were provided for students’ practice 4.0 Self-Assessment Questions 1. What is data? Illustrate with relevant examples 2. How do you process data to become useful information 3. Name four ways of processing data. Give examples for each 4. Explain the concept of Information 5. Name six factors that determine the value of information and explain 6. With relevant examples explain the six stages of Data Processing 5.0 Additional Activities (Videos, Animations &Out of Class activities) e.g. a. Visit U-tube addhttps://bit.ly/2C14W1y. Watch the video & summarise in 1 paragraph b. View the animation on add/site https://bit.ly/2HNFRZT and critique it in the 30 discussion forum 6.0 References/Further Readings Susan, A. (1972). An Introduction to Computers in Information Science. Metuchen, N.J.: Scarecrow. Alexander P.M. (2013). Towards reconstructing meaning when text is communicated electronically. Chapter 3 Data, Information and meaning (page 61-72). (http://repository.up.ac.za/dspace) 31 STUDY SESSION 2 Information Retrieval Definitions and Concepts Section and Subsection Headings: Contents 1.0 Learning Outcomes 2.0 Main Contents 2.1- Information and Documents 2.1.1- Characteristics of Documents 2.2- Information Retrieval Definitions and Concepts 2.2.1- Components of Information Retrieval 3.0Study Session Summary and Conclusion 4.0 Self-Assessment Questions and Answers 5.0 Additional Activities (Videos, Animations &Out of Class activities) 6.0 References/Further Readings Introduction You are welcome to study session 2, in the previous session, you learnt the basic concepts data and information, how data can be processed to make useful information following the Data Processing Cycle and the value of information. In as much as the value of information cannot be overemphasized, it is imperative that the way information is stored and retrieved be known therefore this study session will provide you with what information retrieval is all about and the components that made up the retrieval process. However, document will be a reoccurring term in this course so we will need to understand the meaning and its distinction from information. 32 1.0 Study Session Learning Outcomes After you have finished studying this session, you should be able to: 1. Distinguish between information and document 2. Describe the characteristics of documents 3. Define Information Retrieval 4. Explain the concept Information Retrieval 5. Understand the different components of Information Retrieval 2.0 Main Content 2.1 Information and Documents Having discussed in detail what information is all about and the distinction from data in study session 1. You will be referred to read more on information on study session 1. This section will therefore dwell more on the relationship between information and document.A document is a source of information, and it is as useful as the information in it. A letter from your, uncle is a document. So also is every other letter you receive. Your birthday certificate is a document. Fig 1.2.1: Information and Documents It says something about your birth. Every book you buy and add to your collection is a document. The minutes of the meetings of your club are documents. Your photo albums and your photographs are more examples of document. Your diary is another document. We can add to the list such documents as an issue of a newspaper or a magazine, an audio tape with music or speech, a video recording of any event, a film strip ofa local festival, directories, sales promotion brochure and leaflets, painting and maps. Now you 33 can see that when we use the term "document" we are actually referring to a wide range of information sources. A document is therefore the vehicle conveying information. 2.1.1 Characteristics of Documents We shall now examine some of the characteristic of a document. The first fact to note is that every document must originate from somewhere. In many cases, there is one individual or more persons who created the document. A person who created a document is the author. If the document is a book, the author did all the work of writing it. This course material you are reading now was written by somebody who is the author. A book may have one author or multiple authors. Sometimes, a number of authors contribute sections or chapters of a book while one of them or someone else compiles these contributions and edits them to produce the book. The latter is the editor. On the front back of such a book you will find something like Democracy in Nigeria Edited by O.O Galu Fig 1.2.2:Characteristics of Documents Sometimes the originator of a document may not be just an individual or some individuals but rather an organisation, for example the National Universities Commission, the Ministry of Information, UNESCO, and so forth. That is a case of co-operative authorship. Even though somebody in the organisation wrote the material, it was written in the name of the organisation. Later in this course you will see the importance of author in locating information. Another feature, of a document is the place of publication. When a document has been produced in a large quantity for sale or distribution, it is said to be 34 published. The place where the work of preparing the document was done is the place publication. Usually, the place of publication is the place where the publisher has its main office. However, a publisher may publish a document in more than one place. Take any book and inspect the title page or the reverse of the title page and you will find where the book was published. There you will also find the publisher. There are a number of well-known publishers in Nigeria. There are also many little-known publishers. After the author, the title of the document follows in a bibliographic record. The title is the name that the author gives to the document. The title could be in two parts: the main title followed by a sub-title, for example: "Computer Concepts: A User Perspective." Besides identifying a document, the title could also give a glimpse into what the document is about. The title above suggests that the book is about computers. It must be noted that this is not always the case. If you see a document with the title, "An Enterprise in Futility" would you be able to guess what it is all about? Another feature that may identify a document is the edition. A new edition comes out when a document is revised. Revision is usually necessary for the purpose of correcting errors in a document, enhancing aspects of the document, supplying new information, expanding the scope of coverage and so forth. Some documents go through many revisions, which generate successive editions. Two different editions of the same document are actually regarded as two different documents. The date of publication is no less important. The date of publication says how recent the document is. In some subject field publications become quickly out of date. 35 Lastly, we should mention the unique identification number that every book should have. It is called the ISBN (International Standard Book Number). It is a 13 digit number, for instance 978-1-846-14792-0. ISBN not only identifies the book, but also the publisher and the country of publication. Serial publications such as journals and magazines also have a unique identification number called ISSN (International Standard Serial Number). You may wonder if every document has all the features described above. Not every document has them. A document that has not been published cannot have a publisher or place and date of publication. We may classify information source as published or unpublished. The value of an information source does not depend as much on whether or not it has been published as to the novelty and potential usefulness of the information it contains In-text Question What is a document? Answer A document is a source of information, and it is as useful as the information in it. 2.2Information Retrieval Definitions and Concepts Information Retrieval has been defined in several ways which in turn have the same meaning and interpretations. Here are several definitions of Information Retrieval 1. A process of searching for specific information among a large collection 2. It is a discipline concerned with the process by which queries presented to information retrieval systems are matched against a store of documents 3. It is a process in which sets of records or documents are searched using queries or terms to find items which may help to satisfy the information need or interest of an individual or a group 36 All the above three definitions are acceptable for Information Retrieval;however, the last definition can be taken as more appropriate than the other two because it is more loaded in terms of the components that constitute the process of information retrieval which will be discussed in details in the next section. Information could be in electronic or paper-based form which means information can be retrieved both manually and electronically. However, this course is more of the electronic retrieval rather than manual retrieval. For any of these (manual or electronic), the way a user express their information needs help in information retrieval. It is worth mentioning that information retrieval can be through an intermediary who could be human or computer. This is why information retrieval is also a discipline that people study. In a computerized information system, as shown in figure 1below, we have the human with an information need interacting with an information retrieval system (by query using search terms) through an interface (normally a hardware containing software with many options) to produce information items(in the document store or database) that matches the information need of the human. Figure 1 shows this illustration Interface (a hardware containing a software with many options 37 Document Information need store/ database matching user query Query (search with items terms) in store Information items (result) Figure 1.2.3: Illustration explaining Information Retrieval 2.2.1 Components of Information Retrieval As mentioned earlier in the preceding section, information retrieval process consists of some components which include the following: 1. The set of record or documents that needs to be searched For any information to be retrieved, this means that information should be somewhere hidden waiting to be found. This is where the sets of records or store of documents comes in. Examples of these document stores are databases in computerized systems, documents on any search engine like Yahoo, Google, Google Scholar etc. Institutional Repositories, Directories of Open Access Journals or Books where electronic journals and books are stored, specialized databases such as HINARY, JSTOR, Science Direct and so on. 2. The indexing and access method for the document set Every document set has a way it is indexed and organised for easy access. Depending on the indexing and access method for the document set a user 38 can retrieve information matching his/her information need(s). Examplesare using Boolean method, Truncation or Natural language for searching and so on. 3. The information need of the user Every information searcher has a need. This need could be trivial or scholarly, either of this two requires you go for searching so as to be able to satisfy your information need. If there is no information need then there can be no information retrieval. 4. Search Strategy It is very important to be able to verbalize or express one’s information need in a series of search statement (search strategy). In order to retrieve information that matches a user’s information need, the need must be broken down in form of search statement which is known as Search Strategy. Search strategy is a technique of carrying out a search in an information retrieval system. Expressing one’s need is essential in information retrieval as this can determine the quantity, quality and relevancy of information retrieval products. 5. Search Results The sequence of informationitems presented as a result of the search carried out by a user that has an information need using a search strategy is known as search results. In as much as a user with an information need as posed a query in form of search statement to an information retrieval system, he/she is expecting a result in form of information item that will match his/her information need The diagram below is a good illustration of the components of Information Retrieval process 39 Information need Method for search Set of records or Document Search statement (queries) store Retrieved items (search results) Figure 1.2.4: Components of Information Retrieval In-text Question Define Information Retrieval Answer Information Retrieval is a process in which sets of records or documents are searched using queries or terms to find items which may help to satisfy the information need or interest of an individual or a group 3.0 Study Session Summary and Conclusion This study session has clearly distinguished between information and document. Also the characteristics of documents and its importance in information retrieval were discussed. Several definitions and components of information retrieval were enumerated and explainedin details with diagrammatic illustrations. Tutor marked assignment was given to test the students’ ability to distinguish between Information and documents. Self- Assessment question were provided for students’ practice 40 4.0 Self-Assessment Questions 1. Distinguish between information and documents 2. Clearly explain the term “Information Retrieval” 3. With relevant examples explain each of the components of a retrieval process 5.0 Additional Activities (Videos, Animations &Out of Class activities) e.g. a. Visit U-tube addhttps://bit.ly/2NGfUOk. Watch the video & summarise in 1 paragraph b. View the animation on add/site https://bit.ly/30DPN0C and critique it in the discussion forum 6.0 References/Further Readings Ricardo, B. and Berthier, R. (2015). Modern Information Retrieval. ACM Press New York. https://www.researchgate.net/publication/ 41 STUDY SESSION 3 Information Retrieval System Section and Subsection Headings: Contents 1.0 Learning Outcomes 2.0 Main Contents 2.1- Information Retrieval System 2.1.1- Types of Information Retrieval System 2.2- Data/fact Retrieval System versus Information Retrieval System 3.0 Study Session Summary and Conclusion 4.0 Self-Assessment Questions and Answers 5.0 Additional Activities (Videos, Animations &Out of Class activities) 6.0 References/Further Readings Introduction You are welcome to study session 3, in the previous session, you learnt the distinction between information and documents, also the concepts of Information retrieval and components were introduced to you. This session will however explained the meaning of Information Retrieval System and their types. The differences between Data/fact Retrieval Systems and Information Retrieval System were also enumerated. 1.0 Study Session Learning Outcomes After you have finished studying this session, you should be able to: 1. Explain the concept Information Retrieval System 2. Identify and explain the types Information Retrieval System 3. Identify the distinguishing factors of DataRetrieval and Information Retrieval Systems 42 2.0 Main Content 2.1 Information Retrieval System An Information Retrieval (IR) system is a manual or computerized process for storing, organising and accessing information items. It can also be defined as any device which aids access to document specified by subjects and the operations associated with it (Vickery, n.d)). Hiemstra (2000) defined IR system as a software programme that stores and manages information on documents. The documents can be books, journals, atlases and other records of thought, or any parts of such records- articles, chapters, sections, tables, diagrams or even particular words.The operations can range from simple visual scanning to the most detailed programming.Fig 1.3.1: Information Retrieval System An IR system is designed to retrieve any documents or information required by a user. An IR system stores units of information. Each unit is linked in the system to specifications of one or more documents or part of documents. The user specifies particular unit of information and the system is designed to provide him with a knowledge of a relevant items recorded in the system. This implies thatIR system does change the knowledge of the user on the subject of his enquiry; it merely informs the user of the existence or non-existence and whereabouts of documents relating to the request(Aruleba, AkomolafeandAfeni, 2016). The act of retrieving information is possible for only information that has been stored in the system. The problem of information retrieval is basically of selecting from a collection of document which is/are expected to be useful to the 43 user. This is not much of a problem if the collection is small, a user can browse through the collection, examine each item and determine which one he want. The problem arise if the size of the information is so large that it must be organised for searching The main issues of information retrieval is first, the organisation of the collection and second, retrieval. A searcher or requester must be able to describe the attributes of the document in which he wishes to inspect and the system in turn must be able to identify the and search the collection based on the attributes, summarily, an IR system does not inform the user of the subject of its enquiry. It only informs him of its existence and where about of the document relating to his request. In an IR system, because the format and language used in storing document are less constrained, the IR system can retrieve document or document references and can store a body of information which can be searched and retrieved in response to specific request or needs. The ability of the system to do this depends to a large extent on how the documents have been stored. The indexing of document through the use of control indexing terms allows some form of structure to be imposed on the organization of document in the system and this is convenient tool when searching is to be carried out. A request put to an IR system that uses a control vocabulary is handled in the same way as an incoming document. The request is conceptually analysed and described by means of terms selected by the controlled vocabulary. The request in controlled language is now searched against the subject index and a document is retrieved when a match or partial match occurs between a formalized concise description of the document and the request. Searching is thus essentially a matching operation and the subject indexing or control 44 vocabulary is a tool that reduces a number of document a searcher needs to look at in order to find one or more documents that will satisfy his/her information needs. The first/initial output from the search process may be in form of citation from which the requester can make a selection. Subsequently, the requester can ask for the complete text of the selected item to be presented if the system offered such facilities. IR system can be categorised into two types as discussed below 2.1.1 Types of Information Retrieval System There are usually two types of IR system. They are Bibliographic Retrieval System and Full Text Retrieval System Bibliographic Retrieval System:An IR system may retrieve complete text of document surrogate such as abstract or bibliographic citation. This can also be referred to as Reference Retrieval System. This type of IR system normally contains references to the original document and contains things like title, abstract, publishers, authors and sometimes introduction and so on. They can be entered into the system and they are also searchable. It sometimes contains index terms that are assigned by human index. Although other contents such as title, abstract etc are also entered into the system and are usually searchable. For example, a user (you) might request for all bibliographic reference dealing with “Information System Analysis and Design”. The system will only bring out the reference to the request like authors name (s) year of publication, title etc. that can refer the user to where the full text of the request can be found. Full Text Retrieval System: This type of IR system on the other hand, contains the complete text of the document and the database is usually indexed based on the words present in the entire document. 45 A system that ultimately provides the user or searcher with full text of document is properly called Document Retrieval System. The task of a full-text information retrieval (IR) system is to satisfy a user’s information need byidentifying the documents in a collection of documents that contain the desire information.(Brown, Callan, Croft, and Moss, 1993).In Full Text RS, the search is distinguished from searches based on parts of the original texts represented in databases (such as titles, abstracts, selected sections, or bibliographical references). In a full-text search, a search engine examines all of the words in every stored document as it tries to match search criteria (for example, text specified by a user). Many websites and application programs (such as word processing software) provide full-text-search capabilities. Some web search engines, such as AltaVista, employ full-text-search techniques, while others index only a portion of the web pages examined by their indexing systems. In-text Question What is Information Retrieval System? Answer An Information Retrieval (IR) system is a manual or computerized process for storing, organizing and accessing information items. 2.2 Matching Document and Query Suppose there is a store of documents and a person formulates a question (request or query) to which the answer is set of documents satisfying the information need expressed by his question. He can obtain the set by reading all the documents and discarding all the others. In a 46 sense, this constitutes “Perfect Retrieval”. Fig 1.3.2: Matching Document and Query However, an IRS is built on the concept of Query (q) and Document (d). An IR system must match q and d or d and q depending on the user interface or the direction of the system orientation. Matching is what an IR system does. In relation to defining what an IR system does, it should be noted that IR system matches user queries to documents stored in a database. Since matching of q and d and d and q is the function of an IR system, all components of IR system must support its matching ability. Thus, an IR system must perform the following tasks of: 1. Translating an information need into query terms (q) 2. Indexing documents into index terms (d) 3. Matching the query and documents 4. Ranking the results of matching 5. Managing the access to documents 2.3 Data/fact versus Information Retrieval System There is a difference between Data Retrieval (DR)system and Information Retrieval (IR) system the major difference between the two is that a DR retrieves the actual data desired i.e, there must be an exact match between queries and the item retrieved e.g, today’s date remains today’s date, it cannot be changed likewise the shared price of Nigerian breweries as of today remains the same. The capital city of Nigeria as at today remains the same and so on. For IR, system they provide a set of information that will most likely contain the information the user wants. The user now has to check the retrieved items to see if it matches his needs. (It is 47 subjective). There is always an element of uncertainty because of a continuum of irrelevant to relevant information document search.Following the preceding section, a distinction is needed since the two terms affect the matching type: Exact match and best match. This will be illustrated in the table below Table 1.3.1: distinction between Data Retrieval (DR) and Information Retrieval (IR) systems Data Retrieval (DR) System Information Retrieval (IR) System 1. Deals with exact match i.e. whether an item It is more concerned with partial and best is present or absent in the system match i.e. close to what the user desire 2. Probabilistic is not considered in DR. it only Focuses on relevance and thus is not focuses on matching of items and thus is significantly affected by small errors sensitive to small errors 3. Determines which document contains the Retrieves information about a subject keywords given by the users therefore may not meet the information need of the user 4. A user of DR system does not tolerate a Small errors could go unnoticed by the user single erroneous object among a thousand of IR system retrieved objects and perceives it as a total failure of the system 5. Handles data with a well defined structure Handles natural language texts-unstructured and semantics and ambiguous in semantics 6. Refers to the physical description of an item Refers to information contained or meant in that calls for objective indexing like the name an item. Calls for subjective indexing of an author 7. In respect to the evaluation of retrieval IR could have a performance of 30% recall product based on performance (precision and and 30% precision recall). DR is expected to have a performance100% recall and 100% precision Note Item 7 On The Table 1.3.2: Recall indicates the ability of the system to retrieve relevant items, while precision is the ability of the system to reject irrelevant items. In-text Question What are the functions of an IR system? Answer 48 1. Translating an information need into query terms (q) 2. Indexing documents into index terms (d) 3. Matching the query and documents 4. Ranking the results of matching 5. Managing the access to documents 3.0 Study Session Summary and Conclusion This study session has clearly explained the Information Retrieval System and the types (bibliographic and full text). This session also distinguished between Data Retrieval system and Information Retrieval system. Tutor marked assignment was given to test the students’. Self-Assessment question were provided for students’ practice. 4.0 Self-Assessment Questions 1. What is Information Retrieval System? 2. Clearly explain Bibliographic and Full-text retrieval system. 3. Distinguish between Data/fact Retrieval System and Information Retrieval System 5.0 Additional Activities (Videos, Animations &Out of Class activities) e.g. a. Visit U-tube addhttps://bit.ly/1bhYPOq. Watch the video & summarise in 1 paragraph b. View the animation on add/site https://bit.ly/30DPN0C and critique it in the discussion forum 6.0 References/Further Readings Baeza-Yates, R. and Ribeiro-Neto, B. (1999).Modern Information Retrieval. Addison-Wesley Chowdhury, G.G (2003). Introduction to Modern Information Retrieval.Neal- Schuman 49 MODULE 2 Information Retrieval Models Contents: Study Session 1: Information Retrieval Models: Introduction Study Session 2: Boolean Model and Extended Boolean Model Study Session 3: Vector Space Model and Fuzzy Model Study Session 4: Probabilistic and Natural Language Model STUDY SESSION 1 Information Retrieval Models: Introduction Section and Subsection Headings: Contents 1.0 Learning Outcomes 2.0 Main Contents 2.1- Information Retrieval Models at a glance 2.2- A Conceptual Model of Information Retrieval 3.0 Study Session Summary and Conclusion 4.0 Self-Assessment Questions and Answers 5.0 Additional Activities (Videos, Animations &Out of Class activities) 6.0 References/Further Readings Introduction You are welcome to module 2 of this course, in the previous session, you were introduced to Information Retrieval Systems, the types and the basic distinction between Data Retrieval System and Information Retrieval System. This session explains the concept “Information Retrieval Models”. 50 1.0 Study Session Learning Outcomes After you have finished studying this session, you should be able to: 1. Explain Information Retrieval Models 2. Explain the concept Information Retrieval Models 2.0 Main Content 2.1 Information Retrieval Models at a glance Your primary goal of information retrieval (IR), is to provide users (you) with those documents (including non-textual information, such as multimedia objects) that will satisfy their information need. You have to formulate yourinformation need in a form that can be understood by the system. Fig 2.1.1: Information Retrieval Models at a glance Information seeking is a form of problem solving. It proceeds according to the interaction among eight sub processes: 1. Problem recognition and acceptance 2. Problem definition 3. Search system selection 4. Query formulation 5. Query execution 6. Examination of results (including relevance feedback) 7. Information extraction and 8. Reflection/iteration/termination. To be able to perform effective searches, you have to develop the following expertise: 51 1. Knowledge about various sources of information, 2. Skills in defining search problems and applying search strategies, 3. Competence in using electronic search tools. Figure 2.1.1: Representation of General Model of Information Retrieval Process Information Retrieval Process The information need of the user can be understood as forming a pyramid, where only its peak is made visible by users in the form of a conceptual query (figure 3). The conceptual query captures the key concepts and the relationships among them. It is the result of a conceptual analysis that operates on the information need, which may be well or vaguely defined in the user's mind. This analysis can be challenging, because you are faced with the general "vocabulary problem" as they are trying to translate their 52 information need into a conceptual query. The vocabulary problem refers to the fact that a single word can have more than one meaning, and, conversely, the same concept can be described by surprisingly many different words. It has been shown that two people use the same main word to describe an object only 10 to 20% of the time. Further, the concepts used to represent the documents can be different from the concepts used by the user. The conceptual query can take the form of a natural language statement, a list of concepts that can have degrees of importance assigned to them, or it can be a statement that coordinates the concepts using Boolean operators. Finally, the conceptual query has to be translated into a query surrogate that can be understood by the retrieval system. In-text Question What is the goal of information retrieval? Answer The goal Information retrieval (IR) is to provide users with those documents (including non- textual information, such as multimedia objects) that will satisfy their information need. 2.2 A Conceptual Model of Information Retrieval When you are dealing with text or multimedia documents, you should distinguish different views on these documents. Several subfields of computer science and related fields deal with documents, where most of the fields focus on one or two views and ignore the others.Wewilltry to present an integration of the different perspectives. Four different views are distinguished below: http://www.is.informatik.uni-duisburg.de/bib/pdf External attributes:comprise data that is not contained within the document, i.e. a user (you) looking at the document only may not see these values. External attributes contain information that is needed for certain types of 53 processing the document, e.g. the name of the creator of the document, access rights, or publication information. In digital libraries, this type of data often is called metadata. Logical structure: The media data that is contained within the document and its internal structure comprise the logical structure of a document. Usually, documents have a hierarchical structure (e.g. a book divided into chapters, chapters containing sections, consisting of subsections, which comprise paragraphs, images and tables). In this tree structure, the data is located inthe leaves, where a leaf contains single media data only (e.g. text, graphics, images, audio, video, animation, 3D). Hypermedia links allow for non- hierarchical structures. Layout structure: In order to show a document to a user, it must be presented at some kind of output media (e.g. when a document is printed, we have a sequence of pages). Based on a so-called style sheet, the layout process maps the logical structure onto the output media. The layout structure describes the spatial distribution of the data over the output media, e.g. the sequence of pages, which in turn are subdivided into rectangular areas (e.g. pageheader, footer, columns). This concept can be extended to time-dependent media (e.g. audio, video), where the layout structure describes the temporal and spatial distribution on an appropriate output device. Content deals with the meaning of a document (e.g.: What is the document about? What does it deal with?). The content is derived from the logical structure, in most cases by an automatic process. The content representation may have an internal structure, too, but often rather simple schemes are used. For instance, in text retrieval, content mostly is represented as a set of concepts. 54 When you want to perform information retrieval on multimedia documents, youhave to consider all these views, in order to allow for queries addressing each ofthese viewsseparately, as well as queries for combinations. Examples of querieswith respect tothe different views are: Give me all documents publishedlast month (attributes). Show me all books that have the string `XML' in the titleand contain more than 10 chapters (logical structure). Show me all articles thatare typeset in two columns, with a length of more than 15 pages (layout). Findall documents about image retrieval (content). Since IR focuses on content, youalso will prefer this view throughout this paper. However, since real applications Typically involve more than one view, there is a need for retrieval mechanisms that are not restricted to a single view. A variety of models have been proposed to make retrieval process more efficient. A model is an abstracts away from the real world, which uses a branch of mathematics and possibly a metaphor for searching. Each retrieval model is unique in itself and a classification based on only one component regardless of how major it could be, hides its true potentials. IR models are few and upon limitations, extended and improved. Implementation of models varies even for similar data type and this is very critical issue in commercialized IR systems. The well-knowninformation retrieval models include: 1. Boolean model 2. Extended Boolean Model 3. Document similarity 4. Probabilistic indexing 5. Vector space model 6. Probabilistic retrieval 7. Fuzzy set models 8. Inference networks 55 9. Natural Language models However, Boolean Model and Extended Boolean Model, Vector Space model and Fuzzy Set models, Probabilistic Model and Natural Language Model will be discussed in the study session 2, 3and 4 respectively of this module. In-text Question What are the four different views that distinguished about the Model of information retrieval? Answer External attributes Logical structure Layout structure Content 3.0 Study Session Summary and Conclusion This study session has made the concept of Information Retrieval Modeleasy to understand using the diagrammatic representation of general Model of Information Retrieval Process. Different views regarding the conceptual model of Information Retrieval were also discussed. Tutor marked assignment was given to test the students’. Self-Assessment question were provided for students’ practice. 4.0 Self-Assessment Questions 1. Explain the Information Retrieval Model? 2. Enumerate and clearly explain the different views on the conceptual model of Information Retrieval 3. With the aid of a diagram, explain the general model of Information Retrieval Process 5.0 Additional Activities (Videos, Animations &Out of Class activities) e.g. 56 a. Visit U-tube add https://bit.ly/2Uk2WZo/. Watch the video & summarise in 1 paragraph b. View the animation on add/site https://bit.ly/2Wk8WFx and critique it in the discussion forum 6.0 References/Further Readings Baeza-Yates, R. and Ribeiro-Neto, B. (1999).Modern Information Retrieval.Addison-Wesley Chowdhury, G.G (2003). Introduction to Modern Information Retrieval.Neal- Schuman Jones, K. S. & Willett, P. (1997).Readings in Information Retrieval. Morgan Kaufmann. Kowalski, G. &Maybury, M.T. (2005).Information Storage andRetrieval Systems.Springer. vanRisjbergen, C.J. (2004). The Geometry of Information Retrieval. Cambridge UP. 57 STUDY SESSION 2 Boolean Model and Extended Boolean Model Section and Subsection Headings: Contents 1.0 Learning Outcomes 2.0 Main Contents 2.1- Boolean Model 2.1.1- Advantages and Disadvantages of Boolean Model 2.2- Extended Boolean Model 3.0 Study Session Summary and Conclusion 4.0 Self-Assessment Questions and Answers 5.0 Additional Activities (Videos, Animations &Out of Class activities) 6.0 References/Further Readings Introduction You are welcome to study session 2, in the previous session, you were made to understand the Information Retrieval Model as a concept. This session and the subsequent ones in this module will be discussing the different types of IR Models. We will discuss the Boolean Model and the Extended Boolean Model, their advantages and disadvantages. 1.0 Study Session Learning Outcomes After you have finished studying this session, you should be able to: 1. Explain Boolean Models with their illustrations 2. Explain Extended Boolean Models with their illustrations 3. Enumerate the advantages and disadvantage of both Models 2.0 Main Content 58 2.1 Boolean Model (BM) From our discussion, The Booleanmodelof Information Retrievalis a classical Information Retrieval (IR) model and, at the same time, the first and most adopted one. Proposed in 1950by a mathematician named George Boole. BM is used by virtually all commercial information retrieval systems today. Fig 2.2.1: Boolean Model (BM) The Boolean information retrieval is based on Boolean Logicand classical Set Theory in that both the documents to be searched and the user’s query are conceived as sets of terms. The model uses Union, Intersection and Negation operators (OR, AND, NOT) retrieval is based on matching procedures where items retrieved in response to query are either relevant or irrelevant. This is due to the extreme character of the conjunction and disjunction. Retrieval is based on whether or not the documents contain the query terms. Boolean operators are devised to enhance recall and precision. Recall reflects the ability to retrieve relevant documents and precision reflects the ability to reject irrelevant document. AND is intended for precision while OR is intended for recall. BM is evaluated and tested using recall and precision against a binary scale of relevance. Boolean AND is applied between concepts while OR is for terms with a concept. NOT is used in a cautious manner to avoid missing of relevant items and reserved for searches that strictly require exclusion of terms. For example, 59 yoursearch on images and NOT any other components of multimedia like video and audio. Examples: Boolean AND (Intersection) - all of the words: The Boolean AND of two logical statements x and y means that both x AND y must be satisfied. In order to retrieve the document both terms must beincluded in the search term. Also the two terms must be in the items retrieved. Let us look at the diagram below, using the terms“Information” and “Retrieval” for searching. Information Retrieval Information/Retrieval Figure 2.2.2: Diagram representing Information AND Retrieval All documents in the IR system containing Information/ Retrieval (joined in a document) are retrieved as information items. That is, the shaded part infigure3 Boolean OR (Union)– any of the words:Boolean OR of these same two statements means that at least one of these statements must be satisfied. Document can be retrieved with any or both the present terms. Information, retrieval and information retrieval 60 Information Retrieval Information/Retrieval Figure 2.2.3: Diagram representing Information OR Retrieval The IR system looks for information items containing the terms Information, Information/Retrieval and Retrieval. That is, all shaded parts in figure 4 Boolean NOT (Negation) – none of the words: excludes the term that is not needed. As earlier mentioned, this operator should be used in a cautious manner in order not to miss relevant items. Information Retrieval Information/Retrieval Figure 2.2.4: Diagram representing Retrieval NOT Information The IR system looks for information items containing only “Retrieval”. 61 Information items that contains any terms with “Information” are not included in the search. That is, the un-shaded part of figure 5. In-text Question Define Booleanmodel Answer Booleanmodelof Information Retrieval is a classical Information Retrieval (IR) model and, at the same time, the first and most adopted one. 2.1.1 Advantages and Disadvantages of Boolean Model There are several advantages and disadvantages of Boolean Models you should have in mind. These are enumerated below: Advantages of Boolean Model 1. It is easy to implement and it is computationally efficient. Hence, it is the standard model for the current large-scale, operational retrieval systems and many of the major on-line information services use it. 2. It enables users to express structural and conceptual constraints to describe important linguistic features. Users find that synonym specifications (reflected by OR-clauses) and phrases (represented by proximity relations) are useful in the formulation of queries. 3. Boolean approach possesses a great expressive power and clarity. Boolean retrieval is very effective if a query requires an exhaustive and unambiguous selection. 4. The Boolean method offers a multitude of techniques to broaden or narrow a query. 5. The Boolean approach can be especially effective in the later stages of the search process, because of the clarity and exactness with which relationships between concepts can be represented. 62 Disadvantages of Boolean Model 1. The major problem with BM is that it is very rigid. They look for the exact word used in the search rather than variant word. Many researchers refer to it a fact or data retrieval system rather than information retrieval system. 2. It gives “all or nothing” response. Too many or few documents are retrieved so the size of the results is difficult to control. 3. It does not rank the output of the query or searches. i.edocuments are not ranked in any order of presumed importance to the user. 4. Another disadvantage of Boolean model includes the fact that users find it difficult to construct effective Boolean queries for several reasons. Users use the natural language terms AND, OR or NOT that have a different meaning when used in a query. Thus, they will make errors when they form a Boolean query, because they resort to their knowledge of English. In-text Question What are the advantages of Boolean Model? Answer 1. It is easy to implement and it is computationally efficient. Hence, it is the standard model for the current large-scale, operational retrieval systems and many of the major on-line information services use it. 2. It enables users to express structural and conceptual constraints to describe important linguistic features. Users find that synonym specifications (reflected by OR-clauses) and phrases (represented by proximity relations) are useful in the formulation of queries. 3. Boolean approach possesses a great expressive power and clarity. Boolean retrieval is very effective if a query requires an exhaustive and unambiguous selection. 4. The Boolean method offers a multitude of techniques to broaden or narrow a query. 5. The Boolean approach can be especially effective in the later stages of the search process, because of the clarity and exactness with which relationships between concepts can be represented. 2.2 Extended Boolean Model To overcome the rigidity of Boolean Model, the Extended Boolean Model (EBM) was developed by Edward Alan Fox in 1983. In the EBM, the 63 information are judged as not only relevant or irrelevant with respect to a given query but given relative query. The EBM also provide proximity and adjacency operators to compare a query with document index terms. The Boolean model doesn't consider term weights in queries, and the result set of a Boolean query is often either too small or too big. The idea of the extended model is to make use of partial matching and term weights as in the vector space model. It combines the characteristics of the Vector Space Modelwith the properties of Boolean algebraand ranks the similarity between queries and documents. This way a document may be somewhat relevant if it matches some of the queried terms and will be returned as a result, whereas in the Standard Boolean modelit wasn't. Thus, the EBM can be considered as a generalisation of both the Boolean and vector space models; those two are special cases if suitable settings and definitions are employed. Further, research has shown effectiveness improves relative to that for Boolean query processing. Other research has shown that relevance feedback and query expansion can be integrated with extended Boolean query processing. Extended Boolean Modelprovides also left and right truncation e.g,searching for Computer Architecture. It will search for everything that begins with Comp# and Arch#. Therefore, for Comp#, it may retrieve computer, computing, computation etc. the drawback of this type of model is that there is problem of false drop e.g.retrieving information on computer, to avoid a lot of false drops you can use computerfor searching. Another example is #Architecture which is anything architecture. This can be referred to as Root or Left truncation while the former e.g. computer can be referred to as Stem or Right truncation. 64 The drawback of this model is that there is a problem of false drop e.g. in retrieving a word like computer, and you use comp#, you may have false drops such company, composition, compare and so on. In order to avoid a lot of false drops you can use computer for searching In-text Question What is the full mean of EBM? Answer Extended Boolean Model 3.0 Study Session Summary and Conclusion Two types of Information Retrieval Modelhave been explained in this study session. They are Boolean Model and Extended Boolean Model.A diagrammatic representation of Boolean operators was presented for easy understanding. Tutor marked assignment was given to test the students on their understanding of both models. Self-Assessment question were provided for students’ practice. 4.0 Self-Assessment Questions 1. Write short note on Boolean Model of information retrieval.Explain using diagrams. Give their advantages and disadvantages 2. Write short note on Extended Boolean Model of information retrieval. Give a detailed explanation using example. Give their advantages and disadvantages 5.0 Additional Activities (Videos, Animations &Out of Class activities) e.g. a. Visit U-tube addhttps://bit.ly/1hxT0ot. Watch the video & summarise in 1 paragraph b. View the animation on add/site https://bit.ly/2YJyIk8 and critique it in the discussion forum 6.0 References/Further Readings 65 Croft, W.B. & Lafferty, J. (2003).Language Modelingfor Information Retrieval.Springer. Jones, K. S. & Willett, P. (1997).Readings in Information Retrieval. Morgan Kaufmann. Kowalski, G. &Maybury, M.T. (2005). Information Storage andRetrieval Systems.Springer. vanRisjbergen, C.J. (2004). The Geometry of Information Retrieval, Cambridge UP. 66 STUDY SESSION 3 Vector Space Model and Fuzzy Model Section and Subsection Headings: Contents 1.0 Learning Outcomes 2.0 Main Contents 2.1- Vector Space Model 2.1.1- Advantages and disadvantages of Vector Space Model 2.2- Fuzzy Model 3.0Study Session Summary and Conclusion 4.0 Self-Assessment Questions and Answers 5.0 Additional Activities (Videos, Animations &Out of Class activities) 6.0 References/Further Readings Introduction You are welcome to study session 3, your previous session, started with the discussion of two types of the information retrieval models- Boolean and Extended Boolean Model. This studying session will discuss another two of the IR models which are Vector Space and Fuzzy Models. 1.0 Study Session Learning Outcomes After you have finished studying this session, you should be able to: 1. Explain Vector Space Model 2. Explain Fuzzy Model 67 2.0 Main Content 2.1 Vector Space Model The Vector space model was developed to overcome the rigidity of Boolean model.Proposed in 1970, it is a statistical retrieval modelthat represents the documents and queries as vectors in a multidimensional space, whose dimensions are the terms used to build an index to represent the documents. The creation of an index involves lexical scanning to identify the significant terms, where morphological analysis reduces different word forms to common stems, and the occurrence of those stems is computed. Fig 2.3.1: Vector Space Model Information items as well as queries are described as binary vectors whose components indicate the absence or presence of the nth indexing term. Query and information items are points in the vector space and information items are ranked by the number of common descriptors in items and query. Mathematically, this is the inner product between the two vectors. Query and document surrogates are compared by comparing their vectors, using, for example, the cosine similarity measure. In this model, the terms of a query surrogate can be weighted to take into account their importance, and they are computed by using the statistical distributions of the terms in the collection and in the documents. The vector space model can assign a high ranking score to a document that contains only a few of the query terms if these terms occur infrequently in the 68 collection but frequently in the document. In order to explain the above paragraph more clearly, is the well- knownsimilarity measure which is known as COSINE measure. This is represented by the cosine of the angle between the query and the item vector. The cosine measure ranks the item according to their angle to the query vector. Vectors pointing in similar directions are thus taken to represent similar concepts. A process of clustering is the used to build up groups of items that are close to each other and can be represented by one most typical item. Q Q - Query term I- Information items Ѳ- Cosine measure Ѳ I Figure 2.3.2: Cosine of the angle between the query and the item vector 2.1.1 Advantages and disadvantages of Vector Space Model The vector space model, like all statistical retrieval models have the following advantages: 1. They provide users with a relevance ranking of the retrieved documents. Hence, they enable users to control the output by setting a relevance threshold or by specifying a certain number of documents

LIBS 894 - Information Retrieval PDF

Document Details

Tags

Related

Summary

Full Transcript