CE706-AU Information Retrieval Lecture 1 - PDF
Document Details
Uploaded by WholesomeMoldavite7942
University of Essex
Richard Sutcliffe
Tags
Summary
This document provides an introduction to the CE706-AU Information Retrieval course at the University of Essex. It covers course schedule, overview, office hours, assignment details, assessment details, information about the lecturer, discussion of dissertation project and some questions related to Information Retrieval
Full Transcript
CE706-AU Information Retrieval Richard Sutcliffe Organisation 2 Overview Lectures (Weeks 2-11) Monday 14h00-16h00 LTB03 2h Lab per student (Weeks 3, 5, 7, 9, 11) Tuesday 16h00-18h00 CSEE Lab 1 1h Class per student (Weeks 4, 6, 8, 10) Tuesday 16h00-17h00 LTB03 Progress Test (Week 1...
CE706-AU Information Retrieval Richard Sutcliffe Organisation 2 Overview Lectures (Weeks 2-11) Monday 14h00-16h00 LTB03 2h Lab per student (Weeks 3, 5, 7, 9, 11) Tuesday 16h00-18h00 CSEE Lab 1 1h Class per student (Weeks 4, 6, 8, 10) Tuesday 16h00-17h00 LTB03 Progress Test (Week 10) Tuesday 17h00-18h00 Lab 1 Assignment Out Week 4, Due Friday Week 7 14h00 Exam During the Summer Term, in the Exam Period 3 Overview - 2 Office Hours (Academic Support) Monday 16h00-17h00 4B.527 Wednesday 16h00-17h00 4B.527 Moodle Page https://moodle.essex.ac.uk/course/view.php?id=16260 4 Organisation of Labs There is a lab sheet with detailed instructions explaining how to use software etc. I am at the labs to help you. Then there are some simple exercises at the end. Attempt the exercises, save your answers in a file, and keep the file safely! You may need those answers to do the project or to help you revise for the progress test of the exam! 5 Organisation of Classes - 1 I will start by explaining what the class is about. Then, there is an exercise sheet with detailed instructions. I am at the classes to help you. You write your answers on a piece of paper, using black biro. Keep the answers safely, you will need them to revise for the progress test and the exam! 6 Organisation of Classes - 2 You must bring a black biro and some A4 paper to each class! 7 Progress Test The progress test is a Moodle test which is taken in the designated CSEE lab under exam conditions. This is a Closed Book test with no notes allowed and no access to the internet. It must be taken at exactly the time and place stated on the timetable. Questions very similar to those in the Class Sheets are likely to come up in the Test! 8 Assignment The Assignment is a practical project which you do in the labs. The Assignment work to be done is written in a specification (which will be on Moodle in the appropriate week). You must submit your work to FASER (faser.essex.ac.uk) before the deadline. FASER is quite slow, do not try to submit 5 minutes before the deadline! Allow at least half an hour before the deadline. Your work will also be examined in the lab. 9 Exam The exam lasts 120 minutes. Questions will be based mainly on the Assignment, the answers to Lab tasks and the answers to Class exercises. You need to follow the class exercises each week, and know how to make the calculations. You will not be able to pass the exam by looking at the lecture notes the night before! 10 Assessment Information - 1 30% Project 10% Progress Test 60% Exam ------- 100% Total 11 Assessment Information - 2 Note: In the module directory (https://www1.essex.ac.uk/modules/) it says: 25% of coursework is Progress Test 75% of coursework is Assignment This is confusing! It means: 25% of the 40% coursework is Progress Test (i.e. 10% of the whole module mark)! 75% of the 40% coursework is Assignment (i.e. 30% of the whole module mark)! So these figures are exactly the same as what we have already stated. 12 Questions about Organisation and Assessment Any Questions about Module Organisation? 13 Lecturer 14 Information about the Lecturer First degree: St Andrews University, Scotland, Ph.D. University of Essex Lectured at University of Exeter, UK, University of Limerick, Ireland, University of Essex, UK, Northwest University, Xi’an, China Participant in TREC (English Question Answering), CLEF (French-English Question Answering) and NTCIR (Chinese Question Answering) Organiser of multilingual QA evaluations at CLEF for ten years Organiser of C@merata track on Music Information Retrieval at MediaEval for four years 15 Information about the Lecturer - 2 Interests: Natural Language Processing, Information Retrieval, Computer Musicology In particular: Sentiment Analysis, Personality Using Machine Learning and Neural Network models 16 Dissertation Project CE901 and Project Proposal CE902 All of you will do a Dissertation Project with a supervisor in CSEE. It means Doing a project Writing a Dissertation This means you need (1) a project and (2) a supervisor for the project Also, the Project Proposal you undertake in CE902 will be for the very same project. The project selection process has not started yet, do not panic! 17 Advertisement: Possible Dissertation Projects with Richard Personality identification in social media text Sentiment and personality analysis of social media text Computational comparisons of personality models such as Big-5 and 5-Sentence Prompt methods combined with NN methods for personality identification Click-through data and profiling, practical experiments and data analysis Analysis of North Indian (Hindustani) classical music data using neural network models Analysis of Western classical music data using neural network models 18 What Do these Projects Involve? Choosing or combining large datasets Studying and adapting Neural Network models Training the models on the datasets Analysing the results Also, possibly carrying out practical experiments (e.g. click-though) 19 Try Poll Everywhere Question: Are you interested in some of Richard’s projects? PollEv.com/richardsutcliffe540 20 21 Books 22 Books Search Engines: Information Retrieval in Practice W. Bruce Croft, Donald Metzler, Trevor Strohman Pearson Education 2015 http://ciir.cs.umass.edu/ downloads/ SEIRiP.pdf - on Moodle 23 Books cont. Introduction to Information Retrieval C. D. Manning, P. Raghavan, H. Schütze Cambridge University Press 2008 http:// nlp.stanford.edu/IR-book/ Information Retrieval Keith van Rijsbergen Butterworths 1979! http://www.dcs.gla.ac.uk/Keith/Preface.html 24 Books cont. An Introduction to Neural Information Retrieval Bhaskar Mitra, Nick Craswell Now! Publishers 2018 - on Moodle 25 Poll Everywhere Question: What is your previous background? PollEv.com/richardsutcliffe540 26 27 Now we will actually start the course... 28 1. Introduction to Information Retrieval 29 Some Questions What is information retrieval (IR)? What is a search engine? Why is information retrieval important? How do search engines work? Start with some examples... 30 Example 1: Web Search 31 Example 2: More Web Search 32 Example 3: Question Answering 33 Example 4: Scholar Search 34 Example 5: Site Search 35 Example 6: Email Search 36 Poll Everywhere Question: What search engines do you ever use? (You can tick more than one.) PollEv.com/richardsutcliffe540 37 38 Search Engine Market Share: December 2023 Google 91.62 Bing 3.37 YANDEX 1.65 Yahoo! 1.12 Baidu 0.96 DuckDuckGo 0.51 Naver 0.21 CocCoc 0.16 Ecosia 0.1 Source: https://gs.statcounter.com/search- Haosou 0.09 More → engine-market-share 39 Search Engine Market Share: December 2023 Google 91.62 Bing 3.37 YANDEX 1.65 Russian Yahoo! 1.12 Baidu 0.96 Chinese DuckDuckGo 0.51 Naver 0.21 Korean CocCoc 0.16 Vietnamese Ecosia 0.1 Trees Source: https://gs.statcounter.com/search- Haosou 0.09 Chinese engine-market-share 40 What is Information Retrieval? During the last forty years, huge amounts of information have been stored on computers around the world. With the dawn of the WWW this information is more readily available than ever before. People need information to solve problems. The central question of IR: How do you find the information you want? 41 What is Information Retrieval - Definition Information retrieval is a field concerned with the structure, analysis, organization, storage, and retrieval of information. Gerard Salton, 1968 Salton, G. (1968). Automatic information organization and retrieval. New York: McGraw-Hill 42 What is a Search Engine? Information is in the form of text stored in a number of separate files. An Information need is specified as a short query comprising a text string. The system responds to the query with an ordered list of files which match the query. The user then looks through the files and hopefully finds the answer to their query. 43 Why is IR Important - Brief History In the 1950s, IR was of interest mainly to people cataloguing libraries. In 1970s, Inverted Indexing was invented by Gerry Salton at Cornell. This allowed large collections; however, still a fringe activity. In 1990s, Web started to become more important; Search engines created. In 2010s, search engines were becoming a big business; Google, Bing, Baidu etc. were very important organisations. In 2020s, Information Retrieval is universally used. 44 Fields Related to IR Question Answering (IBM Watson) Input is a query: Who is the wife of Tom Cruise? Output is exact: Katie Holmes (until 2012 anyway!) Information Extraction (e.g. template filling) Input: document of a certain type (e.g. company takeover) Output: Key facts - Who took over What, When, Why, for How much 45 Database Queries (e.g. SQL) vs. Information Retrieval Database Retrieval Information Retrieval Matching Exact match Partial or best match Inference Deduction Induction Model Deterministic Probabilistic Query Lang Artificial Natural Query Spec Complete Incomplete Items wanted Matching Relevant Error Sensitive Insensitive response Based on van Rijsbergen, Manning et al. 46 Deduction vs. Induction Inductive reasoning is a method of reasoning in which the premises are viewed as supplying some evidence, but not full assurance, for the truth of the conclusion. https://en.wikipedia.org/wiki/Inductive_reasoning Deductive reasoning, also deductive logic, is the process of reasoning from one or more statements (premises) to reach a logical conclusion. https://en.wikipedia.org/wiki/Deductive_reasoning 47 Deterministic vs. Probabilistic Determinism is the philosophical view that all events are determined completely by previously existing causes. https://en.wikipedia.org/wiki/Determinism Determinism in IR or DBs: Given certain inputs we always get the same output. Database queries in SQL are like this. Probabilistic reasoning is a method of representation of knowledge where the concept of probability is applied to indicate the uncertainty in knowledge. https://study.com/academy/lesson/probabilistic-reasoning-artificial- intelligence.html Probabilistic reasoning in IR: Keywords etc in a text are evidence that it could be relevant but this is not certain; the more keywords, the more certain. 48 DB vs. IR cont. DB: Convert data into structured form (Normal forms etc). Search precisely using Relational Model. Cannot change the overall framework. IR: Do not convert data to structured form. Search less precisely using keywords etc. Can change the overall framework - just alter search methods. 49 Hybrids of DB and IR NoSQL Databases are one trend. MongoDB is an example: Text documents are structured Queries are also structured Large amounts of data Structured queries (but not SQL) No ‘normalisation’-style guarantees (normalised SQL database cannot be inconsistent) Assignment Topics → 50 Choice of Assignment Topic Now we know what a search engine is. You will be implementing and evaluating a search engine. Each student will have a different topic. Topics must be approved by me. I am interested in what topic you might be interested in. In the following Poll Everywhere, please select your preferred topic area. PollEv.com/richardsutcliffe540 51 52 How do Search Engines Work? Based on two important principles: Inverted Indexing Term Weighting 53 Most Important Idea in IR: Inverted Indexing Simple IR - Forward Index Hints: A term is a word! Input: one or more query terms (words) Output: documents containing the query terms A Docid is a unique identifier for a document. We start by going through each document and making a list of the words it contains Then, go through each doc's list in turn seeing whether it contains the query terms. If it does, return doc. Called Forward Index: Docid → Term Problem: Hopelessly slow! Time for retrieval will go up linearly with number of docs! Also goes up linearly with number of words in docs! 54 Unworkable! Inverted Indexing cont. Solution - Inverted Index Instead of Docid → Term we have Term → Docid In other words, input a Term and in one operation we can produce a list of docs which contain that term. For multiple query terms we must merge resulting lists of Docids. Advantage: Very fast to produce lists - time does not increase (much) with size of doc collection. Merging of lists is slower but does not need to be complete (see later). Disadvantage: Need to create inverted index from forward index. This is slow but only needs doing ‘once’. 55 SB Second Most Important Idea in IR - Term Weighting Simple IR - Keyword Match If a query term is in a doc, it matches. Otherwise it does not. Just return docs which contain matches and that is it. This is the basis of Boolean Retrieval. Advantage: Simplicity Disadvantage: Does not take account of how important the word is; does not allow ordering of matching docs (called Ranking). 56 Term Weighting Associate a score with each term. The higher the score, the more important the term. We call the score of a term its weight. Remember: The use of term weights has advantages: A weight is a number! 1. Dramatically improves the performance of IR in terms of the relevance of the result. 2. Makes system no slower. 3. Allows ranking (i.e. ordering) of results - most relevant come first. Extremely important in large doc collections. Nobody quite knows why it works! 57 TF*IDF - Fundamental Basis of Retrieval We will explain all this in Term Frequency: How often Term occurs in a Doc. detail later. Hypothesis: The more often term appears, the more the doc is about the concept denoted by the term. e.g. the more often ‘horse’ appears in a doc, the more the doc is about horses! Inverted Document Frequency: Reciprocal of how often Term appears in entire doc collection. Hypothesis: The more often the term appears across the doc collection as a whole, the less useful it is for distinguishing between docs. e.g. if ‘horse’ appears in all the docs, not just one or two, then ‘horse’ is no good for distinguishing between docs, because all docs will match! 58 Properties of TF and IDF For Term Frequency, we say TF. For Inverted Document Frequency, we say IDF. The higher the frequency of a term in a doc, the better the doc matches our query term. On the other hand, the higher the frequency of the term in docs generally, the worse it is as a search term. So we take 1/DF (i.e. IDF) because it becomes worse for us as it becomes more frequent. We want the best ‘combination’ of these factors. Hence, multiply TF by IDF to get an overall weight for term. Call this TF*IDF 59 Summary of Topic 1: Introduction to IR IR returns ordered list of matching documents in response to keyword query. Inverted Indexing makes fast searching possible on large collections. The basis of Google. Term weighting is crucial to efficient retrieval. TF*IDF is the dominant paradigm for term weighting. IR has complementary properties to DB searching. 60