Evaluation of Computing & Information Sciences - PDF
Document Details
Uploaded by TopQualityBlessing7037
University of Strathclyde
W. H. Bell
Tags
Summary
This document provides an overview of evaluation approaches in computing and information sciences, including user-based and model-based evaluations. It covers search system metrics, like recall and precision, and methods for evaluating partial sets of results. The document also summarizes test collections, pooling, and other measures for evaluating the performance of different systems.
Full Transcript
Evaluation Computing & Information Sciences W. H. Bell Evaluation Evaluate performance against something. Need to decide what good is. Performance can be measured using: Comparison with a mathematical model. Expectation of users. User-based evaluation People are biased. Ex...
Evaluation Computing & Information Sciences W. H. Bell Evaluation Evaluate performance against something. Need to decide what good is. Performance can be measured using: Comparison with a mathematical model. Expectation of users. User-based evaluation People are biased. Expectation of analyst can cause false identification of features. Even with metrics for thought processes, biases may still occur. Apply user input for some high-level tests. Need to underpin this with a statistical model. Safer to build metrics and then carry out blind analyses. Decide what to test, using a background or simulation. Configure the test, run it and then take the result as it is. Must not tune the analysis when looking at the final data sample. User-based evaluation: Movie reviews The average rating value of movies is biased. People watch movies that they like. In general, they assign scores that are systematically higher. These are real effects that have to be corrected for. User-based evaluation: Search engines User types in a 2-3 words and wants specific page. User expectations are different. One person wants an academic article, another wants a magazine. PageRank and quality is not enough. Need to apply recommender systems to search engine results. Model-based evaluation A measurement – compared with model. A model – central value and uncertainty band. Compare observed to expected. Choose an observable. Decide what is expected. Evaluation of search systems Quantify how well a search system is functioning. Is it returning the right documents? Are the documents returned in the correct order? Does it return the results quickly? Are the results displayed in the right way? Need to determine the desired answer for each question. Then construct a method to perform the evaluation. Understand potential systematic and statistical uncertainties of method. Search system metrics Recall 𝑆𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 ∩ 𝑆𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 Higher value implies that 𝑅𝑒𝑐𝑎𝑙𝑙 = more relevant documents 𝑁𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 have been retrieved. 𝑆𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 – The set of relevant documents. 𝑆𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 – The set of retrieved documents. 𝑁𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 – The number of documents in set 𝑆𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡. Need complete set of relevant documents. Have to determine what is relevant. Precision 𝑆𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 ∩ 𝑆𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 Higher value implies that Precision = 𝑁𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 the fraction of relevant results returned is higher. 𝑆𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 – The set of relevant documents. 𝑆𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 – The set of retrieved documents. 𝑁𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 – The number of documents in set 𝑆𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑. Need complete set of relevant documents. Have to determine what is relevant. Recall and precision Nretrieved Nrelevant Nrelevant&retrieved Recall Precision Query 1 100 60 50 0.83 0.5 Query 2 50 50 25 0.5 0.5 Query 3 100 60 10 0.167 0.1 Query 4 1000 80 60 0.75 0.06 Evaluation of partial set Can evaluate a query in rank order. Properties may be better for earlier results. Properties presented at rank k. Precision@k – proportion of top-k documents that are relevant. Recall@k – proportion of relevant documents that are in the top-k. Using total 𝑁𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 from all documents within calculation. Need to choose value rank k. Evaluation of partial set Rank Document Id Relevant Recall Precision 1 20 0/7 = 0 0/1 = 0 2 7 True 1/7 = 0.143 1/2 = 0.500 3 18 1/7 = 0.143 1/3 = 0.333 4 10 True 2/7 = 0.286 2/4 = 0.500 5 2 2/7 = 0.286 2/5 = 0.400 6 12 2/7 = 0.286 2/6 = 0.333 7 16 2/7 = 0.286 2/7 = 0.286 8 6 True 3/7 = 0.429 3/8 = 0.375 9 17 3/7 = 0.429 3/9 = 0.333 10 3 3/7 = 0.429 3/10 = 0.300 There are a total of 7 relevant documents within this example. Evaluation of partial set In general: The precision falls. The recall increases. 𝑆𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 ∩ 𝑆𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 Precision = 𝑁𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑆𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 ∩ 𝑆𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 Using example data from previous table. 𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑁𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 The precision is typically higher for the first rank. Precision vs recall Characteristic sawtooth shape. Query 1 Query 2 Query returning 5 documents, rank 1 and 4 are relevant, 5 relevant documents in total. Values from previous figure and table. Understanding performance Want to understand precision as function of recall. General performance of the search system for many queries. Calculate the average of several precision vs recall curves. Need a smooth distribution to form the average. Interpolate between points. Then form average. Interpolated precision Take point to right. Interpolate for monotonic behaviour: Use maximum precision at recall or higher recall. Taking the precision from the point to the right. Optimistic interpolation. Average interpolated curves Average over many queries. Compare curves for different search systems. Discuss observed performance. Average precision Rank Relevant Recall Precision P @ threshold 1 0/7 = 0 0/1 = 0 0 2 True 1/7 = 0.143 1/2 = 0.500 0.5 3 1/7 = 0.143 1/3 = 0.333 0 4 True 2/7 = 0.286 2/4 = 0.500 0.5 5 2/7 = 0.286 2/5 = 0.400 0 6 2/7 = 0.286 2/6 = 0.333 0 7 2/7 = 0.286 2/7 = 0.286 0 8 True 3/7 = 0.429 3/8 = 0.375 0.375 9 3/7 = 0.429 3/9 = 0.333 0 10 3/7 = 0.429 3/10 = 0.300 0 Average precision (AP) As total value or at given rank. 𝑛 1 AP = 𝑝 𝜏𝑖 𝑛 𝑖=1 𝑝 𝜏𝑖 – The precision at recall threshold. 𝑛 – The number of thresholds considered. Each term is non-zero when recall changes. When a relevant document is found. Mean average precision (MAP) 𝑁𝑞𝑢𝑒𝑟𝑖𝑒𝑠 σ𝑖=1 𝐴𝑃(𝑞𝑖 ) – The mean average precision. 𝑀𝐴𝑃 = 𝑁𝑞𝑢𝑒𝑟𝑖𝑒𝑠 𝐴𝑃(𝑞𝑖 ) – The average precision of query 𝑞𝑖. 𝑁𝑞𝑢𝑒𝑟𝑖𝑒𝑠 – The number of queries considered. Average precision is evaluated at a given rank. Mean average precision can be evaluated rank k. Mean average precision Query 1 Query 2 Rank Relevance P@k Rank Relevance P@k 1 1 Yes 1.0 2 Yes 0.5 2 3 3 4 4 Yes 0.5 5 Yes 0.4 5 6 AP = 0.75 7 Yes 0.429 8 Mean average precision = 0.591 9 10 Yes 0.4 Average over many queries. AP = 0.432 Mean average precision Provides single value, indicating performance. Assigns more weight to beginning of ranking. Rank 1 is twice as important as rank 2. Normally use the arithmetic mean. GMAP – geometric mean – lower value for any query with low performance. 𝑁𝑞𝑢𝑒𝑟𝑖𝑒𝑠 𝑁𝑞𝑢𝑒𝑟𝑖𝑒𝑠 𝐺𝑀𝐴𝑃 = ෑ 𝐴𝑃(𝑞𝑖 ) 𝑖=1 Mean reciprocal rank (MRR) 𝑁𝑞𝑢𝑒𝑟𝑖𝑒𝑠 1 1 𝑀𝑅𝑅 = 𝑁𝑞𝑢𝑒𝑟𝑖𝑒𝑠 𝑘𝑖 𝑖=1 𝑁𝑞𝑢𝑒𝑟𝑖𝑒𝑠 – The number of queries considered. 𝑘𝑖 – The rank of the specified query. Mean reciprocal rank Query Document 1 Document 2 Document 3 Rank (k) 1/k Query 1 Relevant 3 1/3 Query 2 Relevant 2 1/2 Query 3 Relevant 1 1/1 Order of retrieval 1 1 + +1 𝑀𝑅𝑅 = 3 2 = 0.61 3 Mean reciprocal rank A higher MRR is a better outcome. A higher MRR suggests better performance overall. MRR is only defined if there is only one relevant answer. Use mean average precision if there are more. Rank is not defined if the document is not found. Cannot include not-defined rank within MRR. Other measures Bpref – relevant documents retrieved ahead of irrelevant documents. Relative recall GMAP – The geometric mean of the average precision. Statistical tests to test if one system is better than another. Search system test data Test data Need test data with tagged relevance. Relevance tags per query. Run search system and evaluate performance. Run all queries with tagged relevance information. Form average performance for system. Information retrieval evaluation Early work used Cranfield tests. First tests on automated library systems. Defined notion of test set (or test collection). Most work has focussed on test collections. Text REtrival Conference (TREC) provides test sets. https://trec.nist.gov/ More recently, papers have used simulations. Build simulations from TREC data. Use to simulate selection of irrelevant documents. https://doi.org/10.1108/eb050097 - Cranfield tests. https://doi.org/10.1145/3331184.3331259 - Statistical approaches. System evaluation: test collection Selection of documents – text, image, videos, speech, etc. Information requests – queries, topics, profiles. As many as possible. Real queries or from collection creator. Relevance judgements or tags. Documents that are relevant to specific query. Relevance can be topical or static. Relevance is normally binary. Some recent metrics look at non-binary. Overview of system evaluation Test collection – contains data and queries. Information requests – description of information needs. Queries – searchable versions of the request. Relevance judgements – documents tagged as relevant. Test collection – contains many queries. Simulate many users. Different requests from different people. Simulation of many users searching. Static test collection – repeated use to test different systems. CISI collection 1430 documents. 112 queries. Average of 41 relevant documents per query (3%). All documents have been assessed. http://ir.dcs.gla.ac.uk/resources/test_collections/cisi/ Wall Street Journal (WSJ) collection 74 520 documents – full-text newspaper articles. 50 topics. Detailed description of information required. Average 30 relevant documents per query (0.0004%). Large collections take longer to classify. 50 topics x 74520 documents = 3,726,000 assessments. Relevance may change over time. https://trec.nist.gov/data/docs_eng.html TREC Provide many data sets from different tracks. Provide evaluation tool to interface with test data sets. https://trec.nist.gov/data.html Pooling Used when manually assessing and tagging documents as relevant. For larger collections use pooling. Assess sample of document collection. Random sample. Not useful when many irrelevant documents. Read 2000 WSJ documents to find 1 relevant. Biased sample Choose part of document collection containing more relevant documents. Pooling: steps Use several information retrieval systems with topic. Each system retrieves top n documents. Duplicates are removed to form the “pool”. Assessor reads each document and tags relevance. Output is set of relevant documents. Pooling Small set, likely to contain relevant documents. Maximises chance of finding relevant documents. Misses some relevant documents. Still can use resulting collection for performance studies. Shown not to affect overall results, when comparing system performance. Initial pooling should use many systems. Reduced bias against new systems. Web as test collection Several billion web pages. Queries – comprise 1 – 3 words. Relevant to query – tiny percentage. Dynamic content. Need to create a snapshot and then apply pooling. https://www.worldwidewebsize.com/ Test collections: Summary Static set of documents, queries and relevance tags. Use to run repeatable experiments. Compare results from different systems. Time consuming to build new collections. Need to consider new systems. Need to consider relevance. Evaluation of recommender systems Test data Static collections with user ratings. https://grouplens.org/datasets/ MovieLens – Movies, ratings and tags. WikiLens (2008) – Documents are rated for relevance. Book-crossing data set http://www2.informatik.uni-freiburg.de/~cziegler/BX/ Jester http://www.ieor.berkeley.edu/~goldberg/jester-data/ Jokes and their ratings. Training and testing Split data sample. Data set Train using subsample. Test on remaining data. Training set Testing set K-fold cross validation Bias from using the same training data. Train recommender system from different section of data. Repeat training and testing. Data set Unbiased estimate. Testing Training set set N-1 testing Use data for active user. Select one value for the user. Train on data set, omitting the selected value for the user. Test suggestion for the user against withheld value. Repeat, leaving out different values from training. Confusion matrix Confusion matrix True value Predicted: Yes Predicted: No Yes True Positive (TP) False Negative (FN) No False Positive (FP) True Negative (TN) False Positive (FP) is known as a Type 1 error. False Negative (FN) is known as a Type 2 error. Confusion matrix measures 𝑇𝑃 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = Tends to 1 as FP tends to zero. 𝑇𝑃 + 𝐹𝑃 𝑇𝑃 𝑅𝑒𝑐𝑎𝑙𝑙 = Also known as sensitivity. Tends to 1 as FN tends to zero. 𝑇𝑃 + 𝐹𝑁 𝑇𝑃 + 𝑇𝑁 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = Larger when state (P or N) is correctly identified. 𝑇𝑃 + 𝐹𝑃 + 𝑇𝑁 + 𝐹𝑁 Confusion matrix measures 𝑇𝑃 𝐹1 = Harmonic mean. Tends to 1 as FP and FN tend to zero. 1 𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁 2 𝑇𝑁 𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = True negative rate. Tends to 1 as FP tends to zero. 𝑇𝑁 + 𝐹𝑃 Prediction accuracy Prediction accuracy Need to defined an error value. Used when training recommender systems. Used for other machine learning methods. Feedback to allow training. Error formulae σ𝑁 𝑖=1 𝑥 ො𝑖 − 𝑥𝑖 2 Root-mean-square error. 𝑅𝑀𝑆𝐸 = 𝑁 σ𝑁 𝑖=1 𝑥 ො𝑖 − 𝑥𝑖 Mean absolute error. Large errors affect MAE less than RMSE. 𝑀𝐴𝐸 = 𝑁 𝑥ො𝑖 – Predicted value. 𝑥𝑖 – Measured value. 𝑁 – Number of parameters being tuned. Error distributions (“pulls”) may provide useful information when fitting fewer variables. Example Number User Movie Id Rating Predicted 𝑥ො𝑖 − 𝑥𝑖 𝑥ො𝑖 − 𝑥𝑖 2 1 1 123 5 4 1 1 2 1 45 4 4 0 0 3 2 56 3 5 2 4 4 2 789 5 2 3 9 6 3 654 2 2 0 0 7 3 11 4 3 1 1 𝑅𝑀𝑆𝐸 = 1.581 𝑀𝐴𝐸 = 1.167 Significance testing Is system A better than system B? Use many observables or trends. For each observable can state expected and uncertainty. Can state if effect is statistically significant. Compute likelihood using performance. Use observables and uncertainty. Need to estimate systematic errors too. A/B Testing Controlled experiment with two answers. Two-sample hypothesis testing. Divert small number of users to test environment. Evaluate the impact of new feature. A/B Testing: Font colours example Which colours are better? https://dl.acm.org/doi/pdf/10.1145/2623330.2623341 A/B Testing: Font colours example Bing rank experiments on font colours. Three colour changes were made. Users were more successful in completing tasks. $10M improvement. Experiment repeated with 32 million users. Same result – the improvement was still visible. Other evaluation approaches User evaluation. Involves prototypical users. Users asked to complete tasks. Understand user satisfaction. Operational evaluations. Analysis of logs that have been accumulated.