06 Supervised Learning in TM Metrics PDF
Document Details
Uploaded by Deleted User
NOVA IMS Information Management School, Universidade Nova de Lisboa
2024
Tags
Summary
This presentation from NOVA IMS Information Management School, discusses supervised learning in text mining, focusing on various metrics. Slides include information on supervised learning techniques (regression, classification), traditional machine learning metrics, and "new" metrics such as MAP, MRR, and Rouge. The slides also contain practical examples & exercises. The date is October 14, 2024
Full Transcript
TEXT MINING – LCD 202425 06 Supervised Learning in TM Metrics October 14, 2024 1 A G E...
TEXT MINING – LCD 202425 06 Supervised Learning in TM Metrics October 14, 2024 1 A G E N D A Supervised problems in TM The traditional metrics The “new” metrics October 14, 2024 2 Supervised problems in Text Mining October 14, 2024 3 1 Supervised problems in Text Mining Supervised Text Mining in some tasks can be saw as a Machine Learning problem, but applied to text… Independent Dependent Variables Variable Relationship ? Explain Predict October 14, 2024 4 1 Supervised problems in Text Mining Supervised Learning Regression Classification VS Outcome variable is Outcome variable is numerical categorical October 14, 2024 5 1 Supervised problems in Text Mining Emails Is it SPAM (1) or not (0)? Relationship ? Explain Predict October 14, 2024 6 1 Supervised problems in Text Mining Classification Email SPAM "Congratulations! You've won a million dollars in our lottery!...” 1 "Dear John, we appreciate your recent purchase. Here is the order…” 0 “ Hi John, your monthly newsletter from CoolText is here! Check…” 0 "Lose 30 pounds in 30 days with this miracle weight loss pill!” 1 “Dear John, thank you for your job application. We would like to …” 0 “You've been selected for a free iPhone X! Just enter your personal…” 1 "Get rich quick with this amazing opportunity! Earn $10,000…” 1 “Hi John, I hope this email finds you well. I wanted to follow up on…” 0 … … October 14, 2024 7 1 Supervised problems in Text Mining Regression Post Title Engagement " Check out our new product! It's amazing!" 152 “ Just posted a funny cat video. #CatsRule" 234 “ Learn how to bake the perfect chocolate cake!" 483 “ Exciting news coming soon! Stay tuned!" 38 “ Happy Friday! Enjoy the weekend everyone!" 18 “ Join us for our live webinar on AI today at 3 PM! #AIWebinar" 156 “ 10 Tips for a Healthy Lifestyle.” 189 " Exciting announcement: Our annual summer sale is here!” 322 … … October 14, 2024 8 1 Supervised problems in Text Mining The metrics METRICS IN TM The traditional metrics in ML Other metrics used in TM Accuracy Precision Recall F1 Score AUC RMSE MAPE MAP MRR Rouge BLEU Meteor Perplexity October 14, 2024 9 1 Supervised problems in Text Mining The metrics Source: Vajjala, S., Majumder, B., Gupta, A., & Surana, H. (2020). Practical natural language processing: a comprehensive guide to building real-world NLP systems. O'Reilly Media October 14, 2024 10 The traditional Metrics October 14, 2024 11 2 Metrics METRICS IN TM The traditional metrics in ML Classification Regression Accuracy Precision Recall F1 Score AUC RMSE MAPE October 14, 2024 12 2 Metrics Classification Metrics in Classification True Class + - Predicted + True Positive False Positive Class - False Negative True Negative October 14, 2024 13 2.1 Metrics Classification Classification - Accuracy Accuracy Use it when… True Class You have a balanced class distribution + - All classes have equal importance Predicted + 124 36 Class - 16 102 𝑇𝑃 + 𝑇𝑁 124 + 102 226 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = = = ≈ 0.813 = 81,3% 𝑇𝑃 + 𝐹𝑃 + 𝑇𝑁 + 𝐹𝑁 124 + 36 + 102 + 16 278 October 14, 2024 14 2.2 Metrics Classification Classification - Precision Precision Use it when… True Class You have costly false positives + - Predicted + 124 36 Class - 16 102 𝑇𝑃 124 124 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = = = ≈ 0.775 = 77,5% 𝑇𝑃 + 𝐹𝑃 124 + 36 160 October 14, 2024 15 2.3 Metrics Classification Classification - Recall Recall Use it when… True Class You have costly false negatives + - Predicted + 124 36 Class - 16 102 𝑇𝑃 124 124 𝑅𝑒𝑐𝑎𝑙𝑙 = = = ≈ 0.885 = 88,5% 𝑇𝑃 + 𝐹𝑁 124 + 16 140 October 14, 2024 16 2.4 Metrics Classification Classification – F1 Score Use it when… F1 Score You want to combine precision and recall. True Class + - Predicted + 124 36 Class - 16 102 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑅𝑒𝑐𝑎𝑙𝑙 0.775 × 0.885 0.686 𝐹1 𝑆𝑐𝑜𝑟𝑒 = 2 × =2× ≈2× ≈ 0.827 = 82,7% 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙 0.775 + 0.885 1.66 October 14, 2024 17 2.5 Metrics Classification Classification - AUC Use it when… AUC You want to measure the quality of a model independently of prediction threshold, and to find the optimal threshold for a classification task. Sensitivity AUC = 0.8172 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 = 𝑅𝑒𝑐𝑎𝑙𝑙 𝑇𝑁 𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = 𝑇𝑁𝑅 = 𝑇𝑁 + 𝐹𝑃 1 - Specificity October 14, 2024 18 2.6 Metrics Regression Regression – RMSE (Root Mean Squared Error) Use it when… RMSE You want to calculate the square root of the mean of the squared errors for each data point. 𝑛 𝒚 ෝ 𝒚 (𝑦ො𝑖 − 𝑦𝑖 )2 𝑅𝑀𝑆𝐸 = 𝑛 152 168 𝑖=1 234 212 483 472 (168 − 152)2 +(212 − 234)2 +(472 − 483)2 + (36 − 38)2 + (18 − 18)2 865 38 36 𝑅𝑀𝑆𝐸 = = = 173 5 5 18 18 October 14, 2024 19 2.7 Metrics Regression Regression – MAPE (Mean Absolute Percentage Error) Use it when… MAPE You want to calculate the average of absolute percentage error for each data point. 𝑛 𝒚 ෝ 𝒚 1 𝑦𝑖 − 𝑦ො𝑖 𝑀𝐴𝑃𝐸 = 𝑛 𝑦𝑖 152 168 𝑖=1 234 212 483 472 152 − 168 212 − 234 472 − 483 36 − 38 18 − 18 38 36 + + + + 𝑀𝐴𝑃𝐸 = 152 212 472 36 18 ≈ 0.055 = 5.5% 18 18 5 October 14, 2024 20 The “new” Metrics October 14, 2024 21 3 The “new” Metrics METRICS IN TM Other metrics used in TM MAP MRR Rouge BLEU Meteor Perplexity October 14, 2024 22 3 The “new” Metrics METRICS IN TM Information Retrieval Summarization Other metrics used in TM Machine Translation MAP MRR Rouge BLEU Meteor Perplexity October 14, 2024 23 3 The “new” Metrics METRICS IN TM Information Retrieval Summarization Other metrics used in TM Machine Translation MAP MRR Rouge BLEU Meteor Perplexity October 14, 2024 24 3 The “new” Metrics Information retrieval What is information retrieval? Information retrieval is a subdiscipline of computer science that is concerned with developing accurate algorithms for retrieving information from databases of documents or textual information. Ambert, K. H., & Cohen, A. M. (2012). Text-mining and neuroscience. International review of neurobiology, 103, 109-132. Information retrieval (IR) deals with searching for information as well as recovery of textual information from a collection of resources. Kaushik, S., Baloni, P., & Midha, C. K. (2019). Text mining resources for bioinformatics. October 14, 2024 25 3.1 The “new” Metrics D3 D1 D4 Capital of Portugal D2 D5 26 3.1 The “new” Metrics MAP – Mean Average Precision METRICS IN TM MAP Other metrics used in TM Mean Average Precision MAP MRR Rouge BLEU Meteor Perplexity October 14, 2024 27 3.1 The “new” Metrics MAP – Mean Average Precision MAP Mean Average Precision Use it when… You want to measure the mean average precision calculated across each retrieved result. 1 𝑄 𝑚𝐴𝑃 = 𝐴𝑃𝑖 𝑄 𝑖=1 To understand the calculation of mAP, we first need to understand the concept of: Precision Precision at k Average precision October 14, 2024 28 3.1 The “new” Metrics MAP – Mean Average Precision 1. The precision 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 ∩ 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑖𝑛 𝑖𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑎𝑙 = 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 D1 Relevant documents D2 D3 2 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = = 0.4 5 D4 D5 October 14, 2024 29 3.1 The “new” Metrics MAP – Mean Average Precision 2. The precision at 𝒌 The user submits a query… Capital of Portugal The database returns… D1 D2 D3 D4 D5 The model selects these documents based on Each doc has a score 0.5 0.1 0.4 0.3 0.8 a computed similarity score The doc’s are sorted D5 D1 D3 D4 D2 Sorting based on the model’s computed similarity to based on the score 0.8 0.5 0.4 0.3 0.1 evaluate the most confident selections first The relevant doc’s are Y N Y N N identified October 14, 2024 30 3.1 The “new” Metrics MAP – Mean Average Precision Top 2 We calculate the P(k) D5 D1 D3 D4 D2 Top 1 Top 3 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 𝑖𝑛 𝑡𝑜𝑝 1 1 Calculation of P(1) 𝑃 1 = = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑖𝑛 𝑡𝑜𝑝 1 1 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 𝑖𝑛 𝑡𝑜𝑝 2 1 Calculation of P(2) 𝑃 2 = = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑖𝑛 𝑡𝑜𝑝 2 2 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 𝑖𝑛 𝑡𝑜𝑝 3 2 Calculation of P(3) 𝑃 3 = = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑖𝑛 𝑡𝑜𝑝 3 3 P(1) P(2) P(3) P(4) P(5) Calculation of P(K) for k = 1 to 5 1/1 = 1 1/2 = 0.5 2/3 = 0.67 2/4 = 0.5 2/5 = 0.4 October 14, 2024 31 3.1 The “new” Metrics MAP – Mean Average Precision 3. The average precision Database D5 D1 D3 D4 D2 Is relevant? Y N Y N N 1 𝑁 𝐴𝑃 = 𝑃 𝑘 𝑟(𝑘) Precision at k P(1) P(2) P(3) P(4) P(5) 𝑅𝐷 𝑖=1 1/1 = 1 1/2 = 0.5 2/3 = 0.67 2/4 = 0.5 2/5 = 0.4 Relevance at k R(1) R(2) R(3) R(4) R(5) 1 0 1 0 0 P(1)R(1) P(2)R(2) P(3)R(3) P(4)R(4) P(5)R(5) Values for P(k)r(k) 1 0 0.67 0 0 1 𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝐴𝑃 = 1 + 0 + 0.67 + 0 + 0 = 0.835 2 October 14, 2024 32 3.1 The “new” Metrics MAP – Mean Average Precision 4. The mean average precision 1 𝑄 𝑚𝐴𝑃 = 𝐴𝑃𝑖 𝑄 𝑖=1 The queries Query 1 Query 2 Query 3 Query 4 Calculated AP 0.835 0.92 0.74 0.96 1 𝑚𝐴𝑃 = 0.835 + 0.92 + 0.74 + 0.96 = 0.864 4 October 14, 2024 33 3.1 The “new” Metrics MAP – Mean Average Precision MAP Pros MAP provides a comprehensive measure of the precision of the retrieval system across all relevant documents. It takes into account not just the rank of the first relevant document (as MRR does) but also the ranks of all relevant documents. Able to give more weight to errors that happen high up in the recommended lists. Conversely, it gives less weight to errors that happens deeper in the recommended lists. MAP Cons Calculating MAP can be computationally expensive and complex, especially when dealing with a large number of queries and documents. It requires calculating the precision at multiple points in the ranking for each query and then averaging these precisions. October 14, 2024 34 3.2 The “new” Metrics MRR – Mean Reciprocal Rank METRICS IN TM MRR Other metrics used in TM Mean Reciprocal Rank MAP MRR Rouge BLEU Meteor Perplexity October 14, 2024 35 3.2 The “new” Metrics MRR – Mean Reciprocal Rank D3 D1 D4 Capital of Portugal D2 D5 Mean Reciprocal Rank (MRR) measures how far down the ranking the first relevant document is! If MRR is close to 1, it means relevant results are close to the top of search results - what we want! Lower MRRs indicate poorer search quality, with the right answer farther down in the search results. 36 3.2 The “new” Metrics MRR – Mean Reciprocal Rank MRR Mean Reciprocal Rank Use it when… You want to evaluate the responses retrieved given their probability of being correct. It is the mean of the reciprocal of the ranks of the retrieved results. 1 𝑄 1 𝑀𝑅𝑅 = 𝑄 𝑞=1 𝑟𝑎𝑛𝑘𝑞 October 14, 2024 37 3.2 The “new” Metrics MRR – Mean Reciprocal Rank MRR The user submits a query… Capital of Portugal And the results are returned Rank 1 Rank 2 Rank 3 Rank 4 Rank 5 October 14, 2024 38 3.2 The “new” Metrics MRR – Mean Reciprocal Rank Search Query Results Relevant Result Rank Reciprocal Rank Q1: Capital of Portugal Lisboa, Portugal, Madrid, Porto, Lisboa, Portugal,Lisboa Lisboa Lisboa 3 1/3 Q2: Gravity on Earth 9.8 m/s, 12.4 m/s, 3.7 m/s 9.8 m/s 1 1/1 Q3: Christmas day 12/07, 25/12, 18/01 25/12 2 1/2 What is the MRR of this model? 1 𝑄 1 1 1 1 1 𝑀𝑅𝑅 = = + + = 0.61 𝑄 𝑞=1 𝑟𝑎𝑛𝑘 𝑞 3 3 1 2 October 14, 2024 39 3.2 The “new” Metrics MRR – Advantages & Disadvantages MRR Pros Simple to compute and is easy to interpret. Puts a high focus on the first relevant element of the list. It is best suited for targeted searches such as users asking for the “best item for me”. Good for known-item search such as navigational queries or looking for a fact. MRR Cons Does not evaluate the rest of the list of recommended items. It focuses on a single item from the list. It gives a list with a single relevant item just a much weight as a list with many relevant items. It is fine if that is the target of the evaluation. October 14, 2024 40 3 The “new” Metrics METRICS IN TM Information Retrieval Summarization Other metrics used in TM Machine Translation MAP MRR Rouge BLEU Meteor Perplexity October 14, 2024 41 3 The “new” Metrics Summarization What is summarization? Summarization systems need to produce a concise and fluent summary conveying the key information in the input. Kumar, G. R., Basha, S. R., & Rao, S. B. (2020). A summarization on text mining techniques for information extracting from applications and issues. Journal of Mechanics of Continua and Mathematical Sciences, Special Issue, (5). Automatic summarization involves reducing a text document or a larger corpus of multiple documents into a short set of words or paragraph that conveys the main meaning of the text. Al-Hashemi, R. (2010). Text Summarization Extraction System (TSES) Using Extracted Keywords. Int. Arab. J. e Technol., 1(4), 164-168. October 14, 2024 42 3.3 The “new” Metrics ROUGE METRICS IN TM ROUGE Other metrics used in TM Recall-Oriented Understudy for Gisting Evaluation MAP MRR Rouge BLEU Meteor Perplexity October 14, 2024 43 3.3 The “new” Metrics ROUGE ROUGE Recall-Oriented Understudy for Gisting Evaluation Use it when… You want to compare the quality of generated to reference text, measuring recall. Lin, C. Y. (2004, July). Rouge: A package for automatic evaluation of summaries. In Text summarization branches out (pp. 74-81). Source: https://aclanthology.org/W04-1013.pdf October 14, 2024 44 3.3 The “new” Metrics ROUGE ROUGE I loved watching Game I gave 5 stars to Game of thrones of Thrones. because the acting is superb, the cinematography, and settings are I really loved watching flawless. At the very least it is a visual Game of Thrones. The Game of Thrones is spectacle. The actors are superb in a great show. I loved it. their delivery. Machine generated summary Human reference Game of Thrones Review summaries ROUGE compares a generated summary to one or more reference summaries. The basic idea: Assign a single numerical score to a summary that tell us how “good” it is compared to one or more reference summaries October 14, 2024 45 3.3 The “new” Metrics ROUGE ROUGE ROUGE - N ROUGE - L Calculates the overlap of n- Measures the longest common grams between the generated subsequence between the text and the reference text. generated text and the reference text. October 14, 2024 46 3.3 The “new” Metrics ROUGE ROUGE - N The steps 1. Tokenization: Break the text into tokens and create a list of n-grams for both the generated and reference texts 2. Counting Overlap: Count the number of common n-grams between the generated text and the reference text 3. Precision, Recall and F1-Score: Compute precision, recall and f1-score for each n-gram size Precision measures how many of the generated n-grams are also in the reference Recall measures how many of the reference n-grams are in the generated text F1-Score is the harmonic mean between precision and recall 4. Averaging: Average the precision, recall or f1-score over all reference texts to get the final ROUGE score October 14, 2024 47 3.3 The “new” Metrics ROUGE ROUGE - 1 How can we measure the quality of a generated summary in an automatic way? Compare the n-grams of the generated summary to the n-grams of the reference. Machine generated summary I really loved watching Game of Thrones 6 unigram (words) match Human reference summary I loved watching Game of Thrones October 14, 2024 48 3.3 The “new” Metrics ROUGE ROUGE-1 Machine generated summary I really loved watching Game of Thrones A value of 1 means that all words in the reference summary have been produced in the generated one Human reference summary I loved watching Game of Thrones It can be useful also to measure the ROUGE-1 precision, that tell us 𝑁𝑢𝑚𝑏𝑒𝑟 𝑤𝑜𝑟𝑑 𝑚𝑎𝑡𝑐ℎ𝑒𝑠 6 how much of the generated 𝑅𝑜𝑢𝑔𝑒 − 1 𝑅𝑒𝑐𝑎𝑙𝑙 = = summary is relevant. 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑤𝑜𝑟𝑑𝑠 𝑖𝑛 𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒 6 𝑁𝑢𝑚𝑏𝑒𝑟 𝑤𝑜𝑟𝑑 𝑚𝑎𝑡𝑐ℎ𝑒𝑠 6 𝑅𝑜𝑢𝑔𝑒 − 1 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = = And then we compute the F1-Score: 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑤𝑜𝑟𝑑𝑠 𝑖𝑛 𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑒𝑑 7 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑟𝑒𝑐𝑎𝑙𝑙 𝑅𝑜𝑢𝑔𝑒 − 1 𝐹1𝑆𝑐𝑜𝑟𝑒 = 2 × 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙 October 14, 2024 49 3.3 The “new” Metrics ROUGE ROUGE-1 What if I get this generate summary? Machine generated summary I really really really really loved watching Game of Thrones Human reference summary I loved watching Game of Thrones It also has a Rouge-1 Recall of 1… But it seems a worse summary! 𝑁𝑢𝑚𝑏𝑒𝑟 𝑤𝑜𝑟𝑑 𝑚𝑎𝑡𝑐ℎ𝑒𝑠 6 𝑅𝑜𝑢𝑔𝑒 − 1 𝑅𝑒𝑐𝑎𝑙𝑙 = = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑤𝑜𝑟𝑑𝑠 𝑖𝑛 𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒 6 Measuring the precision allow 𝑁𝑢𝑚𝑏𝑒𝑟 𝑤𝑜𝑟𝑑 𝑚𝑎𝑡𝑐ℎ𝑒𝑠 6 𝑅𝑜𝑢𝑔𝑒 − 1 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = = us to understand this problem! 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑤𝑜𝑟𝑑𝑠 𝑖𝑛 𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑒𝑑 10 October 14, 2024 50 3.3 The “new” Metrics ROUGE ROUGE-2 We can also use bigrams… Machine generated Human reference 𝑁𝑢𝑚𝑏𝑒𝑟 𝑏𝑖𝑔𝑟𝑎𝑚 𝑚𝑎𝑡𝑐ℎ𝑒𝑠 4 summary summary 𝑅𝑜𝑢𝑔𝑒 − 2 𝑅𝑒𝑐𝑎𝑙𝑙 = = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑏𝑖𝑔𝑟𝑎𝑚𝑠 𝑖𝑛 𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒 5 I really I loved 𝑁𝑢𝑚𝑏𝑒𝑟 𝑏𝑖𝑔𝑟𝑎𝑚 𝑚𝑎𝑡𝑐ℎ𝑒𝑠 4 𝑅𝑜𝑢𝑔𝑒 − 2 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = = really loved loved watching 𝑁𝑢𝑚𝑏𝑒𝑟 𝑏𝑖𝑔𝑟𝑎𝑚𝑠 𝑖𝑛 𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑒𝑑 6 loved watching watching Game 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑟𝑒𝑐𝑎𝑙𝑙 𝑅𝑜𝑢𝑔𝑒 − 2 𝐹1 𝑆𝑐𝑜𝑟𝑒 = 2 × 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙 watching Game Game of Game of of Thrones of Thrones October 14, 2024 51 3.3 The “new” Metrics ROUGE ROUGE-N Calculate the ROUGE Score. (𝑅𝑜𝑢𝑔𝑒 − 1 𝑅𝑒𝑐𝑎𝑙𝑙 + 𝑅𝑜𝑢𝑔𝑒 − 2 𝑅𝑒𝑐𝑎𝑙𝑙) 1 + 0.8 𝑅𝑜𝑢𝑔𝑒 𝑅𝑒𝑐𝑎𝑙𝑙 𝑆𝑐𝑜𝑟𝑒 = = = 0.9 2 2 The ROUGE Recall score for this example is 0.9 (90%). And you can do the same for precision and F1-Score… October 14, 2024 52 3.3 The “new” Metrics ROUGE ROUGE-L Rouge-L does not compare n-grams, but treats each summary as a sequence of words and then looks for the longest subsequence (LCS). A subsequence is a sequence that appears in the same relative order but not necessarily contiguous. Machine generated summary I really loved watching Game of Thrones Human reference summary I loved watching Game of Thrones 𝐿𝐶𝑆(𝑔𝑒𝑛, 𝑟𝑒𝑓) 6 𝑅𝑜𝑢𝑔𝑒 − 𝐿 𝑅𝑒𝑐𝑎𝑙𝑙 = = LCS: 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑤𝑜𝑟𝑑𝑠 𝑖𝑛 𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒 6 “I loved watching Game of Thrones” 𝐿𝐶𝑆(𝑔𝑒𝑛, 𝑟𝑒𝑓) 6 𝑅𝑜𝑢𝑔𝑒 − 𝐿 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑤𝑜𝑟𝑑𝑠 𝑖𝑛 𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑒𝑑 7 October 14, 2024 53 3.3 The “new” Metrics ROUGE RECAP ROUGE-N Measures the overlap of n-grams (typically unigrams, bigrams, trigrams, etc.) between the generated text and the reference text. When to use? Best suited for scenarios where maintaining exact word sequences is important. For instance, if you're comparing a generated summary with a reference and want to see how well the model captured word pairings or single-word accuracy. ROUGE-L Measures the Longest Common Subsequence (LCS) between the candidate and reference text. It evaluates the ability of the generated summary to capture the longest matching sequence of words in the reference summary. It doesn’t depend on consecutive n-grams matches. When to use? When the global structure and order of words are more important than precise n-gram matching, especially in tasks where fluency and coherence are critical. October 14, 2024 54 3 The “new” Metrics METRICS IN TM Information Retrieval Summarization Other metrics used in TM Machine Translation MAP MRR Rouge BLEU Meteor Perplexity October 14, 2024 55 3 The “new” Metrics Machine Translation What is Machine Translation? “… the automatic translation of written text from one natural language into another…” Stahlberg, F. (2020). Neural machine translation: A review. Journal of Artificial Intelligence Research, 69, 343-418. “Automatic translation from one human language to another using computers, better known as machine translation…” Al-Onaizan, Y., Curin, J., Jahr, M., Knight, K., Lafferty, J., Melamed, D.,... & Yarowsky, D. (1999). Statistical machine translation. In Final Report, JHU Summer Workshop (Vol. 30, pp. 98-157). October 14, 2024 56 3.4 The “new” Metrics BLEU METRICS IN TM BLEU Other metrics used in TM BiLingual Evaluation Understudy MAP MRR Rouge BLEU Meteor Perplexity October 14, 2024 57 3.4 The “new” Metrics BLEU BLEU BiLingual Evaluation Understudy Use it when… You want to capture the amount of n-gram overlap between the output sentence and the reference ground truth sentence. Initially developed for machine translation tasks It has become a standard metric for evaluating generated text across various NLP domains Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002, July). Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics (pp. 311-318). Source: BLEU: a Method for Automatic Evaluation of Machine Translation October 14, 2024 58 3.4 The “new” Metrics BLEU BLEU My dog is four years old O meu cão tem quatro My dog has four years anos … Source Text Machine generated (Portuguese) translation Human reference translations BLEU compares a generated translation to one or more reference translations The basic idea: Assign a single numerical score to a translation that tell us how “good” it is compared to one or more reference translations October 14, 2024 59 3.4 The “new” Metrics BLEU How can we measure the quality of a generated translation in an automatic way? Compare the n-grams of the generated translation to the n-grams of the reference. Machine generated translation My dog has four years 4 unigram (words) match Human reference summary My dog is four years old Precision ranges from 0 to 1 𝑁𝑢𝑚𝑏𝑒𝑟 𝑤𝑜𝑟𝑑 𝑚𝑎𝑡𝑐ℎ𝑒𝑠 4 𝑈𝑛𝑖𝑔𝑟𝑎𝑚 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑤𝑜𝑟𝑑𝑠 𝑖𝑛 𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑖𝑜𝑛 5 Higher precision scores mean a better translation October 14, 2024 60 3.4 The “new” Metrics BLEU One problem with unigram precision… Translation models sometimes get stuck in repetitive patterns and repeat the same word several times. Machine generated translation four four four four 4 unigram (words) match Human reference summary My dog is four years old How to handle this? 𝑁𝑢𝑚𝑏𝑒𝑟 𝑤𝑜𝑟𝑑 𝑚𝑎𝑡𝑐ℎ𝑒𝑠 4 𝑈𝑛𝑖𝑔𝑟𝑎𝑚 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = = Apply a modified unigram precision 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑤𝑜𝑟𝑑𝑠 𝑖𝑛 𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑖𝑜𝑛 4 October 14, 2024 61 3.4 The “new” Metrics BLEU Machine generated translation four four four four 4 unigram (words) match Human reference summary My dog is four years old 𝑐𝑙𝑖𝑝(𝑁𝑢𝑚𝑏𝑒𝑟 𝑤𝑜𝑟𝑑𝑠 𝑚𝑎𝑡𝑐ℎ) 1 𝑀𝑜𝑑𝑖𝑓𝑖𝑒𝑑 𝑈𝑛𝑖𝑔𝑟𝑎𝑚 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑤𝑜𝑟𝑑𝑠 𝑖𝑛 𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑖𝑜𝑛 4 October 14, 2024 62 3.4 The “new” Metrics BLEU Another problem with unigram precision… It does not take into account the order in which the words appear in the translations. Machine generated translation years four has dog My 4 unigram (words) match Human reference summary My dog is four years old This is not what we want… How to solve? 𝑐𝑙𝑖𝑝(𝑁𝑢𝑚𝑏𝑒𝑟 𝑤𝑜𝑟𝑑𝑠 𝑚𝑎𝑡𝑐ℎ) 4 BLEU computes the precision 𝑀𝑜𝑑𝑖𝑓𝑖𝑒𝑑 𝑈𝑛𝑖𝑔𝑟𝑎𝑚 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑤𝑜𝑟𝑑𝑠 𝑖𝑛 𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑖𝑜𝑛 5 for several different n-grams and then averages the result October 14, 2024 63 3.4 The “new” Metrics BLEU If we compare 4-grams… Machine generated translation Human reference summary years four has dog My My dog is four years old years four has dog My My dog is four years old My dog is four years old 𝑐𝑙𝑖𝑝 𝑁𝑢𝑚𝑏𝑒𝑟 4 − 𝑔𝑟𝑎𝑚 𝑚𝑎𝑡𝑐ℎ𝑒𝑠 0 4 − 𝑔𝑟𝑎𝑚 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = = =0 𝑁𝑢𝑚𝑏𝑒𝑟 4 − 𝑔𝑟𝑎𝑚𝑠 𝑖𝑛 𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑖𝑜𝑛 2 October 14, 2024 64 3.4 The “new” Metrics BLEU Machine generated Human reference translation translation My dog has four years My dog is four years old Step 1 Calculate the precision for each n-gram (1-grams, 2-grams…) in the generated text compared to the reference text. 4 2 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 1 − 𝑔𝑟𝑎𝑚 = = 0.8 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 2 − 𝑔𝑟𝑎𝑚 = = 0.5 5 4 Note: In this case we are just going to compute until 2-gram, but it is common to include also 3-gram and 4-gram October 14, 2024 65 3.4 The “new” Metrics BLEU Step 2 Calculate the brevity penalty to account for shorter generated texts. # 𝑤𝑜𝑟𝑑𝑠 𝑖𝑛 𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒 6 𝐵𝑟𝑒𝑣𝑖𝑡𝑦 𝑃𝑒𝑛𝑎𝑙𝑡𝑦 = 𝑀𝑖𝑛𝑖𝑚𝑢𝑚 1, exp(1 − ) = 𝑀𝑖𝑛 1, exp(1 − ) = 𝑀𝑖𝑛 1,0.82 = 0.82 # 𝑤𝑜𝑟𝑑𝑠 𝑖𝑛 𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑖𝑜𝑛 5 What is the brevity penalty? A penalty to “discourage” the generation of excessively short output, that could otherwise score too high simply by having matching words, even if they don't cover the full meaning of the reference sentence If the candidate translation is shorter than the reference (generation < reference), the penalty is an exponential decay function, so the BLEU score is reduced proportionally to how much shorter the candidate is. October 14, 2024 66 3.4 The “new” Metrics BLEU Step 3 Calculate the geometric mean of the precisions 2 2 𝐺𝑒𝑜𝑚𝑒𝑡𝑟𝑖𝑐 𝑀𝑒𝑎𝑛 = 𝑝1 × 𝑝2 = 0.8 × 0.5 = 0.63 Step 4 Calculate the BLEU Score: 𝐵𝐿𝐸𝑈 𝑆𝑐𝑜𝑟𝑒 = 𝐵𝑟𝑒𝑣𝑖𝑡𝑦 𝑃𝑒𝑛𝑎𝑙𝑡𝑦 × 𝐺𝑒𝑜𝑚𝑒𝑡𝑟𝑖𝑐 𝑀𝑒𝑎𝑛 = 0.82 × 0.63 = 0.52 The BLEU Score for this example is approximately 52% when calculating precision for 1-grams and 2- grams. October 14, 2024 67 3.4 The “new” Metrics BLEU Advantages The simplicity The Ease of Computation Widely used Disadvantages Does not consider semantics / meaning of the words Does not incorporate sentence structure Struggles with non-English languages Reiter, E. (2018). A structured review of the validity of BLEU. Computational Linguistics, 44(3), 393-401. October 14, 2024 68 3.5 The “new” Metrics METEOR METRICS IN TM METEOR Other metrics used in TM Metric for Evaluation of Translation with Explicit ORdering MAP MRR Rouge BLEU Meteor Perplexity October 14, 2024 69 3.5 The “new” Metrics METEOR METEOR Metric for Evaluation of Translation with Explicit ORdering Use it when… You want to measure quality of generated text. It is a more robust BLEU. Banerjee, S., & Lavie, A. (2005, June). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization (pp. 65-72). Source: https://aclanthology.org/W05-0909.pdf October 14, 2024 70 3.5 The “new” Metrics METEOR Advantages Capability to consider synonyms and paraphrases, and reordering of words More robust to evaluate text generation by models with linguistic variations Disadvantages Can be computationally intensive May require a large amount of training data to produce accurate results October 14, 2024 71 3.5 The “new” Metrics METEOR vs. BLEU METEOR VS BLEU Word Matching BLEU: Exact n-gram matches only. METEOR: Handles synonyms, stemming, and exact matches. Precision vs. Recall BLEU: Precision-focused (how much the candidate matches the reference). METEOR: Balances precision and recall (how much of the reference is covered by the candidate). Handling of Word Order BLEU: No explicit handling of word order, other than n-gram matching. METEOR: Applies a penalty for badly ordered or fragmented matches. Use Case BLEU is suitable for short texts where exact phrase matching is important. METEOR is better for longer texts or more flexible language use, such as translations or summaries that involve synonyms and paraphrasing. October 14, 2024 72 3.6 The “new” Metrics Perplexity METRICS IN TM PERPLEXITY Other metrics used in TM MAP MRR Rouge BLEU Meteor Perplexity October 14, 2024 73 3.6 The “new” Metrics Perplexity PERPLEXITY Use it when… You want to measure how confused a Text Mining model is, derived from cross-entropy in a next word prediction task Lower perplexity indicates a better language model See document provided on Moodle for more explanation on perplexity October 14, 2024 74 Practical class… Metrics in Supervised Learning Next week… Sentiment Analysis October 14, 2024 75 Obrigada! Morada: Campus de Campolide, 1070-312 Lisboa, Portugal Tel: +351 213 828 610 | Fax: +351 213 828 611 October 14, 2024 76