NLP Evaluation Methods PDF

Classification Evaluation Sanda Martinčić-Ipšić Full professor [email protected] Evaluating classification methods Predictive accuracy Efficiency time to construct the model time to use the model Robustness: handling noise and missing values Scalability: efficiency in disk-resident databases Interpretability: understandable and insight provided by the model Compactness of the model: size of the tree, or the number of rules. 2 Classification measures: Accuracy Error - Accuracy is one measure paired with Error error = 1-accuracy Accuracy is not suitable in some applications. In text mining, we may only be interested in the documents of a particular topic, which are only a small portion of a big document collection. In classification involving skewed or highly imbalanced data, (network intrusion and financial fraud detections), we are interested only in the minority class. High accuracy does not mean any intrusion is detected. example: 1% intrusion. Achieve 99% accuracy by doing nothing. The class of interest is commonly called the positive class, and the rest negative classes. 3 Evaluation: Accuracy Why don't we use accuracy as our metric? Imagine we saw 1 million tweets 100 of them talked about Delicious Pie Co. 999,900 talked about something else We could build a dumb classifier that just labels every tweet "not about pie" It would get 99.99% accuracy!!! Wow!!!! But useless! Doesn't return the comments we are looking for! That's why we use precision and recall instead Evaluation Example Let's consider just binary text classification tasks Imagine you're the CEO of Delicious Pie Company You want to know what people are saying about your pies So you build a "Delicious Pie" tweet detector Positive class: tweets about Delicious Pie Co Negative class: all other tweets Text Classification Evaluation Precision, Recall, and F measure Evaluation II Use a test collection where you have A set of documents A set of queries A set of relevance judgments that tell you which documents are relevant to each query 7 Precision and Recall document= one test example, instance Precision is the percentage of things you find (return) that are right. #relevant docs returned Precision= #docs returned Recall is the percentage of right things out there that you found (returned) #relevant docs returned Recall= #relevant docs total 8 The 2-by-2 confusion matrix Evaluation: Precision % of items the system detected (i.e., items the system labeled as positive) that are in fact positive (according to the human gold labels) Evaluation: Recall % of items actually present in the input that were correctly identified by the system. Why Precision and recall: Evaluation Example Our dumb pie-classifier Just label nothing as "about pie" Accuracy=99.99% but Recall = 0 (it doesn't get any of the 100 Pie tweets) Precision and recall, unlike accuracy, emphasize true positives: finding the things that we are supposed to be looking for. A combined measure: F F measure: a single number that combines P and R: We almost always use balanced F1 (i.e.,  = 1) Text Classification Evaluation ROC - AUROC ROC receiver operating characteristic (ROC) curve is a plot of the true positive rate against the false positive rate. true positive rate (TPR) is defined as the fraction of actual positive cases that are correctly classified TP TPR = + TP + FN false positive rate (FPR) is defined as the fraction of actual negative cases that are classified to the positive class FP FPR = TN + FP + 15 ROC II sensitivity: recall of the positive class TP sensitivity = TPR = TP + FN + specificity: recall of the negative class true negative rate TNR TN specificity = TNR = + TN + FP TN FPR = 1 − specificity = TN + FP 16 Tom Fawcett, An introduction to ROC analysis, Pattern Recognition Letters, Volume 27 (8), 2006,pp 861-874, ISSN 0167-8655, https://doi.org/10.1016/j.patrec.2005.10.010. https://arize.com/blog/what-is-auc/ 17 Evaluation methods: Holdout set Holdout set: The available data set D is divided into two disjoint subsets, the training set Dtrain (for learning a model) the test set Dtest (for testing the model) 70%: 30% Important: training set should not be used in testing and the test set should not be used in learning. Unseen test set provides an unbiased estimate of accuracy. The test set is also called the holdout set. the examples in the original data set D are all labeled with classes. This method is mainly used when the data set D is large. 18 Evaluation methods: Cross-validation n-fold cross-validation: The available data is partitioned into n equal-size disjoint subsets. Use each subset as the test set and combine the rest n-1 subsets as the training set to learn a classifier. The procedure is run n times, which give n accuracies. The final estimated accuracy of learning is the average of the n accuracies. 10-fold and 5-fold cross-validations are commonly used. This method is used when the available data is not large. 19 Evaluation methods: Leave-one-out Leave-one-out cross-validation: This method is used when the data set is very small. It is a special case of cross-validation Each fold of the cross validation has only a single test example and all the rest of the data is used in training. If the original data has m examples, this is m-fold cross-validation 20 Evaluation methods: Validation set Validation set: the available data is divided into three subsets, a training set, a validation set and a test set. A validation set is used frequently for estimating parameters in learning algorithms. In such cases, the values that give the best accuracy on the validation set are used as the final parameter values. Cross-validation can be used for parameter estimating as well. 21 Development Test Sets ("Devsets") and Cross-validation Training set Development Test Set Test Set Train on training set, tune on devset, report on testset This avoids overfitting (‘tuning to the test set’) More conservative estimate of performance But paradox: want as much data as possible for training, and as much for dev; how to split? Cross-validation: multiple splits Pool results over splits, Compute pooled dev performance Confusion Matrix Evaluation with more than two classes MICRO and MACRO AVERAGING Confusion Matrix for 3-class classification How to combine P/R from 3 classes to get one metric Macroaveraging: compute the performance for each class, and then average over classes Microaveraging: collect decisions for all classes into one confusion matrix compute precision and recall from that table. Macroaveraging and Microaveraging Statistical Significance Testing How do we know if one classifier is better than another? Given: Classifier A and B Metric M: M(A,x) is the performance of A on testset x 𝛿(x): the performance difference between A, B on x: 𝛿(x) = M(A,x) – M(B,x) We want to know if 𝛿(x)>0, meaning A is better than B 𝛿(x) is called the effect size Suppose we look and see that 𝛿(x) is positive. Are we done? No! This might be just an accident of this one test set, or circumstance of the experiment. Instead: Statistical Hypothesis Testing Consider two hypotheses: Null hypothesis: A isn't better than B H1: A is better than B We want to rule out H0 We create a random variable X ranging over test sets And ask, how likely, if H0 is true, is it that among these test sets we would see the 𝛿(x) we did see? Formalized as the p-value: Statistical Hypothesis Testing In our example, this p-value is the probability that we would see δ(x) assuming H0 (=A is not better than B). If H0 is true but δ(x) is huge, that is surprising! Very low probability! A very small p-value means that the difference we observed is very unlikely under the null hypothesis, and we can reject the null hypothesis Very small p:.05 or.01 A result(e.g., “A is better than B”) is statistically significant if the δ we saw has a probability that is below the threshold and we therefore reject this null hypothesis. Statistical Hypothesis Testing How do we compute this probability? In NLP, we don't tend to use parametric tests (like t-tests) Instead, we use non-parametric tests based on sampling: artificially creating many versions of the setup. For example, suppose we had created zillions of testsets x'. Now we measure the value of 𝛿(x') on each test set That gives us a distribution Now set a threshold (say.01). So if we see that in 99% of the test sets 𝛿(x) > 𝛿(x') We conclude that our original test set delta was a real delta and not an artifact. Statistical Hypothesis Testing Two common approaches: approximate randomization bootstrap test Paired tests: Comparing two sets of observations in which each observation in one set can be paired with an observation in another. For example, when looking at systems A and B on the same test set, we can compare the performance of system A and B on each same observation xi The Paired Bootstrap Test Bootstrap test Efron and Tibshirani, 1993 Can apply to any metric (accuracy, precision, recall, F1, etc). Bootstrap means to repeatedly draw large numbers of smaller samples with replacement (called bootstrap samples) from an original larger sample. Bootstrap example Consider a baby text classification example with a test set x of 10 documents, using accuracy as metric. Suppose these are the results of systems A and B on x, with 4 outcomes (A & B both right, A & B both wrong, A right/B wrong, A wrong/B right): 4.9 STATI STI CAL SI GNI FI CA NCE T ESTI NG 17 either A+B both correct, or 1 2 3 4 5 6 7 8 9 10 A% B% d() x AB A B ◆ AB AB A B ◆ AB A B ◆ AB A B ◆ AB ◆.70.50.20 x(1) A B◆ AB A B ◆ AB AB A B ◆ AB AB A B ◆ AB.60.60.00 x(2) A B◆ AB A B ◆ AB AB AB AB A B ◆ AB AB.60.70 -.10 Bootstrap example Now we create, many, say, b=10,000 virtual test sets x(i), each of size n = 10. To make each x(i), we randomly select a cell from row x, with 4.9 STATI STI CA L SI GNI FI CA NCE T ESTI NG 17 replacement, 10 times: 1 2 3 4 5 6 7 8 9 10 A% B% d() x AB AB◆ AB AB AB◆ AB AB◆ AB AB◆ AB ◆.70.50.20 x(1) AB◆ AB AB◆ AB AB AB◆ AB AB AB◆ AB.60.60.00 x(2) AB◆ AB AB◆ AB AB AB AB AB◆ AB AB.60.70 -.10... x(b) (i) Bootstrap example Now we have a distribution! We can check how often A has an accidental advantage, to see if the original 𝛿(x) we saw was very common. Now assuming H0, that means normally we expect 𝛿(x')=0 So we just count how many times the 𝛿(x') we found exceeds the expected 0 value by 𝛿(x) or more: Bootstrap example Alas, it's slightly more complicated. We didn’t draw these samples from a distribution with 0 mean; we created them from the original test set x, which happens to be biased (by.20) in favor of A. So to measure how surprising is our observed δ(x), we actually compute the p-value by counting how often δ(x') exceeds the expected value of δ(x) by δ(x) or more: Bootstrap example Suppose: We have 10,000 test sets x(i) and a threshold of.01 And in only 47 of the test sets do we find that δ(x(i)) ≥ 2δ(x) The resulting p-value is.0047 This is smaller than.01, indicating δ (x) is indeed sufficiently surprising And we reject the null hypothesis and conclude A is better than B. Paired bootstrap example After Berg-Kirkpatrick et al (2012) Road Map: Classification Basic concepts Decision tree induction Evaluation of classifiers Naïve Bayesian classification Naïve Bayes for text classification Support vector machines K-nearest neighbor Summary 42 Evaluating classification methods Predictive accuracy Efficiency time to construct the model time to use the model Robustness: handling noise and missing values Scalability: efficiency in disk-resident databases Interpretability: understandable and insight provided by the model Compactness of the model: size of the tree, or the number of rules. 43 Evaluation methods: Holdout set Holdout set: The available data set D is divided into two disjoint subsets, the training set Dtrain (for learning a model) 70%: 30% the test set Dtest (for testing the model) Important: training set should not be used in testing and the test set should not be used in learning. Unseen test set provides an unbiased estimate of accuracy. The test set is also called the holdout set. the examples in the original data set D are all labeled with classes. This method is mainly used when the data set D is large. 44 Evaluation methods: Cross-validation n-fold cross-validation: The available data is partitioned into n equal-size disjoint subsets. Use each subset as the test set and combine the rest n-1 subsets as the training set to learn a classifier. The procedure is run n times, which give n accuracies. The final estimated accuracy of learning is the average of the n accuracies. 10-fold and 5-fold cross-validations are commonly used. This method is used when the available data is not large. 45 Evaluation methods: Leave-one-out Leave-one-out cross-validation: This method is used when the data set is very small. It is a special case of cross-validation Each fold of the cross validation has only a single test example and all the rest of the data is used in training. If the original data has m examples, this is m-fold cross- validation 46 Evaluation methods: Validation set Validation set: the available data is divided into three subsets, a training set, a validation set and a test set. A validation set is used frequently for estimating parameters in learning algorithms. In such cases, the values that give the best accuracy on the validation set are used as the final parameter values. Cross-validation can be used for parameter estimating as well. 47 Classification measures: Accuracy Error - Accuracy is one measure paired wit Error error = 1-accuracy Accuracy is not suitable in some applications. In text mining, we may only be interested in the documents of a particular topic, which are only a small portion of a big document collection. In classification involving skewed or highly imbalanced data, (network intrusion and financial fraud detections), we are interested only in the minority class. High accuracy does not mean any intrusion is detected. example: 1% intrusion. Achieve 99% accuracy by doing nothing. The class of interest is commonly called the positive class, and the rest negative classes. 48 Evaluation II Use a test collection where you have A set of documents A set of queries A set of relevance judgments that tell you which documents are relevant to each query 49 Precision and Recall document= one test example, instance Precision is the percentage of things you find (return) that are right. #relevant docs returned Precision= #docs Recall is the percentage of right things outreturned there that you found (returned) #relevant docs returned Recall= #relevant docs total 50 Confussion matrix TP precision: p= TP + FP recall: TP r= TP + FN TP: the number of correct classifications of the positive examples (true positive) FN: the number of incorrect classifications of positive examples (false negative) FP: the number of incorrect classifications of negative examples (false positive) TN: the number of correct classifications of negative examples (true negative) 51 F-score in practice high precision is achieved almost always at the expense of recall and high recall is achieved at the expense of precision we need a single measure to compare different classifiers F1-score is the harmonic mean of precision and recall 2 pr 2 F= = p+r 1 +1 precision and recall breakeven point: p r the precision and the recall are equal 52 ROC receiver operating characteristic (ROC) curve is a plot of the true positive rate against the false positive rate. true positive rate (TPR) is defined as the fraction of actual positive cases that are correctly classified TP TPR = TP + FN + false positive rate (FPR) is defined as the fraction of actual negative cases that are classified to the positive class FP FPR = TN + FP + 53 ROC II sensitivity: recall of the positive class TP sensitivity = TPR = TP + FN + specificity: recall of the negative class true negative rate TNR TN specificity = TNR = + TN + FP TN FPR = 1 − specificity = TN + FP 54 Another evaluation method: Scoring and ranking Scoring is related to classification. We are interested in a single class (positive class), e.g., buyers class in a marketing database. Instead of assigning each test instance a definite class, scoring assigns a probability estimate (PE) to indicate the likelihood that the example belongs to the positive class. 55 Ranking and lift analysis ranking: classifier gives a score (+ or -) to all examples rank all examples according to scores (decresaing) divide the data into n (say 10) bins (equally sized) A lift curve can be drawn according how many positive examples are in each bin. This is called lift analysis. Classification systems can be used for scoring. Need to produce a probability estimate. in decision trees, we can use the confidence value at each leaf node as the score. 56 An example We want to send promotion materials to potential customers to sell a watch. Each package cost $0.50 to send (material and postage). If a watch is sold, we make $5 profit. Suppose we have a large amount of past data for building a predictive/classification model. We also have a large list of potential customers. How many packages should we send and who should we send to? 57 An example Assume that the test set has 10000 instances. Out of this, 500 are positive cases. After the classifier is built, we score each test instance. We rank the test set, and divide the ranked test set into 10 bins (10% of the data and sort them – decreasing) Each bin has 1000 test instances. Bin 1 has 210 actual positive instances Bin 2 has 120 actual positive instances Bin 3 has 60 actual positive instances … Bin 10 has 5 actual positive instances 58 Lift curve Bin 1 2 3 4 5 6 7 8 9 10 210 120 60 40 22 18 12 7 6 5 42% 24% 12% 8% 4.40% 3.60% 2.40% 1.40% 1.20% 1% 42% 66% 78% 86% 90.40% 94% 96.40% 97.80% 99% 100% 100 Percent of total positive cases 90 80 70 60 lift 50 random 40 30 20 10 0 0 10 20 30 40 50 60 70 80 90 100 Percent of testing cases 59

NLP Evaluation Methods PDF

Document Details

Tags

Related

Summary

Full Transcript