Lecture 7bis: Performance Metrics for Classifiers PDF

G.C. Calafiore (Politecnico di Torino) 1 / 22 LECTURE 7bis Performance Metrics for Classifiers G.C. Calafiore (Politecnico di Torino) 2 / 22 Outline 1 Evaluating a classifier’s performance 2 Estimating the performance metrics F1 , ROC, etc. G.C. Calafiore (Politecnico di Torino) 3 / 22 Performance metrics In regression, we can use, for instance, the average prediction error (on the test set) to evaluate the performance of a particular prediction algorithm. In classification, the counterpart of the average prediction error would be the so called accuracy, which simply counts the relative number of correct predictions: number of correct predictions Accuracy =. total number of predictions made However, accuracy hardly tells the full story about a classification algorithm’s performance. Consider a binary classification problem, with y = 1 for the positive class and y = 0 for the negative class. Misclassification errors can have very different consequences! G.C. Calafiore (Politecnico di Torino) 4 / 22 Performance metrics Type I and Type II errors Type I errors are the so-called False Positives (FP), that is individuals that are classified as positive, while they are negative in reality. Type II errors are the so-called False Negatives (FN), that is individuals that are classified as negative, while they are positive in reality. Examples: A missile pointing device classifies a target as positive when it is an enemy, and negative otherwise. A FP (type I) error may lead to destroying a civilian or allied target. Medical test for HIV is positive if the patient may have contracted HIV. A FN (type II) error may lead to missing therapy opportunity and/or leading a dangerous behaviour. G.C. Calafiore (Politecnico di Torino) 5 / 22 Performance metrics Why is accuracy a misleading indicator In many situations the sample population is highly unbalanced, that is the negative class vastly outnumbers the positive class. For example, there is about a 1% global prevalence of HIV infections globally among adults. That is, the sample population is 99% in the negative class and only 1% in the positive class. Suppose we build a very “stupid” HIV test that, for any patient, always return the result “negative.” Such a “stupid” test, on average, will be correct 99% of the times, i.e., its accuracy is 99%, which is very high indeed! The above example makes a case for the fact that accuracy alone may be a very poor indicator of a classifier’s performance. G.C. Calafiore (Politecnico di Torino) 6 / 22 Performance metrics We need to capture false positives and false negatives, and we can use the corresponding error rates (evaluated on the test set) as performance criteria detected as negative that are actually positive in reality detected as negative that are indeed negative in reality Positive samples Negative samples FN TN TP FP all samples detected as positive by the classi er detected as positive that indeed are positive in reality detected as positive that are actually negative in reality G.C. Calafiore (Politecnico di Torino) 7 / 22 Performance metrics Precision and sensitivity We let x denote the hidden (true and unknown) state, with x = 0 for a negative individual and x = 1 for a positive individual. We let y denote the classifier output with y = 0 for a negative-classified individual and y = 1 for a positive-classified individual. We define: I Precision (or positive predictive value PPV):. p = Prob{x = 1|y = 1} the probability that the true state is positive, given that the sample is classified as positive. I Sensitivity (true positive rate TPR, or recall):. r = Prob{y = 1|x = 1} the probability that the classifier returns positive, given that the sample is positive in reality. G.C. Calafiore (Politecnico di Torino) 8 / 22 Performance metrics Type I and II error rates Specificity (true negative rate TNR):. s = Prob{y = 0|x = 0} the probability that the classifier returns negative, given that the sample is negative in reality. Since Prob{y = 1|x = 0} + Prob{y = 0|x = 0} = 1, we have that. FPR = Prob{y = 1|x = 0} = 1 − Prob{y = 0|x = 0} = 1 − TNR, that is, the False Positive Rate (or Type I error rate) is the complement of the specificity (TNR). Since Prob{y = 1|x = 1} + Prob{y = 0|x = 1} = 1, we have that. FNR = Prob{y = 0|x = 1} = 1 − Prob{y = 1|x = 1} = 1 − TPR, that is, the False Negative Rate (or Type II error rate) is the complement of the sensitivity (TPR). G.C. Calafiore (Politecnico di Torino) 9 / 22 Relation between precision and sensitivity Precision and sensitivity (recall, r or TPR) are related via Bayes’ Rule: Prob{y = 1|x = 1}Prob{x = 1} p = Prob{x = 1|y = 1} = Prob{y = 1} TPR · Prob{x = 1} = Prob{y = 1|x = 0}Prob{x = 0} + Prob{y = 1|x = 1}Prob{x = 1} TPR · Prob{x = 1} = , FPR · (1 − Prob{x = 1}) + TPR · Prob{x = 1} where Prob{x = 1} is the so-called baseline risk. G.C. Calafiore (Politecnico di Torino) 10 / 22 Estimating the performance metrics Confusion matrix The confusion matrix is a 2 × 2 matrix that suitably arranges the number of samples that fall in the four possibilities (TP, FP, FN, TN). state positive (x=1) state negative (x=0) classified positive (y=1) CP=TP+FP TP FP classified negative (y=0) CN=FN+TN FN TN P=TP+FN N=FP+TN G.C. Calafiore (Politecnico di Torino) 11 / 22 Estimating the performance metrics By building the confusion matrix on test sample data, we can obtain estimates of the main classification performance criteria: Precision p: the number of TP divided by state positive (x=1) state negative (x=0) the number of all classified positive results: classified positive (y=1). TP CP=TP+FP TP FP p = Prob{x = 1|y = 1} =. TP + FP Recall r : the number of TP divided by the classified negative (y=0) number of total actual positives: CN=FN+TN FN TN. TP TPR = r = Prob{y = 1|x = 1} =. TP + FN P=TP+FN N=FP+TN Specificity s: the number of TN divided by the number of total actual negatives:. TN TNR = s = Prob{y = 0|x = 0} =. FP + TN G.C. Calafiore (Politecnico di Torino) 12 / 22 Multi-criteria metrics A classifier’s performance cannot be evaluated only on a single criterion. Suitable mixtures of performance criteria have been proposed, notably: I The F1 score. 2 pr F1 = 1 1 =2 r + p p+r is the harmonic mean of precision and recall. I The balanced accuracy TPR + TNR BA = 2 is the average of TPR and TNR. I The ROC curve plots TPR vs. FPR... see next. G.C. Calafiore (Politecnico di Torino) 13 / 22 Receiver Operating Characteristic (ROC) In binary classification, the class prediction for each instance is often made based on a continuous random variable X , which is a “score” computed for the instance (e.g., estimated probability in logistic regression). Given a threshold parameter T , the instance is classified as “positive” if X > T , and “negative” otherwise. The ROC curve is created by plotting the true positive rate (recall, TPR) against the false positive rate (FPR) at various threshold settings. TP TP TPR = Prob{predicted pos.|is positive} = = TP + FN P FP FP FPR = Prob{predicted pos.|is negative} = = FP + TN N In other words, the ROC curve shows the ratios between true alarms (hit rate) and false alarms. G.C. Calafiore (Politecnico di Torino) 14 / 22 Receiver Operating Characteristic (ROC) X follows a probability density f1 (x) if the instance actually belongs to class “positive,” and f0 (x) if otherwise. Therefore, the true positive rate is given by Z ∞ TPR(T ) = f1 (x) dx T and the false positive rate is given by Z ∞ FPR(T ) = f0 (x) dx. T The ROC curve plots parametrically TPR(T ) versus FPR(T ) with T as the varying parameter. G.C. Calafiore (Politecnico di Torino) 15 / 22 Receiver Operating Characteristic (ROC) Example (Wikipedia): suppose that the blood protein levels in diseased people and healthy people are normally distributed with means of 2 g/dL and 1 g/dL respectively. A medical test might measure the level of a certain protein in a blood sample and classify any number above a certain threshold as indicating disease. The experimenter can adjust the threshold (black vertical line in the figure), which will in turn change the false positive rate. Increasing the threshold would result in fewer false positives (and more false negatives), corresponding to a leftward movement on the curve. The actual shape of the curve is determined by how much overlap the two distributions have. G.C. Calafiore (Politecnico di Torino) 16 / 22 Receiver Operating Characteristic (ROC) G.C. Calafiore (Politecnico di Torino) 17 / 22 Area Under Curve The AUC (Area Under the ROC Curve) is an evaluation metric for checking a classification model’s performance. It is also known as AUROC. AUC represents degree or measure of separability. It tells how much the model is capable of distinguishing between classes. Higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s. An excellent model has AUC near to the 1 which means it has good measure of separability. A poor model has AUC near to 0.5 which means it has no discrimination capacity to distinguish between positive class and negative class (0.5 is the AUC of a uniform random number generator predictor). G.C. Calafiore (Politecnico di Torino) 18 / 22 ROC Example with logistic model Consider a (binary) logistic regression model. For a given input vector x of features, the logistic model returns the probability p(x) of x being in the positive class (as well as the complementary probability 1 − p(x)). The actual decision is based on comparing this probability with a threshold γ ∈ [0, 1]. That is, we decide that x is in the positive class iff p(x) > γ. For any given γ we can test the result of the corresponding classifier on validation data, and estimate the TPR and FPR. The ROC is then obtained as the (parametric) plot of TPR vs. FPR. G.C. Calafiore (Politecnico di Torino) 19 / 22 ROC Example with logistic model Clearly, for γ = 1 we will have TPR= 0, FPR= 0, since no sample is ever classified as positive. For γ = 0 we will have both TPR= 1, FPR= 1, since all samples are classified as positive. Intermediate values of γ produce the “interesting” part of the ROC curve. A suitable value for the threshold γ can be inferred graphically, by choosing the point where the ROC curve has a “knee.” G.C. Calafiore (Politecnico di Torino) 20 / 22 ROC Example with logistic model Consider the height and weight data for Male and Female population used in Practice 1. We train a logistic regression model on 70% randomly chosen individuals, and then test the prediction performance on the remaining 30% of data. Letting Male be the positive class, we estimate the TPR and FPR on validation data, for γ values in the range (0, 1) and plot the ROC. The curve shows excellent predictive performance, with an AUC of ≡0.97. See roc example.m G.C. Calafiore (Politecnico di Torino) 21 / 22 ROC Example with logistic model ROC curve 1 0.9 = 0.5 0.8 0.7 0.6 AUC = 0.97 TPR 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 FPR G.C. Calafiore (Politecnico di Torino) 22 / 22

Lecture 7bis: Performance Metrics for Classifiers PDF

Document Details

Tags

Related

Summary

Full Transcript