Chapter 8 Supervised Learning - Model Performance Evaluation PDF

Chapter 8: Supervised Learning - Model Performance Evaluation Learning Objectives Previous chapters mentioned the concept of “model evaluation,” specifically in the context of cross-validation and the fine-tuning of hyperparameters. This chapter formalizes these concepts by introducing metrics that can be used to evaluate the individual performance of a model or for comparison across models. A distinction is made between the measures used to evaluate the performance of a model when the output is a continuous variable or a discrete variable. After completing this chapter, you should be able to: Discuss metrics used to evaluate the performance of a model when the outcome variable is continuous. Evaluate the performance of a classification model using a confusion matrix and related metrics. Explain the relationship between true and false positive rates and how this trade off can be illustrated using the receiver operating curve (ROC). 8.1 Model Evaluation When the Output Is Continuous When the output is a continuous variable (e.g., a return or yield forecast), the most common measure of the predictive ability of a model is the mean squared error (MSE), sometimes referred to as the mean squared forecast error (MSFE). This is obtained by computing the average of the squares of the difference between the observed values and the predicted ones: (8.1) where yi is the true value of the outcome for observation i and is its model prediction and N is the number of observations in the test sample, i.e., the number of predictions. Another measure of predictive ability is root mean square error (RMSE), which is the square root of MSE: (8.2) Because the forecast errors are squared, the MSE scales with the square of the units of the data, whereas the RMSE scales with the units of the data and so is easier to interpret. Nothing prevents us from computing the MSE over the training sample. However, we are generally more interested in computing performance measures over the validation sample (to tune the model’s parameters) or over the test sample (to evaluate the model’s performance over unseen instances) than on the training sample. These would also be known as out-of-sample forecasts. Often, models overfit to the training sample and examining a model’s accuracy in predicting observations that it has already seen and used to determine the parameters is not a true test of its performance. As this chapter is about model performance, we will generally assume that we are computing the measures of forecast accuracy over the test sample, which comprises observations not used in determining the parameter estimates or tuning the hyperparameters. It is apparent that MSE measures how close on average the predictions are to the actual data and the square is taken to remove the sign from negative distances, which occur when the prediction overestimates the target value. Taking the squares ensures that the positive and negative forecast errors do not cancel out. The closer the predictions are to the target, the more accurate the model is. Therefore, among a set of different predictive models, the most accurate is the one with the lowest MSE in the test sample. To understand the mechanics of MSE computation, we reconsider the regression model estimated in Chapter 7 to predict the salary of a sample of notional bank employees based on their experience: (8.3) We now evaluate the performance of this model using the test sample reported in Table 8.1. Note that this is a test sample of observations not used to estimate the model parameters. First, we obtain the difference between the predictions and the observed values, reported in the fifth column of the table. Then, we square those differences to obtain the values in the sixth column. A MSE equal to 43.1 is obtained as the sum of the observations in the sixth column divided by ten (the number of observations in the test sample). Sometimes the RMSE (in this case equal to = 6.57) is reported instead of the MSE as the former is on the same scale as the prediction. Table 8.1: Illustration of calculation of MSE for a test sample of bank employees Observation, Experience, Actual Predicted yi−ŷi (yi−ŷi)2 i xi salary, yi salary ŷi 1 4 20.15 21.12 −0.97 0.94 2 6 21.48 23.24 −1.76 3.10 3 9 31.88 26.42 5.46 29.81 4 10 32.88 27.48 5.40 29.16 5 21 49.92 39.14 10.78 116.21 6 25 57.46 43.38 14.08 198.25 7 3 21.13 20.06 1.07 1.14 8 1 12.18 17.94 −5.76 33.18 9 2 19.47 19.00 0.47 0.22 10 7 28.66 24.30 4.36 19.01 Although it remains the most widely used measure of performance in practice (also because of its connection with the least squares estimation technique, as discussed in Chapter 7), MSE has some drawbacks. The most notable one is that, because of taking the squares of the distances between actual data and the predictions, MSE overweights large deviations from the observed values. This implies that, when MSE is used to pick the best model, one that is generally very accurate but displaying a few occasional large deviations from the observed value might be less preferable (i.e., have a higher MSE) than a model that is in general less accurate but where no prediction largely under- or overestimates the true value. In other words, MSE is highly sensitive to outliers in the test sample. To overcome this limitation, we could use the mean absolute error (MAE), which is like the MSE but instead of taking the squares of the distances between the predictions and the actual data, it uses the absolute value: (8.4) RMSE and MAE are both unnormalized measures. If we consider the salary example above, this means that if we underestimate the salary by $2,000, this will contribute to the RMSE and MAE in the same way irrespective of the level of salary that we are predicting. In other words, predicting a salary of $148,000 when it is $150,000 is treated symmetrically to predicting a salary of $10,000 when it is $12,000. However, in some practical situations, the first error might be considered much less important than the second one, as it represents about 1% of the total salary, whereas in the second situation the error is more than 10% of the salary. A measure that considers the relative importance of errors relative to the scale of the variable being predicted is the mean absolute percentage error (MAPE). This can be computed as: (8.5) where the multiplication by 100 is to express the metric as a percentage of the actual value. Similarly, the MSE or RMSE can be expressed as a percentage of the actual value by dividing the squared forecast errors by the corresponding actual values. Table 8.2 extends the calculations in Table 8.1 to the case of MAE and MAPE. Table 8.2: Illustration of calculation of MAE and MAPE for a test sample of bank employees Observation, Experience, Actual salary, Predicted yi−ŷi |yi−ŷi| i xi yi salary ŷi 1 4 20.15 21.12 −0.97 0.97 −0.05 0.05 2 6 21.48 23.24 −1.76 1.76 −0.08 0.08 Observation, Experience, Actual salary, Predicted yi−ŷi |yi−ŷi| i xi yi salary ŷi 3 9 31.88 26.42 5.46 5.46 0.17 0.17 4 10 32.88 27.48 5.40 5.40 0.16 0.16 5 21 49.92 39.14 10.78 10.78 0.22 0.22 6 25 57.46 43.38 14.08 14.08 0.25 0.25 7 3 21.13 20.06 1.07 1.07 0.05 0.05 8 1 12.18 17.94 −5.76 5.76 −0.47 0.47 9 2 19.47 19.00 0.47 0.47 0.02 0.02 10 7 28.66 24.30 4.36 4.36 0.15 0.15 The MAE is computed by averaging the absolute differences reported in the sixth column. The calculations yield a MAE of 5.01. The MAPE is computed by taking the averages of the values in the last column, which have been obtained by dividing the differences by the actual salary value and taking the absolute value of this ratio. The calculations yield a value of 0.16 that is multiplied by 100 to obtain a MAPE of 16%. Of all the forecast accuracy measures introduced so far in this chapter, MAPE is the most intuitive. The figure of 16% can be interpreted as implying that the average forecast error is 16% of the actual salary. When used to evaluate alternative models, MSE, MAE and MAPE can all yield different model rankings. The choice of the measure that is most appropriate ultimately depends on the problem at hand. For instance, a risk manager is likely to evaluate the expected loss in units of dollars. On the contrary, an index manager is interested in controlling the deviation of a portfolio’s percentage return (either + or −) from the index it is tracking. 8.1.1 An Example of Continuous Variable Model Performance Comparison The performance metrics discussed above can be used to compare alternative models. Therefore, in this example, we demonstrate how they can be used to compare two different models employed to predict the house price per unit area: a simple linear regression (Model A) and a tree-based regression (Model B)1. The sample consists of 414 observations, equally split between training and test sample. The features are house age, the distance from the closest metropolitan train station, the number of convenience stores in the area, and the longitude and latitude coordinates. The performance metrics calculated by estimating the model using the training sample and using it for predicting the output values in the test sample are reported in Table 8.3. The best performing model according to each metric is denoted with an asterisk. Table 8.3 Performance measures for two alternative models for house price prediction MSE MAE MAPE Model A 94.85 6.56 18.25 Model B 77.35* 5.41* 14.45* Notably, a non-linear, tree-based regression is more accurate to predict the house price per unit area according to all performance measures in this case because the values in the second row are all smaller than the corresponding ones in the first. 1 Please refer to section 4.1 for a discussion of this model. 8.2 Model Evaluation: Classification Classification models generally yield two types of predictions: a continuous value, which could be a score or a probability, and a discrete value, which is the predicted class. The continuous value is typically transformed to a discrete one (the class to which an instance is assigned) using a threshold Z. For instance, suppose that we had to predict whether a firm will pay a dividend the following year (positive outcome) or not (negative outcome). We could use the logistic regression model introduced in Chapter 3 to obtain predicted probabilities of dividend payment, , between 0 and 1. We then predict a positive outcome if and a negative outcome if . Although an obvious and popular choice for the threshold Z is 0.5, it does not need to be, and a different value might be more appropriate as discussed below. Although metrics such as the MSE discussed above could be used to evaluate the continuous prediction (of the probability), we are often interested in evaluating the discrete prediction of the outcome derived from the probability. Similarly, it is often the case that continuous predictions are distilled into discrete values for use in a decision rule. For instance, a hedge fund analyst might have used a neural network model to make time-series forecasts of future returns but wishes to turn these into an automated trading rule. The approach would be the same as for predicted probabilities: establish a threshold and generate a binary dummy variable indicating whether the prediction is above or below the threshold. In this case, the threshold might be zero so that a buy signal (1) is generated if the return is predicted to be positive and a sell signal (0) if the forecast is negative. To evaluate discrete (class) predictions, different performance measures are necessary. When the outcome is a class, a common way to evaluate the model is through calculations based on a confusion matrix, which is a simple cross tabulation of the observed and the predicted classes. The main diagonal (from top left to bottom right) elements of the matrix denote cases where the correct class has been predicted, and off- diagonal elements illustrate all the possible cases of misclassification. It is easier to illustrate how the confusion matrix is constructed with reference to the case in which the outcome is binary. In this case, the confusion matrix is a 2 × 2. For example, suppose that we constructed a model to calculate the probability that a firm will pay a dividend in the following year or not based on a sample of 1,000 firms, of which 600 did pay and 400 did not. We could then set up a confusion matrix such as the following: Prediction Firm will pay dividend Firm will not pay Outcome 432 (43.2%) 168 (16.8%) Pays dividend − TP − FN 121 (12.1%) 279 (27.9%) No dividend − FP − TN The confusion matrix would have the same structure however many features were involved in the model—whatever the sample size and whatever the model—so long as the outcome variable was binary. We identify the four elements of the table as follows: 1. True positive: The model predicted a positive outcome, and it was indeed positive. (TP) 2. False negative: The model predicted a negative outcome, but it was positive. (FN) 3. False positive: The model predicted a positive outcome, but it was negative. (FP) 4. True negative: The model predicted a negative outcome, and it was indeed negative. (TN) Based on these four elements, it is possible to compute a simple performance metric, accuracy (with calculations using the dividend example numbers in the confusion matrix above and expressed as percentages): (8.6) or, alternatively (8.7) Accuracy, the sum of the diagonal elements of the matrix divided by the sum of all its elements, is straightforward to interpret as it simply reflects the agreement between the predicted and observed classes or equivalently, the percentage of all predictions that were correct. In this case, 71.1% of the dividend predictions were correct and 28.9% were incorrect. Similarly, the error rate is simply one minus accuracy, which is the proportion of instances that were misclassified. Accuracy and error rate are very intuitive classification performance measures, but both metrics ignore the type of error that has been made. In other words, they do not distinguish between type I and type II errors, where the former happens when an outcome is predicted to be true when the outcome is false (false positive), and the latter when an outcome is predicted to be false when the outcome is true (false negative). In practical situations, the cost of committing each type of error is typically not the same. For instance, for a bank extending credit, misclassifying a borrower as solvent (who then goes on to default, losing the bank a lot of money) is typically more costly than the opposite (losing out on the profit from having made an additional loan to a borrower who would not have defaulted). Additionally, error rate and accuracy are problematic as performance measures when the classes are largely unbalanced. For instance, suppose again that we want to classify a pool of borrowers as solvent (positive outcome) or insolvent (negative outcome). If 98% of the borrowers were solvent, a model that classifies 100% of the borrowers as solvent would only have a 2% error rate; however, this model would be practically useless. To overcome the limitations discussed above, two other metrics are available, precision and recall: (8.8) (8.9) Precision is the number of correctly classified positives among all the instances that have been classified as positive. In other words, precision is the estimate of the probability that the model is right when labelling an outcome as true. Recall (also known as sensitivity) is the true positive rate, that is, the number of correctly classified positives over the total number of positives. It is also possible to compute the true negative rate (also known as specificity), which is the proportion of negative outcomes that were correctly predicted as negative. Either precision or recall can be more useful depending on the context. For instance, a bank predicting whether a borrower is solvent (positive outcome) might be more interested in precision. Conversely, an analyst predicting whether a company would pay a dividend might be more interested in recall. A further measure, known as the F1 score, combines precision and recall (technically, it takes the “harmonic mean” of the two) into a single measure: (8.10) The F1 score will be bounded between 0 and 1, and it will be closer to 1 if the model has both high precision and high recall, but a low score can arise from either poor precision, poor recall, or both. Like most other performance measures, F1 is sensitive to severe class imbalances. More generally, there is a trade off between the true and false-positive rates when setting the decision threshold, Z. As the true positive rate increases, the false-positive rate also increases. For instance, taking the example above, we can identify more dividend paying firms at the cost of also misclassifying more non-dividend paying firms as dividend paying ones. In other words, we can increase recall at the cost of decreasing precision, and vice versa. A way to illustrate this tradeoff is the receiver operating curve (ROC), which plots the true and false-positive rates against different choices of the threshold Z, as illustrated in Figure 8.1. Considering again the example of dividend paying firms, if we had chosen Z = 0, we would have classified all the firms as dividend paying. This implies that all the positives are accurately classified but all the negatives are misclassified so both the true and false-positive rates are 100%. At the other extreme, if Z = 1, all the firms are classified as non-dividend paying. Therefore, both the true and false-positive rates are 0%. For values of Z between 0 and 1, the true positive rate increases when the false- positive rate increases. Figure 8.1 A sample receiver operating curve The greater the area under the ROC curve (referred to as area under curve or AUC), the better the predictions from the model. A completely accurate set of predictions gives an AUC of 1. A value of AUC equal to 0.5 corresponds to the dashed line and indicates that the model has no predictive ability. An AUC value less than 0.5 indicates that the model performs worse than randomly guessing. The formulae for the performance metrics described above implicitly assumed only two possible outcomes (e.g., 0 or 1). Such cases are more intuitive and make for more straightforward examples. However, the formulae can be extended naturally to situations where there are several classes, such as when credit ratings are being predicted. For instance, in the multi-class case, the Precision measure would be calculated by summing all the true positive classifications and dividing by the sum of all the true positive and all the true negative classifications. Likewise, the other performance evaluation metrics would be generalized in a similar fashion. 8.2.1 An Example of Classification Model Performance Comparison Suppose that we had to build a model to classify loans in terms of whether they turn out to default or repay. Several different models can be used to this end, and we want to test whether a single hidden layer feedforward neural network (see Chapter 4) outperforms a simple logistic regression (see Chapter 3). The neural network contains ten units in the hidden layer and a logistic activation function. The confusion matrices for both models are given in Table 8.4. The first panel presents the matrix for a logistic regression estimation on the training sample; the second panel presents the matrix for the logistic regression predictions on the test sample; the third panel presents the matrix for the neural network estimated on the training sample; and the fourth panel presents the matrix for the neural network predictions on the test sample. Interpreting or evaluating a neural network model is harder than for more conventional econometric models. It is possible to examine the fitted weights, looking for very strong or weak connections or where estimates are offsetting (one large positive and another large negative), which would be indicative of overfitting. However, in the spirit of machine learning, the focus is on how useful the specification is in classification using a validation sample. Given that the same data and features have been employed for both the logistic regression and neural network, the results from the models can be compared in Table 8.5. For simplicity, a threshold of 0.5 is employed, so that for any predicted probability of default greater than or equal to 0.5, the fitted value is of a default, whereas if the probability is less than 0.5, the fitted value is of no default. Table 8.4 Confusion matrices for predicting defaults on personal loans Logistic regression training sample Prediction No default Default Outcome No default 400 11 Default 68 21 Logistic regression test sample Prediction Logistic regression training sample No default default Outcome No default 104 10 Default 38 15 Neural network training sample Prediction No default default Outcome No default 406 5 Default 74 15 Neural network test sample Prediction No default default Outcome No default 97 17 Default 33 20 Table 8.5 Comparison of logistic regression and neural network performance for a sample of loans from the LendingClub Training sample (500 data points) Test sample (167 data points) Measure Logistic regression Neural network Logistic regression Neural network Accuracy 0.842 0.842 0.713 0.701 Precision 0.656 0.750 0.600 0.541 Recall 0.236 0.169 0.283 0.377 The performance summary measures show that, as expected, the fit of the model is somewhat weaker on the test data than on the training data. This result could be interpreted as slight overfitting, and it might be worth removing some of the least empirically relevant features or applying a regularization to the fitted models. The confusion matrices shown in Table 8.4 show that the classifications from the two models are more divergent than the summary measures suggested. The logistic regression predicts more defaults for the training sample, whereas the neural network predicts more defaults for the test sample. Hence the logistic regression has a higher true positive rate but a lower true negative rate for the training data, whereas the situation is the other way around for the test data. (Note that “positive event” is “default” because both models predict default.) When comparing the logistic regression and neural network approaches, there is very little to choose between them. On the training sample, their accuracies are identical, and although the neural network performs better in terms of its precision, its recall is weaker. But when applied to the test sample, the logistic regression does better on accuracy and precision grounds, but worse on recall. Overall, these contradictory indicators illustrate the importance of fitting the evaluation metric to the problem at hand.

Chapter 8 Supervised Learning - Model Performance Evaluation PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue