Readings9.pdf
Document Details
Uploaded by LovingWetland
Naval Postgraduate School
2023
Tags
Full Transcript
Lecture 9: Machine learning Maxim Massenkoff∗ [email protected] May 13, 2023 Contents 1 Causal vs. Predictive Questions 2 Machine learning 2.1 Overfitting . . . . . . . . . . . . 2.2 Intuitive overfitting example . . . 2.3 Quantitative overfitting example 2.4 Cross validation . . . . . . . ....
Lecture 9: Machine learning Maxim Massenkoff∗ [email protected] May 13, 2023 Contents 1 Causal vs. Predictive Questions 2 Machine learning 2.1 Overfitting . . . . . . . . . . . . 2.2 Intuitive overfitting example . . . 2.3 Quantitative overfitting example 2.4 Cross validation . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 3 3 4 3 Describing performance 3.1 The confusion matrix, accuracy, and error . 3.2 Sensitivity and specificity . . . . . . . . . . 3.3 The ROC curve . . . . . . . . . . . . . . . . 3.3.1 Which cutoff to use? . . . . . . . . . 3.3.2 A menu of options . . . . . . . . . . 3.3.3 Making ROC curves in R and Stata 3.3.4 The AUC . . . . . . . . . . . . . . . 3.3.5 In the wild: Detecting breast cancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 4 6 6 6 7 8 9 9 . . . . . . . . . . . . . . . . . . . . 4 In the wild: Predicting attrition 10 5 Code example: predicting attrition 11 5.1 Worth it? Back-of-the-envelope calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 5.2 Optional: Making a train-test split in Stata . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 5.3 Optional: Comparing ROC curves in Stata . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 6 Advanced: Logistic regression with 6.1 How logistic regression works . . . 6.2 Introduction to the lasso . . . . . . 6.3 Choosing λ with cross-validation . lasso 14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 7 Advanced: Integration with Python 1 16 Causal vs. Predictive Questions Say we find that veterans who do yoga have better health outcomes. I don’t have the data, but it’s probably true! Veterans who do yoga have lower mortality, lower cancer incidence, and so on. If we’re curious about the causal effect of yoga, we want to know whether doing yoga actually improves health. Would these same veterans be less healthy if they hadn’t done yoga? This could factor into a ∗ This document cannot be circulated without the instructor’s permission. These materials build on earlier versions by Marigee Bacolod. 1 cost-benefit analysis of providing such instruction to veterans. A lot of the methods we discuss in this class attempt to isolate causal effects by trying to remove the influence of omitted variables. It’s typically very hard if you do not have a randomized trial. If, on the other hand, we’re concerned only with prediction, we don’t really care whether the relationship is causal or merely a correlation. Yoga may just be a tag, correlated with other healthy behaviors we cannot see—and that’s fine. For example, we may be trying to predict the future health expenditures of the Veterans Health Administration. In this case, variables are useful if they make our predictions more accurate. With this goal in mind, a yoga variable could enhance our model—even if yoga has no causal effect on health. Machine learning optimizes this process. 2 Machine learning Machine learning is interested primarily in predictive problems, using some variables X1 , .., Xk to predict some output y. Let’s start with some key terms. Typically we have called our right-hand side variables (X1 , .., Xk ) controls, predictors, or covariates. In machine learning, they are usually called features. The outcome we are trying to predict, y, is sometimes called the label. The training set is a sample of observations where we know both the features and the outcome. The test set are new examples that were not used for fitting the model. In other words, these are observations that were left out when we estimated our coefficients. We gauge our performance based on the accuracy of our predictions on the test set. The practice of splitting our data in this way, which we would not use for a causal study, is important. Train/Test • Training set: only for training prediction functions. (In linear regression, we could call this estimating our regression coefficients.) • Test set: only for assessing performance, i.e., seeing how closely the predicted values match the true values. In causal work, there is no need to use a test set because our focus is on the coefficients—the income gain from having a college degree, for example. In machine learning our implicit goal is to make accurate predictions on new data. The best way to judge ourselves on this goal is to “hide” some data from our modeling process so that we get a clear estimate of how the model would perform on cases with unknown outcomes. We do this to avoid overfitting. 2.1 Overfitting Overfitting Overfitting is when we find chance occurrences in our data that do not generalize to unseen data— although it’s important to not overcorrect for this either.a a This is also discussed as part of the bias variance tradeoff. Figure 1 shows the issues visually. We want to be in the Goldilocks zone in the middle: not overly simple but not overly complex. The underfitted model uses just one linear term and fails to capture the obvious U-shape in the data. The overfitted model uses many predictor variables and matches every zig and zag in the data. It’s unlikely that unseen cases would follow the exact pattern suggested by the overfitted line. The fit in the middle strikes a balance between the two. When we test our model on the same data used to estimate the model, we could trivially get high accuracy metrics. The overfitted model in Figure 1 would clearly have the highest accuracy within that same dataset—but not on new observations, and that’s a problem. We should judge our model using unseen cases, otherwise we may be fooling ourselves. 2 Figure 1: Overfitting 2.2 Intuitive overfitting example This XKCD comic gives a nice real-world example of overfitting. Political junkies will be familiar with claims like “a Catholic will never be elected president” or “No Republican can win without Vermont.” At some point, these rules of thumb accurately fit the past history of presidential elections—but they made for poor predictions. 2.3 Quantitative overfitting example Let’s see overfitting in action. Recall that in Lecture 1, we showed that we can attain an R-squared of 1 if we keep adding random variables to our model. The problem is that there will always be correlations due to random chance, but these will not extrapolate to new data. The same lesson applies to machine learning: as long as you’re always using the same data, you can always maximize the accuracy of your model by adding a bunch of nonsense variables. The model won’t work on any new cases, however. So having a train/test split is essential. set seed 415 sysuse auto, clear gen test_data = runiform()< 0.5 * Generate 34 entirely random variables forvalues i=1/34 { gen random`i' = runiform() } regress foreign random* if test_data==0 predict foreign_hat The preceding code makes a bunch of random variables and uses them to predict the binary outcome foreign in the training data. We know that all correlations are just due to random chance because the variables are entirely random. However, regression will still combine these random variables in a way that minimizes the squared error, and we attain 100% accuracy in the training data. 3 The code below first takes the fitted probability and makes it into a dummy, equal to 1 if the predicted probability exceeds 0.5, otherwise 0. Then, to check if the model was right, we ask how often the 0/1 prediction matches the true 0/1 value of foreign. The accuracy is the percent of times these two dummies are equal to each other. The model in the training data is 100% accurate: gen predict_foreign = foreign_hat > 0.5 // We predict 1 if the fitted prob. > 0.5 gen correct = foreign == predict_foreign // Correct if predicted=actual sum correct if test_data == 0 // See fraction correctly predicted in training data Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------correct | 41 1 0 1 1 The issue is that our model is not picking up on any durable relationships between foreign and the predictor variables, just random noise in the training set. When we score our predictions on the test set, our accuracy tanks. About 1/3 of observations in the test set have foreign=1, so we would have attained 66.7% accuracy by just predicting 0 for all cases. This is actually better than we got with our random noise model, so our predictions were worse than useless: sum correct if test_data == 1 Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------correct | 33 .5757576 .5018904 0 1 R code here For more on this, see Section 2.2.1 Measuring the Quality of Fit in James et al. (2013). 2.4 Cross validation Models often use cross-validation in order to avoid being tricked by overfitting. In cross-validation, we divide up our labeled data (i.e., data where we know the answers) into k partitions called folds. Usually we use k = 5 or 10 folds. A visual of this process for the case of k = 5 is given in Figure 2.1 At the first step, we arbitrarily divide up the data into five folds. Next, we train five different models. The first model only has access to the data in folds 2-5. Then its predictions are tested only on fold 1, the data that was not used to build the model. A similar process is used to build and test the remaining four models. To get an estimate of our accuracy, we would average the calculated accuracies from the five folds: Accuracy 1 + ... + Accuracy 5 5 Using this average is much better than considering either accuracy (i) within the training data or (ii) in just one holdout sample. Cross-validation is a core tool of machine learning, improving the accuracy of our predictions on new data. When we leave out 20% of the data while fitting our model, that holdout data remains in some sense “new.” Our accuracy on holdout samples gives us the best sense of how our algorithm would perform if we were confronted with an actually new set of cases that needed predictions. The lasso logit, described below, chooses coefficients in part by gauging accuracy in the hold-out sample. This is in contrast to a regular logit, which would only try to maximize the likelihood with the data used to estimate the model. Accuracy = 3 Describing performance 3.1 The confusion matrix, accuracy, and error How do we assess and communicate the quality of our predictions? For a binary classification problem, i.e., predicting whether some value is 0 or 1, we can summarize the results using a confusion matrix, which 1 From slide 40 here. 4 Figure 2: Cross validation Prediction Class 0 Class 1 Truth Class 0 Class 1 TN FP FP TP Table 1: Confusion matrix is displayed in Table 1. We refer to the values as Class 0 and Class 1, meant to stand for the two possible values of a binary outcome. For instance, in cancer diagnosis, Class 0 could be benign and Class 1 could be malignant. In the table: • True Negatives (TN) is cases from Class 0 that the model predicted correctly as Class 0 • False Negatives (FN) is cases from Class 1 that the model predicted incorrectly as Class 0 • False Positives (FP) is cases from Class 0 that the model predicted incorrectly as Class 1 • True Positives (TP) is cases from Class 1 that the model predicted correctly as Class 1 The language around true/false negatives/positives should be familiar from hypothesis testing, where we can represent conclusions and true states of the world in a similar table. For more intuition, say our outcome is whether someone passes an exam. A true positive would be someone who is predicted by the model to pass their exam and does. A false negative is someone who is predicted by the model to fail but actually passes. And so on. With this framework in mind, we can discuss some key accuracy metrics. 5 Accuracy and error Accuracy is the fraction of correct predictions: TP + TN Correct predictions = TP + TN + FP + FN Total predictions The error is the fraction of incorrect predictions: FP + FN Incorrect predictions = TP + TN + FP + FN Total predictions 3.2 Sensitivity and specificity The error measure lumps together the two kinds of mistakes, false positives and false negatives. But it often makes sense to have a preference between these two. In cancer diagnosis, a false negative can be very costly. We failed to identify someone’s cancer, giving it time to spread. In contrast, a false negative in an email client’s spam filter is less costly—it’s just one extra email in someone’s inbox. The next terms refer to the different kinds of errors (unfortunately there are several synonyms). Sensitivity and specificity • Sensitivity or true positive rate or recall: the share of positive cases that are identified as positive. TP True Positives = TP + FN Total Positives E.g., what is the share of attriters who are predicted to attrite by our model? or, of the truly cancerous images in our database, what percent are classified as positive? • Specificity or true negative rate: the share of negative cases that are identified as negative. TN True Negatives = TN + FP Total Negatives E.g., what is the share of non-attriters that are predicted correctly by our model to not attrite? or, of the truly benign (non-cancerous) images in our database, what percent are classified as negative? Note that we could always maximize one of these metrics. If we always guess positive, the sensitivity would be 1. All positive cases would be identified because we only ever guessed positive. This of course is not a useful algorithm. It would not be worth it to submit every possible cancer patient to chemotherapy. The task is to strike a balance between true positives and true negatives. 3.3 3.3.1 The ROC curve Which cutoff to use? We can adjust the sensitivity and specificity by tweaking our cutoff points (also called a discrimination threshold). There is a tradeoff: we can increase one qualitry metric at the cost of decreasing the other. Say we have some machine learning algorithm that outputs a set of predictions p̂ which give the probability that the outcome is 1. For practical purposes, our probability needs to made into a binary 0/1 variable. We often assume that if p̂ > 0.5, our prediction is 1 and 0 otherwise. But we could use other cutoffs too. For instance, with cancer, a false negative is arguably more costly than a false positive. We could make our test more sensitive by making the cutoff 0.3 instead of 0.5. Now a patient with p̂ > 0.3 will be labeled as positive. This could be the right way to take into account the costs of a missed diagnosis. With this cutoff, we will have lower specificity because more truly negative cases will be labeled as positive, i.e., more healthy people labeled as having cancer. 6 Figure 3: The ROC curve 3.3.2 A menu of options We’re necessarily going to have some mix of false positive and false negative errors. We can actually calculate the sensitivity and specificity for all potential cutoff points, ranging from 0 to 1. This allows us to see the whole menu of error mixes available to us. The ROC curve shows the results of this exercise, and it’s the most common way that machine learning researchers will summarize their models. It plots the sensitivity (true positive rate) against the false positive rate (1− specificity) for the range of cutoff points.2 A diagram is given in Figure 3. The ROC curves are shown in green, orange, and blue. If you follow any of the lines from the bottom left corner, you’ll see that as the false positive rate (or 1 - specificity) increases, so does the true positive rate. Why? As we move along the x-axis, we are casting a wider net with our model. More false positives (x-axis) are getting ensnared, but we’re also catching more of the true positives (y-axis). Notice that the lines always hit the points (0,0) and (1,1). At the lowest cutoff, we predict positive every time. Our false positive rate is 1—we could not have done worse!—because all negative cases are predicted to be positive. Our true positive rate is also 1—all positive cases are predicted to be positive—giving us (1,1) in that plane. This is equivalent to saying that everyone has cancer! No false negatives but a whole lot of false positives. Similarly, if our cutoff is higher than any p̂, we always predict negative. We get no false positives but we also don’t catch any of the true positives, putting us at (0,0). This is like saying no one has cancer. Obviously, these extreme cases are not useful at all. The red line shows what a random classifier would get you. Imagine a bunch of randomly generated p̂’s, completely unrelated to the data. We could still make a confusion matrix using those p̂’s while varying the cutoff points. Since p̂ is unrelated to the data, you should be equally likely to make each mistake at all cutoff points. That’s why the random classifier line occurs where the two measures are equal. As our predictions get better than the imaginary random guess, the ROC curve arches up and away from the red line. For instance, say we were dealing with cancer and willing to accept a high false positive rate of 0.8. I mark this with the vertical pink line. Then the green model would give us a true positive rate of about 0.82. In other words, if we allow for 80% of the positive predictions to be false positives, we will catch 82% of actual cancer cases. The blue line offers a better deal: with the same false positive rate of 80%, it catches closer to 90% of the true positive cases. For any amount of false positives that we’re willing to accept, the 2 ROC stands for Receiver Operator Characteristic and allegedly comes from radar operators in the military. 7 model represented by the blue line will catch a higher share of true positives. So the blue line is the best model in this picture. 3.3.3 Making ROC curves in R and Stata The code in these examples runs two simple logit models, where the 2nd model does a better job at predicting the outcome. We predict loan approval in the dataset loanapp (from the bcuse/wooldridge package). We can see from the plot that the 2nd model is strictly better for any level of specificity. * Stata code set seed 415 bcuse loanapp, clear gen test = runiform()<0.5 logit approve male married dep if test==0 predict p1 logit approve male married dep i.race price liq pubrec if test==0 predict p2 roccomp approve p1 p2 if test==1, graph xline(.66) /// graphregion(color(white)) /// plot1opts(msize(vtiny) lp(-)) /// plot2opts(lw(thin) msize(vtiny)) /// ytitle(True positive rate) /// xtitle(False positive rate) /// legend(ring(0) rows(3) position(11)) R code here for estimating two different logit models in this data and comparing the ROC curves. It makes the plot below. 8 3.3.4 The AUC As we have been discussing, the better models have ROC curves that arch up and to the left. How can we summarize this visual take-away? We use the area under the ROC curve, abbreviated as AUC. It is possibly the most widely used performance metric in machine learning. It is literally the area under the ROC curve: imagine shading the area under the lines in Figure 3 or the two coding examples above and calculating that area. Remember that each point in the ROC curve is from trying out a different cutoff. So the AUC gives us an aggregate measure of performance across all possible thresholds (source). As the models get better, the shaded area increases, so a higher AUC is better. A perfect predictor would run along the left and top border of the plot, and the area below it would be 1 × 1 = 1.0. A random classifier (i.e., the worst possible model) achieves an AUC of 0.5. In the Stata figure above, the legend tells us that the first model was just barely better than random with an AUC of 0.53. The 2nd model scored a more respectable 0.70. How do we put this into plain English? The AUC has several equivalent interpretations. One way to interpret the numerical value is as the probability that a randomly chosen positive case has a higher prediction than a randomly chosen negative case. If the guesses are random, then a positive case is equally likely to be ranked above or below a negative case: the AUC is 0.50. If the guesses are very good, then most of the positive cases will be given higher probabilities than the negative cases. The AUC would be above 0.50. In the case of a perfect predictor, it’s guaranteed that the positive cases were ranked higher than the negative cases and the AUC=1.0. 3.3.5 In the wild: Detecting breast cancer McKinney et al. (2020) built an AI system for detecting breast cancer in mammogram images. Their main results are conveyed in Figure 4. The ROC for the AI system is the purple line. In the caption they write: “The ROC curve of the AI system on the UK screening data. The AUC is 0.889 (95% CI 0.871, 0.907; n = 25, 856 patients).” How does this compare to human radiologists? The inset box zooms in on the area in the grey square so the reader can better see differences between the AI system and radiologists.3 Radiologists interpret each case as positive or negative, 1 or 0. This contrasts with the AI system, which gives a probability of malignancy. Thus, the radiologist data cannot generate sensitivity and specificity for multiple cutoff points—the authors just have a single count of true positives and false negatives. 3 By the way, you might be wondering how the authors established “ground truth” in this setting. By definition, a malignant cancer will eventually become obvious, so they tracked patients long enough to observe their outcomes. Positives were always confirmed with a tissue biopsy, while negatives lacked any cancer diagnosis within the multi-year follow up period. 9 Figure 4: ROC curve from McKinney et al This is why the radiologists’ performance is represented by single dots instead of a curve. The first reader is in the green dot. (This is not a particular person but rather the first radiologist that happened to judge an image.) Based on the values in the x- and y-axis, they achieve a 7% false positive rate (x-axis) but catch only about 63% of the true breast cancers (sensitivity, the y-axis). The inset shows that the purple line is always above and to the left of the green dot: the algorithm clearly outperforms the first reader. With the same share of false positives (7%), it would catch 66% of the cancers rather than 63%. But it does not surpass the second reader (who has access to the first reader’s opinion so their diagnosis in generally better) or the consensus, a prediction that combines decisions by the two readers. This research, which was published in the journal Nature, is on the frontier of computer vision and AIassisted diagnosis. For more, see this New York Times story. State-of-the-art AI models will typically be evaluated in this way. 4 In the wild: Predicting attrition Marrone (2020) studies attrition in different branches of the military. His study actually never calculates the machine learning performance metrics on hold-out samples, but this is forgivable because it’s clear that the performance of his probit model was quite bad even within the data used to train it. I reproduce his results in Figure 5. This is another useful visual, showing what the fitted probabilities were for people who did and did not attrite, although these probably should have been the test cases from a standard train-test split. The clear problem is that these densities are lumped on top of each other. The predicted probabilities do a poor job of distinguishing the two groups, and whether we set the cutoff at 0.10 or 0.25, we would catch a bunch of non-attriters in our positive classifications. These fitted probabilities would have too many false positives to be useful. It’s worth noting that this analysis did not use more modern methods described above like the lasso and k-fold validation. These techniques, meant to attain predictive accuracy, may have had an edge although it’s hard to say for certain. 10 Figure 5: Bad predictions: The RAND report on attrition 5 Code example: predicting attrition Let’s evaluate a similar model used to predict attrition in the army dataset. I used Stata’s lasso logit command (glmnet in R) using all plausible features to predict attrition. In Figure 6 I make similar histograms as in Figure 5 using the predicted probabilities, except using only the held-out test sample, containing only observations that were not used to train the model in contrast to Marrone (2020). The code to make this figure is below. Note that the model is estimated only using the training data (where test==0) and evaluated on the test data (where test==1). gen test = runiform() > 0.50 lasso logit early_attrite afqt accession_date female hs /// nohi married year month day ed_level if test==0 predict phat twoway (hist phat if early_attrite==1 & test==1, color(red%30) freq) /// (hist phat if early_attrite==0 & test==1, color(green%20) freq), /// legend(order(1 "Attrited Early" 2 "Did not attrite early")) /// xline(0.4) xline(0.6) graphregion(color(white)) R code here In contrast to Figure 5 from RAND, the model achieves more differentiation between the soldiers who did and did not attrite early. This is clear whether you look at the Stata or R plot in Figure 6—the procedures generating them were slightly different. The vertical lines show two potential cutoff points, 0.4 and 0.6. No matter what the cutoff point is, we see that a mass of early attriters (the red bars) is to the left of the red vertical line. These would be false negatives; they have escaped detection. Similarly, some of the green bars lie above the cutoffs; these are false positives, enlistees that might be needlessly screened out. There is no perfect classifier here. Still, there’s a decision to make. The two potential cutoffs strike a different balance between the sources of error. In the Stata results, with a cutoff of 0.4, we have 7% false positives and catch 58% of the attriters. With a cutoff of 0.6, only 1% of the non-attriters are mis-labeled as attriters but we only catch 20% of attriters. In other words, the red bars below the vertical line at 0.6 constitute 80% of attriters. To figure out which cutoff would be best, and whether the exercise is useful at all, we need to have some rough numbers 11 (a) Stata (b) R Figure 6: Predicting attrition in the Army about how costly the two different kinds of mistakes are. 5.1 Worth it? Back-of-the-envelope calculations Cost-benefit thinking Homer Simpson on lowering the speed limit to 55 MPH: 55?! That’s ridiculous. Sure, they’ll save a few lives but millions will be late. Should the Army adopt one of these classifiers to screen out bad recruits? Let’s think this through using some example numbers. We imagine this is part of a cost-benefit analysis of our screening algorithm, although just one of the steps. We assume that the Army does not currently do any screening and that it is considering adopting our recommended classifier, which uses the 0.6 cutoff, to screen out enlistees. Let’s say that recent formal cost-benefit analyses estimate that, all things considered, the cost of training an enlistee who will soon drop out is $30,000, while the typical enlistee who makes it past boot camp provides $150,000 in value. To work with a round number, let’s say that 500 enlistees will come our way during the period in question. Based on past data, the early attrition rate is about 20%. So out of the of 500 enlistees, about 100 are expected to separate early. Based on the results above, the 0.6 classifier would screen out 20% or 20 of them, saving the Army 20 ∗ $30, 000 = $600, 000. But it would also screen out some good apples. One percent of the 400 non-separators would be targeted by the model, so 4 enlistees. Rejecting them would lose the Army 4 ∗ $150, 000 = $600, 000. In other words: Benefit from screening out early attriters: 20 ∗ $30, 000 = $600, 000 Cost from screening out non-attriters: 4 ∗ $150, 000 = $600, 000 We have a tie. And if we were actually adopting this machine learning algorithm, we would probably want it to be justified by a wide margin since we should assume model will do a little worse on the truly new data (i.e., new recruits who would not have been in any of our data). (Note that this only takes two quality metrics: the true positive rate and the false positive rate. Why? Those are the cases that we’re acting on. Anyone labeled as negative will not get any differential treatment by the system because we’re assuming that no screening algorithm is in place.) 12 5.2 Optional: Making a train-test split in Stata Here’s a simple example on making a train-test split. I find it useful to use the runiform() command, which asks Stata to draw a random number between 0 and 1 from the uniform interval. If we want to randomly assign 20 percent of the auto observations to be test data, we can write the following: sysuse auto, clear gen test_data = runiform() < 0.2 sum test_data Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------test_data | 74 .2027027 .404757 0 1 The sum command shows that this worked. test data is equal to 1 about 20 percent of the time. Next, we want to fit a regression but leave out the test data. In a sense, we are hiding the test data from the model. The if command accomplishes this for us. regress foreign price mpg if test_data == 0 Source | SS df MS Number of obs = 59 ... -----------------------------------------------------------------------------foreign | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------price | .0000494 .0000217 2.27 0.027 5.87e-06 .000093 mpg | .0385008 .0115076 3.35 0.001 .0154484 .0615533 _cons | -.8334385 .3321088 -2.51 0.015 -1.498732 -.1681447 -----------------------------------------------------------------------------As proof that the if restriction worked, note that the number of observations in the regression is 59. We ignored observations with test data equal to 1. 5.3 Optional: Comparing ROC curves in Stata The standard way to compare predictive models is by looking at their ROC curves and calculating the area under the curve (AUC) for each—all on the test set, not the training set! The code below gives a self-contained example that visualizes the ROC curves and reports their confidence intervals. The first two lines load the data and then create a binary dummy variable, test, that is equal to 1 with 50% probability. Next, we store the list of features in three different globals. The number of features increases at every step. Next, a for loop runs a logit model for each of the three feature sets, restricting to the training data with the if condition. Finally, the roccomp command is used to first graph the ROC curves (with the graph option) and then calculate the confidence intervals for the models. Link to code here bcuse loanapp, clear gen test = runiform() < .50 global features1 emp dep married global features2 emp dep married appinc loanamt unit global features3 emp dep married appinc loanamt unit pubrec multi self liq lines mortg forvalues i=1/3 { logit approve ${features`i'} if test==0 13 predict phat`i' estat classification } roccomp approve phat1 phat2 phat3 if test==1, /// graph plot1opts(msize(tiny) lw(vthin)) /// plot2opts(msize(tiny) lw(vthin)) /// plot3opts(msize(tiny) lw(vthin)) /// graphregion(color(white)) /// xtitle(False positive rate) ytitle(True positive rate) roccomp approve phat1 phat3 if test==1 ROC Asymptotic normal Obs area Std. err. [95% conf. interval] ------------------------------------------------------------------------phat1 965 0.5606 0.0288 0.50413 0.61711 phat2 965 0.5923 0.0308 0.53190 0.65263 phat3 965 0.6996 0.0279 0.64492 0.75435 ------------------------------------------------------------------------- 6 Advanced: Logistic regression with lasso 6.1 How logistic regression works Recall that logistic regressions model a binary outcome yi as follows: prob(yi = 1) = 1 1+ e−(b0 +b1 X1i +...+bk Xki ) (1) In order to come up with estimates of the coefficients b0 , ..., bk , Stata essentially “tries out” different sets 14 of coefficents to see which values would make the realized outcomes most likely. This is called maximum likelihood estimation. Here’s an example. If our model with some set of b̂’s predicts a 0.9 probability of cancer for those with cancer and an 0.2 probability for those without cancer, then our likelihood in a sample with 2 positive and 2 negative cases would be: L= 0.9 ∗ 0.9} | {z 2 positive cases ∗ (1 − 0.2) ∗ (1 − 0.2) = 0.5184 | {z } 2 negative cases This gives the probability that the string of diagnoses—1,1,0,0—would happen, according to the model’s predictions. These predictions are not perfect. The theoretically best predicted probabilities would be 1,1,0,0, matching the true diagnoses perfectly. The likelihood would then be L = 1 ∗ 1 ∗ (1 − 0) ∗ (1 − 0) = 1 Of course, our predictions are never perfect. Say we tweak the coefficients and now have a model that gives a probability of 0.95 to the first cancer patient and 0.1 to the 2nd non-cancer patient, but otherwise makes the same predictions. Then our likelihood would be: L = 0.95 ∗ 0.9 ∗ (1 − 0.2) ∗ (1 − 0.1) = 0.6156 This new model has an increased likelihood, which means our predictions are a better fit for the data. Stata and other software packages proceed this way to estimate the coefficients bk in the logit, trying out potential values of the coefficients until it’s not possible to increase the likelihood L any further. 6.2 Introduction to the lasso In a lasso, we penalize our model for having coefficients that are too numerous or large. Instead of simply maximizing the likelihood L, Stata maximizes L minus a penalization term that depends on the values of the coefficients: X L−λ |bk | k where λ is always positive. So if λ is equal to 0.005 and we are considering some coefficients bˆ1 = 2 and bˆ2 = −7, our equation above would be L − 0.005 ∗ (2 + 7) If Stata can get the exact same likelihood L by setting b2 = −3, it will do so because this will give it a higher score: L − 0.005 ∗ (2 + 3) Since L did not change between those two sets of coefficients, the algorithm will prefer bˆ2 = −3. The innovation in lasso regression is to add the λ term to the numeric score that our models attempt to maximize. Now our methods will strike a balance between two goals: maintaining predictiveP accuracy in our sample (the likelihood L) and not having excessively large coefficients (the penalty term λ k |bk |). While we focus on the logit here, note that there is also lasso regression, where, instead of minimizing only the squared residuals, the model works to minimize the squared residuals plus the penalty term. X X (yi − ŷ)2 + λ |bk | i k If we set λ = 0, we would get regular old ordinary least squares—what we get from our regress command. Instead, and just like with the logit work above, we feed the algorithm an additional incentive to keep a lid on the coefficients. Why do we “penalize” our model for having large coefficients? It prevents against overfitting. Models with many large bk ’s will be more like the third panel in Figure 1: zigging and zagging to match the idiosyncrasies unique to the training data, and therefore more likely to strike out when presented with the new cases in the test data. 15 6.3 Choosing λ with cross-validation One problem remains, which is that we do not know what value of λ is best. The reason why Stata’s lasso logit command takes so long is that it’s trying out different values of λ and then using cross validation to assess how the model performs on hold-out samples. To be more specific, the algorithm proceeds as follows, illustrated for the case of 10-fold validation: (1) Start with some potential value of λ (2) Estimate a logistic regression using 90% of the sample. (3) Calculate an accuracy metric on the remaining 10%. (4) Repeat steps (2) and (3) 9 times with the 9 remaining folds and take the average of the accuracy metric across the ten folds. (5) Now the current value of λ has an associated accuracy level; return to (1) using a different λ to see if it yields better accuracy. Stata proceeds in this way until it has found the λ that maximizes the cross-validation accuracy. Thus the final output of the lasso logit command is always a set of coefficients bk and the λ that was used to arrive at these coefficients (and which maximizes out-of-sample accuracy). 7 Advanced: Integration with Python Stata 17 allows you to switch to the Python programming language within your dofile or command window. To access Python, you simply type “python.” To check if it’s working, you can try these simple lines in Stata: python print("Hello, this is python code being run") If it doesn’t work, you may have to install Python or switch to a newer version of Stata. The addition of this Python feature makes it easy to run more advanced machine learning models from within Stata. The example here gives a brief example where the RandomForestClassifier command from Python is used to predict approval decisions from the loanapp dataset. Here is a more developed example which pits Stata’s lasso logit against python’s Random Forests. References James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani, An introduction to statistical learning, Vol. 112, Springer, 2013. Marrone, James V, “Predicting 36 Month Attrition in the US Military,” Technical Report, RAND NATIONAL DEFENSE RESEARCH INST SANTA MONICA CA SANTA MONICA United States 2020. McKinney, Scott Mayer, Marcin Sieniek, Varun Godbole, Jonathan Godwin, Natasha Antropova, Hutan Ashrafian, Trevor Back, Mary Chesus, Greg S Corrado, Ara Darzi et al., “International evaluation of an AI system for breast cancer screening,” Nature, 2020, 577 (7788), 89–94. 16