Chapter 7: Classification Trees and Forests PDF
Document Details
Uploaded by Deleted User
Tags
Summary
This document is a chapter on classification trees and forests. It details the concepts of classification trees, including the tree algorithm and tree models tutorial. It also includes Random Forests.
Full Transcript
Chapter 7: Classification Trees and Forests Contents 7.0 Introduction..................................... 1 7.1 The Tree Algorithm................................. 2 7.2 Tree Models Tutorial................................. 8 7.3 Random Forests........................
Chapter 7: Classification Trees and Forests Contents 7.0 Introduction..................................... 1 7.1 The Tree Algorithm................................. 2 7.2 Tree Models Tutorial................................. 8 7.3 Random Forests................................... 23 7.0 Introduction Jargon Alert. Classification trees are one type of “Decision Tree”. The other type is a regression tree. The difference is simply that a classification tree is for categorical target variables and a regression tree is for continuous target variables. This parallels the difference between ordinary and logistic regression, which are similar to each other, but one is for continuous traget variables and the other for categorical target variables. The machine learning world has adopted the convention of using “regression” as an adjective to describe algorithms for continous targets, unlike the statistical world, where “regression” is also applied to algorithms for categorical variables, as in “logistic regression.” And to be perfectly clear, note that machine-learning decision trees are entirely different creatures from the decision trees used in operations management that are used to calculate expected values to assist with decision-making under uncertainty. The tree structure. Decision trees generate a set of if-then rules that allow the use of known predictor variables to predict target variables. As a simple example of a classification decision tree that uses this type of decision rule, consider a rule for a credit card company that determines whether to issue a Platinum Card to an applicant: If monthly mortgage to income ratio is less than 28% and months posted late is less than 1, and salary is greater than $45,000, then issue a platinum card. The set of if-then rules for making the decision to issue an applicant a credit card can be graphically portrayed using a tree representation, as shown in Figure 7.1. 1 Figure 7.1: A Tree Representation of the Decision to Issue a Platinum Credit Card A natural question to ask is how can we generate good if-then rules? Ideally, as in our previous models, they would be derived from historical data from other customers, and would be based on the the relation between characteristics of those customers and whether or not a customer generated a profit for the credit card company. This is precisely what the tree algorithm does. Specifically, these methods infer the set of if-then rules (known as splits or splitting rules) that separate profitable from unprofitable customers. Why trees? While classification trees have some advantages over logistic regression, they have one disadvantage: they are not good at finding weak predictors once the stronger predictors are identified. A common approach to solve that problem is to train many slightly different models to pick up the weaker predictors. Such ensemble methods are commonly used to pick up weak predictors. At the end of this next chapter we will introduce one method for generating many trees, a random forest1. Before we can dive into the more sophisticated world of forests, we will need to make sure we have a solid understanding of classification trees. 7.1 The Tree Algorithm Like other predictive models the classification tree algorithm uses a data set with historical information on both predictor and target variables. The target variable is nominal, (binary in our example) which might take the value of “Bad” (to indicate that “this customer has proven to be a costly cardholder”) or “Good” (to indicate that “this customer has proven to be a profitable cardholder”). The algorithm searches for decision rules based on the predictor variables (mortgage to income ratio, months late, salary) to progressively divide the cases in the data into smaller groups. The criterion used to determine split values at each branch is that each of the two smaller groups has a higher concentration of either “Bad” or “Good” than the preceding larger group. (Jargon alert: each subset is more homogeneous, or pure, than its parent set). The final groups, 1 A more powerful, and more complex, algorithm using an ensemble of trees is gradient boosting. If you want to go on in machine learning, gradient boosting is the next predictive modeling algorithm to learn. 2 called leaves, will have either mostly good or mostly bad cardholders. The sequence of rules can then be used to sort new applicants who currently do not have a platinum card into likely good or likely bad customers based only on the predictor variables, providing useful information into the decision of issuing a card or not. It may be helpful to compare tree models with regression, and classification trees with logistic regression: 1. Both classical regression and tree models require historical data containing both predictor and target variables. 2. Both regression and tree algorithms use the historical data to create (i.e. estimate or train) a model that summarizes the relation between the predictor and target variables. 3. A regression model is an algebraic formula, while the tree model is a sequence of if-then rules. 4. Either model, once generated, can then use predictor variables from a new case and “predict” the likely value of the target variable. 5. The logistic regression model calculates the probability of a target variable taking the value “Good” for a particular case (e.g., customer). A threshold probability (for example, but not necessarily, p = 0.5) can then be used to split the data into two groups using only the predictor variables. If the probability is greater than the threshold (e.g. 0.5), we can put the customer into the “Good” category; otherwise, into the “Bad” category. Similarly each leaf in the classification tree model has a proportion of Good/Bad past customers, which, for new customers, can be considered the probability that a customer with the preceding characteristics will be “Good” or “Bad”. A threshold probability can then be selected to assign customers to a category in the same way as logistic probabilites2. Tree model algorithms vary in a number of characteristics. We will use the rpart (which stands for recursive partitioning) algorithm (Venables and Ripley, 2002), and confine ourselves to binary target variables. The rpart model is also known as the Classification and Regression Tree (CART or C&RT) model (Breiman et al., 1984). Tree models have both advantages and disadvantages over other types of predictive models. Interpretation of a trained tree model is very simple and intuitive, which makes it particularly useful for presentations to non-technical audiences. It is also more flexible than classic regression models in its assumptions about model structure (e.g. does not assume linearity or diminishing returns), easily handles large numbers of variables and can tolerate missing values. These features make it a good tool for initial exploration of a database. However, a logistic regression model will usually pick up weaker (but still meaningful) effects; and with enough data, the right variable transformations, and careful management of missing values, will fit and predict better than a tree model. 7.1.1 Training the Tree The first step of the rpart algorithm is a search of all possible split points of all possible predictor variables to select the one split of one variable that creates a pair of subgroups, such that the target variable concentrations are the most homogeneous (or “pure”) of all possible pairs. This requires a measure of purity so that all possible splits of all variables can be compared 2 In the credit card applicaton, the probability would be used to calculate expected returns and the threshold be based on that, rather than the raw probability. 3 automatically. The core requirement for this measure is that it must indicate that a subgroup with half positive and half negatives (say, donors and nondonors), a 50/50 mix, is the least pure, and a subgroup with either all positives or all negatives is the most pure. Several measures are possible. The measure we will use is the Gini index. The following example illustrates that training process. Suppose a bicycle shop has collected information on adult customers entering the store. This information includes the customer’s age, the number of children he or she has, and whether or not he or she purchased a bicycle. The owner of the shop wishes to quantify the relation between a bicycle purchase — the target — and age and children — the predictors. The data set has 289 buyers and 911 non-buyers. 24% of the total were buyers. In a spreadsheet, this information is contained in the variable “Buy” which takes the values “Yes” or “No.” We start by generating trial splits. For example, a trial split at customer age 40 produces 400 customers who are under 40, and 800 who are over 40, with the following proportions of buyers and non-buyers: Buy Age < 40 Age ≥ 40 Yes/No: 189/211 100/700 Total in branch: 400 800 (47% Yes) (12.5% Yes) The starting mix was 24%/76%. This trial split for under 40 has a 47%/53% mix, less pure. The over-40 mix is 12.5% / 78.5%, more pure. To automate the selection of the best of all possible splits, we need a single number that we can use to compare the purity of the above split with all possible splits. Recall that this measure, needs to identify a 50-50 split is the most impure a subgroup can be, and a 0-100 (or 100-0) split is most pure. We then need to combine the purities of both subgroups to get a single measure of purity for the split. We should also recognize that that the over-40 group has twice as many cases (800 vs 400) by giving it more weight. The Gini index provides such a measure of the change in “impurity” resulting from a split of the data. The Gini index is calculated as follows for the bicycle shop example: Define PL and PR as the proportions of the total cases (1200) in original node that are split to the left and right nodes by the Age predictor set to 40, resulting in PL = 400/1200 = 0.333 PR = 800/1200 = 0.666 Define the proportion of one type (Yes) within the left and right subnodes as pL and pR , which gives pL = 189/400 = 0.4725 pR = 100/800 = 0.125 Define the Gini index as PL (pL )(1 − pL ) + PR (pR )(1 − pR ) = (0.333 × 0.4725 × 0.5275) + (0.666 × 0.125 × 0.875) = 0.154 4 We can calculate this single number for all possible splits on all possible predictor variables. To build some intuition for this index, note that when pL or pR is 0.5 (the most impure, 50-50, split) the terms (pL )(1 - pL ) and (pR )(1 - pR ) are largest, having a value of 0.5 times 0.5 = 0.25. When pL or pR is one or zero, the most pure cases, then these terms are smallest, having a value of zero. These terms are then averaged. But wait (you say), the sizes are different (400 and 800) on each side, shouldn’t the big one count for more? Yes, indeed: we will weight each side with PL and PR , to give the Gini Index. Then the smallest Gini index is taken as the best split. For the bicycle shop, there are many possible splits on the age variable, and one possible split on the kids variable. The Gini indices for six trial splits, five for the age variable and the one possible split for the kids variable: Trial Split Gini Index Value Kids = “no”, Kids = “yes” 0.213 Age = 0.095 AveDonAmt < 24 No Yes Yes AveIncEA >= 55e+3 200 / 236 6/7 259 / 365 No DonPerYear < 1.1 58 / 88 Yes SomeUnivP >= 0.21 13 / 13 Yes SomeUnivP < 0.31 29 / 37 No Yes 26 / 39 11 / 15 Figure 7.6: The CCS Tree Diagram with Uniform Branch Length You can also redo the plots with each leaf showing the proportions of the two values of the target variable rather than the counts, by changing extra= from 2 to 4. Try that now. By using the arrows in the RStudio plot plane, move between this plot and the previous plot to see how the proportions information is derived from the count information. Exercise 7.2.5: Copy the proportions tree into a Word document. In one sentence, what is the meaning of the two proportions given in a node? 6. Even more info. With type = 0, only the counts in the leaves were shown. We can also show the counts (or proportions) in all intermediate nodes, with type = 1. The plot becomes extra messy so for presentation purposes we would not normally show it. It may be useful,though, to see how each node is developing. Go go ahead and give it a try. Set type = 1, and extra = 2 (to get back to counts). Reduce the number of digits back to to 2 (to reduce clutter) as well. Note that the root node is now labled No and it has 403/800 No individuals (thus 397 Yes). That is not quite 50/50 as requested in the original selection of the 800, which assures us the selection process is actually random. Exercise 7.2.6 Copy and paste your plot into a Word document. How many individuals had an Average Donation Amount less than $12? Of those, how many belonged to the Monthly Giving program? 13 In the next two steps we will explore tree models’ structural flexibility and the effect of correlated predictors. Following that we will learn how to assess and control overfitting in tree models, before going on to comparing models using lift charts. 7. Structural Flexibility: In logistic regression, a natural logarithm transformation of the Ave- DonAmt variable improved the fit and lift because there is a concave “decreasing returns” relationship, and because the untransformed model makes a strong assumption about the struc- tural relation between target probability and predictor variables —that it is a simple S-shape — and is not able to capture a convex relationship well. Tree models make weaker assumptions, allowing trees to handle the nonlinear (specifically, convex) relation between AveDonAmt and MonthGive without the need for a transformation. To demonstrate that the log transformation is not necessary, we will run the model with the transformed variable. Create a new variable, Log.AveDonAmt, the log of AveDonAmt as in the logistic regression tutorial. Rerun the tree with the same parameters as before except replacing AveDonAmt with Log.AveDonAmt. Note that you can copy and paste the previous tree model function, edit the modelname to trlogADA, and AveDonAmt to Log.AveDonAmt. Plot the new tree. The result is shown in Figure 7.7. Use the arrow keys on the plot to go back and forth between the two trees. Note that all that changes in Figure 7.7 is the value of the first split. It is the log of the value of the first split in Figure 7.6. The same individuals end up in the same leaves. true Log.AveDonAmt < 2.5 false DonPerYear >= 0.095 Log.AveDonAmt < 3.2 No Yes Yes AveIncEA >= 55e+3 200 / 236 6/7 259 / 365 No DonPerYear < 1.1 58 / 88 Yes SomeUnivP >= 0.21 13 / 13 Yes SomeUnivP < 0.31 29 / 37 No Yes 26 / 39 11 / 15 Figure 7.7: The Log-Transformed-Average-Donation-Amount Tree 14 Generally, any monotonic transformation (one which does not change the ordering of the values) of a variable will not change the resulting tree model. While this flexibility in handling nonlinear relations between predictor and target is a strength of tree models, particularly for exploring data, it is difficult to interpret nonlinearities beyond the first or second split, so the various plotting methods and analyses discussed in previous tutorials remain very useful for exploring data. Delete the Log transformed variable as we will no longer need it and do not want to accidentally include it later on. # Delete Log.AveDonAmt CCS$Log.AveDonAmt = 55e+3 SomeUnivP >= 0.19 SomeUnivP < 0.33 No No No Yes Yes Yes Yes 220 / 285 46 / 63 24 / 36 11 / 12 15 / 17 18 / 21 259 / 366 Figure 7.8: Last Donation Amount Tree In the logistic regression tutorial we found a similar effect in the sequence of logistic regression models. In some models one of these two variables did not even appear as significant, even though they are both individually quite strongly related to the dependent variable. Even in the final model, where both appear, neither is indicated as extremely strongly significant (in fact, AveDonAmt appeared to be insignificant). This is a general consequence of correlated predictor variables in the same model: they will reduce the apparent impact of each other on the target. In regression models the estimate of correlated variables’ variances is increased (a phenomenon known as “variance inflation”), therefore lowering their significance, indicated by increased p-values. In tree models, whichever variable is used first (e.g., AveDonAmt) takes up much of the effect of the second variable, so it is not used. Highly correlated predictors mainly cause difficulty in interpreting models and in explaining to management why a predictor that they KNOW matters is not in your model. As in regression, including variables that are correlated with each other will have little effect on the predictive ability, as long as the trained model is applied to new data from a similar context (and hence structure) as the historical data on which the model is estimated. 9. Overfitting. As indicated earlier, letting a tree grow too large will result in overfitting when applied to the estimation sample. The first method to avoid this is to have stopping rule that prevents us from fitting noise. For trees, the approach used by rpart is to grow a large (overfit) 16 tree, and then prune it back using a cross-validation measure. This time we want to print the cross-validation or “pruning” table and associated plot. We will start with a large, complex tree by setting a small cp value. Run the tree algorithm, renaming the model tree001; use all the original - not log transformed - variables and set cp = 0.001 to generate a large complex tree. Plot with uniform branches, and the leaves still on the tree (not fallen). # Create tree model with cp of 0.001 tree001 = 0.095 AveDonAmt < 24 No Yes 201 / 243 355 / 557 DonPerYear < 0.54 Yes AveIncEA >= 55e+3 DonPerYear < 0.61 No 6/7 No Yes 200 / 236 96 / 192 259 / 365 No AdultAge >= 44 Region = R4,R5,R6 DonPerYear < 1.1 DonPerYear >= 0.095 Age20t29 >= 0.16 157 / 173 No No Yes Yes Yes 43 / 63 58 / 88 66 / 104 137 / 216 122 / 149 No Yes No AdultAge >= 51 SomeUnivP >= 0.21 Yes Age60pls >= 0.19 Yes Region = R1,R3,R5 Yes 38 / 49 9 / 14 21 / 24 No Yes 13 / 13 Yes 17 / 17 Yes 82 / 93 37 / 64 53 / 91 120 / 199 40 / 56 No Age60pls < 0.18 SomeUnivP < 0.31 DonPerYear < 0.27 Age80pls < 0.022 Age20t39 >= 0.31 DwelValEA >= 203e+3 FinUnivP >= 0.39 10 / 11 No No Yes No Yes Yes Yes 27 / 53 30 / 54 29 / 37 46 / 91 75 / 108 12 / 23 28 / 33 LastDonAmt < 17 Yes SomeUnivP >= 0.27 Yes No Yes No Age20t29 >= 0.07 Region = R1,R4,R5,R6 Yes No Yes No Yes No 10 / 12 No 11 / 15 5/9 25 / 28 10 / 10 Yes Yes 42 / 50 6/7 11 / 16 4/7 25 / 26 25 / 41 26 / 39 45 / 81 33 / 58 No hh1t2mem >= 0.38 No YearsGive < 8 DonPerYear < 0.48 Yes No DwelValEA >= 254e+3 12 / 14 Yes 11 / 11 No No 14 / 15 13 / 18 Yes 14 / 27 15 / 28 35 / 66 28 / 40 No Yes No Yes YearsGive < 10 Yes No Yes 10 / 15 9 / 12 13 / 19 7/9 No 16 / 23 10 / 19 19 / 21 28 / 43 No Yes 25 / 33 7 / 10 The result is difficult to make sense of, or even read. If you can read it, you can see some of the leaves have very small numbers. It is very likely overfitting. We want to rerun this model with progressively fewer leaves – “pruning” – and print out the cross-validation results. The printcp() command does this, as well as the model formula. 17 # formula and pruning table printcp(tree001) ## ## Classification tree: ## rpart(formula = MonthGive ~ AdultAge + Age20t29 + Age20t39 + ## Age60pls + Age70pls + Age80pls + AveIncEA + DonPerYear + ## DwelValEA + EngPrmLang + FinUnivP + hh1mem + hh1t2mem + NewDonor + ## Region + SomeUnivP + YearsGive + AveDonAmt + LastDonAmt, ## data = filter(CCS, Sample == "Estimation"), model = TRUE, ## cp = 0.001) ## ## Variables actually used in tree construction: ## AdultAge Age20t29 Age20t39 Age60pls Age80pls AveDonAmt AveIncEA ## DonPerYear DwelValEA FinUnivP hh1t2mem LastDonAmt Region SomeUnivP ## YearsGive ## ## Root node error: 397/800 = 0.49625 ## ## n= 800 ## ## CP nsplit rel error xerror xstd ## 1 0.3853904 0 1.00000 1.07557 0.035541 ## 2 0.0352645 1 0.61461 0.61965 0.032877 ## 3 0.0125945 3 0.54408 0.58690 0.032369 ## 4 0.0109152 4 0.53149 0.59698 0.032531 ## 5 0.0067170 7 0.49874 0.61461 0.032802 ## 6 0.0062972 12 0.46348 0.65239 0.033336 ## 7 0.0050378 23 0.36272 0.65239 0.033336 ## 8 0.0041982 25 0.35264 0.67254 0.033596 ## 9 0.0025189 28 0.34005 0.68010 0.033689 ## 10 0.0010000 31 0.33249 0.68766 0.033779 Note the Root node error above is 397/800. If you go back to the tree with the intermediate nodes labeled, you’ll see that the root node was labled No. That’s the “prediction” for that node, the no-model prediction, before any predictors are used and modelling applied. Also note that there were 397 Yes — thus if No is a prediction (the best you could do without the model) for new data, it would be wrong 397/800 times, almost half the time5. The table shows the number of splits nsplits for various values of the complexity parameter CP, from no splits, down to the level of cp you set (in this case 0.001, and 31 splits). Counterintuitively, the complexity of the tree increases as cp decreases. As variables are added at each split the error in prediction is recalculated, and as the nodes becomer more pure, the error decreases. The 5 Root node error is same information as “null deviance” in calculating McFadden R2 – it’s how well you would do knowing only the initial proportion of responses, with no predictors, and provides a starting level to see how much using modelling with various predictors reduces fitting error. The last two columns, xerror and xstd, will vary somewhat with every run so may not be the same as this table. 18 decrease is given as a fraction of the root node error, under rel error, for easier interpretation. (Analagous to the McFadden R2 relying on the fraction that the model deviance is of the null deviance). The relative error of the root node (the nsplit= 0 case, in the first line) is thus 1.0 (0.49625/0.492625). As this is only calculated for the internal training samples, it always decreases as more leaves are added. This progression should remind you of the continual increase in R2 in a multiple linear regression model, or McFadden R2 in logistic regression. Fit always increases, error always decreases, and overfitting will occur at some point, and we need another measure to know where to stop. xerror: To check for overfitting, the tree routine also automatically and repeatedly splits the input data set into its own estimation and validation samples, re-estimates the model on the new (smaller) estimation sample, and then uses the model to predict the target in the validation sample. The error between the known and predicted target in these internal validation samples is averaged over several iterations, and is called the cross-validation error, labeled xerror. The table shows a common pattern of fit measures that compensate for overfitting: xerror is always larger than rel error xerror decreases, levels out, and then increases, with the minimum indicating approximately where overfitting starts. The table has a minimum xerror value of about 0.60 (yours may have a slightly different value) at nsplit = 4 and cp = 0.0109152. Not all splits are reported; intermediate splits that could not possibly be candidates for the optimal tree are left out. We can also visualize the table (other upper labels are possible, see the Help file). plotcp(tree001, upper = "splits") 19 number of splits 0 1 3 4 7 12 23 25 28 31 1.2 X−val Relative Error 1.0 0.8 0.6 Inf 0.12 0.012 0.0065 0.0046 0.0016 cp Figure 7.9: THe CCS Tree Cross-Validation Pruning Plot Figure 7.9 shows xerror as a function of the size (number of splits) or complexity of the tree.6 Recall that the precise value of xerror depends on how the algorithm divides the data into estimation and validation samples during cross-validation. Since this is done randomly, each run will give slightly different results, and your plot may be slightly different7. An obvious choice for the best cutoff is at 3 splits. Since each xerror value is an estimate, it comes with a standard deviation, xstd, shown in the last column of the table. Any xerror values within one standard deviation of the minimum xerror are statistically indistinguishable. From the pruning table we see that the minimum xerror is about 0.58, and has xstd of 0.032. The table is shown again here for reference. Note that rpart() returns many “variables” to tree001 (see the Help file). Since we only want to look at the cross validation table, we can select the variable cptable: 6 Note that the cp scale at the bottom doesn’t match the cp value in the table. It is the geometric mean of the cp value at the split level and the previous cp value, and can be used to set a new cp cutoff in pruning, (coming up in a moment.) 7 A good way to get some sense of how much variation the random selection of cross-validation samples produces, you can use the up-arrow selection in the Console to rewrite and then execute previous functions. Hit the up arrow twice to bring the rpart() function down and Enter to run it, then twice again to bring the plotcp() function down and Enter to run it. Repeatedly doing this will produce a sequence of slightly different plots. The minimum at 4 leaves, 3 splits remains fairly stable in most. 20 tree001$cptable ## CP nsplit rel error xerror xstd ## 1 0.385390428 0 1.0000000 1.0755668 0.03554126 ## 2 0.035264484 1 0.6146096 0.6196474 0.03287660 ## 3 0.012594458 3 0.5440806 0.5869018 0.03236935 ## 4 0.010915197 4 0.5314861 0.5969773 0.03253066 ## 5 0.006717045 7 0.4987406 0.6146096 0.03280173 ## 6 0.006297229 12 0.4634761 0.6523929 0.03333596 ## 7 0.005037783 23 0.3627204 0.6523929 0.03333596 ## 8 0.004198153 25 0.3526448 0.6725441 0.03359570 ## 9 0.002518892 28 0.3400504 0.6801008 0.03368870 ## 10 0.001000000 31 0.3324937 0.6876574 0.03377934 Thus any splits with xerror less than 0.58 + 0.03, about 0.61, are equally good on the basis of cross-validation. This level is indicated by the dotted line on the plot. The interpretation is that any values below the line are statistically equivalent. In this case, 3 and 4 splits are equally good, and 1 and 7 splits almost as good. One criterion we could consider is to choose the most parsimonious of the equivalent trees, with one split, which will be easiest to interpret and conservative from the purpose of avoiding overfitting. And easy to interpret is also easy to explain. In this case the 3 or 4 splits is still very easy to interpret so we would not use the one split, two-leaf, model. 10. Pruning the tree: Use the pruning table to identify the cp value associated with any desired tree. The most conservative tree, at nsplit = 1, is at cp = 0.0352645. To generate the conservative tree, rerun the model with any cp cutoff between that cp value and the next larger value (0.3853904) in the pruning table — let’s choose 0.036. Name the model tree036 and plot it. The next two more complex trees occur at cp values of 0.0125945 and 0.0109152. Setting cp values of 0.013 and 0.012 are values that are slightly larger than those two values, and will give us those trees. Run two more models with those values, naming them tree013, and tree012. Plot the trees and check that they have the expected number of splits. Note how the tree grows with more nodes and leaves as cp decreases. Exercise 7.2.10 Copy and paste the three trees into a Word document. Generate the Pruning Plots for each of the three, and copy and paste them into the document as well. Describe any differences between the pruning plot for the large tree in Figure 7.9 and any of the new pruning plots. Keep in mind that each run will produce slightly different results in the cross validation error, with the differences becoming more noticeable as the tree becomes larger and small nonrepeatable differences between the internal estimaton and validation samples are picked up. 11. The champion model First we will generate lift charts to see the improvement in fit on our estimation sample. Enter 0.01 for the true response rate, "Yes" for the target, set the Subset expression to Sample == "Estimation", and "Estimation" for the subtitle. As a reminder, here is 21 the lift.chart() function with the required arguments. # Create lift charts to compare models on the "Estimation" sample lift.chart(modelList = c("tree001", "tree012", "tree013", "tree036"), data = filter(CCS, Sample == "Estimation"), targLevel = "Yes", trueResp = 0.01, type = "cumulative", sub = "Estimation") Weighted Cumulative Response Captured 1.0 Percent of Total Response Captured 0.8 0.6 0.4 tree001 tree012 tree013 0.2 tree036 0.2 0.4 0.6 0.8 1.0 Sample Proportion Estimation Figure 7.10: Comparison of Model Fit for Progressivley more Complex Trees As expected, the more branches in the tree (the smaller its complexity parameter), the better the fit to the estimation sample. However, we can be sure that at least the big 31-split tree (tree001) is overfitting, based on our previous cross-validation \textsf{xerror]. To pick a model for prediction we will assess the models on the validation sample. Make the necessary changes to the lift chart arguments to generate the following chart. 22 Weighted Cumulative Response Captured 1.0 Percent of Total Response Captured 0.8 0.6 0.4 tree001 tree012 tree013 0.2 tree036 0.2 0.4 0.6 0.8 1.0 Sample Proportion Validation Figure 7.11: Comparison of Model Prediction on the Validation Sample for Progressivley more Complex Trees The sequence of improvement stays the same up to tree012 (4 splits, 5 leaves), which is just slightly better than tree013 (3 splits, 4 leaves). The relations between predictors and target that it found in the estimation data also exist in the validation data. But tree001, the big tree, does worse. It’s awesome results on the estmation data set are not replicable on new data. It is capturing relations in the estimation data that don’t exist in other data. Capturing unrepeatable relations or noise, is overfitting. We will take tree tree012 as our champion model. Exercise 7.2.11 How consistent is the model comparison information that you get from Figure 7.11 with the information from xerror in the large pruning table for tree001? (Note that the pruning tables for the smaller trees are essentially the same as the tree001 table with bottom rows missing). Which column in the pruning table gives similar model comparison information to that in Figure 7.10? Copy the table and lift charts into your document to support your answer. 7.3 Random Forests 1. A common approach in machine learning is to combine many models to produce better predictions. Such ensemble methods are particularly useful to combine weak models to produce a strong result. Tree models are weak predictors compared to logistic regression, but there are ways to 23 generate many different trees and combine their predictions to produce a better outcome. A popular ensemble method for trees is random forests. Many different trees are created, each using only a subset of the cases in each tree, and a subset of all the available predictors in each node. The details of the cases selection process is beyond the scope of this short tutorial, but the the key point is that for each tree, it leaves out different cases to be used in cross validation, just as the internal cross validation process does with single trees. These holdout cases which are not used for training each tree so that they can be used for validation are referred to as “out of bag” or OOB. Unlike the single tree, the optimization is done automatically without analyst intervention. Then at each node only a random subset of predictors is selected to test for splits. For example, we could start with randomly selecting four of the 19 predictors in the CCS data, and find the best first split using only those predictors. In the next nodes, we use a different randomly selected four predictors to make the next split. In each node, the split search is limited to four randomly selected predictors. This process is repeated for hundreds of trees. The final prediction for an individual case (donor in our example) is determined by a “vote” among all the trees. Whichever outcome (Yes or No) that occured most frequently for that donor among all the trees is the prediction for that individual. As usual, that predicted outcome can be compared with the observed outcome in validation sets, to generate hit rates, confusion matrices, ROC curves, or lift charts to compare and select models. If you look back at the section on correlated predictors, you can see how using only a subset of predictors helps. In that section you saw that, although both LastDonAmt and AveDonAmt should be strong predictors, a single tree cannot pick up both. Over a few hundred trees using subsets of predictor variables, you would get the effect of one of the predictors in some trees, and the other in other trees. Besides generating a stronger predictive model than an individual tree, random forests solve the correlated predictor problem. All of the advantages of classification trees are retained with the random forest, except easy interpretation. The nice if-then structure of an individual tree that provides the details of the process is lost. The random forest is a bit of a black box, although we can extract the relative importance of individual variables. This is a big loss, particularly for presentation purposes. As well as the number of variables selected in each node, and the number of trees to grow, additional “tuning parameters” are available as arguments to control the size of individual trees (see nodesize and maxnodes in the randomForest Help tab), but their default values usually work well. The default rule is to stop splitting when there are less than 5 cases in a node, thus very complex trees will be grown. The usual way to find good machine learning models is to search through a range of values of the many tuning parameters available. Once again, fast computers allow such rather inelegant but effective brute force searches. While we won’t delve into the tools available for searching for optimal tuning parameters, you are encouraged at the end of the tutorial to experiment with the changing the number of variables per node and number of trees. 2. For our first random forest we will generate 500 trees (ntree = 500), and in each node of each tree, randomly select 4 variables (mtry = 4) to work with. These are the default values so do not need to be specified, but we will put them in for now. We will also request the importance values, which does not default to TRUE. Call the model Forest4: 24 # Create a Random Forest Model - see ?randomForest::randomForest Forest4