AI 305: Introduction to Machine Learning - Tree-Based Methods
Document Details
Uploaded by Deleted User
Tags
Summary
These lecture notes provide an introduction to tree-based machine learning methods and ensemble learning, including decision tree learning, advantages and disadvantages, and different algorithms, like bagging and boosting. The document also covers the concept of entropy and related topics.
Full Transcript
Introduction to Machine learning AI 305 Tree-based Methods and Ensemble Learning 1 Contents Decision Tree Learning Classification Decision Trees Regression Decision Trees Decision Trees vs. Linear Models Advantages ad...
Introduction to Machine learning AI 305 Tree-based Methods and Ensemble Learning 1 Contents Decision Tree Learning Classification Decision Trees Regression Decision Trees Decision Trees vs. Linear Models Advantages ad Disadvantages Ensemble Learning Bagging Random Forest Boosting 2 Tree-based Methods Here we describe tree-based methods for regression and classification. These involve stratifying or segmenting the predictor space into a number of simple regions (rectangles). Since the set of splitting rules used to segment the predictor space can be summarized in a tree, these types of approaches are known as decision-tree methods. 3 4 Tree-based Methods- cont. Tree-based methods are: fast, simple and useful for interpretation (easy to explain). Invariant under scaling/translation, robust to irrelevant features. Works well with both categorical and quantitative features Decision boundary can be arbitrarily complicated and perfectly classify any training set 𝑅3 𝑅1 𝑅2 5 The Basics of Decision Trees Decision trees can be applied to both regression and classification problems. We first consider classification problems, and then move on to regression. 6 Terminology for Decision Trees In keeping with the tree analogy, the regions R 1 , R 2 , and R 3 are known as terminal nodes, leaf node Leaf node specifies ℎ(𝑥) Decision trees are typically drawn upside down, in the sense that the leaves are at the bottom of the tree. The points along the tree where the predictor space is split are referred to as internal nodes 7 Classification Decision Tree oDecision trees are trained in a greedy, recursive fashion, downward from the root. oAfter creating a node with some associated split, the children of that node are constructed by the same tree-growing procedure. oSimply, we consider the tests are of the form “Is feature j of this point less than the value v?" ◦ These tests carve up the feature space in a nested rectangular fashion. oHowever, the data used to train left subtree are only those points satisfying 𝑥𝑗 < 𝑣, oand similarly the right subtree is grown with only those points with 𝑥𝑗 ≥ 𝑣. oEventually we must reach a base case where no further splits are to be made, and a prediction (rather than a split) is associated with the node. 8 Classification Decision Tree - Algorithm oLet S = {1,2, … , n} be set of sample point indices. 9 Decision Tree - How to choose best split? oTry all splits. (All features, and all splits within a feature.) oFor a set S , let 𝐽(𝑆 ) be the cost of S. oChoose the split that minimizes 𝐽 𝑆𝑙 + 𝐽(𝑆𝑟 ); or, othe split that minimizes weighted average 𝑆𝑙 𝐽 𝑆𝑙 + |𝑆𝑟 |𝐽(𝑆𝑟 ) 𝑆𝑙 + 𝑆𝑟 10 Decision Tree - How to choose cost 𝐽(𝑆 )? Poor cost function: (misclassification rate) oLabel S with the class C that labels the most examples in S ◦ 𝐽 𝑆 : number of examples in S not in class C ◦ Problem: 𝐽 𝑆𝑙 + 𝐽 𝑆𝑟 = 10 for both splits, ◦ but left split much better. [Weighted average prefers right split!] 11 Decision Tree - How to choose cost 𝐽(𝑆 )?-2 Good cost function: (Entropy) oWe can use a cost function based on the information theory by measuring the entropy oLet Y be a random class variable, and suppose 𝑃(𝑌 = 𝐶) = 𝑝𝐶. oThe surprise of Y being class C is: − log 2 𝑝𝐶. (nonnegative) ◦ event with probability 1 gives us zero surprise. ◦ event with probability 0 gives us infinite surprise! 12 Entropy oIn information theory, the surprise is equal to the expected number of bits of information we need to transmit which events happened to a recipient who knows the probabilities of the events. oThe entropy of a set S is the average surprise 𝐻 𝑆 = − 𝑝𝐶 log 2 𝑝𝐶 𝐶 oWhere 𝑝𝐶 is the proportion examples in S that are in class C {𝑖∈𝑆:𝑦𝑖 =𝐶} 𝑝𝐶 = 𝑆 13 Entropy – cont. oIf all examples in S belong to same class: 𝐻 𝑆 = −1 𝑙𝑜𝑔2 1 = 0 oIf half of class C, half of class D: 𝐻 𝑆 = −0.5 𝑙𝑜𝑔2 0.5 − 0.5 𝑙𝑜𝑔2 0.5 = 1 1 oIf n examples, all different classes:𝐻 𝑆 = −𝑙𝑜𝑔2 = 𝑙𝑜𝑔2 𝑛 𝑛 14 Entropy – cont. oThe entropy is the expected number of bits of information we need to transmit to identify the class of a sample point in S chosen uniformly at random. oIt makes sense that it takes 1 bit to specify C or D when each class is equally likely. oAnd it makes sense that it takes 𝑙𝑜𝑔2 𝑛 bits to specify one of n classes when each class is equally likely. 15 Information Gain oChoose split that maximizes information gain 𝐼 𝑆 = 𝐻 𝑆 − 𝐻𝑎𝑓𝑡𝑒𝑟 𝑆𝑙 𝐻 𝑆𝑙 +|𝑆𝑟 |𝐻(𝑆𝑟 ) Where 𝐻𝑎𝑓𝑡𝑒𝑟 is the average entropy after split: 𝐻𝑎𝑓𝑡𝑒𝑟 = 𝑆𝑙 + 𝑆𝑟 16 Entropy vs. Misclassification rate oSuppose we pick two points on the entropy curve, then draw a line segment connecting them. oBecause the entropy curve is strictly concave, the interior of the line segment is strictly below the curve. oAny point on that segment represents a weighted average of the two entropies for suitable weights. oIf you unite the two sets into one parent set, the parent set’s value 𝑝𝐶 is the weighted average of the children’s 𝑝𝐶 ’s. oTherefore, the point directly above that point on the curve represents the parent’s entropy. oThe information gain is the vertical distance between them. oSo the information gain is positive unless the two child sets both have exactly the same 𝑝𝐶 and lie at the same point on the curve.] 17 Entropy vs. Misclassification rate - cont. oplotting the % misclassified, if we draw a line segment connecting two points on the curve, the segment might lie entirely on the curve. oIn that case, uniting the two child sets into one, or splitting the parent set into two, changes neither the total misclassified sample points nor the weighted average of the % misclassified. oThe bigger problem, though, is that many different splits will get the same weighted average cost; this test doesn’t distinguish the quality of different splits well 18 Entropy vs. Misclassification rate - cont. oBoth curves are concave, but the one for misclassication rate is not strictly; ◦ it has only two unique tangent lines. ◦ Because the conditional quantity is a convex combination (since it is weighted by probabilities/proportions) of the children's values, oThe strictly convex functions always yield positive information gain as long as the children's distributions are not identical to the parent's. oWe have no such guarantee in either linear region of the misclassication rate curve; ◦ any convex combination of points on a line also lies on the line, yielding zero information gain. 19 Gini impurity oAnother way to assess the quality of a split is Gini impurity (strictly concave) oIt measures how often a randomly chosen element from the set would be incorrectly labeled if it were randomly labeled according to the distribution of labels in the subset. 𝐺 𝑆 = 𝑝𝐶 (1 − 𝑝𝐶 ) 𝐶 oEmpirically, the Gini impurity is found to produce very similar results to entropy, and it is slightly faster to compute because there is no need to take logs. 20 Stopping criteria oWe mentioned earlier that sufficiently deep decision trees can represent arbitrarily complex decision boundaries, ◦ this will lead to overfitting if we are not careful. ◦ Shallow trees tend to underfit and deep trees tend to overfit the data. oThere are a number of heuristics we may consider to decide when to stop splitting: ◦ Fixed depth: don't split if the node is beyond some fixed depth in the tree ◦ Node purity: don't split if the proportion of training points in some class is sufficiently high ◦ Information gain criteria: don't split if the gained information/purity is suffciently close to zero oNote that these are not mutually exclusive, and the thresholds can be tuned with validation. oAs an alternative (or addition) to stopping early, you can prune a fully-grown tree by re-combining splits if doing so reduces validation error. 21 Fixed Tree Depth oIncreasing the maximal depth does not necessarily result in better performance oThe ten-fold cross-validation loss as a function of the maximal tree depth for a classification problem. oThe optimal maximal tree depth is here 6. 22 Pruning oGrow tree too large; greedily remove each split whose removal improves validation performance. oMore reliable than stopping early. oPruning often works better than stopping early, this is because often a split that doesn’t seem to make much progress is followed by a split that makes a lot of progress. ◦ If you stop early, you’ll never find out. oPruning is a simple idea, but it’s highly recommended when you have enough time to build and prune the tree. 23 Regression Decision Tree The goal is to find regions (boxes) R 1 ,... , R J that minimize the residual sum of squares ( RSS), given by: 𝐽 (𝒴𝑖 − 𝒴𝑅𝑗 )2 𝑗=1 𝑖∈𝑅𝑗 owhere 𝒴𝑅𝑗 is the mean response for the training observations within the 𝑗 𝑡ℎ box. 24 Baseball salary data: Salary is color-coded from low (blue, green) to high (yellow,red) Tree Regions 25 Regression Decision Tree –cont. We predict the response for a given test observation using the mean of the training observations in the region to which that test observation belongs. A five-region example of this approach is shown in the next slide. 26 A partition of two- The output of dimensional feature recursive binary space that could not splitting on a two- result from recursive dimensional binary splitting. example. A tree corresponding to A perspective plot of the partition in the top the prediction right panel. surface corresponding to that tree. 27 Tree vs Linear models oA two-dimensional classification example in which the true decision boundary is linear, and is indicated by the shaded regions. o A classical approach that assumes a linear boundary (left) will outperform a decision tree that performs splits parallel to the axes (right). 28 Tree vs Linear models –cont. oHere the true decision boundary is non-linear. oHere a linear model is unable to capture the true decision boundary (left), whereas a decision tree is successful (right). 29 Advantages of Trees o ▲ Trees are very easy to explain to people. In fact, they are even easier to explain than linear regression! o ▲ Some people believe that decision trees more closely mirror human decision-making than do the regression and classification approaches. o ▲ Trees can be displayed graphically, and are easily interpreted even by a non-expert (especially if they are small). o ▲ Trees can easily handle qualitative predictors without the need to create dummy variables. 30 Disadvantages of Trees ▼ Unfortunately, trees generally do not have the same level of predictive accuracy as some of the other regression and classification approaches. ▼ Additionally, trees can be very non-robust. ◦ a small change in the data can cause a large change in the final estimated tree (high variance). However, by aggregating many decision trees, the predictive performance of trees can be substantially improved. We introduce these concepts next. 31 Ensemble Learning Tree-based methods are: fast, simple and useful for interpretation. Invariant under scaling/translation, robust to irrelevant features. oBut, It is not the best at prediction accuracy (Compared to other methods.) ◦ It can produce a high variance. Though we can achieve very low bias. oSuppose we take a training data set, split it into two halves, and train two decision trees, one on each half of the data. oIt’s not uncommon for the two trees to turn out very different. ◦ In particular, if the two trees pick different features for the very first split at the root of the tree, then it’s quite common for the trees to be completely different. oSo decision trees tend to have high variance. 32 Ensemble Learning – cont. oCan we reduce the high variance of DT? oSimple answer: Yes, ◦ by taking an average answer of a bunch of decision trees oThe main idea is that sometimes the average opinion of a bunch of idiots is better than the opinion of one expert. oAnd so it is with learning algorithms. oWe call a learning algorithm a weak learner if it does better than guessing randomly. And we combine a bunch of weak learners to get a strong one. 33 Ensemble Learning – cont. oSo, how we can get the average: ◦The average of output of different learning algorithms ◦The average of same learning algorithm on many training sets (if we have tons of data) ◦same learning algorithm on many random subsamples of one training set (bagging) ◦randomized decision trees on random subsamples (random forests) 34 Ensemble Learning – cont. oThe Netflix Prize was an open competition (ended in 2009) for the best collaborative filtering algorithm to predict user ratings for films, based on previous ratings. oThe winners used an extreme ensemble method that took an average of many different learning algorithms. oIn fact, a couple of top teams combined into one team so they could combine their methods. ◦They said, “Let’s average our models and split the money,”. 35 Ensemble Learning – cont. oAn ensemble method is an approach that combines many simple “building block” models in order to obtain a single and potentially very powerful model. oThese simple building block models are sometimes known as weak learners, since they may lead to mediocre predictions on their own. oThere are several ways: ◦ Bagging ◦ Random forests ◦ Boosting oIt is considered as a another way to combat overfitting. ◦ Typically by plurality vote in the case of classification and averaging in the case of regression. 36 Bagging Bootstrap AGGregatING, or bagging, is a general-purpose procedure for reducing the variance of a statistical learning method; we introduce it here because it is particularly useful and frequently used in the context of decision trees. Recall that given a set of n independent observations Z 1,... , Z n , each with variance σ2, the variance of the mean Z¯ of the observations is given by σ2/n. In other words, averaging a set of observations reduces variance. Of course, this is not practical because we generally do not have access to multiple training sets. 37 Bagging— continued Instead, we can bootstrap, by taking repeated samples from the (single) training data set. In this approach we generate B different bootstrapped training data sets. We then train our method on the 𝑏 𝑡ℎ bootstrapped training set in order to get 𝑓መ ∗𝑏 𝑥 , the prediction at a point x. We then average all the predictions to obtain 𝐵 1 𝑓መ𝑏𝑎𝑔 𝑥 = 𝑓መ ∗𝑏 𝑥 𝐵 𝑏=1 This is called bagging. 38 Bagging classification trees The above prescription applied to regression trees For classification trees: for each test observation, we record the class predicted by each of the B trees, and take a majority vote: the overall prediction is the most commonly occurring class among the B predictions. 39 Out-of-Bag Error Estimation It turns out that there is a very straightforward way to estimate the test error of a bagged model. Recall that the key to bagging is that trees are repeatedly fit to bootstrapped subsets of the observations. One can show that on average, each bagged tree makes use of around two-thirds of the observations. The remaining one-third of the observations not used to fit a given bagged tree are referred to as the out-of-bag (OOB) observations. We can predict the response for the 𝑖 𝑡ℎ observation using each of the trees in which that observation was OOB. This will yield around B/3 predictions for the 𝑖 𝑡ℎ observation, which we average (regression) or take the majority (classification). This estimate is essentially the Leave-one-out (LOO) cross-validation error for bagging, if B is large. 40 Random Forests In Bagging, we use random sampling, which isn’t random enough! often the decision trees look very similar. One really strong predictor→ same feature split at top of every tree. oFor example, if you’re building decision trees to identify spam, the first split might always be “lottery.” oRandom sampling might not change that. ◦ If the trees are very similar, then taking their average doesn’t reduce the variance much 41 Random Forests – cont. Random forests provide an improvement over bagged trees by decorrelating the trees. This reduces the variance when we average the trees. As in bagging, we build a number of decision trees on bootstrapped training samples. But when building these decision trees, each time a split in a tree is considered, a random selection of m predictors (features) is chosen as split candidates from the full set of p predictors. The split is allowed to use only one of those m predictors. Per-classifier bagging Per-split feature randomization A fresh selection of m predictors is taken at each split, and typically we choose 𝑚 ≈ 𝑑 the number of predictors considered at each split is approximately equal to the square root of the total number of predictors (4 out of the 13 for the Heart data). 42 Example: The heart data oThese data contain a binary outcome HD for 303 patients who presented with chest pain. oAn outcome value of Yes indicates the presence of heart disease based on an angiographic test, while No means no heart disease. oThere are 13 predictors including Age, Sex, Chol (a cholesterol measurement), and other heart and lung function measurements. 43 Results: The heart data oBagging and random forest results for the Heart data. The test error (black and orange) is shown as a function of B, the number of bootstrapped training sets used. Random forests were applied with 𝑚 = 𝑝 The dashed line indicates the test error resulting from a single classification tree. The green and blue traces show the OOB error, which in this case is considerably lower 44 Example: gene expression data We applied random forests to a high-dimensional biological data set consisting of expression measurements of 4,718 genes measured on tissue samples from 349 patients. There are around 20,000 genes in humans, and individual genes have different levels of activity, or expression, in particular cells, tissues, and biological conditions. Each of the patient samples has a qualitative label with 15 different levels: either normal or one of 14 different types of cancer. We use random forests to predict cancer type based on the 500 genes that have the largest variance in the training set. We randomly divided the observations into a training and a test set, and applied random forests to the training set for three different values of the number of splitting variables m. 45 Results: gene expression data Results from random forests for the fifteen-class gene expression data set with p = 500 predictors. The test error is displayed as a function of the number of trees. Each colored line corresponds to a different value of m, the number of predictors available for splitting at each interior tree node. Random forests (m < p) lead to a slight improvement over bagging (m = p). A single classification tree has an error rate of 45.7%. As with bagging, random forests will not overt if we increase B, In practice we use a value of B suffiently large for the error rate to have settled down. 46 Boosting Like bagging, boosting is a general approach that can be applied to many statistical learning methods for regression or classification. Recall that bagging involves creating multiple copies of the original training data set using the bootstrap, fitting a separate decision tree to each copy, and then combining all of the trees in order to create a single predictive model. each tree is built on a bootstrap data set, independent of the other trees. Boosting works in a similar way, except that the trees are grown sequentially: each tree is grown using information from previously grown trees. 47 Boosting – cont. oThe key idea is: to improve our combined model, we should focus on finding learners that correctly predict the points which the overall boosted model is currently predicting inaccurately. oBoosting algorithms implement this idea by associating a weight with each training point and iteratively reweighting so that mispredicted points have relatively high weights. oIntuitively, some points are “harder’ to predict than others, so the algorithm should focus its efforts on those. 48 Adaboost oThe most popular versions of boosting. AdaBoost (adaptive boosting), which is a method for binary classification. 49 Adaboost – cont. oThen the final classifier is: Where: oNote that this choice of 𝛼𝑚 gives high weight to classifiers that have low error 50 Boosting for regression Unlike fitting a single large decision tree to the data, which amounts to fitting the data hard and potentially overfitting, the boosting approach instead learns slowly. Given the current model, we fit a decision tree to the residuals from the model. We then add this new decision tree into the fitted function in order to update the residuals. Each of these trees can be rather small, with just a few terminal nodes, determined by the parameter d in the algorithm. By fitting small trees to the residuals, we slowly improve fˆ in areas where it does not perform well. The shrinkage parameter λ slows the process down even further, allowing more and different shaped trees to attack the residuals. 51 Boosting for regression – cont. 52 Tuning parameters for boosting 1.The number of trees B. Unlike bagging and random forests, boosting can overfit if B is too large, although this overfitting tends to occur slowly if at all. ◦ We use cross-validation to select B. 2.The shrinkage parameter λ, a small positive number. This controls the rate at which boosting learns (step size). ◦ Typical values are 0.01 or 0.001, and the right choice can depend on the problem. ◦ Very small λ can require using a very large value of B in order to achieve good performance. 3.The number of splits d in each tree, which controls the complexity of the boosted ensemble. ◦ Often d = 1 works well, in which case each tree is a stump, consisting of a single split and resulting in an additive model. ◦ More generally d is the interaction depth, and controls the interaction order of the boosted model, since d splits can involve at most d variables. 53 54 Another regression example 55 Another classication example 56 Variable importance measure oBagging and RF typically results in improved accuracy over prediction using a single tree, but with loosing the interpretability. oBut, we can obtain an overall summary of the importance of each predictor using the RSS (regression trees) or the Gini index (for classification trees). 57 Variable importance measure – cont. For bagged/RF regression trees, we record the total amount that the RSS is decreased due to splits over a given predictor, averaged over all B trees. Variable importance plot for the Heart data A large value indicates an important predictor. Similarly, for bagged/RF classification trees, we add up the total amount that the Entropy/ Gini measure is decreased by splits over a given predictor, averaged over all B trees. 58 Summary Decision trees are simple and interpretable models for regression and classification However they are often not competitive with other methods in terms of prediction accuracy. Bagging, random forests and boosting are good methods for improving the prediction accuracy of trees. They work by growing many trees on the training data and then combining the predictions of the resulting ensemble of trees. The latter two methods— random forests and boosting— are among the state-of-the-art methods for supervised learning. However their results can be difficult to interpret. 59 End 60