Tree-Based Methods PDF

12/3/24 TREE-BASED METHODS Reading: Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2009. G. James, D. Witten, T. Hastie, R. Tibshirani, J. Taylor. An Introduction to Statistical Learning with Applications in Python, Springer, July 2023. 1 Moving beyond linearity The truth is never linear! Or almost never! But often the linearity assumption is good enough. When it is not… polynomials, step functions, splines, local regression, generalized additive models, offer a lot of flexibility, without losing the ease and interpretability of linear models. Other methods: Tree-based methods, support vector machines, neural networks. 2 2 1 12/3/24 Tree-based methods Here we describe tree-based methods for regression and classification. These involve stratifying or segmenting the predictor space into a number of simple regions. Since the set of splitting rules used to segment the predictor space can be summarized in a tree, these types of approaches are known as decision-tree methods. 3 3 Pros and Cons Tree-based methods are simple and useful for interpretation. However, they typically are not competitive with the best supervised learning approaches in terms of prediction accuracy. Hence, we also discuss bagging, random forests, and boosting. These methods grow multiple trees which are then combined to yield a single consensus prediction. Combining a large number of trees can often result in dramatic improvements in prediction accuracy, at the expense of some loss of interpretation. 4 4 2 12/3/24 The basics of decision trees Decision trees can be applied to both regression and classification problems. We first consider regression problems, and then move on to classification. 5 5 Baseball salary data How would you stratify it? Salary is color-coded from low (blue, green) to high (yellow, red) 6 6 3 12/3/24 Decision tree 7 7 Decision tree For the Hitters data, a regression tree for predicting the log salary of a baseball player, based on the number of years that he has played in the major leagues and the number of hits that he made in the previous year. At a given internal node, the label (of the form Xj < tk) indicates the left-hand branch emanating from that split, and the right-hand branch corresponds to Xj ³ tk. For instance, the split at the top of the tree results in two large branches. The left-hand branch corresponds to Years < 4.5, and the right-hand branch corresponds to Years ³ 4.5. The tree has two internal nodes and three terminal nodes, or leaves. The number in each leaf is the mean of the response for the observations that fall there. 8 8 4 12/3/24 Results Overall, the tree stratifies or segments the players into three regions of predictor space: R1 ={X | Years< 4.5}, R2 ={X| Years ³ 4.5, Hits

Tree-Based Methods PDF

Document Details

Tags

Related

Summary

Full Transcript