lecture 06.pdf

Decision Trees • Decision Tree • How to grow a tree ! • Pruning • Ensemble Learning • Bagging • Random Forest • Boosting The idea… a flowchart-like structure • A Supervised learning algorithm • The idea of a decision tree is to split the data set into smaller data sets based on the features until you reach a small enough set that contains data points that fall under one label - segmenting the predictor space into a number of simple regions (based on most significant splitter). • Each feature of the data set becomes a root[parent] node, and the leaf[child] nodes represent the outcomes. The decision on which feature to split on is made based on resultant entropy reduction or information gain from the split. • Continuous features are turned to categorical variables (i.e. lesser than or greater than a certain value) before a split at the root node. Because there could be infinite boundaries for a continuous variable, the choice is made depending on which boundary will result in the most information gain. • Root • Split • Sub-tree • Decision node • Terminal node • Parent – child • The goal of the decision tree is to result in a set that minimizes impurity. Shannon’s Entropy Model is a computational measure of the impurity of elements in the set. • Pruning is a method of limiting tree depth to reduce overfitting in decision trees. • Creating ensembles involves aggregating the results of different models. • Bagging involves creating multiple decision trees each trained on a different bootstrap sample of the data. Because bootstrapping involves sampling with replacement, some of the data in the sample is left out of each tree. • A bag of decision trees that uses subspace sampling is referred to as a random forest. • Boosting involves aggregating a collection of weak learners(regression trees) to form a strong predictor. A boosted model is built over time by adding a new tree into the model that minimizes the error by previous learners. This is done by fitting the new tree on the residuals of the previous trees. SUPERVISED LEARNING CLASSIFICATION Decision Tree Bagged and Boosted Decision Trees How it Works A decision tree lets you predict responses to data by following the decisions in the tree from the root (beginning) down to a leaf node. A tree consists of branching conditions where the value of a predictor is compared to a trained weight. The number of branches and the values of weights are determined in the training process. Additional modification, or pruning, may be used to simplify the model. Best Used... • When you need an algorithm that is easy to interpret and fast to fit, to minimize memory usage and when high predictive accuracy is not a requirement How it works In these ensemble methods, several “weaker” decision trees are combined into a “stronger” ensemble. A bagged decision tree consists of trees that are trained independently on data that is bootstrapped from the input data. Boosting involves creating a strong learner by iteratively adding “weak” learners and adjusting the weight of each weak learner to focus on misclassified examples. Best Used... • When predictors are categorical (discrete) or behave nonlinearly • When the time taken to train a model is less of a concern How trees are formed? Decision trees are formed by a collection of rules based on variables in the modeling data set: • Rules based on variables' values are selected to get the best split to differentiate observations based on the dependent variable. • Once a rule is selected and splits a node into two, the same process is applied to each "child" node (i.e. it is a recursive procedure) • Splitting stops when tree detects no further gain can be made, or some pre-set stopping rules are met. (Alternatively, the data are split as much as possible and then the tree is later pruned.) Each branch of the tree ends in a terminal node. Each observation falls into one and exactly one terminal node, and each terminal node is uniquely defined by a set of rules Classification and regression trees (CART) are a non-parametric decision tree learning technique that produces either classification or regression trees, depending on whether the dependent variable is categorical or numeric, respectively. • Decision trees can be applied to both regression and classification problems. • Tree models where the target variable can take a finite set of values are called classification trees. In these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees. • Classification tree analysis is when the predicted outcome is the class to which the data belongs. • Regression tree analysis is when the predicted outcome can be considered a real number (e.g. the price of a house, or a patient’s length of stay in a hospital). • A loss function is minimized to fit a decision tree, the algorithm usually looks for the best variable and the best splitting value among all possibilities. The loss function can be defined as the impurities in the child nodes, which are measured by Gini index or entropy. Criteria can be used to ensure the tree is interpretable and prevent overfitting • Max depth: deciding a maximum depth of the tree • Node size: at least N observations in each node One can also build a large tree with many branches, and then prune the tree by combining subtrees with the lowest trade-off in the goodness of fit. The explanatory variables are a set of technical indicators, each indicator represents a well-documented market behavior. In order to reduce the noise in the data and to try to identify robust relationships, each independent variable is considered to have a binary outcome. •Volatility (VAR1): High volatility is usually associated with a down market and low volatility with an up market. Volatility is defined as the 20 days raw ATR (Average True Range) spread to its moving average (MA). If raw ATR > MA then VAR1 = 1, else VAR1 = -1. •Short term momentum (VAR2): The equity market exhibits short term momentum behavior captured here by a 5 days simple moving averages (SMA). If Price > SMA then VAR2 = 1 else VAR2 = -1 •Long term momentum (VAR3): The equity market exhibits long term momentum behavior captured here by a 50 days simple moving averages (LMA). If Price > LMA then VAR3 = 1 else VAR3 = -1 •Short term reversal (VAR4): This is captured by the CRTDR which stands for Close Relative To Daily Range and calculated as following: If CRTDR > 0.5, then VAR4 = 1 else VAR4 = -1 •Autocorrelation regime (VAR5): The equity market tends to go through periods of negative and positive autocorrelation regimes. If returns autocorrelation over the last 5 days > 0 then VAR5 = 1 else VAR5 = -1 In the tree above, the path to reach node #4 is: VAR3 >=0 (Long Term Momentum >= 0) and VAR4 >= 0 (CRTDR >= 0). The red rectangle indicates this is a DOWN leaf (e.g., terminal node) with a probability of 58% (1 – 0.42). In market terms this means that if Long Term Momentum is Up and CRTDR is > 0.5 then the probability of a positive return next week is 42% based on the in-sample data. 18% indicates the proportion of the data set that falls into that terminal node (e.g., leaf) What is a random forest? • A forest of random trees ! • Random forest is like bootstrapping with Decision trees – Randomly selected features. What is bootstrapping? • E.g. if we have 1000 observation with 10 variables. Random forest tries to build multiple trees with different sample and different initial variables. Say, a random sample of 100 observation and 5 randomly chosen initial variables to build a tree. Then repeat the process (say) 10 times and then make a final prediction on each observation. This final prediction can simply be the mean of each prediction. A toy Example • 30 stocks with three variables PB, Vol and Momentum. 15 out of these 30 go up ! I want to create a model to predict which stocks will go up. In this problem, we need to segregate stocks that goes up based on highly significant input variable among all three. For sake os easiness we will assume each stock belongs in two categories for each variable high or low. • Lets see how Gini Index, Chi-Square, Information gain and reduction in variance works in deciding which feature is best split. Gini Index • Performs binary splits • Higher the value of Gini higher the homogeneity. • Calculate Gini for sub-nodes, using formula sum of square of probability for success and failure (p^2+q^2). • Calculate Gini for split using weighted Gini score of each node of that split • Split by PB – say 10 (2 goes up, 20%) and 20 (13 goes up, 65%) • Split by Vol – say 14 (6 goes up, 43%) and 16 (9 goes up, 56%) • Gini for low PB sub node = (0.2)*(0.2)+(0.8)*(0.8)=0.68 • sub-node High PB sub node = (0.65)*(0.65)+(0.35)*(0.35)=0.55 • Weighted Gini for Split PB = (10/30)*0.68+(20/30)*0.55 = 0.59 • For split by Vol it comes .51 so algo chooses PB ! Chi- Square • Algorithm to find out the statistical significance between the differences between sub-nodes and parent node. • Higher the value of Chi-Square higher the statistical significance of differences between sub-node and Parent node. • Chi-square = ((Actual – Expected)^2 / Expected)^1/2 • Split by PB – say 10 (2 goes up, 8 goes down) and 20 (13 goes up, 7 goes down). Expected = 50% from data (15 goes up!) • Actual – expected = 2-5 and 8-5 , total chi-square = 4.58 • Try to Split by vol Chi-sq 1.46 Information gain • Which node require more info to describe? • less impure node requires less information to describe it. And, more impure node requires more information. • Information theory is a measure to define this degree of disorganization in a system known as Entropy. If the sample is completely homogeneous, then the entropy is zero • Entropy for parent node = -(15/30) log2 (15/30) – (15/30) log2 (15/30) = 1. Here 1 shows that it is a impure node. • Entropy for PB High node = -(2/10) log2 (2/10) – (8/10) log2 (8/10) = 0.72 and for PB low node, -(13/20) log2 (13/20) – (7/20) log2 (7/20) = 0.93 • Entropy for split by PB = Weighted entropy of sub-nodes = (10/30)*0.72 + (20/30)*0.93 = 0.86 • Entropy by Momentum = 0.99 (try!) • entropy for Split on PB is the low. • In case of Variance reduction - Calculate variance for each node, and variance for each split as weighted average of each node variance. • Sklearn has options such as gini, entropy etc. When to stop? Constraints on tree size • Minimum samples for a node split - the minimum number of observations which are required in a node to be considered for splitting. • Minimum samples for a terminal node (leaf) -the minimum observations required in a terminal node or leaf. • Maximum depth of tree (vertical depth) The maximum depth of a tree. • Maximum number of terminal nodes - Since binary trees are created, a depth of ‘n’ would produce a maximum of 2^n leaves. • Maximum features to consider for split - The number of features to consider while searching for a best split. Pruning • First make the decision tree to a large depth. Then start at the bottom and start removing leaves which givenegative returns when compared from the top. • Example from ISL - a regression tree for predicting the log salary of a baseball player, based on the number of years that he has played in the major leagues and the number of hits that he made in the previous year. – a regression tree problem. • The number in each leaf is the mean of the response for the observations that fall there. Example for ISL - Regression Tree • The regions R1, R2, and R3 are known as terminal nodes • The points along the tree where the predictor space is split are referred to as internal nodes • The two internal nodes are indicated by the text Years <4.5 and Hits < 117.5. • Years is the most important factor in determining Salary. • Among players who have been in the major leagues for five or more years, the number of Hits made in the previous year does affect Salary, and players who made more Hits last year tend to have higher salaries. • We divide the predictor space — that is, the set of possible values for X1, X2, . . . , Xp — into J distinct and non-overlapping regions, R1, R2, . . . , RJ . For every observation that falls into the region Rj , we make the same prediction, which is simply the mean of the response values for the training observations in Rj . • Do you see how it works better for non-linear relationships? • But… it is computationally infeasible to consider every possible partition of the feature space into J boxes. So we take a top-down + greedy approach that is known as recursive binary splitting. • Begins at the top of the tree and then successively splits the predictor space; each split is indicated via two new branches further down on the tree. • It is greedy because at each step of the tree-building process, the best split is made at that particular step, rather than looking ahead and picking a split that will lead to a better tree in some future step. • We first select the predictor Xj and the cut point s such that splitting the predictor space into the regions {X|Xj < s} and {X|Xj ≥ s} leads to the greatest possible reduction in RSS. • Repeat the process, looking for the best predictor and best cut point in order to split the data further so as to minimize the RSS within each of the resulting regions. • However, this time, instead of splitting the entire predictor space, we split one of the two previously identified regions. We now have three regions. • Again, we look to split one of these three regions further, so as to minimize the RSS. The process continues until a stopping criterion is reached; for instance, we may continue until no region contains more than five observations. • We predict the response for a given test observation using the mean of the training observations in the region to which that test observation belongs. • The process described above may produce good predictions on the training set, but is likely to overfit the data, leading to poor test set performance. • A smaller tree with fewer splits (that is, fewer regions R1, . . . , RJ ) might lead to lower variance and better interpretation at the cost of a little bias. • One possible alternative to the process described above is to grow the tree only so long as the decrease in the RSS due to each split exceeds some (high) threshold. • This strategy will result in smaller trees, but is too short-sighted: a seemingly worthless split early on in the tree might be followed by a very good split — that is, a split that leads to a large reduction in RSS later on. • A better strategy is to grow a very large tree T0, and then prune it back in order to obtain a subtree • Cost complexity pruning — also known as weakest link pruning — is used to do this • we consider a sequence of trees indexed by a nonnegative tuning parameter α. For each value of α there corresponds a subtree T ⊂ T0 such that is as small as possible. Here |T| indicates the number of terminal nodes of the tree T, Rm is the rectangle (i.e. the subset of predictor space) corresponding to the mth terminal node, and ˆyRm is the mean of the training observations in Rm. • The tuning parameter α controls a trade-off between the subtree’s complexity and its fit to the training data. • We select an optimal value of α using cross-validation. Ensemble Learning • A very simple way to create an even better classifier is to aggregate the predictions of each classifier and predict the class that gets the most votes. • Ensemble methods work best when the predictors are as independent from one another as possible. One way to get diverse classifiers is to train them using very different algorithms. This increases the chance that they will make very different types of errors, improving the ensemble’s accuracy. Bagging • Bootstrap aggregation, or bagging, is a general-purpose procedure for reducing the variance of a statistical learning method; we introduce it here because it is particularly useful and frequently used in the context of decision trees – we use the idea that averaging a set of observations reduces variance. • we can bootstrap, by taking repeated samples from the (single) training data set. • we generate different bootstrapped training data sets then train our method on each bootstrapped training set then average all the predictions to obtain - • For classification trees: for each test observation, we record the class predicted by each of the B trees, and take a majority vote: the overall prediction is the most commonly occurring class among the B predictions. Random Forests • Random forests provide an improvement over bagged trees by way of a small tweak that decorrelates the trees. This reduces the variance when we average the trees. • As in bagging, we build a number of decision trees on bootstrapped training samples. • But when building these decision trees, each time a split in a tree is considered, a random selection of m predictors is chosen as split candidates from the full set of p predictors. The split is allowed to use only one of those m predictors. • A fresh selection of m predictors is taken at each split, and typically we choose m ≈ √p — that is, the number of predictors considered at each split is approximately equal to the square root of the total number of predictors. • Assume number of cases in the training set is N. Then, sample of these N cases is taken at random but with replacement. This sample will be the training set for growing the tree. • If there are M input variables, a number m<M is specified such that at each node, m variables are selected at random out of the M. The best split on these m is used to split the node. The value of m is held constant while we grow the forest. • Each tree is grown to the largest extent possible and there is no pruning. • Predict new data by aggregating the predictions of the n trees (i.e., majority votes for classification, average for regression). Boosting • Like bagging, boosting is a general approach that can be applied to many statistical learning methods for regression or classification. We only discuss boosting for decision trees. • Recall that bagging involves creating multiple copies of the original training data set using the bootstrap, fitting a separate decision tree to each copy, and then combining all of the trees in order to create a single predictive model. • Notably, each tree is built on a bootstrap data set, independent of the other trees. • Boosting works in a similar way, except that the trees are grown sequentially: each tree is grown using information from previously grown trees. • Bagging, random forests and boosting are good methods for improving the prediction accuracy of trees. They work by growing many trees on the training data and then combining the predictions of the resulting ensemble of trees. • The latter two methods— random forests and boosting— are among the state-of-the-art methods for supervised learning. However their results can be difficult to interpret. from sklearn import tree model = tree.DecisionTreeClassifier(criterion='gini’) # model = tree.DecisionTreeRegressor() for regression model.fit(X, y) predicted= model.predict(x_test) ##### from sklearn.ensemble import RandomForestClassifier #use RandomForestRegressor for regression problem model= RandomForestClassifier(n_estimators=500) model.fit(X, y) predicted= model.predict(x_test) ##### from sklearn.ensemble import GradientBoostingClassifier #For Classification from sklearn.ensemble import GradientBoostingRegressor #For Regression clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1) clf.fit(X_train, y_train) ##### from sklearn.ensemble import BaggingClassifier from sklearn.tree import DecisionTreeClassifier

Document Details

Tags

Related

Full Transcript

Upgrade to continue