Tree-Based Methods Lecture Notes PDF
Document Details
Uploaded by ExquisiteNarwhal
Tags
Summary
These lecture notes cover tree-based methods in machine learning, focusing on decision trees, including algorithms like ID3 and CART, and their associated concepts.
Full Transcript
TREE-BASED METHODS Decision Tree Algorithm Can be used for solving regression and classification problems The goal of using a Decision Tree is to create a training model that can use to predict the class or value of the target variable by learning simple decision rules inferred from prior data(tr...
TREE-BASED METHODS Decision Tree Algorithm Can be used for solving regression and classification problems The goal of using a Decision Tree is to create a training model that can use to predict the class or value of the target variable by learning simple decision rules inferred from prior data(training data). In Decision Trees, for predicting a class label for a record we start from the root of the tree. We compare the values of the root attribute with the record’s attribute. On the basis of comparison, we follow the branch corresponding to that value and jump to the next node. Important Terminology Related to Decision Trees 1. Root Node: It represents the entire population or sample and this further gets divided into two or more homogeneous sets. 2. Splitting: It is a process of dividing a node into two or more sub-nodes. 3. Decision Node: When a sub-node splits into further sub-nodes, then it is called the decision node. 4. Leaf / Terminal Node: Nodes do not split is called Leaf or Terminal node. 5. Pruning: When we remove sub-nodes of a decision node, this process is called pruning. You can say the opposite process of splitting. 6. Branch / Sub-Tree: A subsection of the entire tree is called branch or sub-tree. 7. Parent and Child Node: A node, which is divided into sub-nodes is called a parent node of sub- nodes whereas sub-nodes are the child of a parent node How do Decision Trees work? The decision of making strategic splits heavily affects a tree’s accuracy. The decision criteria are different for classification and regression trees. Decision trees use multiple algorithms to decide to split a node into two or more sub-nodes. The creation of sub-nodes increases the homogeneity of resultant sub-nodes. In other words, we can say that the purity of the node increases with respect to the target variable. The decision tree splits the nodes on all available variables and then selects the split which results in most homogeneous sub-nodes. The algorithm selection is also based on the type of target variables. Here are some algorithms used in Decision Trees: ID3 → (extension of D3) C4.5 → (successor of ID3) CART → (Classification And Regression Tree) CHAID → (Chi-square automatic interaction detection Performs multi-level splits when computing classification trees) MARS → (multivariate adaptive regression splines) The ID3 algorithm builds decision trees using a top-down greedy search approach through the space of possible branches with no backtracking. A greedy algorithm, as the name suggests, always makes the choice that seems to be the best at that moment. It begins with the original set S as the root node. On each iteration of the algorithm, it iterates through the very unused attribute of the set S and calculates Entropy(H) and Information gain(IG) of this attribute. It then selects the attribute which has the smallest Entropy or Largest Information gain. The set S is then split by the selected attribute to produce a subset of the data. The algorithm continues to recur on each subset, considering only attributes never selected before. Steps in ID3 Algorithm Attribute Selection Measures If the dataset consists of N attributes, then deciding which attribute to place at the root or at different levels of the tree as internal nodes is a complicated step. By just randomly selecting any node to be the root can’t solve the issue. If we follow a random approach, it may give us bad results with low accuracy. For solving this attribute selection problem, researchers worked and devised some solutions. They suggested using some criteria like : Information Reduction Entropy Gini index Gain Ratio Chi-Square Gain in Variance These criteria will calculate values for every attribute. The values are sorted, and attributes are placed in the tree by following the order i.e., the attribute with a high value(in case of information gain) is placed at the root. While using Information Gain as a criterion, we assume attributes to be categorical, and for the Gini index, attributes are assumed to be continuous. Entropy Entropy is a measure of the randomness in the information being processed. The higher the entropy, the harder it is to draw any conclusions from that information. Flipping a coin is an example of an action that provides information that is random. Information Gain Information gain (IG) is a statistical property that measures how well a given attribute separates the training examples according to their target classification. Constructing a decision tree is all about finding an attribute that returns the highest information gain and the smallest entropy. Information gain is a decrease in entropy. It computes the difference between entropy before split and average entropy after split of the dataset based on given attribute values. ID3 (Iterative Dichotomiser) decision tree algorithm uses information gain. Mathematically, IG is represented as: where before is the dataset before the split, K is the number of subsets generated by the split, and (j, after) is subset j after the split. Information Gain Let us calculate the information for the event of throwing a coin. It has two possible values, i.e. head (p1) or tail (p2). In case of unbiased coin, the probability of head and tail is 0.5 respectively. Thus, the information is entropy ( p ) 0.5log(0.5) 0.5log(0.5) 0.5( 1) 0.5( 1) 0.5 0.5 1 [log 2 0.5 1] But if the coin is biased and has heads on both the sides, then probability for head is 1 while the probability of tails will be 0. Thus, total information in tossing this coin will be as follows. entropy ( p ) 1log(1) 0 log(0) 0 [log 2 (1) 0] You can clearly observe that tossing of biased coin carries no information while tossing of unbiased coin carries information of 1. Gini Index You can understand the Gini index as a cost function used to evaluate splits in the dataset. It is calculated by subtracting the sum of the squared probabilities of each class from one. It favors larger partitions and easy to implement whereas information gain favors smaller partitions with distinct values. Gini Index works with the categorical target variable “Success” or “Failure”. It performs only Binary splits. Higher value of Gini index implies higher inequality, higher heterogeneity. **CART (Classification and Regression Tree) uses the Gini index method to create split points. Gini Index Gini Index representing perfect equality Gain Ratio Information gain is biased towards choosing attributes with a large number of values as root nodes. It means it prefers the attribute with a large number of distinct values. C4.5, an improvement of ID3, uses Gain ratio which is a modification of Information gain that reduces its bias and is usually the best option. Gain ratio overcomes the problem with information gain by taking into account the number of branches that would result before making the split. It corrects information gain by taking the intrinsic information of a split into account. Reduction in Variance Reduction in variance is an algorithm used for continuous target variables (regression problems). This algorithm uses the standard formula of variance to choose the best split. The split with lower variance is selected as the criteria to split the population: The acronym CHAID stands for Chi-squared Automatic Interaction Detector. It is one of the oldest tree classification methods. It finds out the statistical significance between the differences between sub-nodes and parent node. We measure it by the sum of squares of standardized differences between observed Chi-Square and expected frequencies of the target variable. It works with the categorical target variable “Success” or “Failure”. It can perform two or more splits. Higher the value of Chi-Square higher the statistical significance of differences between sub-node and Parent node. It generates a tree called CHAID (Chi-square Automatic Interaction Detector). Pros & Cons Tree-based methods are simple and useful for interpretation. × However, they typically are not competitive with the best supervised learning approaches in terms of prediction accuracy. Hence, we also discuss bagging, Random Forests, and boosting. These methods grow multiple trees which are then combined to yield a single consensus prediction. Combining a large number of trees can often result in dramatic improvements in prediction accuracy, at the expense of some loss interpretation How to avoid/counter OVERFITTING in Decision Trees? PRUNING DECISION RANDOM FOREST TREES The splitting process results in fully grown trees until the stopping criteria are Pruning reached. But, the fully grown tree is likely to overfit the data, leading to poor accuracy on unseen data. In pruning, you trim off the branches of Decision the tree, i.e., remove the decision nodes starting from the leaf node such that the overall accuracy is not disturbed. Trees This is done by segregating the actual training set into two sets: training data set, D and validation data set, V. Prepare the decision tree using the segregated training data set, D. Then continue trimming the tree accordingly to optimize the accuracy of the validation data set, V. ENSEMBLE CLASSIFIERS Ensemble classifier/learning is a way of generating various base classifiers from which a new classifier is derived which performs better than any constituent classifier. These base classifiers may differ in the algorithm used, hyperparameters, representation or the training set. The key objective of the ensemble methods is to reduce bias and variance. Figure above shows a basic outline of ensemble techniques. Type of Ensemble Learning Bagging Bagging, also known as bootstrap aggregation, is the ensemble learning method that is commonly used to reduce variance within a noisy dataset. In bagging, a random sample of data in a training set is selected with replacement—meaning that the individual data points can be chosen more than once. After several data samples are generated, these weak models are then trained independently, and depending on the type of task—regression or classification, for example—the average or majority of those predictions yield a more accurate estimate. As a note, the random forest algorithm is considered an extension of the bagging method, using both bagging and feature randomness to create an uncorrelated forest of decision trees. How Bagging Works? In 1996, Leo Breiman introduced the bagging algorithm, which has three basic steps: Bootstrapping: Bagging leverages a bootstrapping sampling technique to create diverse samples. This resampling method generates different subsets of the training dataset by selecting data points at random and with replacement. This means that each time you select a data point from the training dataset, you are able to select the same instance multiple times. As a result, a value/instance repeated twice (or more) in a sample. Parallel training: These bootstrap samples are then trained independently and in parallel with each other using weak or base learners. Aggregation: Finally, depending on the task (i.e. regression or classification), an average or a majority of the predictions are taken to compute a more accurate estimate. In the case of regression, an average is taken of all the outputs predicted by the individual classifiers; this is known as soft voting. For classification problems, the class with the highest majority of votes is accepted; this is known as hard voting or majority voting. Algorithm for the Bagging classifier: Benefits of Bagging Ease of implementation: Reduction of variance: Bagging can Python libraries such as scikit-learn (also reduce the variance within a learning known as sklearn) make it easy to algorithm. This is particularly helpful combine the predictions of base learners with high-dimensional data, where or estimators to improve model missing values can lead to higher performance. Their documentation lays variance, making it more prone to out the available modules that you can overfitting and preventing accurate leverage in your model optimization. generalization to new datasets. Challenges of Bagging Applications of Bagging Healthcare: Bagging has been used to form medical data predictions. For example, there is research shows that ensemble methods have been used for an array of bioinformatics problems, such as gene and/or protein selection to identify a specific trait of interest. More specifically, this research delves into its use to predict the onset of diabetes based on various risk predictors. IT: Bagging can also improve the precision and accuracy in IT systems, such as ones network intrusion detection systems. Meanwhile, this research looks at how bagging can improve the accuracy of network intrusion detection—and reduce the rates of false positives. Environment: Ensemble methods, such as bagging, have been applied within the field of remote sensing. More specifically, this research shows how it has been used to map the types of wetlands within a coastal landscape. Finance: Bagging has also been leveraged with deep learning models in the finance industry, automating critical tasks, including fraud detection, credit risk evaluations, and option pricing problems. This research demonstrates how bagging among other machine learning techniques have been leveraged to assess loan default risk. This study highlights how bagging helps to minimize risk by to prevent credit card fraud within banking and financial institutions. The random forest algorithm is an extension of the bagging method as it utilizes both bagging and feature randomness to create an uncorrelated forest of decision trees. Random Feature randomness, also known as feature bagging or “the random subspace method” generates a random subset of features, which ensures low correlation among decision trees. This is a key difference between decision trees Forest and random forests. While decision trees consider all the possible feature splits, random forests only select a subset of those features. Random forest algorithms have three main hyperparameters, which need to be set before training. These include node size, the number of trees, and the number of features sampled. From there, the random forest classifier can be used to solve for regression or classification problems. Classification in Random Forest Classification in Random Forest Disadvantages of Random Forest It is a difficult tradeoff between the training time (and space) and increased number of trees. The increase of the number of trees can According to an example, when selecting stocks improve the accuracy of prediction. However, from the CSI300 Index components, random random forest often involves higher time and forest may lose effectiveness. According to the space to train the model as a larger number of importance score of factors. Market value and trees are involved. Assume that the maximum reversal factor greatly affect random forest. number of weak learners (decision trees) in the Which can generally achieve good prediction random forest is n. Generally speaking, if n is results under the premise that the environment too small, underfitting is easy to occur. If n is too does not change a lot. But it is sensitive to large, the amount of calculation will be parameters, noise, environmental changes and increased, and as n reaches a certain number, other factors. Therefore, in this example, the model will be improved very little with random forest for stock selection performed increasing n, so, a moderate value should be well from 2011 to 2016, but had poor found, tree number n, and learning efficiency performance since 2017. should be considered together. And the tradeoff between them is a difficult problem for solving. Boosting Boosting is an ensemble learning method that combines a set of weak learners into a strong learner to minimize training errors. In boosting, a random sample of data is selected, fitted with a model and then trained sequentially—that is, each model tries to compensate for the weaknesses of its predecessor. With each iteration, the weak rules from each individual classifier are combined to form one, strong prediction rule. Type of Boosting Adaptive boosting Extreme gradient Gradient boosting or AdaBoost boosting or XGBoost Building on the work of Leo Breiman, Jerome H. Friedman Yoav Freund and Robert Schapire developed gradient boosting, which are credited with the creation of works by sequentially adding XGBoost is an implementation of the AdaBoost algorithm. This predictors to an ensemble with each gradient boosting that’s designed method operates iteratively, one correcting for the errors of its for computational speed and scale. identifying misclassified data predecessor. However, instead of XGBoost leverages multiple cores points and adjusting their weights changing weights of data points like on the CPU, allowing for learning to minimize the training error. The AdaBoost, the gradient to occur in parallel during model continues optimize in a boosting trains on the residual errors training. sequential fashion until it yields of the previous predictor. The name, the strongest predictor. gradient boosting, is used since it combines the gradient descent algorithm and boosting method. Other Type of Boosting Techniques LightGBM (Light Stochastic Gradient Gradient Boosting CatBoost: HistGradientBoosting: Boosting: Machine): A variation of Gradient A variant of Gradient A variation of Gradient Boosting that uses A boosting method Boosting designed for Boosting that histogram-based that is particularly high efficiency and introduces techniques for good with categorical scalability, especially randomness by splitting data, making data and avoids the with large datasets. It subsampling the data it faster and more need for extensive uses a leaf-wise tree before each iteration, memory-efficient, pre-processing, such growth strategy for helping prevent implemented in as one-hot encoding. better accuracy. overfitting. libraries like Scikit- learn. Benefits of Boosting Ease of Implementation: Boosting can be used with several hyper-parameter tuning options to improve fitting. No data preprocessing is required, and boosting algorithms like have built-in routines to handle missing data. In Python, the scikit-learn library of ensemble methods (also known as sklearn.ensemble) makes it easy to implement the popular boosting methods, including AdaBoost, XGBoost, etc. Reduction of bias: Boosting algorithms combine multiple weak learners in a sequential method, iteratively improving upon observations. This approach can help to reduce high bias, commonly seen in shallow decision trees and logistic regression models. Computational Efficiency: Since boosting algorithms only select features that increase its predictive power during training, it can help to reduce dimensionality as well as increase computational efficiency. Challenges of Boosting Overfitting: There’s some dispute in the research around whether or not boosting can help reduce overfitting or exacerbate it. We include it under challenges because in the instances that it does occur, predictions cannot be generalized to new datasets. Intense computation: Sequential training in boosting is hard to scale up. Since each estimator is built on its predecessors, boosting models can be computationally expensive, although XGBoost seeks to address scalability issues seen in other types of boosting methods. Boosting algorithms can be slower to train when compared to bagging as a large number of parameters can also influence the behavior of the model. Applications of Boosting Healthcare: Boosting is used to lower errors in medical data predictions, such as predicting cardiovascular risk factors and cancer patient survival rates. For example, research shows that ensemble methods significantly improve the accuracy in identifying patients who could benefit from preventive treatment of cardiovascular disease, while avoiding unnecessary treatment of others. Likewise, another study found that applying boosting to multiple genomics platforms can improve the prediction of cancer survival time. IT: Gradient boosted regression trees are used in search engines for page rankings, while the Viola-Jones boosting algorithm is used for image retrieval. As noted by Cornell , boosted classifiers allow for the computations to be stopped sooner when it’s clear in which way a prediction is headed. This means that a search engine can stop the evaluation of lower ranked pages, while image scanners will only consider images that actually contains the desired object. Finance: Boosting is used with deep learning models to automate critical tasks, including fraud detection, pricing analysis, and more. For example, boosting methods in credit card fraud detection and financial products pricing analysis improve the accuracy of analyzing massive data sets to minimize financial losses. Bagging VS Boosting Bagging and boosting are two main types of ensemble learning methods. As highlighted in one study, the main difference between these learning methods is the way in which they are trained. In bagging, weak learners are trained in parallel, but in boosting, they learn sequentially. This means that a series of models are constructed and with each new model iteration, the weights of the misclassified data in the previous model are increased. This redistribution of weights helps the algorithm identify the parameters that it needs to focus on to improve its performance. AdaBoost, which stands for “adaptative boosting algorithm,” is one of the most popular boosting algorithms as it was one of the first of its kind. Other types of boosting algorithms include XGBoost, GradientBoost, and BrownBoost. Another difference in which bagging and boosting differ are the scenarios in which they are used. For example, bagging methods are typically used on weak learners which exhibit high variance and low bias, whereas boosting methods are leveraged when low variance and high bias is observed. Stacking Stacking is one of the popular ensemble modeling techniques in machine learning. Various weak learners are ensembled in a parallel manner in such a way that by combining them with Meta learners, we can predict better predictions for the future. This ensemble technique works by applying input of combined multiple weak learners' predictions and Meta learners so that a better output prediction model can be achieved. In stacking, an algorithm takes the outputs of sub-models as input and attempts to learn how to best combine the input predictions to make a better output prediction. Stacking is also known as a stacked generalization and is an extended form of the Model Averaging Ensemble technique in which all sub-models equally participate as per their performance weights and build a new model with better predictions. This new model is stacked up on top of the others; this is the reason why it is named stacking. Architecture of Stacking The architecture of the stacking model is designed in such as way that it consists of two or more base/learner's models and a meta-model that combines the predictions of the base models. These base models are called level 0 models, and the meta-model is known as the level 1 model. So, the Stacking ensemble method includes original (training) data, primary level models, primary level prediction, secondary level model, and final prediction. The basic architecture of stacking can be represented as shown below the image. Architecture of Stacking Base models: These Level-0 Predictions: Each Meta Model: The architecture of Original data: This data models are also referred to base model is triggered on the stacking model consists of one is divided into n-folds as level-0 models. These some training data and meta-model, which helps to best and is also considered models use training data provides different combine the predictions of the test data or training and provide compiled predictions, which are base models. The meta-model is data. predictions (level-0) as an known as level-0 also known as the level-1 model. output. predictions. Level-1 Prediction: The meta-model learns how to best combine the predictions of the base models and is trained on different predictions made by individual base models, i.e., data not used to train the base models are fed to the meta-model, predictions are made, and these predictions, along with the expected outputs, provide the input and output pairs of the training dataset used to fit the meta-model. Summary of Stacking Ensemble Stacking is an ensemble method that enables the model to learn how to use combine predictions given by learner models with meta-models and prepare a final model with accurate prediction. The main benefit of stacking ensemble is that it can shield the capabilities of a range of well-performing models to solve classification and regression problems. Further, it helps to prepare a better model having better predictions than all individual models. Example of Case Study Stacking Stacking is a method where a single training dataset is given to multiple models and trained. The training set is further divided using k-fold validation and the resultant model is formed. Ensemble Bagging Classifiers Bootstrap aggregation In this method, n samplings of training data are generated by picking various data items from the training data with replacement. Boosting Boosting is a self-learning technique. It learns by assigning a weight for various items in the data. TO SUMMARIZE… The accuracy a model exhibits during testing does not necessarily translate to its performance on samples classified after deployment. There are a number of reasons why this is the case: First, it is assumed that the distribution of the data used to train the model approximates the actual distribution of the domain of interest. However, the training data may be too sparse or unevenly Uncertainty focused on sub-regions of the sample space compared to the actual domain to reasonably reflect its true distribution It is also possible that the characteristics of the domain have drifted in Supervised after the data was collected. For example, when classifying websites that spread malicious content, in order to evade detection, hackers Learning will rapidly change the URLs they use to deliver their attacks. A time period after the training data is collected, its validity will have lessened. Therefore, when presented with a current URL collected from the internet, the model's output may or may not still be valid for the point in feature space that the URL occupies. Ideally training data should always be collected or re-sampled as recently as possible. This, however, can be too time consuming or expensive to practically do. Or, in some situations, it may be impossible. An evaluation of uncertainty may also provide the means to determine when it is necessary to re-collect data. Difference between Error and Uncertainty Error describes differences between predictions and actual observations given a fixed model. Uncertainty arises from many sources including the data, model selection, preprocessing and regularization, the model parameterization algorithm, inference algorithms, and the decision context itself. The sources of uncertainty give rise to a variety of possible data models, each with its own errors. No Free Lunch Theorem Given that there is no best single machine learning algorithm across all possible prediction problems, it motivates the need to continue to develop new learning algorithms and to better understand algorithms that have already been developed. “As a consequence of the no free lunch theorem, we need to develop many different types of models, to cover the wide variety of data that occurs in the real world. And for each model, there may be many different algorithms we can use to train the model, which make different speed- accuracy-complexity tradeoffs.” — Pages 24-25, Machine Learning: A Probabilistic Perspective, 2012.