Decision Trees and Entropy Concepts

Podcast

Play an AI-generated podcast conversation about this lesson

Download our mobile app to listen on the go

Get App

Questions and Answers

What is the entropy value when all examples in a set belong to the same class?

0.5
log2(n)
1
0 (correct)

Entropy increases when examples are evenly distributed across multiple classes.

True (A)

What formula represents information gain?

I(S) = H(S) - H_after

The expected number of bits needed to transmit information to identify a class in S is referred to as ___.

entropy

Signup and view all the answers

What happens to information gain if two child sets have the same probability distribution?

It becomes zero (D)

Signup and view all the answers

Match the terms with their correct definitions:

Entropy = Measure of uncertainty Information Gain = Reduction in entropy after a split Misclassification Rate = Percentage of incorrectly classified points Uniform Distribution = Equal likelihood of each class

Signup and view all the answers

What is the entropy value when there are two equally likely classes?

1

Signup and view all the answers

If there are n different classes, the entropy is calculated using the formula ___ = log2(n).

H(S)

Signup and view all the answers

What determines the split for the left subtree in a decision tree?

All points satisfying 𝑥𝑗 < 𝑣 (B)

Signup and view all the answers

The cost function J(S) is calculated by simply summing the number of misclassified examples.

False (B)

Signup and view all the answers

What is used to measure the surprise of a class variable Y in decision trees?

logarithm of the probability of the class

Signup and view all the answers

The entropy of a set S is calculated using the formula 𝐻(𝑆) = − ∑ 𝑝𝐶 log_2 𝑝𝐶, where 𝑝𝐶 is the _____ in S that are in class C.

proportion of examples

Signup and view all the answers

Match the following components of decision tree learning with their descriptions:

Left subtree = Uses points where 𝑥𝑗 < 𝑣 Right subtree = Uses points where 𝑥𝑗 ≥ 𝑣 Cost function = Measures the performance of a split Entropy = Quantifies surprise based on probabilities

Signup and view all the answers

Which aspect of splits does a good cost function improve upon compared to a poor cost function?

It provides a more accurate measure of left and right splits. (B)

Signup and view all the answers

Weighted averages can sometimes prefer a split that is not optimal for decision trees.

True (A)

Signup and view all the answers

What is the outcome when no further splits can be made in a decision tree?

a prediction associated with the node

Signup and view all the answers

What method involves combining multiple learning algorithms to improve model performance?

Ensemble Learning (A)

Signup and view all the answers

What is a disadvantage of decision trees?

They generally have low predictive accuracy. (D)

Signup and view all the answers

Bagging is primarily used to increase the bias of a model.

False (B)

Signup and view all the answers

Ensemble learning can help reduce the high variance found in decision trees.

True (A)

Signup and view all the answers

What is the main idea behind using ensemble methods in decision trees?

To combine multiple weak learners to create a strong learner.

Signup and view all the answers

What is the main goal of ensemble methods?

To combine multiple models to produce a more accurate and robust prediction.

Signup and view all the answers

In bagging, the final prediction for classification is determined by a __________ vote.

majority

Signup and view all the answers

The Netflix Prize aimed to develop the best __________ algorithm for predicting user ratings.

collaborative filtering

Signup and view all the answers

Match the following ensemble methods to their descriptions:

Bagging = Reduces variance by averaging predictions from multiple models Random forests = Uses multiple decision trees for classification or regression Boosting = Sequentially combines weak models to improve accuracy Ensemble learning = Combines multiple learning algorithms for better performance

Signup and view all the answers

Match the following terms with their descriptions:

Weak Learner = A learning algorithm that performs better than random guessing. Bagging = Using random subsets of a training set to create multiple models. Random Forests = Multiple randomized decision trees created from subsamples. Variance = The tendency of decision trees to produce different results with small changes in data.

Signup and view all the answers

Which of the following is NOT a characteristic of tree-based methods?

They always produce high accuracy. (A)

Signup and view all the answers

Which of the following is NOT a benefit of using ensemble methods?

Increased computational speed (D)

Signup and view all the answers

Small changes in the dataset have little effect on the final estimated tree.

False (B)

Signup and view all the answers

In bagging, each bootstrapped training set is created by sampling with replacement.

True (A)

Signup and view all the answers

What is the formula for the average prediction in bagging?

f_bag(x) = (1/B) * Σ from b=1 to B of f*_b(x)

Signup and view all the answers

What is meant by 'high variance' in decision trees?

High variance means that small changes in the training data can lead to large changes in the model output.

Signup and view all the answers

What is indicated by the dashed line in the heart data results?

Test error from a single classification tree (C)

Signup and view all the answers

Random forests always lead to a higher error rate than bagging.

False (B)

Signup and view all the answers

How many patients were tissue samples collected from in the gene expression data study?

349

Signup and view all the answers

Random forests are applied with ___ predictors for splitting at each node.

m

Signup and view all the answers

Match the following terms to their definitions:

Bagging = Creates multiple copies of the original data set using the bootstrap Random Forests = Combines results of multiple decision trees for better predictions OOB error = Error estimate based on bootstrap samples not used in training Boosting = Building models sequentially to correct errors of previous models

Signup and view all the answers

Which of the following best describes the purpose of using 500 genes with the largest variance in the training set?

To maximize prediction of cancer type (B)

Signup and view all the answers

Increasing the number of trees in random forests will always decrease the error rate.

False (B)

Signup and view all the answers

What levels were used to classify the patients in the gene expression data set?

Normal or one of 14 different types of cancer

Signup and view all the answers

What is the primary goal of pruning in decision trees?

To improve validation performance (B)

Signup and view all the answers

Pruning is more reliable than stopping early in decision tree construction.

True (A)

Signup and view all the answers

What does RSS stand for in the context of regression decision trees?

Residual Sum of Squares

Signup and view all the answers

The mean response for the training observations within the j-th box is represented as 𝒴෠𝑅𝑗. Therefore, this represents the predicted response for a given test observation using the _____ of the training observations.

mean

Signup and view all the answers

Match the following techniques with their characteristics:

Pruning = Improves validation performance by removing unnecessary splits Regression Decision Tree = Minimizes residual sum of squares Decision Tree = Mimics human decision-making Linear Model = Assumes a linear decision boundary

Signup and view all the answers

In what situation can decision trees outperform linear models?

When the true decision boundary is non-linear (B)

Signup and view all the answers

Decision trees are more complicated to explain than linear regression models.

False (B)

Signup and view all the answers

What is one key reason people might prefer decision trees over regression models?

They mirror human decision-making

Signup and view all the answers

Flashcards

Data used to train subtrees

The data used to train a specific subtree (either left or right) of a decision tree is restricted to the subset of points satisfying a particular condition determined by the current node's split. For instance, the left subtree is built using only points where the feature 𝑥𝑗 is less than the split value 𝑣, while the right subtree uses points where 𝑥𝑗 is greater than or equal to 𝑣.

Base case in decision tree

A decision tree continues to split its nodes until a stopping point is reached. Once this point is reached, no further splitting occurs, and instead, a prediction is made for the class label associated with that node.

Cost function 𝐽(𝑆 )

The cost function 𝐽(𝑆 ) measures the quality of a set S. In the context of decision trees, it assesses the homogeneity of the data within a particular node.

Misclassification rate as cost function

A simple, but not very effective, cost function for a node is the misclassification rate. It counts the number of examples in the node that do not belong to the majority class label.

Signup and view all the flashcards

Entropy cost function

The entropy cost function is based on information theory. It measures the uncertainty or randomness associated with the distribution of class labels within a node.

Signup and view all the flashcards

Entropy of a set

The entropy of a set S is computed by averaging the surprise of each class label, weighted by its proportion in the set. The surprise of a class is the negative logarithm (base 2) of its probability.

Signup and view all the flashcards

Optimizing splits in a decision tree

A good split in a decision tree aims to minimize the weighted average of the cost functions of its resulting subtrees. This ensures that the tree prioritizes splits that lead to more homogeneous and well-defined child nodes.

Signup and view all the flashcards

Exploring all splits

When choosing the best split in a decision tree, it's important to explore all possible splits across all features and their values. This ensures that the algorithm doesn't overlook potentially better splits.

Signup and view all the flashcards

Entropy (H(S))

A measure of uncertainty or randomness in a set of data, specifically related to the distribution of classes within that set.

Signup and view all the flashcards

Entropy Interpretation

The expected number of bits needed to identify the class of a randomly chosen sample point in a set.

Signup and view all the flashcards

Average Entropy After Split (H_after)

Calculated as the weighted average of entropy after a split, considering the size of each subset.

Signup and view all the flashcards

Information Gain (I(S))

A metric that measures the improvement in information gained by splitting a set of data based on a specific feature. Higher information gain indicates a better split for classification.

Signup and view all the flashcards

Optimal Split

A split that results in the highest information gain, offering the best improvement in separating classes.

Signup and view all the flashcards

Misclassification Rate

A metric representing the proportion of misclassified samples in a dataset, indicating classification accuracy.

Signup and view all the flashcards

Entropy Curve

A concave curve where the entropy of a set decreases as the proportion of a specific class increases, starting from a point where all classes are equally likely.

Signup and view all the flashcards

Entropy Curve Property

The weighted average of the entropy of the child sets is always greater than the entropy of the parent set, representing the information gain from splitting.

Signup and view all the flashcards

Decision Tree (Graphical representation)

A tree diagram that visually represents the decision-making process, making it easy to understand for both experts and non-experts, especially for smaller trees.

Signup and view all the flashcards

Decision Trees (Qualitative Predictors)

Decision trees naturally handle different types of data without requiring complex transformations. They can directly process both numerical and categorical variables.

Signup and view all the flashcards

Pruning

A technique where a decision tree is built and then branches are removed if removing them improves its performance on a validation set.

Signup and view all the flashcards

Tree Growth

The process of determining the best splits in a decision tree, by repeatedly partitioning the data based on the feature that results in the most significant reduction in the chosen error metric.

Signup and view all the flashcards

Decision Trees (Predictive Accuracy)

While decision trees are easy to interpret, they may not always achieve the highest accuracy compared to other prediction models.

Signup and view all the flashcards

Recursive Binary Splitting

A splitting method in decision trees where the dataset is repeatedly split based on maximizing the purity of the resulting groups.

Signup and view all the flashcards

Decision Trees (Non-robustness)

Decision trees can be highly sensitive to small changes in the data, leading to significant differences in the final predicted tree, making them less reliable.

Signup and view all the flashcards

Residual Sum of Squares (RSS)

A metric used in regression trees to evaluate the fit of the model, calculated as the sum of squared differences between actual values and predicted values.

Signup and view all the flashcards

Ensemble Learning

Combining multiple decision trees to improve predictive accuracy. This helps overcome the limitations of individual trees.

Signup and view all the flashcards

Weak Learner

A learning algorithm that performs slightly better than random guessing. It's not the best learner on its own but can improve by combining with other weak learners.

Signup and view all the flashcards

Mean Response of a Region

The average response value of the training observations belonging to a specific region in a regression tree. This average is used as the prediction for any test observation falling within that region.

Signup and view all the flashcards

Strong Learner

Combining multiple weak learners to create a stronger and more accurate predictive model.

Signup and view all the flashcards

Regression Decision Tree

A decision tree model used to predict continuous values, such as house prices or stock prices, by partitioning the feature space into regions and utilizing the average response in each region for prediction.

Signup and view all the flashcards

Bagging (Ensemble Learning)

A technique where multiple decision trees are combined by averaging their predictions based on various inputs.

Signup and view all the flashcards

Tree vs Linear Models

Linear models assume a linear relationship between features and the target variable, while decision trees can capture non-linear relationships, making them suitable for complex datasets where relationships may not be linear.

Signup and view all the flashcards

Advantages of Trees

Decision trees offer a clear and understandable representation of how predictions are made, making them easy to interpret and communicate to non-technical users. They are also highly interpretable.

Signup and view all the flashcards

Bagging

A technique in ensemble learning where multiple models are trained on different subsets of the training data, created by randomly sampling with replacement. The final prediction is then averaged across all models.

Signup and view all the flashcards

Out-of-Bag Error Estimation

A method for estimating the test error of a bagged model by using the predictions of the models trained on the data points not included in their respective bootstrapped training sets.

Signup and view all the flashcards

Boosting

A technique in ensemble learning where models are trained sequentially, with each model focusing on correcting the errors made by the previous model. The models are then weighted based on their performance.

Signup and view all the flashcards

Random Forests

A type of ensemble method that combines multiple decision trees, where each tree is trained on a random subset of features and data points. The final prediction is determined by averaging the predictions of all trees.

Signup and view all the flashcards

Plurality Vote

A common approach in ensemble learning for classification tasks. The final prediction is determined by the class label that receives the majority of votes from the individual models in the ensemble.

Signup and view all the flashcards

Averaging

A common approach in ensemble learning for regression tasks. The final prediction is determined by averaging the predictions of the individual models in the ensemble.

Signup and view all the flashcards

Out-of-Bag (OOB) Error

A type of error measurement for a random forest model calculated using data points that were not included in the training of each individual tree.

Signup and view all the flashcards

B (Number of Trees)

In bagging and random forests, the number of bootstrap samples or trees used to build the ensemble model.

Signup and view all the flashcards

m (Number of Features)

In random forests, the number of features randomly selected for each tree split. A common choice is m = √p, where p is the total number of features.

Signup and view all the flashcards

Test Error

A type of error measurement for a machine learning model based on data points that were not used in the training process.

Signup and view all the flashcards

Study Notes

Introduction to Machine Learning AI 305

This course covers tree-based methods and ensemble learning.
Topics include decision tree learning, classification and regression decision trees, decision trees vs. linear models, advantages and disadvantages, ensemble learning, bagging, random forest and boosting.

Decision Tree Learning
- Classification decision trees
- Regression decision trees
- Decision trees vs. linear models
- Advantages and disadvantages
Ensemble Learning
- Bagging
- Random Forest
- Boosting

Tree-based Methods

Tree-based methods segment the predictor space into simple regions (rectangles).
These are also known as decision-tree methods due to the summary of splitting rules within a tree structure.
These methods are used for both regression and classification problems.

Classification Decision Tree

Decision trees are trained in a greedy recursive fashion, starting from the root and progressing downwards.
After each split, the children of the node are constructed by the same procedure.
The test questions are in the form "Is feature j of this point less than the value v?"
Splits carve up the feature space into nested rectangular areas.
Left subtree contains points satisfying xᵢⱼ < v, right subtree contains points satisfying xᵢⱼ ≥ v.
A base case is reached when no further splits are possible, and a prediction, instead of a split, is assigned to the node.

Classification Decision Tree - Algorithm

Input: S, a set of sample point indices.
If all points in S belong to the same class C, return a new leaf node with label C.
Otherwise:
- Choose the best splitting feature j and splitting value β.
- Create two child nodes from S₁ = {i ∈ S : xᵢⱼ < β} and S₂ = {i ∈ S : xᵢⱼ ≥ β}.
- Recursively call the algorithm on S₁ and S₂.
Return the new node (j, β, GrowTree(S₁), GrowTree(S₂)).

Decision Tree - How to Choose Best Split

Try all possible splits (all features and all possible split values within each feature).
Choose the split that minimizes the cost function of the resulting child nodes, or a weighted average of their costs.

Decision Tree - How to Choose Cost J(S)

Misclassification rate:
- Assign the class to a set S that labels the most examples in S.
- J(S) = # of examples in S not belonging to the assigned class.
Entropy:
- Measure based on information theory.
- Surprise of Y being class C = - log₂p_C (non-negative).
- Entropy of set S = average surprise H(S) = -Σ_C p_C log₂p_C
  - where p_c = proportion of examples in S belonging to class C.
- If all examples in S belong to the same class: H(S) = 0.
- If half of the class is C and half is D, H(S) = 1
  - When all classes are different, H(S) = log₂n.

Information Gain

Choose a split that maximizes information gain.
I(S) = H(S) - H_after (weighted average entropy after the split).

Entropy vs. Misclassification Rate

Entropy is strictly concave, and the line between two data points lie completely below the curve.
Using the Weighted average of the children's probabilities.
Important information is gained when the child distributions are not the same.
The results with misclassification rate are not strictly concave, and this is a disadvantage.

Gini Impurity

Measures how often a randomly chosen element from the set would be incorrectly labeled if randomly labeled according to the distribution of labels.
G(S) = Σ_c p_c(1-p_c).
Empirically produces results very similar to entropy, but slightly faster to compute.

Stopping Criteria

Stop splitting when trees are too deep, leading to overfitting.
Several heuristics, not mutually exclusive, help us decide when to stop
- Fixed depth
- Node purity
- Information gain criteria
Thresholds for stopping criteria might be adjusted.
Pruning may be used as an alternate, combining splits to reduce validation error.

Fixed Tree Depth

Increasing the maximal depth does not always lead to better performance.
Using ten-fold cross-validation to find the optimal maximal tree depth.

Pruning

Growing a complete tree and then removing splits to improve validation performance.
A more robust approach than stopping early, identifying splits that don't improve much, or followed by other splits that improve greatly.

Regression Decision Tree

The goal is to find regions (boxes) that minimize the residual sum of squares (RSS)

Ensemble Learning - Cont

Tree-based methods are fast, simple, good for interpretation, and invariant under scaling/translation.
Not the best at prediction accuracy, but can have low bias.
Taking averages of different trees to reduce the variance.
Generating random subsamples of the training data to build each tree, and averaging the outputs.
Taking an average of differing learning algorithms, or a learning algorithm on multiple training sets (if enough data are available).
Netflix Prize: used average results from several algorithms.

Bagging

Bootstrap Aggregating, or bagging
General purpose procedure for reducing the variance of a statistical method
We sample repeatedly from the training set to generate multiple training sets.
We train the method using each training set, and obtain the prediction f_b(x)
Then average all the predictions to obtain f_bag(x)
This technique is called bagging
Bagging is more reliable and typically does much better than individual models.

Bagging Classification Trees

For classification trees: we record the class predicted by each of the B trees for each test observation.
We then take a majority vote to determine the overall prediction.
The overall prediction is the most commonly occurring class among the B predictions.

Out-of-Bag Error Estimation

Useful for estimating the test error of a bagged model.
Each tree only uses two third of the data; the remaining one third is the OOB observations.
Using these OOB observations to predict the output , averages them to obtain the error.
When B is large, OOB provides a good approximation of cross validation error.

Random Forests

Improvement over bagging by decorrelating the trees, reduces the variance when we average all results.
Build multiple trees using bootstrapped sampling; each time a split occurs, a random subset of the predictors are used instead of all possible splits.
Per-classifier bagging method;
Per-split feature randomization
Choosing a number of predictors to use for splitting should be around √p, where p is the number of predictors.

Example: The Heart Data

Contains a binary outcome (HD) for 303 patients with chest pain and 13 predictors (age, sex, Chol, etc) indicating the presence or absence of heart disease

Results: The Heart Data

Results from bagging and random forests for the Heart data, where the test errors are presented as a function of 'B' which is the number of bootstrapped training set used.
Random forests ( m<p) lead to greater improvement compared with bagging.
Accuracy of a single classification tree is around 45.7%.
Greater number of trees improve the accuracy of bagged and random forest method.

Example: Gene Expression Data

High-dimensional biological data with 349 patients and 20,000+ genes.
Gene expression measurements for 4718 genes, where each patient is labeled based on 15 different levels (normal or a specific cancer type).
Using random forests to predict cancer type.

Results: Gene Expression Data

Results from random forests for fifteen-class gene expression data, with p = 500 predictors, showing test errors as a function of the number of trees.
Using m=√p lead to a slight improvement over bagging with m = p
A single classification tree has an error rate of around 45.7.
Using a sufficient number of trees (B) to result a more accurate classification for a random forest model.

Boosting

A general approach to regression and classification that creates multiple copies of the original training data using bootstrap to fit a separate decision tree to each copy.
Trees are grown sequentially, using information from previously grown trees.
Aims to improve the overall combined model by focusing on finding learners that correctly predict the points that were mispredicted by the current model by associating weights with training data points.

Adaboost

Adaptive boosting, a popular method for binary classification aiming to improve the overall combined model using weights from repeatedly fitting decision trees to the training data points.
Weights are assigned to training data points that are misclassified by previously fit decision trees.
Weights are normalized in order to sum to 1.
Gives greater weight to estimators that have a lower error rate.

Boosting for Regression

Uses the residuals from the model to improve the prediction.
Builds a decision tree to predict the residuals (error terms)
Fits the new tree to the residuals; adds in the new tree in order to predict the next model iteratively, allowing for better predictions in areas where previous models are poor.

Tuning Parameters for Boosting

Number of trees and shrinkage parameter (controlling the rate at which boosting learns).
Splitting criteria or number of splits in each tree.

Other Regression Example

Describes the use of decision trees for prediction accuracy for the California Housing Dataset.

Another Classification Example

Demonstrates the use of boosting models to improve the accuracy of a specific spam Dataset

Variable Importance Measures

Obtaining a summary of the importance of each predictor, using the RSS values (Regression trees) for bagging and Random Forest to provide improved prediction accuracy over the predictions using a single tree model.
In classification trees, the GINI Index is used for obtaining variable importance.

Summary

Decision trees are straightforward and easy to interpret, but they can be less accurate than other methods in some circumstances.
Bagging, Random Forests, and Boosting are excellent methods for improving the accuracy of decision trees.
They combine multiple simpler trees, called weak learners, into a stronger model.
These methods are state-of-the-art methods in supervised learning; however, their results can be difficult to interpret.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Decision Trees and Entropy Concepts

Choose a study mode

Podcast

Questions and Answers

What is the entropy value when all examples in a set belong to the same class?

Entropy increases when examples are evenly distributed across multiple classes.

What formula represents information gain?

The expected number of bits needed to transmit information to identify a class in S is referred to as ___.

What happens to information gain if two child sets have the same probability distribution?

Match the terms with their correct definitions:

What is the entropy value when there are two equally likely classes?

If there are n different classes, the entropy is calculated using the formula ___ = log2(n).

What determines the split for the left subtree in a decision tree?

The cost function J(S) is calculated by simply summing the number of misclassified examples.

What is used to measure the surprise of a class variable Y in decision trees?

The entropy of a set S is calculated using the formula 𝐻(𝑆) = − ∑ 𝑝𝐶 log_2 𝑝𝐶, where 𝑝𝐶 is the _____ in S that are in class C.

Match the following components of decision tree learning with their descriptions:

Which aspect of splits does a good cost function improve upon compared to a poor cost function?

Weighted averages can sometimes prefer a split that is not optimal for decision trees.

What is the outcome when no further splits can be made in a decision tree?

What method involves combining multiple learning algorithms to improve model performance?

What is a disadvantage of decision trees?

Bagging is primarily used to increase the bias of a model.

Ensemble learning can help reduce the high variance found in decision trees.

What is the main idea behind using ensemble methods in decision trees?

What is the main goal of ensemble methods?

In bagging, the final prediction for classification is determined by a __________ vote.

The Netflix Prize aimed to develop the best __________ algorithm for predicting user ratings.

Match the following ensemble methods to their descriptions:

Match the following terms with their descriptions:

Which of the following is NOT a characteristic of tree-based methods?

Which of the following is NOT a benefit of using ensemble methods?

Small changes in the dataset have little effect on the final estimated tree.

In bagging, each bootstrapped training set is created by sampling with replacement.

What is the formula for the average prediction in bagging?

What is meant by 'high variance' in decision trees?

What is indicated by the dashed line in the heart data results?

Random forests always lead to a higher error rate than bagging.

How many patients were tissue samples collected from in the gene expression data study?

Random forests are applied with ___ predictors for splitting at each node.

Match the following terms to their definitions:

Which of the following best describes the purpose of using 500 genes with the largest variance in the training set?

Increasing the number of trees in random forests will always decrease the error rate.

What levels were used to classify the patients in the gene expression data set?

What is the primary goal of pruning in decision trees?

Pruning is more reliable than stopping early in decision tree construction.

What does RSS stand for in the context of regression decision trees?

The mean response for the training observations within the j-th box is represented as 𝒴෠𝑅𝑗. Therefore, this represents the predicted response for a given test observation using the _____ of the training observations.

Match the following techniques with their characteristics:

In what situation can decision trees outperform linear models?

Decision trees are more complicated to explain than linear regression models.

What is one key reason people might prefer decision trees over regression models?

Flashcards

Data used to train subtrees

Base case in decision tree

Cost function 𝐽(𝑆 )

Misclassification rate as cost function

Entropy cost function

Entropy of a set

Optimizing splits in a decision tree

Exploring all splits

Entropy (H(S))

Entropy Interpretation

Average Entropy After Split (H_after)

Information Gain (I(S))

Optimal Split

Misclassification Rate

Entropy Curve

Entropy Curve Property

Decision Tree (Graphical representation)

Decision Trees (Qualitative Predictors)

Pruning

Tree Growth

Decision Trees (Predictive Accuracy)

Recursive Binary Splitting

Decision Trees (Non-robustness)

Residual Sum of Squares (RSS)

Ensemble Learning

Weak Learner

Mean Response of a Region