Podcast
Questions and Answers
What is the entropy value when all examples in a set belong to the same class?
What is the entropy value when all examples in a set belong to the same class?
Entropy increases when examples are evenly distributed across multiple classes.
Entropy increases when examples are evenly distributed across multiple classes.
True
What formula represents information gain?
What formula represents information gain?
I(S) = H(S) - H_after
The expected number of bits needed to transmit information to identify a class in S is referred to as ___.
The expected number of bits needed to transmit information to identify a class in S is referred to as ___.
Signup and view all the answers
What happens to information gain if two child sets have the same probability distribution?
What happens to information gain if two child sets have the same probability distribution?
Signup and view all the answers
Match the terms with their correct definitions:
Match the terms with their correct definitions:
Signup and view all the answers
What is the entropy value when there are two equally likely classes?
What is the entropy value when there are two equally likely classes?
Signup and view all the answers
If there are n different classes, the entropy is calculated using the formula ___ = log2(n).
If there are n different classes, the entropy is calculated using the formula ___ = log2(n).
Signup and view all the answers
What determines the split for the left subtree in a decision tree?
What determines the split for the left subtree in a decision tree?
Signup and view all the answers
The cost function J(S) is calculated by simply summing the number of misclassified examples.
The cost function J(S) is calculated by simply summing the number of misclassified examples.
Signup and view all the answers
What is used to measure the surprise of a class variable Y in decision trees?
What is used to measure the surprise of a class variable Y in decision trees?
Signup and view all the answers
The entropy of a set S is calculated using the formula 𝐻(𝑆) = − ∑ 𝑝𝐶 log_2 𝑝𝐶, where 𝑝𝐶 is the _____ in S that are in class C.
The entropy of a set S is calculated using the formula 𝐻(𝑆) = − ∑ 𝑝𝐶 log_2 𝑝𝐶, where 𝑝𝐶 is the _____ in S that are in class C.
Signup and view all the answers
Match the following components of decision tree learning with their descriptions:
Match the following components of decision tree learning with their descriptions:
Signup and view all the answers
Which aspect of splits does a good cost function improve upon compared to a poor cost function?
Which aspect of splits does a good cost function improve upon compared to a poor cost function?
Signup and view all the answers
Weighted averages can sometimes prefer a split that is not optimal for decision trees.
Weighted averages can sometimes prefer a split that is not optimal for decision trees.
Signup and view all the answers
What is the outcome when no further splits can be made in a decision tree?
What is the outcome when no further splits can be made in a decision tree?
Signup and view all the answers
What method involves combining multiple learning algorithms to improve model performance?
What method involves combining multiple learning algorithms to improve model performance?
Signup and view all the answers
What is a disadvantage of decision trees?
What is a disadvantage of decision trees?
Signup and view all the answers
Bagging is primarily used to increase the bias of a model.
Bagging is primarily used to increase the bias of a model.
Signup and view all the answers
Ensemble learning can help reduce the high variance found in decision trees.
Ensemble learning can help reduce the high variance found in decision trees.
Signup and view all the answers
What is the main idea behind using ensemble methods in decision trees?
What is the main idea behind using ensemble methods in decision trees?
Signup and view all the answers
What is the main goal of ensemble methods?
What is the main goal of ensemble methods?
Signup and view all the answers
In bagging, the final prediction for classification is determined by a __________ vote.
In bagging, the final prediction for classification is determined by a __________ vote.
Signup and view all the answers
The Netflix Prize aimed to develop the best __________ algorithm for predicting user ratings.
The Netflix Prize aimed to develop the best __________ algorithm for predicting user ratings.
Signup and view all the answers
Match the following ensemble methods to their descriptions:
Match the following ensemble methods to their descriptions:
Signup and view all the answers
Match the following terms with their descriptions:
Match the following terms with their descriptions:
Signup and view all the answers
Which of the following is NOT a characteristic of tree-based methods?
Which of the following is NOT a characteristic of tree-based methods?
Signup and view all the answers
Which of the following is NOT a benefit of using ensemble methods?
Which of the following is NOT a benefit of using ensemble methods?
Signup and view all the answers
Small changes in the dataset have little effect on the final estimated tree.
Small changes in the dataset have little effect on the final estimated tree.
Signup and view all the answers
In bagging, each bootstrapped training set is created by sampling with replacement.
In bagging, each bootstrapped training set is created by sampling with replacement.
Signup and view all the answers
What is the formula for the average prediction in bagging?
What is the formula for the average prediction in bagging?
Signup and view all the answers
What is meant by 'high variance' in decision trees?
What is meant by 'high variance' in decision trees?
Signup and view all the answers
What is indicated by the dashed line in the heart data results?
What is indicated by the dashed line in the heart data results?
Signup and view all the answers
Random forests always lead to a higher error rate than bagging.
Random forests always lead to a higher error rate than bagging.
Signup and view all the answers
How many patients were tissue samples collected from in the gene expression data study?
How many patients were tissue samples collected from in the gene expression data study?
Signup and view all the answers
Random forests are applied with ___ predictors for splitting at each node.
Random forests are applied with ___ predictors for splitting at each node.
Signup and view all the answers
Match the following terms to their definitions:
Match the following terms to their definitions:
Signup and view all the answers
Which of the following best describes the purpose of using 500 genes with the largest variance in the training set?
Which of the following best describes the purpose of using 500 genes with the largest variance in the training set?
Signup and view all the answers
Increasing the number of trees in random forests will always decrease the error rate.
Increasing the number of trees in random forests will always decrease the error rate.
Signup and view all the answers
What levels were used to classify the patients in the gene expression data set?
What levels were used to classify the patients in the gene expression data set?
Signup and view all the answers
What is the primary goal of pruning in decision trees?
What is the primary goal of pruning in decision trees?
Signup and view all the answers
Pruning is more reliable than stopping early in decision tree construction.
Pruning is more reliable than stopping early in decision tree construction.
Signup and view all the answers
What does RSS stand for in the context of regression decision trees?
What does RSS stand for in the context of regression decision trees?
Signup and view all the answers
The mean response for the training observations within the j-th box is represented as 𝒴𝑅𝑗. Therefore, this represents the predicted response for a given test observation using the _____ of the training observations.
The mean response for the training observations within the j-th box is represented as 𝒴𝑅𝑗. Therefore, this represents the predicted response for a given test observation using the _____ of the training observations.
Signup and view all the answers
Match the following techniques with their characteristics:
Match the following techniques with their characteristics:
Signup and view all the answers
In what situation can decision trees outperform linear models?
In what situation can decision trees outperform linear models?
Signup and view all the answers
Decision trees are more complicated to explain than linear regression models.
Decision trees are more complicated to explain than linear regression models.
Signup and view all the answers
What is one key reason people might prefer decision trees over regression models?
What is one key reason people might prefer decision trees over regression models?
Signup and view all the answers
Study Notes
Introduction to Machine Learning AI 305
- This course covers tree-based methods and ensemble learning.
- Topics include decision tree learning, classification and regression decision trees, decision trees vs. linear models, advantages and disadvantages, ensemble learning, bagging, random forest and boosting.
Contents
-
Decision Tree Learning
- Classification decision trees
- Regression decision trees
- Decision trees vs. linear models
- Advantages and disadvantages
-
Ensemble Learning
- Bagging
- Random Forest
- Boosting
Tree-based Methods
- Tree-based methods segment the predictor space into simple regions (rectangles).
- These are also known as decision-tree methods due to the summary of splitting rules within a tree structure.
- These methods are used for both regression and classification problems.
Classification Decision Tree
- Decision trees are trained in a greedy recursive fashion, starting from the root and progressing downwards.
- After each split, the children of the node are constructed by the same procedure.
- The test questions are in the form "Is feature j of this point less than the value v?"
- Splits carve up the feature space into nested rectangular areas.
- Left subtree contains points satisfying xᵢⱼ < v, right subtree contains points satisfying xᵢⱼ ≥ v.
- A base case is reached when no further splits are possible, and a prediction, instead of a split, is assigned to the node.
Classification Decision Tree - Algorithm
- Input: S, a set of sample point indices.
- If all points in S belong to the same class C, return a new leaf node with label C.
- Otherwise:
- Choose the best splitting feature j and splitting value β.
- Create two child nodes from S₁ = {i ∈ S : xᵢⱼ < β} and S₂ = {i ∈ S : xᵢⱼ ≥ β}.
- Recursively call the algorithm on S₁ and S₂.
- Return the new node (j, β, GrowTree(S₁), GrowTree(S₂)).
Decision Tree - How to Choose Best Split
- Try all possible splits (all features and all possible split values within each feature).
- Choose the split that minimizes the cost function of the resulting child nodes, or a weighted average of their costs.
Decision Tree - How to Choose Cost J(S)
-
Misclassification rate:
- Assign the class to a set S that labels the most examples in S.
- J(S) = # of examples in S not belonging to the assigned class.
-
Entropy:
- Measure based on information theory.
- Surprise of Y being class C = - log₂pC (non-negative).
- Entropy of set S = average surprise H(S) = -ΣC pC log₂pC
- where pc = proportion of examples in S belonging to class C.
- If all examples in S belong to the same class: H(S) = 0.
- If half of the class is C and half is D, H(S) = 1
- When all classes are different, H(S) = log₂n.
Information Gain
- Choose a split that maximizes information gain.
- I(S) = H(S) - Hafter (weighted average entropy after the split).
Entropy vs. Misclassification Rate
- Entropy is strictly concave, and the line between two data points lie completely below the curve.
- Using the Weighted average of the children's probabilities.
- Important information is gained when the child distributions are not the same.
- The results with misclassification rate are not strictly concave, and this is a disadvantage.
Gini Impurity
- Measures how often a randomly chosen element from the set would be incorrectly labeled if randomly labeled according to the distribution of labels.
- G(S) = Σc pc(1-pc).
- Empirically produces results very similar to entropy, but slightly faster to compute.
Stopping Criteria
- Stop splitting when trees are too deep, leading to overfitting.
- Several heuristics, not mutually exclusive, help us decide when to stop
- Fixed depth
- Node purity
- Information gain criteria
- Thresholds for stopping criteria might be adjusted.
- Pruning may be used as an alternate, combining splits to reduce validation error.
Fixed Tree Depth
- Increasing the maximal depth does not always lead to better performance.
- Using ten-fold cross-validation to find the optimal maximal tree depth.
Pruning
- Growing a complete tree and then removing splits to improve validation performance.
- A more robust approach than stopping early, identifying splits that don't improve much, or followed by other splits that improve greatly.
Regression Decision Tree
- The goal is to find regions (boxes) that minimize the residual sum of squares (RSS)
Ensemble Learning - Cont
- Tree-based methods are fast, simple, good for interpretation, and invariant under scaling/translation.
- Not the best at prediction accuracy, but can have low bias.
- Taking averages of different trees to reduce the variance.
- Generating random subsamples of the training data to build each tree, and averaging the outputs.
- Taking an average of differing learning algorithms, or a learning algorithm on multiple training sets (if enough data are available).
- Netflix Prize: used average results from several algorithms.
Bagging
- Bootstrap Aggregating, or bagging
- General purpose procedure for reducing the variance of a statistical method
- We sample repeatedly from the training set to generate multiple training sets.
- We train the method using each training set, and obtain the prediction fb(x)
- Then average all the predictions to obtain fbag(x)
- This technique is called bagging
- Bagging is more reliable and typically does much better than individual models.
Bagging Classification Trees
- For classification trees: we record the class predicted by each of the B trees for each test observation.
- We then take a majority vote to determine the overall prediction.
- The overall prediction is the most commonly occurring class among the B predictions.
Out-of-Bag Error Estimation
- Useful for estimating the test error of a bagged model.
- Each tree only uses two third of the data; the remaining one third is the OOB observations.
- Using these OOB observations to predict the output , averages them to obtain the error.
- When B is large, OOB provides a good approximation of cross validation error.
Random Forests
- Improvement over bagging by decorrelating the trees, reduces the variance when we average all results.
- Build multiple trees using bootstrapped sampling; each time a split occurs, a random subset of the predictors are used instead of all possible splits.
- Per-classifier bagging method;
- Per-split feature randomization
- Choosing a number of predictors to use for splitting should be around √p, where p is the number of predictors.
Example: The Heart Data
- Contains a binary outcome (HD) for 303 patients with chest pain and 13 predictors (age, sex, Chol, etc) indicating the presence or absence of heart disease
Results: The Heart Data
- Results from bagging and random forests for the Heart data, where the test errors are presented as a function of 'B' which is the number of bootstrapped training set used.
- Random forests ( m<p) lead to greater improvement compared with bagging.
- Accuracy of a single classification tree is around 45.7%.
- Greater number of trees improve the accuracy of bagged and random forest method.
Example: Gene Expression Data
- High-dimensional biological data with 349 patients and 20,000+ genes.
- Gene expression measurements for 4718 genes, where each patient is labeled based on 15 different levels (normal or a specific cancer type).
- Using random forests to predict cancer type.
Results: Gene Expression Data
- Results from random forests for fifteen-class gene expression data, with p = 500 predictors, showing test errors as a function of the number of trees.
- Using m=√p lead to a slight improvement over bagging with m = p
- A single classification tree has an error rate of around 45.7.
- Using a sufficient number of trees (B) to result a more accurate classification for a random forest model.
Boosting
- A general approach to regression and classification that creates multiple copies of the original training data using bootstrap to fit a separate decision tree to each copy.
- Trees are grown sequentially, using information from previously grown trees.
- Aims to improve the overall combined model by focusing on finding learners that correctly predict the points that were mispredicted by the current model by associating weights with training data points.
Adaboost
- Adaptive boosting, a popular method for binary classification aiming to improve the overall combined model using weights from repeatedly fitting decision trees to the training data points.
- Weights are assigned to training data points that are misclassified by previously fit decision trees.
- Weights are normalized in order to sum to 1.
- Gives greater weight to estimators that have a lower error rate.
Boosting for Regression
- Uses the residuals from the model to improve the prediction.
- Builds a decision tree to predict the residuals (error terms)
- Fits the new tree to the residuals; adds in the new tree in order to predict the next model iteratively, allowing for better predictions in areas where previous models are poor.
Tuning Parameters for Boosting
- Number of trees and shrinkage parameter (controlling the rate at which boosting learns).
- Splitting criteria or number of splits in each tree.
Other Regression Example
- Describes the use of decision trees for prediction accuracy for the California Housing Dataset.
Another Classification Example
- Demonstrates the use of boosting models to improve the accuracy of a specific spam Dataset
Variable Importance Measures
- Obtaining a summary of the importance of each predictor, using the RSS values (Regression trees) for bagging and Random Forest to provide improved prediction accuracy over the predictions using a single tree model.
- In classification trees, the GINI Index is used for obtaining variable importance.
Summary
- Decision trees are straightforward and easy to interpret, but they can be less accurate than other methods in some circumstances.
- Bagging, Random Forests, and Boosting are excellent methods for improving the accuracy of decision trees.
- They combine multiple simpler trees, called weak learners, into a stronger model.
- These methods are state-of-the-art methods in supervised learning; however, their results can be difficult to interpret.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.
Related Documents
Description
Test your understanding of entropy and information gain in decision trees. This quiz covers key concepts like class distribution, information transmission, and the calculations of entropy. Perfect for students studying machine learning and data classification.