Decision Trees and Entropy Concepts
48 Questions
0 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

What is the entropy value when all examples in a set belong to the same class?

  • 0.5
  • log2(n)
  • 1
  • 0 (correct)
  • Entropy increases when examples are evenly distributed across multiple classes.

    True

    What formula represents information gain?

    I(S) = H(S) - H_after

    The expected number of bits needed to transmit information to identify a class in S is referred to as ___.

    <p>entropy</p> Signup and view all the answers

    What happens to information gain if two child sets have the same probability distribution?

    <p>It becomes zero</p> Signup and view all the answers

    Match the terms with their correct definitions:

    <p>Entropy = Measure of uncertainty Information Gain = Reduction in entropy after a split Misclassification Rate = Percentage of incorrectly classified points Uniform Distribution = Equal likelihood of each class</p> Signup and view all the answers

    What is the entropy value when there are two equally likely classes?

    <p>1</p> Signup and view all the answers

    If there are n different classes, the entropy is calculated using the formula ___ = log2(n).

    <p>H(S)</p> Signup and view all the answers

    What determines the split for the left subtree in a decision tree?

    <p>All points satisfying 𝑥𝑗 &lt; 𝑣</p> Signup and view all the answers

    The cost function J(S) is calculated by simply summing the number of misclassified examples.

    <p>False</p> Signup and view all the answers

    What is used to measure the surprise of a class variable Y in decision trees?

    <p>logarithm of the probability of the class</p> Signup and view all the answers

    The entropy of a set S is calculated using the formula 𝐻(𝑆) = − ∑ 𝑝𝐶 log_2 𝑝𝐶, where 𝑝𝐶 is the _____ in S that are in class C.

    <p>proportion of examples</p> Signup and view all the answers

    Match the following components of decision tree learning with their descriptions:

    <p>Left subtree = Uses points where 𝑥𝑗 &lt; 𝑣 Right subtree = Uses points where 𝑥𝑗 ≥ 𝑣 Cost function = Measures the performance of a split Entropy = Quantifies surprise based on probabilities</p> Signup and view all the answers

    Which aspect of splits does a good cost function improve upon compared to a poor cost function?

    <p>It provides a more accurate measure of left and right splits.</p> Signup and view all the answers

    Weighted averages can sometimes prefer a split that is not optimal for decision trees.

    <p>True</p> Signup and view all the answers

    What is the outcome when no further splits can be made in a decision tree?

    <p>a prediction associated with the node</p> Signup and view all the answers

    What method involves combining multiple learning algorithms to improve model performance?

    <p>Ensemble Learning</p> Signup and view all the answers

    What is a disadvantage of decision trees?

    <p>They generally have low predictive accuracy.</p> Signup and view all the answers

    Bagging is primarily used to increase the bias of a model.

    <p>False</p> Signup and view all the answers

    Ensemble learning can help reduce the high variance found in decision trees.

    <p>True</p> Signup and view all the answers

    What is the main idea behind using ensemble methods in decision trees?

    <p>To combine multiple weak learners to create a strong learner.</p> Signup and view all the answers

    What is the main goal of ensemble methods?

    <p>To combine multiple models to produce a more accurate and robust prediction.</p> Signup and view all the answers

    In bagging, the final prediction for classification is determined by a __________ vote.

    <p>majority</p> Signup and view all the answers

    The Netflix Prize aimed to develop the best __________ algorithm for predicting user ratings.

    <p>collaborative filtering</p> Signup and view all the answers

    Match the following ensemble methods to their descriptions:

    <p>Bagging = Reduces variance by averaging predictions from multiple models Random forests = Uses multiple decision trees for classification or regression Boosting = Sequentially combines weak models to improve accuracy Ensemble learning = Combines multiple learning algorithms for better performance</p> Signup and view all the answers

    Match the following terms with their descriptions:

    <p>Weak Learner = A learning algorithm that performs better than random guessing. Bagging = Using random subsets of a training set to create multiple models. Random Forests = Multiple randomized decision trees created from subsamples. Variance = The tendency of decision trees to produce different results with small changes in data.</p> Signup and view all the answers

    Which of the following is NOT a characteristic of tree-based methods?

    <p>They always produce high accuracy.</p> Signup and view all the answers

    Which of the following is NOT a benefit of using ensemble methods?

    <p>Increased computational speed</p> Signup and view all the answers

    Small changes in the dataset have little effect on the final estimated tree.

    <p>False</p> Signup and view all the answers

    In bagging, each bootstrapped training set is created by sampling with replacement.

    <p>True</p> Signup and view all the answers

    What is the formula for the average prediction in bagging?

    <p>f_bag(x) = (1/B) * Σ from b=1 to B of f*_b(x)</p> Signup and view all the answers

    What is meant by 'high variance' in decision trees?

    <p>High variance means that small changes in the training data can lead to large changes in the model output.</p> Signup and view all the answers

    What is indicated by the dashed line in the heart data results?

    <p>Test error from a single classification tree</p> Signup and view all the answers

    Random forests always lead to a higher error rate than bagging.

    <p>False</p> Signup and view all the answers

    How many patients were tissue samples collected from in the gene expression data study?

    <p>349</p> Signup and view all the answers

    Random forests are applied with ___ predictors for splitting at each node.

    <p>m</p> Signup and view all the answers

    Match the following terms to their definitions:

    <p>Bagging = Creates multiple copies of the original data set using the bootstrap Random Forests = Combines results of multiple decision trees for better predictions OOB error = Error estimate based on bootstrap samples not used in training Boosting = Building models sequentially to correct errors of previous models</p> Signup and view all the answers

    Which of the following best describes the purpose of using 500 genes with the largest variance in the training set?

    <p>To maximize prediction of cancer type</p> Signup and view all the answers

    Increasing the number of trees in random forests will always decrease the error rate.

    <p>False</p> Signup and view all the answers

    What levels were used to classify the patients in the gene expression data set?

    <p>Normal or one of 14 different types of cancer</p> Signup and view all the answers

    What is the primary goal of pruning in decision trees?

    <p>To improve validation performance</p> Signup and view all the answers

    Pruning is more reliable than stopping early in decision tree construction.

    <p>True</p> Signup and view all the answers

    What does RSS stand for in the context of regression decision trees?

    <p>Residual Sum of Squares</p> Signup and view all the answers

    The mean response for the training observations within the j-th box is represented as 𝒴෠𝑅𝑗. Therefore, this represents the predicted response for a given test observation using the _____ of the training observations.

    <p>mean</p> Signup and view all the answers

    Match the following techniques with their characteristics:

    <p>Pruning = Improves validation performance by removing unnecessary splits Regression Decision Tree = Minimizes residual sum of squares Decision Tree = Mimics human decision-making Linear Model = Assumes a linear decision boundary</p> Signup and view all the answers

    In what situation can decision trees outperform linear models?

    <p>When the true decision boundary is non-linear</p> Signup and view all the answers

    Decision trees are more complicated to explain than linear regression models.

    <p>False</p> Signup and view all the answers

    What is one key reason people might prefer decision trees over regression models?

    <p>They mirror human decision-making</p> Signup and view all the answers

    Study Notes

    Introduction to Machine Learning AI 305

    • This course covers tree-based methods and ensemble learning.
    • Topics include decision tree learning, classification and regression decision trees, decision trees vs. linear models, advantages and disadvantages, ensemble learning, bagging, random forest and boosting.

    Contents

    • Decision Tree Learning
      • Classification decision trees
      • Regression decision trees
      • Decision trees vs. linear models
      • Advantages and disadvantages
    • Ensemble Learning
      • Bagging
      • Random Forest
      • Boosting

    Tree-based Methods

    • Tree-based methods segment the predictor space into simple regions (rectangles).
    • These are also known as decision-tree methods due to the summary of splitting rules within a tree structure.
    • These methods are used for both regression and classification problems.

    Classification Decision Tree

    • Decision trees are trained in a greedy recursive fashion, starting from the root and progressing downwards.
    • After each split, the children of the node are constructed by the same procedure.
    • The test questions are in the form "Is feature j of this point less than the value v?"
    • Splits carve up the feature space into nested rectangular areas.
    • Left subtree contains points satisfying xᵢⱼ < v, right subtree contains points satisfying xᵢⱼ ≥ v.
    • A base case is reached when no further splits are possible, and a prediction, instead of a split, is assigned to the node.

    Classification Decision Tree - Algorithm

    • Input: S, a set of sample point indices.
    • If all points in S belong to the same class C, return a new leaf node with label C.
    • Otherwise:
      • Choose the best splitting feature j and splitting value β.
      • Create two child nodes from S₁ = {i ∈ S : xᵢⱼ < β} and S₂ = {i ∈ S : xᵢⱼ ≥ β}.
      • Recursively call the algorithm on S₁ and S₂.
    • Return the new node (j, β, GrowTree(S₁), GrowTree(S₂)).

    Decision Tree - How to Choose Best Split

    • Try all possible splits (all features and all possible split values within each feature).
    • Choose the split that minimizes the cost function of the resulting child nodes, or a weighted average of their costs.

    Decision Tree - How to Choose Cost J(S)

    • Misclassification rate:
      • Assign the class to a set S that labels the most examples in S.
      • J(S) = # of examples in S not belonging to the assigned class.
    • Entropy:
      • Measure based on information theory.
      • Surprise of Y being class C = - log₂pC (non-negative).
      • Entropy of set S = average surprise H(S) = -ΣC pC log₂pC
        • where pc = proportion of examples in S belonging to class C.
      • If all examples in S belong to the same class: H(S) = 0.
      • If half of the class is C and half is D, H(S) = 1
        • When all classes are different, H(S) = log₂n.

    Information Gain

    • Choose a split that maximizes information gain.
    • I(S) = H(S) - Hafter (weighted average entropy after the split).

    Entropy vs. Misclassification Rate

    • Entropy is strictly concave, and the line between two data points lie completely below the curve.
    • Using the Weighted average of the children's probabilities.
    • Important information is gained when the child distributions are not the same.
    • The results with misclassification rate are not strictly concave, and this is a disadvantage.

    Gini Impurity

    • Measures how often a randomly chosen element from the set would be incorrectly labeled if randomly labeled according to the distribution of labels.
    • G(S) = Σc pc(1-pc).
    • Empirically produces results very similar to entropy, but slightly faster to compute.

    Stopping Criteria

    • Stop splitting when trees are too deep, leading to overfitting.
    • Several heuristics, not mutually exclusive, help us decide when to stop
      • Fixed depth
      • Node purity
      • Information gain criteria
    • Thresholds for stopping criteria might be adjusted.
    • Pruning may be used as an alternate, combining splits to reduce validation error.

    Fixed Tree Depth

    • Increasing the maximal depth does not always lead to better performance.
    • Using ten-fold cross-validation to find the optimal maximal tree depth.

    Pruning

    • Growing a complete tree and then removing splits to improve validation performance.
    • A more robust approach than stopping early, identifying splits that don't improve much, or followed by other splits that improve greatly.

    Regression Decision Tree

    • The goal is to find regions (boxes) that minimize the residual sum of squares (RSS)

    Ensemble Learning - Cont

    • Tree-based methods are fast, simple, good for interpretation, and invariant under scaling/translation.
    • Not the best at prediction accuracy, but can have low bias.
    • Taking averages of different trees to reduce the variance.
    • Generating random subsamples of the training data to build each tree, and averaging the outputs.
    • Taking an average of differing learning algorithms, or a learning algorithm on multiple training sets (if enough data are available).
    • Netflix Prize: used average results from several algorithms.

    Bagging

    • Bootstrap Aggregating, or bagging
    • General purpose procedure for reducing the variance of a statistical method
    • We sample repeatedly from the training set to generate multiple training sets.
    • We train the method using each training set, and obtain the prediction fb(x)
    • Then average all the predictions to obtain fbag(x)
    • This technique is called bagging
    • Bagging is more reliable and typically does much better than individual models.

    Bagging Classification Trees

    • For classification trees: we record the class predicted by each of the B trees for each test observation.
    • We then take a majority vote to determine the overall prediction.
    • The overall prediction is the most commonly occurring class among the B predictions.

    Out-of-Bag Error Estimation

    • Useful for estimating the test error of a bagged model.
    • Each tree only uses two third of the data; the remaining one third is the OOB observations.
    • Using these OOB observations to predict the output , averages them to obtain the error.
    • When B is large, OOB provides a good approximation of cross validation error.

    Random Forests

    • Improvement over bagging by decorrelating the trees, reduces the variance when we average all results.
    • Build multiple trees using bootstrapped sampling; each time a split occurs, a random subset of the predictors are used instead of all possible splits.
    • Per-classifier bagging method;
    • Per-split feature randomization
    • Choosing a number of predictors to use for splitting should be around √p, where p is the number of predictors.

    Example: The Heart Data

    • Contains a binary outcome (HD) for 303 patients with chest pain and 13 predictors (age, sex, Chol, etc) indicating the presence or absence of heart disease

    Results: The Heart Data

    • Results from bagging and random forests for the Heart data, where the test errors are presented as a function of 'B' which is the number of bootstrapped training set used.
    • Random forests ( m<p) lead to greater improvement compared with bagging.
    • Accuracy of a single classification tree is around 45.7%.
    • Greater number of trees improve the accuracy of bagged and random forest method.

    Example: Gene Expression Data

    • High-dimensional biological data with 349 patients and 20,000+ genes.
    • Gene expression measurements for 4718 genes, where each patient is labeled based on 15 different levels (normal or a specific cancer type).
    • Using random forests to predict cancer type.

    Results: Gene Expression Data

    • Results from random forests for fifteen-class gene expression data, with p = 500 predictors, showing test errors as a function of the number of trees.
    • Using m=√p lead to a slight improvement over bagging with m = p
    • A single classification tree has an error rate of around 45.7.
    • Using a sufficient number of trees (B) to result a more accurate classification for a random forest model.

    Boosting

    • A general approach to regression and classification that creates multiple copies of the original training data using bootstrap to fit a separate decision tree to each copy.
    • Trees are grown sequentially, using information from previously grown trees.
    • Aims to improve the overall combined model by focusing on finding learners that correctly predict the points that were mispredicted by the current model by associating weights with training data points.

    Adaboost

    • Adaptive boosting, a popular method for binary classification aiming to improve the overall combined model using weights from repeatedly fitting decision trees to the training data points.
    • Weights are assigned to training data points that are misclassified by previously fit decision trees.
    • Weights are normalized in order to sum to 1.
    • Gives greater weight to estimators that have a lower error rate.

    Boosting for Regression

    • Uses the residuals from the model to improve the prediction.
    • Builds a decision tree to predict the residuals (error terms)
    • Fits the new tree to the residuals; adds in the new tree in order to predict the next model iteratively, allowing for better predictions in areas where previous models are poor.

    Tuning Parameters for Boosting

    • Number of trees and shrinkage parameter (controlling the rate at which boosting learns).
    • Splitting criteria or number of splits in each tree.

    Other Regression Example

    • Describes the use of decision trees for prediction accuracy for the California Housing Dataset.

    Another Classification Example

    • Demonstrates the use of boosting models to improve the accuracy of a specific spam Dataset

    Variable Importance Measures

    • Obtaining a summary of the importance of each predictor, using the RSS values (Regression trees) for bagging and Random Forest to provide improved prediction accuracy over the predictions using a single tree model.
    • In classification trees, the GINI Index is used for obtaining variable importance.

    Summary

    • Decision trees are straightforward and easy to interpret, but they can be less accurate than other methods in some circumstances.
    • Bagging, Random Forests, and Boosting are excellent methods for improving the accuracy of decision trees.
    • They combine multiple simpler trees, called weak learners, into a stronger model.
    • These methods are state-of-the-art methods in supervised learning; however, their results can be difficult to interpret.

    Studying That Suits You

    Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

    Quiz Team

    Description

    Test your understanding of entropy and information gain in decision trees. This quiz covers key concepts like class distribution, information transmission, and the calculations of entropy. Perfect for students studying machine learning and data classification.

    More Like This

    Use Quizgecko on...
    Browser
    Browser