Decision Trees - A Classic ML Algorithm PDF
Document Details
![AltruisticMystery9422](https://quizgecko.com/images/avatars/avatar-7.webp)
Uploaded by AltruisticMystery9422
Szymon Jaroszewicz
Tags
Summary
This document provides a lecture or presentation on the decision tree algorithm. It covers the basics of information theory related to the algorithm. The document is an academic study of the algorithm and its relationship with information theory.
Full Transcript
Decision trees – a classic ML algorithm Basics of Information Theory Szymon Jaroszewicz Szymon Jaroszewicz Decision trees Decision trees – what are they? Leaves of the tree contain decisions Non-leaf nodes contain tests To classify a new instance (ca...
Decision trees – a classic ML algorithm Basics of Information Theory Szymon Jaroszewicz Szymon Jaroszewicz Decision trees Decision trees – what are they? Leaves of the tree contain decisions Non-leaf nodes contain tests To classify a new instance (case): 1 start at the root of the tree 2 if the root is a leaf, return the decision 3 perform the test at the root 4 move down a branch according to the test’s outcome 5 repeat the procedure recursively (go to step 2) Szymon Jaroszewicz Decision trees Decision trees – what are they? X1 < 5 No Yes X2 > −3.7 X3 No Yes A B C Bad X1 < 8.14 Good Good Bad No Yes Bad Good Szymon Jaroszewicz Decision trees Decision trees – background The first Machine Learning algorithm (1970’s) Still used today decent prediction accuracy (though newer methods are often better) many implementations available (every analytics package has them) good implementations: efficient, work with numeric/categorical attributes, handle missing data, etc. Very active research are in the 80’s and 90’s hundreds (thousands?) of research papers now mostly abandoned However DTs are a basis of powerful modern ensemble methods several trees are combined together in an ensemble bagging, boosting, random forests also coming back in explainable machine learning Szymon Jaroszewicz Decision trees Learning decision trees – a high level algorithm Procedure BuildTree Input: Dataset D Output: Decision tree T 1 If all records in D have the same class y : 2 Return a leaf with decision y 3 Choose a test A for the root of the tree 4 Split D into datasets Da depending on the outcomes of A Da = {(x, y ) ∈ D : A(x) = a} 5 For each outcome a build a subtree Ta recursively Ta = BuildTree(Da ) 6 Return a tree with test A in the root and Ta ’s as subtrees of the root Szymon Jaroszewicz Decision trees How decision trees partition the sample space? Tests involve single variables The sample space is split into several (hyper)rectangles X Szymon Jaroszewicz Decision trees How decision trees partition the sample space? Tests involve single variables The sample space is split into several (hyper)rectangles X Szymon Jaroszewicz Decision trees How decision trees partition the sample space? Tests involve single variables The sample space is split into several (hyper)rectangles X Szymon Jaroszewicz Decision trees How decision trees partition the sample space? Tests involve single variables The sample space is split into several (hyper)rectangles X Szymon Jaroszewicz Decision trees Choosing a test in the root of the tree The algorithm above is just a sketch Many important details were left out How are tests chosen in the root of the tree? Many methods possible (used to be a popular research topic) Pick a test which gives the most information about Y Pick a test which gives the largest decrease in uncertainty about the value of Y Pick a test which is most correlated with Y We need to precisely define information and uncertainty Szymon Jaroszewicz Decision trees Basics of information theory Shannon’s Information Theory defines the notion of entropy (used earlier in physics) A random variable X with distribution P(X ) = (p1 ,... , pk ) The entropy of X is defined as k X H(X ) = − pi log pi i=1 Entropy measures the amount of information X carries Entropy measures the amount of uncertainty about the value of X A very important and useful concept Szymon Jaroszewicz Decision trees Properties of entropy 0 ≤ H(X ) ≤ log k Szymon Jaroszewicz Decision trees Properties of entropy 0 ≤ H(X ) ≤ log k H(X ) = 0 if one of the probabilities is one and all others zero, P(X ) = (0,... , 1,... , 0) this makes sense: in this case uncertainty is minimal (in practice we know the value of X ) X carries no information that we didn’t know before observing its value Szymon Jaroszewicz Decision trees Properties of entropy 0 ≤ H(X ) ≤ log k H(X ) = 0 if one of the probabilities is one and all others zero, P(X ) = (0,... , 1,... , 0) this makes sense: in this case uncertainty is minimal (in practice we know the value of X ) X carries no information that we didn’t know before observing its value H(X ) = log k when P(X ) is uniform, P(X ) = ( k1 ,... , k1 ) this makes sense: all values of X are equally probable. We can say nothing before actually observing the value of X uncertainty is maximal Szymon Jaroszewicz Decision trees Properties of entropy Interpretation: the average number of bits learned by observing the outcome of X (for base 2 logs) Example: a fair coin P = ( 12 , 12 ), H(X ) = 1[bit] Example: a coin with two heads P = (1, 0), H(X ) = 0[bits] Example: a loaded coin P = ( 34 , 41 ), H(X ) ≈ 0.81128[bits] somewhat harder to interpret Szymon Jaroszewicz Decision trees Joint entropy Two variables X and Y Joint entropy of X and Y is the entropy of the joint distribution P(X , Y ) of X and Y XX H(X , Y ) = − P(x, y ) log P(x, y ) x y Szymon Jaroszewicz Decision trees Joint entropy Two variables X and Y Joint entropy of X and Y is the entropy of the joint distribution P(X , Y ) of X and Y XX H(X , Y ) = − P(x, y ) log P(x, y ) x y H(X , Y ) ≤ H(X ) + H(Y ) H(X , Y ) = H(X ) + H(Y ) iff (if and only if) X and Y are independent Szymon Jaroszewicz Decision trees Mutual information H(X , Y ) ≤ H(X ) + H(Y ) H(X , Y ) = H(X ) + H(Y ) iff X ⊥ Y The difference is used to define an important concept: Mutual information of X and Y I (X , Y ) = H(X ) + H(Y ) − H(X , Y ) Measures how much information X and Y share with each other If X ⊥ Y then I (X , Y ) = 0 – independent variables share no information knowing one gives no information about the other I (X , X ) = H(X ) – equal variables share all information knowing one gives all information about the other I (X , Y ) = I (Y , X ) symmetry Szymon Jaroszewicz Decision trees Conditional entropy Two variables X and Y If we know that Y = y X H(X |Y = y ) = − P(x|Y = y ) log P(x|Y = y ) x Szymon Jaroszewicz Decision trees Conditional entropy Two variables X and Y If we know that Y = y X H(X |Y = y ) = − P(x|Y = y ) log P(x|Y = y ) x So if we learn the value of Y , the uncertainty of X is X H(X |Y ) = P(Y = y )H(X |Y = y ) y This is the conditional entropy of X given Y Szymon Jaroszewicz Decision trees Conditional entropy – properties If X ⊥ Y then H(X |Y ) = H(X ) and H(Y |X ) = H(Y ) H(X |Y ) ≤ H(X ) – learning the value of Y cannot increase the uncertainty of X H(X |Y ) = H(X , Y ) − H(Y ) Szymon Jaroszewicz Decision trees Back to decision trees – test selection criteria Pick a test which gives the most information about Y which gives the largest decrease in uncertainty about Y We can now say formally: Szymon Jaroszewicz Decision trees Information gain Class entropy before the split: H(Y ) Class entropy after the split on a test T : H(Y |T ) (note that T is an r.v.) Information gain Hgain (T ) = H(Y ) − H(Y |T ) Szymon Jaroszewicz Decision trees Information gain Class entropy before the split: H(Y ) Class entropy after the split on a test T : H(Y |T ) (note that T is an r.v.) Information gain Hgain (T ) = H(Y ) − H(Y |T ) Splitting criterion: pick a test with the largest information gain Properties Hgain (T ) ≥ 0: making a test cannot increase the uncertainty of Y (not really true in practice!) Hgain (T ) = 0 iff Y ⊥ T Hgain (T ) is maximal iff Y is a function of T (no uncertainty left after the split) Szymon Jaroszewicz Decision trees Information gain X1 < 5 H(Y ) No Yes H(Y |X ≥ 5) H(Y |X < 5) H(Y |T ) = P(X ≥ 5)H(Y |X ≥ 5) + P(X < 5)H(Y |X < 5) Hgain (T ) = H(Y ) − H(Y |T ) Szymon Jaroszewicz Decision trees Problems with information gain Suppose one of the attributes is PESEL (or SSN) Each record has a unique value of PESEL Question: what is H(Y |PESEL)? Szymon Jaroszewicz Decision trees Problems with information gain Suppose one of the attributes is PESEL (or SSN) Each record has a unique value of PESEL Question: what is H(Y |PESEL)? Answer: H(Y |PESEL) = 0, information gain is maximal Question: is PESEL a good choice for a test? Szymon Jaroszewicz Decision trees Problems with information gain Suppose one of the attributes is PESEL (or SSN) Each record has a unique value of PESEL Question: what is H(Y |PESEL)? Answer: H(Y |PESEL) = 0, information gain is maximal Question: is PESEL a good choice for a test? Answer: it is completely useless Problem Information gain is biased towards tests with many outcomes Information gain may select useless attributes if they have many values Szymon Jaroszewicz Decision trees Tests with many outcomes, solution 1: gain ratio Invented by Quinlan (1980s) Gain ratio is defined as Hgain (T ) Hratio = H(T ) Tests with many outcomes will have high entropy (up to log k) Dividing by their entropy should correct the bias Szymon Jaroszewicz Decision trees Gain ratio – a problem A side effect is that if H(T ) is low (when does this happen?) then a test may be selected regardless of its predictive value Solution: Compute the average information gain of all tests Consider only tests whose Hgain is above average Out of those tests pick the one with the highest gain ratio (Hratio ) Szymon Jaroszewicz Decision trees Gain ratio – a problem A side effect is that if H(T ) is low (when does this happen?) then a test may be selected regardless of its predictive value Solution: Compute the average information gain of all tests Consider only tests whose Hgain is above average Out of those tests pick the one with the highest gain ratio (Hratio ) It’s an unelegant hack... but works in practice There are many such hacks in decision tree learning... but they work in practice Szymon Jaroszewicz Decision trees Tests with many outcomes, solution 2: use only binary splits Used by Breiman, et al. in their CART algorithm Use only binary splits/tests For numerical attributes splits are always binary X < a For categorical attributes use tests of the form X ∈B where B ⊂ Dom(X ) Szymon Jaroszewicz Decision trees Tests with many outcomes, solution 2: use only binary splits Used by Breiman, et al. in their CART algorithm Use only binary splits/tests For numerical attributes splits are always binary X < a For categorical attributes use tests of the form X ∈B where B ⊂ Dom(X ) B is found by checking all possible subsets Time complexity O(2|Dom(X )| ) But |Dom(X )| is usually small Szymon Jaroszewicz Decision trees Other splitting criteria – the Gini index Let X be an r.v. with P(X ) = (p1 ,... , pk ) Gini index is defined as k X Gini(X ) = 1 − pi 2 i=1 Szymon Jaroszewicz Decision trees Other splitting criteria – the Gini index Let X be an r.v. with P(X ) = (p1 ,... , pk ) Gini index is defined as k X Gini(X ) = 1 − pi 2 i=1 1 0 ≤ Gini(X ) ≤ 1 − n Gini(X ) = 0 iff P(X ) = (0,... , 0, 1, 0,... , 0), is maximum for uniform distribution Interpretation: classification error made when picking the target value at random from its distribution Szymon Jaroszewicz Decision trees Other splitting criteria – the Gini index In fact Gini index behaves similarly to entropy Take P(X ) = (p1 , p2 ) and plot H(X ) and Gini(X ): 1.0 0.8 0.6 0.4 0.2 entropy Gini index 0.00.0 0.2 0.4 0.6 0.8 1.0 p1 Szymon Jaroszewicz Decision trees Other splitting criteria – the Gini index Conditional Gini index and gain are defined in the same way as for entropy Similar properties to conditional entropy and information gain In practice both criteria work equally well Gini is slightly easier to compute Gini has better statistical properties for p ≈ 0 and p ≈ 1 Entropy has a nicer interpretation Szymon Jaroszewicz Decision trees Other splitting criteria – the χ2 test Used by Kass in his CHAID algorithm For a given test T apply the χ2 test for independence between Y and T Pick the test whith the lowest p-value Well known statistical procedure Natural way to stop tree construction early (although the problem of multiple tests makes it more difficult) Szymon Jaroszewicz Decision trees Other splitting criteria There are several other splitting criteria available in literature A nice one: permutation tests in party package in R In practice most of them work equally well The pruning method (our next topic) has a much larger impact on tree quality than the splitting criterion Szymon Jaroszewicz Decision trees Overfitting A model adapts to a specific dataset but does not generalize to future data In other words: it models random fluctuations of the specific training set, not true relationships between attributes Szymon Jaroszewicz Decision trees Overfitting A model adapts to a specific dataset but does not generalize to future data In other words: it models random fluctuations of the specific training set, not true relationships between attributes Extreme example: a decision tree with a test based on PESEL in the root 100% accuracy on the training set useless on future data Szymon Jaroszewicz Decision trees Overfitting A model adapts to a specific dataset but does not generalize to future data In other words: it models random fluctuations of the specific training set, not true relationships between attributes Extreme example: a decision tree with a test based on PESEL in the root 100% accuracy on the training set useless on future data Preventing overfitting is the most important problem of Machine Learning Szymon Jaroszewicz Decision trees Statistical issues in decision tree learning Recall: when a decision tree is built, the training set is split at every step The deeper we are in the tree, the smaller the training sample becomes At lower levels of the tree, decisions are made pretty much at random Szymon Jaroszewicz Decision trees Statistical issues in decision tree learning Recall: when a decision tree is built, the training set is split at every step The deeper we are in the tree, the smaller the training sample becomes At lower levels of the tree, decisions are made pretty much at random This will result in overfitting! Szymon Jaroszewicz Decision trees Statistical issues in decision tree learning Two ways to counteract this phenomenon: Early stopping criteria Pruning The second strategy helps a lot Still, the reduction in sample size at lower levels of the tree is a problem of the approach Newer classification methods are often more accurate Szymon Jaroszewicz Decision trees Early stopping criteria Idea: stop building the tree early: Stop when we are no longer able to choose a test reliably Leaves will contain more examples, so they can be reliably labeled Stopping rules: 1 height of the tree greater than some user specified value 2 number of training cases in the current node < n0 3 Hgain < ε, for user specified threshold ε 4 Statistical test shows no significant correlation between T and Y Szymon Jaroszewicz Decision trees Early stopping criteria Idea: stop building the tree early: Stop when we are no longer able to choose a test reliably Leaves will contain more examples, so they can be reliably labeled Stopping rules: 1 height of the tree greater than some user specified value 2 number of training cases in the current node < n0 3 Hgain < ε, for user specified threshold ε 4 Statistical test shows no significant correlation between T and Y Simple early stopping criteria are commonly used in methods based on ensembles of decision trees Szymon Jaroszewicz Decision trees Pruning decision trees Szymon Jaroszewicz Decision trees Pruning Early stopping criteria were believed to work poorly for single decision trees Pruning was advised Now, the topic is largely abandoned (I marked it optional) Pruning is still useful if you want interpretable models Szymon Jaroszewicz Decision trees Early stopping criteria Why stopping criteria may not work well? Exercise Find a joint probability distribution P(A, B, Y ) s.t.: 1 A and Y are statistically independent 2 B and Y are statistically independent 3 A and B taken together determine the value of Y Szymon Jaroszewicz Decision trees Early stopping criteria Why stopping criteria may not work well? Exercise Find a joint probability distribution P(A, B, Y ) s.t.: 1 A and Y are statistically independent 2 B and Y are statistically independent 3 A and B taken together determine the value of Y Even if no single variable significantly improves prediction, their combination may give significant improvements Most early stopping criteria would halt early Szymon Jaroszewicz Decision trees Pruning decision trees A better idea: pruning How it works? Build a complete tree Remove parts of the tree which do not bring improvements in accuracy Pruning significantly improves accuracy on future data Pruning strategy typically has a much bigger impact on the quality of a decision tree than test selection criterion Szymon Jaroszewicz Decision trees Pruning – overview Prune a subtree rooted at node N 1 Remove all nodes below N 2 Replace N with a leaf labeled with the majority class in the training data falling into N (majority class = most frequent class) Szymon Jaroszewicz Decision trees Pruning – overview Prune a subtree rooted at node N 1 Remove all nodes below N 2 Replace N with a leaf labeled with the majority class in the training data falling into N (majority class = most frequent class) Overall algorithm: 1 Iterate over all nodes N level by level, bottom to top 2 If keeping the subtree rooted at N does not result in a decrease of classification error: 3 prune the subtree rooted at N How to estimate classification error? Error on the training set is no good Szymon Jaroszewicz Decision trees Pruning – major approaches How to estimate classification error? Three main approaches: Based on a validation set Based on confidence intervals for error Cost-complexity pruning (with validation set or cross-validation) The last approach also affects pruning order Szymon Jaroszewicz Decision trees Validation error based pruning Split the original training set into two parts: 1 training set – for building trees 2 validation set – for calculating error during pruning The algorithm: 1 Iterate over all nodes N level by level, bottom to top 2 Compute the error of the whole subtree rooted at N on the validation set 3 Compute the error of the single leaf N on the validation set 4 If the subtree has a validation error ≥ the single leaf 5 prune the subtree rooted at N Szymon Jaroszewicz Decision trees Validation error based pruning Advantages: Easy to implement Good statistical properties (validation set is an independent sample) Disadvantage: 1 A large part of available data (e.g. 30%) is not used for training this is a problem for small datasets Szymon Jaroszewicz Decision trees Pruning based on confidence intervals Method proposed by Quinlan in 1987 Classification errors in leaves are estimated using confidence intervals The number nE of misclassified records in a leaf follows a binomial distribution with parameters n – total number of records in the leaf E – probability of an incorrect classification (error) We can obtain a confidence interval (L, U) for E We can use: exact binomial interval normal approximation (does not work well for probabilities close to 0 or 1) Szymon Jaroszewicz Decision trees Pruning based on confidence intervals What did Quinlan propose? Use only the original training dataset Obtain 25% confidence intervals for errors of the pruned and unpruned subtree Use upper ends of the intervals as error estimates The rest: just the standard pruning algorithm Szymon Jaroszewicz Decision trees Pruning based on confidence intervals Advantages: Uses only the training set – all data used for model building Works well in practice Disadvantages: 1 Theoretical problems: 2 Why 25% (because it worked for Quinlan) 3 Multiple tests not taken into account But again, it works well in practice Szymon Jaroszewicz Decision trees Cost-complexity pruning Proposed by Breiman and Stone in 1978, later extended Cost of a tree = misclassification error on the training set Can easily be extended to real costs Complexity of a tree = number of leaves the complexity parameter α ≥ 0 The cost-complexity measure of a tree T : Rα (T ) = Cost(T ) + α · Complexity(T ) Szymon Jaroszewicz Decision trees Cost-complexity pruning Let T max be the fully built tree which we want to prune Let Tα be the smallest tree s.t.: Tα is obtained from T max by pruning Rα (Tα ) is minimal By changing the value of α we get a whole family of trees: α = 0 – largest tree α = ∞ – just a single leaf intermediate values of α – trees of decreasing complexity Szymon Jaroszewicz Decision trees Cost-complexity pruning Note that the tree changes only at a few values of α Breiman et. al: give an algorithm for finding such α’s each successive tree in sequence is obtained by cutting the ‘weakest link’ prove that Tα form a nested sequence Note that more than one node can be pruned in one step So... we have a nested sequence of trees of decreasing complexity Szymon Jaroszewicz Decision trees Cost-complexity pruning – the 1-SE rule So... we have a nested sequence of trees of decreasing complexity Q: How do we pick the best tree? A: Use a validation set or cross-validation (next lecture) Accuracy of neighboring trees is often similar, which one to choose? Szymon Jaroszewicz Decision trees Cost-complexity pruning – the 1-SE rule So... we have a nested sequence of trees of decreasing complexity Q: How do we pick the best tree? A: Use a validation set or cross-validation (next lecture) Accuracy of neighboring trees is often similar, which one to choose? The 1-SE rule: 1 Pick the tree T ∗ with the smallest validation error 2 Compute the standard deviation of the error of this tree 3 Pick the smallest tree whose error is less than the error of T ∗ + 1 st. dev. Szymon Jaroszewicz Decision trees Cost-complexity pruning – example Diabetes data from UCI Machine Learning Repository Data on Indian women from the Pima tribe They either tested positive or negative for diabetes Szymon Jaroszewicz Decision trees Cost-complexity pruning – example Initial tree: fairly big hard to understand not much more accurate than simpler trees Szymon Jaroszewicz Decision trees Cost-complexity pruning – example Cost-complexity plot Szymon Jaroszewicz Decision trees Cost-complexity pruning – example Pruned tree: much smaller, easier to understand accuracy (on new data) similar or better than bigger trees Szymon Jaroszewicz Decision trees Other aspects of decision tree learning Numerical attributes Missing values Szymon Jaroszewicz Decision trees Numerical attributes The tests have the form X < a But how to pick a? Szymon Jaroszewicz Decision trees Numerical attributes The tests have the form X < a But how to pick a? Sort the values of X in the dataset Pick the midpoint between each two consecutive different values of X as a potential splitting point Each numerical attribute introduces many possible tests The test with the highest value of the splitting criterion (among all attributes) is picked Szymon Jaroszewicz Decision trees Missing values Data often contains missing values They are a problem for many Machine Learning algorithms One can remove records/attributes with missing values, but this leads to data loss There are two ways to handle missing values in decision trees 1 weighting (Quinlan) 2 surrogate splits (Breiman et. al) Szymon Jaroszewicz Decision trees Missing values – weighting Splitting criteria are computed by simply ignoring the missing values When the data is split: 1 If a record contains a missing value in the test attribute: 2 The record is sent down all branches with weights proportional to the distribution of test outcomes in the training sample When a new record is classified: 1 If a record contains a missing value in the test attribute: 2 The record is sent down all branches of the tree 3 Predictions from all branches are combined using the same weights (weighted voting) Szymon Jaroszewicz Decision trees Missing values – surrogate splits When a tree is being built we select a test for the root of the tree Breiman et. al suggest finding also a number of tests which are most similar to the selected one, but split on different attributes They are called surrogate splits When using a tree, when we cannot perform a test because of a missing value, try each of the surrogate splits in turn Szymon Jaroszewicz Decision trees Missing values – surrogate splits Surrogate splits also help in tree interpretation We may try to explain the data by looking at the tests in non-leaf nodes of the tree This can be misleading: there may be many correlated attributes the one which ended up in the root is not necessarily causally related to the class Solution: look at surrogate splits they show which attributes may be equally important Szymon Jaroszewicz Decision trees Concrete algorithms Some important concrete algorithms described in literature and available in software packages: ID3 CHAID CART C4.5 Szymon Jaroszewicz Decision trees ID3 ID3 (Iterative Dichotomizer): Developed by Quinlan Historically first Splitting based on information gain No pruning / early stopping Not used anymore Szymon Jaroszewicz Decision trees CHAID CHAID (Chi-square Automatic Interaction Detector) Developed by Kass (1980) Not popular anymore, still used by some business software Splitting criterion based on chi-square test Stopping criterion based on the same test Follows more or less the way a statistician analyzes data: test hypotheses by performing statistical tests stop when no more significant relationships can be found Does not correct for multiple tests Szymon Jaroszewicz Decision trees CART CART (Classification and Regression Trees) Developed by Breiman, Friedman, Olshen and Stone (1984) Second most popular algorithm Splitting criterion: Gini gain Cost-complexity pruning using cross-validation Missing values using surrogate splits Szymon Jaroszewicz Decision trees C4.5 C4.5 Developed by Quinlan (1986) Probably the most popular algorithm Splitting criterion: gain ratio Pruning based on confidence intervals Missing values using weighting New version C5.0 is available A commercial product, details not published ⇒ nobody uses it Szymon Jaroszewicz Decision trees Decision trees – further developments Regression trees More complex leaves: model trees More complex splits Unbiased splits Szymon Jaroszewicz Decision trees Regression trees DTs can be used also when the target variable Y is numerical Breiman et. al: Each leaf contains a constant value Splitting criterion based on variance Other approaches: regression models in leaves (not just for regression, next slide) Szymon Jaroszewicz Decision trees More complex leaves – model trees Idea: each leaf contains a whole classification / regression model Some examples: M5 regression trees by Quinlan. Each leaf contains a regression model NBtree each leaf contains a Naive Bayesian classifier (next lecture) LMT Logistic Model Trees. Each leaf contains a logistic regression model Szymon Jaroszewicz Decision trees More complex splits When two classes can be separated by a slanted hyperplane a single linear model will do However, a large decision tree is necessary Idea: do not just split on single variables, split on several variables simultaneously E.g. Breiman et. al proposed the following tests: w1 X1 + w2 X2 +... + wl Xl > θ for numerical attributes Conjunctions of (possibly negated) binary variables Not popular in practice When the trick works, regression will often work just as well Szymon Jaroszewicz Decision trees Unbiased decision trees Splitting criteria we described so far are “biased” They are biased towards tests with large number of outcomes (using gain ratio does not completely remove the bias) They are biased towards numerical attributes: each such attribute introduces many tests, so there is more chance one of them will be selected Idea: use p-values of statistical tests to pick attributes QUEST: the first such approach, uses p-values based on ANOVA. Not completely unbiased. Szymon Jaroszewicz Decision trees Unbiased decision trees A better (at least more unbiased) approach: Torsten Hothorn, Kurt Hornik and Achim Zeileis (2006). Unbiased Recursive Partitioning: A Conditional Inference Framework. Journal of Computational and Graphical Statistics, 15(3), 651–674. http://statmath.wu-wien.ac.at/~zeileis/ papers/Hothorn+Hornik+Zeileis-2006.pdf They use splitting criteria based on permutation tests Implemented in R in the party package Szymon Jaroszewicz Decision trees Literature 1 Leo Breiman, Jerome Friedman, Richard Olshen and Charles Stone, Classification and Regression Tress. Chapman & Hall, 1984. 2 Ross Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufman, 1988. 3 David MacKay, Information Theory, Inference, and Learning Algorithms, Cambridge University Press, 2003. http: //www.inference.phy.cam.ac.uk/itprnn/book.pdf Szymon Jaroszewicz Decision trees