Decision Trees PDF
Document Details
Uploaded by UnabashedTuba
Tom Mitchell
Tags
Summary
This document provides an overview of decision trees, a machine learning algorithm used for classification tasks. It includes examples, diagrams, and equations related to decision trees and touches upon their applications. It discusses different aspects of decision trees, such as different types of decision trees (e.g. different leaf node types).
Full Transcript
Decision Trees Func-on Approxima-on Problem Se*ng Set of possible instances X Set of possible labels Y Unknown target func-on f : X ! Y Set of fun...
Decision Trees Func-on Approxima-on Problem Se*ng Set of possible instances X Set of possible labels Y Unknown target func-on f : X ! Y Set of func-on hypotheses H = {h | h : X ! Y} Input: TnD raining examples Eo nD of unknown E target D func-on Eo f n (i) (i) x ,y = x(1) , y (1) ,... , x(n) , y (n) i=1 Output: Hypothesis h 2 H that best approximates f Based on slide by Tom Mitchell Sample Dataset Columns denote features Xi nD Eon nD E Rows denote labeled instances x , y i=1 = x , y ,... , (i) (i) (1) (1) Class label denotes whether a tennis game was played nD Eon nD E D Eo (i) (i) (1) (1) (n) (n) x ,y = x ,y ,..., x ,y i=1 Decision Tree A possible decision tree for the data: Each internal node: test one aFribute Xi Each branch from a node: selects one value for Xi Each leaf node: predict Y (or p(Y | x 2 leaf) ) Based on slide by Tom Mitchell Decision Tree A possible decision tree for the data: What predic-on would we make for ? Based on slide by Tom Mitchell Decision Tree If features are con-nuous, internal nodes can test the value of a feature against a threshold 6 DecisionDecision Tree Learning Tree Learning Problem Setting: Set of possible instances X – each instance x in X is a feature vector – e.g., Unknown target function f : XY – Y is discrete valued Set of function hypotheses H={ h | h : XY } – each hypothesis h is a decision tree – trees sorts x to leaf, which assigns y Slide by Tom Mitchell Stages of (Batch) Machine Learning nD Eon Given: labeled training data X, Y = x(i) , y (i) i=1 Assumes each x (i) ⇠ ) with y (i) = ftarget x(i) D(X X, Y Train the model: learner model ß classifier.train(X, Y ) x model ypredic-on Apply the model to new data: x ⇠ D(X ) Given: new unlabeled instance ypredic-on ß model.predict(x) Example Applica-on: A Tree to Predict Caesarean Sec-on Risk Based on Example by Tom Mitchell Decision Tree Induced Par--on Color green blue red Size + Shape big small square round - + Size + big small - + Decision Tree – Decision Boundary Decision trees divide the feature space into axis-‐ parallel (hyper-‐)rectangles Each rectangular region is labeled with one label – or a probability distribu-on over labels Decision boundary 11 Expressiveness Decision trees can represent any boolean func-on of the input aFributes Truth table row à path to leaf In the worst case, the tree will require exponen-ally many nodes Expressiveness Decision trees have a variable-‐sized hypothesis space As the #nodes (or depth) increases, the hypothesis space grows – Depth 1 (“decision stump”): can represent any boolean func-on of one feature – Depth 2: any boolean fn of two features; some involving three features (e.g., (x 1 ^ x 2 ) _ (¬x 1 ^ ¬x 3 ) ) – etc. Based on slide by Pedro Domingos Another Example: Restaurant Domain (Russell & Norvig) Model a patron’s decision of whether to wait for a table at a restaurant ~7,000 possible cases A Decision Tree from Introspec-on Is this the best decision tree? Preference bias: Ockham’s Razor Principle stated by William of Ockham (1285-‐1347) – “non sunt mul0plicanda en0a praeter necessitatem” – en--es are not to be mul-plied beyond necessity – AKA Occam’s Razor, Law of Economy, or Law of Parsimony Idea: The simplest consistent explana-on is the best Therefore, the smallest decision tree that correctly classifies all of the training examples is best Finding the provably smallest decision tree is NP-‐hard ...So instead of construc-ng the absolute smallest tree consistent with the training examples, construct one that is preFy small Basic Algorithm for Top-‐Down Induc-on of Decision Trees [ID3, C4.5 by Quinlan] node = root of decision tree Main loop: 1. A ß the “best” decision aFribute for the next node. 2. Assign A as decision aFribute for node. 3. For each value of A, create a new descendant of node. 4. Sort training examples to leaf nodes. 5. If training examples are perfectly classified, stop. Else, recurse over new leaf nodes. How do we choose which aFribute is best? Choosing the Best AFribute Key problem: choosing which aFribute to split a given set of examples Some possibili-es are: – Random: Select any aFribute at random – Least-‐Values: Choose the aFribute with the smallest number of possible values – Most-‐Values: Choose the aFribute with the largest number of possible values – Max-‐Gain: Choose the aFribute that has the largest expected informa0on gain i.e., aFribute that results in smallest expected size of subtrees rooted at its children The ID3 algorithm uses the Max-‐Gain method of selec-ng the best aFribute Choosing an AFribute Idea: a good aFribute splits the examples into subsets that are (ideally) “all posi-ve” or “all nega-ve” Which split is more informa-ve: Patrons? or Type? Based on Slide from M. desJardins & T. Finin ID3-‐induced Decision Tree Based on Slide from M. desJardins & T. Finin Compare the Two Decision Trees Based on Slide from M. desJardins & T. Finin Informa-on Gain Which test is more informa-ve? Split over whether Split over whether Balance exceeds 50K applicant is employed Less or equal 50K Over 50K Unemployed Employed 22 Based on slide by Pedro Domingos Informa-on Gain Impurity/Entropy (informal) – Measures the level of impurity in a group of examples 23 Based on slide by Pedro Domingos Impurity Very impure group Less impure Minimum impurity 24 Based on slide by Pedro Domingos Entropy: a common way to measure impurity Entropy # of possible values for X Entropy H(X) of a random variable X H(X) is the expected number of bits needed to encode a randomly drawn value of X (under most efficient code) Why? Information theory: Most efficient code assigns -log2P(X=i) bits to encode the message X=i So, expected number of bits to code one random X is: Slide by Tom Mitchell Entropy: a common way to measure impurity Entropy # of possible values for X Entropy H(X) of a random variable X H(X) is the expected number of bits needed to encode a randomly drawn value of X (under most efficient code) Why? Information theory: Most efficient code assigns -log2P(X=i) bits to encode the message X=i So, expected number of bits to code one random X is: Slide by Tom Mitchell Example: Huffman code In 1952 MIT student David Huffman devised, in the course of doing a homework assignment, an elegant coding scheme which is op-mal in the case where all symbols’ probabili-es are integral powers of 1/2. A Huffman code can be built in the following manner: – Rank all symbols in order of probability of occurrence – Successively combine the two symbols of the lowest probability to form a new composite symbol; eventually we will build a binary tree where each node is the probability of all nodes beneath it – Trace a path to each leaf, no-cing direc-on at each node Based on Slide from M. desJardins & T. Finin Huffman code example M code length prob M P A 000 3 0.125 0.375 B 001 3 0.125 0.375 A .125 1 C 01 2 0.250 0.500 B .125 0 1 D 1 1 0.500 0.500 C .25 .5 .5 average message length 1.750 D .5 0 1 D .25 .25 If we use this code to many messages (A,B,C or D) with this 0 1 C probability distribu-on, then, over .125 .125 -me, the average bits/message should approach 1.75 A B Based on Slide from M. desJardins & T. Finin 2-‐Class Cases: n X Entropy H(x) = P (x = i) log2 P (x = i) i=1 What is the entropy of a group in which all Minimum examples belong to the same class? impurity – entropy = -‐ 1 log21 = 0 not a good training set for learning What is the entropy of a group with 50% Maximum in either class? impurity – entropy = -‐0.5 log20.5 – 0.5 log20.5 =1 good training set for learning 30 Based on slide by Pedro Domingos Sample Entropy Sample Entropy Slide by Tom Mitchell Informa-on Gain We want to determine which aFribute in a given set of training feature vectors is most useful for discrimina-ng between the classes to be learned. Informa-on gain tells us how important a given aFribute of the feature vectors is. We will use it to decide the ordering of aFributes in the nodes of a decision tree. 32 Based on slide by Pedro Domingos From Entropy to Informa-on Gain Entropy Entropy H(X) of a random variable X Specific conditional entropy H(X|Y=v) of X given Y=v : Conditional entropy H(X|Y) of X given Y : Mututal information (aka Information Gain) of X and Y : Slide by Tom Mitchell From Entropy to Informa-on Gain Entropy Entropy H(X) of a random variable X Specific conditional entropy H(X|Y=v) of X given Y=v : Conditional entropy H(X|Y) of X given Y : Mututal information (aka Information Gain) of X and Y : Slide by Tom Mitchell From Entropy to Informa-on Gain Entropy Entropy H(X) of a random variable X Specific conditional entropy H(X|Y=v) of X given Y=v : Conditional entropy H(X|Y) of X given Y : Mututal information (aka Information Gain) of X and Y : Slide by Tom Mitchell From Entropy to Informa-on Gain Entropy Entropy H(X) of a random variable X Specific conditional entropy H(X|Y=v) of X given Y=v : Conditional entropy H(X|Y) of X given Y : Mututal information (aka Information Gain) of X and Y : Slide by Tom Mitchell Informa-on Gain Information Gain is the mutual information between input attribute A and target variable Y Information Gain is the expected reduction in entropy of target variable Y for data sample S, due to sorting on variable A Slide by Tom Mitchell Calcula-ng Informa-on Gain InformaOon Gain = entropy(parent) – [average entropy(children)] child = −⎛⎜ 13 ⋅ log 13 ⎞⎟ − ⎛⎜ 4 ⋅ log 4 ⎞⎟ = 0.787 impurity 2 2 entropy ⎝ 17 17 ⎠ ⎝ 17 17 ⎠ Entire population (30 instances) 17 instances child ⎛ 1 1 ⎞ ⎛ 12 12 ⎞ impurity = −⎜ entropy 13 ⋅ log 2 ⎟ − ⎜ ⋅ log 2 ⎟ = 0.391 ⎝ 13 ⎠ ⎝ 13 13 ⎠ parent = −⎛⎜ 14 ⋅ log 2 14 ⎞⎟ − ⎛⎜ 16 ⋅ log 2 16 ⎞⎟ = 0.996 impurity entropy ⎝ 30 30 ⎠ ⎝ 30 30 ⎠ 13 instances ⎛ 17 ⎞ ⎛ 13 ⎞ (Weighted) Average Entropy of Children = ⎜ ⋅ 0.787 ⎟ + ⎜ ⋅ 0.391⎟ = 0.615 ⎝ 30 ⎠ ⎝ 30 ⎠ Information Gain= 0.996 - 0.615 = 0.38 38 Based on slide by Pedro Domingos Entropy-‐Based Automa-c Decision Tree Construc-on Training Set X Node 1 x1=(f11,f12,…f1m) What feature x2=(f21,f22, f2m) should be used? . What values? . xn=(fn1,f22, f2m) Quinlan suggested informa-on gain in his ID3 system and later the gain ra-o, both based on entropy. 39 Based on slide by Pedro Domingos Using Informa-on Gain to Construct a Decision Tree Choose the aFribute A Full Training Set X with highest informa-on AFribute A gain for the full training v1 v2 vk set at the root of the tree. Construct child nodes for each value of A. Set X ʹ′ Xʹ′={x∈X | value(A)=v1} Each has an associated subset of vectors in repeat which A has a par-cular recursively -ll when? value. Disadvantage of informa-on gain: It prefers aFributes with large number of values that split the data into small, pure subsets Quinlan’s gain ra-o uses normaliza-on to improve this 40 Based on slide by Pedro Domingos Slide by Tom Mitchell Slide by Tom Mitchell Slide by Tom Mitchell Restaurant example Random: Patrons or Wait-‐-me; Least-‐values: Patrons; Most-‐values: Type; Max-‐gain: ??? French Y N Y N Italian Type variable Thai N Y NY Burger N Y NY Empty Some Full Patrons variable Compu-ng Informa-on Gain French Y N I(T) = ? I (Pat, T) = ? Y N Italian I (Type, T) = ? Thai N Y NY Burger N Y N Y Empty Some Full Gain (Pat, T) = ? Gain (Type, T) = ? 50 Based on Slide from M. desJardins & T. Finin Compu-ng informa-on gain French Y N I(T) = - (.5 log.5 +.5 log.5) Y N =.5 +.5 = 1 Italian I (Pat, T) = 2/12 (0) + 4/12 (0) + Thai N Y NY 6/12 (- (4/6 log 4/6 + 2/6 log 2/6)) Burger N Y N Y = 1/2 (2/3*.6 + Empty Some Full 1/3*1.6) =.47 I (Type, T) = Gain (Pat, T) = 1 -.47 =.53 2/12 (1) + 2/12 (1) + Gain (Type, T) = 1 – 1 = 0 4/12 (1) + 4/12 (1) = 1 Based on Slide from M. desJardins & T. Finin Spli{ng examples by tes-ng aFributes Decision Tree Applet hFp://webdocs.cs.ualberta.ca/~aixplore/learning/ DecisionTrees/Applet/DecisionTreeApplet.html Which Tree Should We Output? ID3 performs heuristic search through space of decision trees It stops at smallest acceptable tree. Why? Occam’s razor: prefer the simplest hypothesis that fits the data Slide by Tom Mitchell The ID3 algorithm builds a decision tree, given a set of non-categorical attributes C1, C2,.., Cn, the class attribute C, and a training set T of records function ID3(R:input attributes, C:class attribute, S:training set) returns decision tree; If S is empty, return single node with value Failure; If every example in S has same value for C, return single node with that value; If R is empty, then return a single node with most frequent of the values of C found in examples S; # causes errors -- improperly classified record Let D be attribute with largest Gain(D,S) among R; Let {dj| j=1,2,.., m} be values of attribute D; Let {Sj| j=1,2,.., m} be subsets of S consisting of records with value dj for attribute D; Return tree with root labeled D and arcs labeled d1..dm going to the trees ID3(R-{D},C,S1)... ID3(R-{D},C,Sm); Based on Slide from M. desJardins & T. Finin How well does it work? Many case studies have shown that decision trees are at least as accurate as human experts. – A study for diagnosing breast cancer had humans correctly classifying the examples 65% of the -me; the decision tree classified 72% correct – Bri-sh Petroleum designed a decision tree for gas-‐oil separa-on for offshore oil pla|orms that replaced an earlier rule-‐based expert system – Cessna designed an airplane flight controller using 90,000 examples and 20 aFributes per example Based on Slide from M. desJardins & T. Finin Extensions of ID3 Using gain ra-os Real-‐valued data Noisy data and overfi{ng Genera-on of rules Se{ng parameters Cross-‐valida-on for experimental valida-on of performance C4.5 is an extension of ID3 that accounts for unavailable values, con-nuous aFribute value ranges, pruning of decision trees, rule deriva-on, and so on Based on Slide from M. desJardins & T. Finin Using gain ra-os The informa-on gain criterion favors aFributes that have a large number of values – If we have an aFribute A that has a dis-nct value for each record, then Info(X,A) is 0, thus Gain(X,A) is maximal To compensate for this Quinlan suggests using the following ra-o instead of Gain: GainRa-o(X,A) = Gain(X,A) / SplitInfo(X,A) SplitInfo(X,A) is the informa-on due to the split of X on the basis of value of categorical aFribute A SplitInfo(X,A) = I(|X1|/|X|, |X2|/|X|, .., |Xm|/|X|) where {X1, X2, .. Xm} is the par--on of X induced by value of A Based on Slide from M. desJardins & T. Finin Computing gain ratio French Y N I(X) = 1 Y N I (Pat, X) =.47 Italian I (Type, X) = 1 Thai N Y NY Gain (Pat, X) =.53 Gain (Type, X) = 0 Burger N Y N Y Empty Some Full SplitInfo (Pat, X) = - (1/6 log 1/6 + 1/3 log 1/3 + 1/2 log 1/2) = 1/6*2.6 + 1/3*1.6 + 1/2*1 = 1.47 SplitInfo (Type, X) = 1/6 log 1/6 + 1/6 log 1/6 + 1/3 log 1/3 + 1/3 log 1/3 = 1/6*2.6 + 1/6*2.6 + 1/3*1.6 + 1/3*1.6 = 1.93 GainRatio (Pat, X) = Gain (Pat, X) / SplitInfo(Pat, X) =.53 / 1.47 =.36 GainRatio (Type, X) = Gain (Type, X) / SplitInfo (Type, X) = 0 / 1.93 = 0 Based on Slide from M. desJardins & T. Finin Gain Ratio l Gain Ratio: 𝑘 𝐺𝑎𝑖𝑛𝑠𝑝𝑙𝑖𝑡 𝑛𝑖 𝑛𝑖 𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜 = 𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 = − 𝑙𝑜𝑔2 𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 𝑛 𝑛 𝑖=1 Parent Node, 𝑝 is split into 𝑘 partitions (children) 𝑛𝑖 is number of records in child node 𝑖 – Adjusts Information Gain by the entropy of the partitioning (𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜). ◆Higher entropy partitioning (large number of small partitions) is penalized! – Used in C4.5 algorithm – Designed to overcome the disadvantage of Information Gain 2/1/2021 Introduction to Data Mining, 2nd Edition 1 Gain Ratio l Gain Ratio: 𝑘 𝐺𝑎𝑖𝑛𝑠𝑝𝑙𝑖𝑡 𝑛𝑖 𝑛𝑖 𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜 = 𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 = 𝑙𝑜𝑔2 𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 𝑛 𝑛 𝑖=1 Parent Node, 𝑝 is split into 𝑘 partitions (children) 𝑛𝑖 is number of records in child node 𝑖 CarType CarType CarType {Sports, {Family, Family Sports Luxury {Family} {Sports} Luxury} Luxury} C1 1 8 1 C1 9 1 C1 8 2 C2 3 0 7 C2 7 3 C2 0 10 Gini 0.163 Gini 0.468 Gini 0.167 SplitINFO = 1.52 SplitINFO = 0.72 SplitINFO = 0.97 2/1/2021 Introduction to Data Mining, 2nd Edition 2