Lecture B2 - Decision trees.pdf
Document Details
Uploaded by Deleted User
Full Transcript
Decision trees What are decison trees? What is a decision tree? A decision tree represents a decision procedure where you start with one question the answer to that question determines which question you should ask next and so on, until you reach a decision We will usuall...
Decision trees What are decison trees? What is a decision tree? A decision tree represents a decision procedure where you start with one question the answer to that question determines which question you should ask next and so on, until you reach a decision We will usually call the questions “tests”, and the decision a “prediction” 3 Example decision tree Do I want to play tennis today, or not? The tree below represents one possible decision procedure (which takes the weather conditions into account). Outlook Sunny Overcast Rainy Humidity Yes Wind High Normal Strong Weak No Yes No Yes 4 What does the tree represent? Assume given a set of input attributes X1, X2, …, Xn and a target attribute Y The questions can involve any Xi The decision assigns a value to Y We will denote the input space by X; points in X are vectors x = (x1, x2, …, xn) Any x is mapped to exactly one value y for Y Hence, the tree represents a function from X to Y 5 Illustration Here, X1 = Outlook, with possible values {Sunny, Overcast, Rainy}; X2 = Humidity (High, Normal), X3 = Wind (Strong, Weak), Y = Tennis (Yes, No) The tree represents a function Outlook × Humidity × Wind → Tennis Outlook Sunny Overcast Rainy Humidity Yes Wind High Normal Strong Weak No Yes No Yes 6 Representation power Assume that the input attributes X1, X2, …, Xn and the output attribute Y are boolean (true/false) Trees then represent boolean functions Can every imaginable boolean function be represented? Try to write down a tree for: ∨, the OR function : X1 ∨ X2 = true if X1, or X2, or both are true ∧, the AND function: X1 ∧ X2 = true if X1 and X2 are both true ⊕, the XOR function: X1 ⊕ X2 = true if exactly one of X1 and X2 is true (A ∧ B) ∨ (C ∧ ¬D ∧ E) (¬ means NOT: ¬D = true if D = false) 7 Continuous input attributes What if one of the input attributes is a continuous attribute? E.g., X1 is a number between 0 and 1 We cannot make a different child node for each possible value! Tests must have a finite number of possible outcomes Solution: use comparative tests, e.g., X1 < 0.5 (two possible outcomes: yes, no) 8 Types of trees When Y is nominal, the tree is called a classification tree When Y is numerical, the tree is called a regression tree (There are other types of trees: see later) 9 Why trees? If we want to learn a predictive function f : X → Y, why would we want to learn it in the form of a decision tree? there are many alternatives: neural network, support vector machine, probabilistic model, … Main reasons (as we will see later on): learning and using trees is very efficient they tend to have good predictive accuracy, even better when combined into ensembles (see elsewhere) they are interpretable: we can understand what the tree says, and we can explain its predictions 10 How can we learn decision trees from data? The basic learning algorithm Learning decision trees A decision tree represents a function T : X → Y Predictive learning is about learning such functions from data We assume given: existence of some unknown function f : X → Y a dataset D containing “examples” (x, f(x)) with x ∈ X We want to find: a tree T : X → Y that fulfills some criterion C (C depends on the task) 12 Learning decision trees: task 1 Find the smallest tree T such that ∀(x, f(x)) ∈ D : T(x) = f(x) (= smallest tree consistent with the data) Useful: uncovers the link between X and Y in the data set tells us under what conditions Y takes certain values this provides insight in the data smallest tree = simplest explanation Mostly of theoretical interest Often, D is only a sample from some larger distribution , and we want T to be accurate over , not only D 13 𝒟 𝒟 Learning decision trees: task 2 Find the tree T such that for x drawn from , T(x) is (on average) maximally similar to f(x) This is formally expressed using a loss function ℓ : Y × Y → ℝ ℓ(y1, y2) indicates how bad it is to predict y1 when you should have predicted y2 (the higher the number, the worse) The risk R of T, relative to f, is Ex∼ [(ℓ(T(x), f(x)] (the expected value of ℓ(T(x), f(x)), with x drawn from ) Task, restated: find the tree with minimal risk 14 𝒟 𝒟 𝒟 Learning decision trees: task 2 In practice, we usually do not know , so we cannot compute the risk of a tree T, let alone find the tree with minimal risk Practical algorithms for learning decision trees will therefore use heuristics to find a tree that, hopefully, is small (so it provides insight in the connection between X and Y) generalizes well (so it can be used to make predictions outside D) using only the data in D 15 𝒟 Hardness of decision tree learning Known result from complexity theory: “learning decision trees is NP-hard” What is meant here, is that finding the smallest tree that is consistent with the data is NP-hard In practice, that is not what learners try to do We will see later on that practical algorithms for learning decision trees are in fact very efficient 16 Learning DTs: the basic principle This approach is known as “Top-down induction of decision trees” (TDIDT), or as “recursive partitioning” Start with the full data set D Find a test such that examples in D with the same outcome for the test tend to have the same value of Y Split D into subsets, one for each outcome of that test Repeat this procedure on each subset that is not yet sufficiently “pure” (meaning, not all elements have the same Y) Keep repeating until no further splits possible 17 Example: cocktail drinking ? Will I get sick if I drink this, or not? 18 The data set D: 8 examples 19 Data set in tabular form Shape Color Content Sick? Cilinder Orange 25cl No Cilinder Black 25cl No Coupe White 10cl No Trapezoid Green 15cl No Coupe Yellow 15cl No Trapezoid Orange 15cl Yes Coupe Orange 15cl Yes Coupe Orange 10cl Yes 20 Observation 1: shape is important Shape 21 Observation 2: Sometimes, color is important too Shape Color 22 The decision tree Shape ? Color orange Non-orange 23 Rule representation of tree Trees can be written as if-then-rules Rules can be simplified if shape = cilindric then 😀 if shape ≠ cilindric else and color = orange if color ≠ orange then 😀 then 🤢 else 🤢 else 😀 24 Two main questions… The basic principle has just been illustrated Two important details need to be filled in: How to choose the “best” test When to stop splitting nodes 25 How to choose the tests? Think of the game “20 questions”, where you need to guess which person another player is thinking of, by asking questions that can only be answered by “yes” or “no”. What’s a good criterion for choosing the questions? 26 Choosing a test: classification We focus on classification trees (Y is nominal) A good test is a test that carries much information about the class What is information? Information theory defines “entropy” or “missing information” how many bits needed, on average, to convey a piece of information, if an optimal encoding is used 27 Information Assume I need to tell you for 100 objects what their class is, out of four possibilities (A,B,C,D) Could encode A as 00, B as 01, C as 10, D as 11: 2 bits per objects needed Can we do better? If all classes are equally likely: no If some classes are frequent and others are rare: yes! Use shorter encoding for more frequent class. E.g. A: 0, B: 10, C: 110, D: 111 25% each: on average (1+2+3+3)/4 = 2.25 bits (worse than 2) 50% A, 30%B, 10%C, 10%D: 0.5 + 0.6 + 0.3 + 0.3 = 1.7 bits 28 Information More clever encodings can be devised Given a set of values c1, c2, …, ck with respective probabilities p1, p2, …, pk, an encoding exists that uses, on average, e bits for representing a randomly drawn value, where k ∑ e=− pi log2(pi) i=1 (with log2 the binary logarithm: a = log2(b) ⇔ b = 2a ) This number e reflects the minimal number of bits that you will need, on average, to encode one value. It is the inherent information content, or entropy 29 Class entropy The class entropy of a set S of objects (x, y), where y can be any of k classes ci, is defined as k | {(x, y) ∈ S | y = ci} | ∑ CE(S) = − pi log2(pi) with pi = i=1 |S| (proportion of elements in S with class ci) It measures how much uncertainty there is about the class of a particular instance 30 Entropy: example S contains 6 elements with classes A, A, B, B, B, C: CE(S) = - 1/3 log2 1/3 - 1/2 log2 1/2 - 1/6 log2 1/6 = 1.46 S contains 16 elements with 8xA, 8xB: CE(S) = - 1/2 log2 1/2 - 1/2 log2 1/2 = 1 S contains 16 elements, 15xA, 1xB: CE(S) = -15/16 log2 15/16 - 1/16 log2 1/16 = 0.34 High entropy = “many possibilities, all equally likely” Low entropy = “few possibilities”, or possibly: “many possibilities, but most are highly unlikely” Entropy measures uncertainty 31 Information gain Entropy reflects missing information The information gain of a question = the expected reduction of entropy by obtaining the answer to the question In the case of classification trees: expected reduction of class entropy: o | Si | ∑ |S| IG(S, t) = CE(S) − E[CE(Si)] = CE(S) − CE(Si) i=1 with t a test, o the number of possible outcomes of t, and Si the subset of S for which the i’th outcome was obtained 32 Example Tennis: assume set S has 9 “yes” (+) and 5 “no” (-)examples S: [9+,5-] S: [9+,5-] E = 0.940 E = 0.940 Humidity Wind High Normal Strong Weak S: [3+,4-] S: [6+,1-] S: [6+,2-] S: [3+,3-] E = 0.985 E = 0.592 E = 0.811 E = 1.0 IG(S, Humidity) Gain(S, Humidity) IG(S, Wind) Gain(S, Wind) =.940 - (7/14).985 - (7/14).592 =.940 - (8/14).811 - (6/14)1.0 = 0.151 = 0.048 33 Choosing a test: regression Now assume Y is numerical How would you now choose the test? {1,3,4,7,8,12}! {1,3,4,7,8,12}! A1! A2! {1,4,12}! {3,7,8}! {1,3,4}! {7,8,12}! 34 Variance reduction Try to create subsets such that examples in one subset have similar Y values Instead of information gain: variance reduction The variance of Y in a set S of instances (x, y) is ∑(x,y)∈S (y − ȳ)2 ∑(x,y)∈S y Var(S) = with ȳ = |S| − 1 |S| Variance reduction is defined similarly to information gain: o X |Si | V R(S, t) = V ar(S) V ar(Si ) i=1 |S| 35 When to stop splitting nodes? In principle, keep splitting until all instances in a subset have the same Y value Useful for Task 1 (no generalization needed) Less useful for Task 2: risk of overfitting 36 Tree with tests Y of the form X