Decision Trees PDF

Decision Trees 의료 인공지능 So Yeon Kim *Slides adapted from Roger Grosse Decision Trees 2 Decision Trees 3 Decision Trees Decision trees make predictions by recursively splitting on different attributes according to a tree structure. 4 Example with Discrete Inputs What if the attributes are discrete? 5 Example with Discrete Inputs The tree to decide whether to wait (T) or not (F) 6 Decision Trees Internal nodes test attributes Branching is determined by attribute value Leaf nodes are outputs (predictions) 7 Decision Trees Each path from root to a leaf defines a region Rm of input space Let {(𝑥 𝑚1 , 𝑡 𝑚1 ), … , (𝑥 𝑚𝑘 ,𝑡 𝑚𝑘 )} be the training examples that fall into Rm Classification tree: discrete output leaf value 𝑦 𝑚 typically set to the most common value in {𝑡 𝑚1 ,…,𝑡 𝑚𝑘 } Regression tree: continuous output leaf value 𝑦 𝑚 typically set to the mean value in {𝑡 𝑚1 ,…,𝑡 𝑚𝑘 } 8 Expressiveness Discrete-input, discrete-output case: Decision trees can express any function of the input attributes E.g., for Boolean functions, truth table row → path to leaf: Continuous-input, continuous-output case: Can approximate any function arbitrarily closely 9 Learning Decision Trees How do we construct a useful decision tree? Learning the simplest (smallest) decision tree is an NP complete problem Resort to a greedy heuristic: Start from an empty decision tree Split on the “best” attribute Recurse Which attribute is the “best”? 10 Choosing a Good Split Is this split good? 11 Choosing a Good Split Idea: Use counts at leaves to define probability distributions, so we can measure uncertainty We will use techniques from information theory 12 Quantifying Uncertainty 13 Quantifying Uncertainty Entropy is a measure of expected “surprise”: 14 Quantifying Uncertainty Measures the information content of each observation Unit = bits A fair coin flip has 1 bit of entropy 15 Entropy “High Entropy”: Variable has a uniform like distribution Flat histogram Values sampled from it are less predictable “Low Entropy”: Distribution of variable has many peaks and valleys Histogram has many lows and highs Values sampled from it are more predictable 16 Entropy of a Joint Distribution Example: X = {Raining, Not raining}, Y = {Cloudy, Not cloudy} 17 Conditional Entropy What is the entropy of cloudiness Y, given that it is raining? 18 Conditional Entropy The expected conditional entropy: 19 Conditional Entropy What is the entropy of cloudiness, given the knowledge of whether or not it is raining? 20 Conditional Entropy Chain rule: 𝐻(𝑋, 𝑌) = 𝐻(𝑋|𝑌) + 𝐻(𝑌) = 𝐻(𝑌|𝑋) + 𝐻(𝑋) If X and Y independent, then X doesn’t tell us anything about Y: 𝐻(𝑌|𝑋) = 𝐻(𝑌) But Y tells us everything about Y: 𝐻(𝑌|𝑌) = 0 By knowing X, we can only decrease uncertainty about Y: 𝐻(𝑌|𝑋) ≤ 𝐻(𝑌) 21 Information Gain How much information about cloudiness do we get by discovering whether it is raining? This is called the information gain in Y due to X, or the mutual information of Y and X If X is completely uninformative about Y: IG(Y|X) = 0 If X is completely informative about Y: IG(Y|X) = H(Y) 22 Information Gain Information gain measures the informativeness of a variable What is the information gain of this split? 23 Constructing Decision Trees At each level, one must choose: 1. Which variable to split. 2. Possibly where to split it. Choose them based on how much information we would gain from the decision! (choose attribute that gives the highest gain) 24 Decision Tree Construction Algorithm Simple, greedy, recursive approach, builds up tree node-by-node 1. pick an attribute to split at a non-terminal node 2. split examples into groups based on attribute value 3. for each group: if no examples– return majority from parent else if all examples in same class– return class else loop to step 1 25 What Makes a Good Tree? Not too small: need to handle important but possibly subtle distinctions in data Not too big: Computational efficiency (avoid redundant, spurious attributes) Avoid over-fitting training examples Human interpretability We desire small trees with informative nodes near the root 26 Summary Advantages: Good when there are lots of attributes, but only a few are important Good with discrete attributes Easily deals with missing values (just treat as another value) Robust to scale of inputs Fast at test time Interpretable 27 Summary Problems: You have exponentially less data at lower levels Too big of a tree can overfit the data Greedy algorithms don’t necessarily yield the global optimum Handling continuous attributes Split based on a threshold, chosen to maximize information gain Decision trees can also be used for regression on real-valued outputs. Choose splits to minimize squared error, rather than maximize information gain. 28 Thank You! Q&A 29

Decision Trees PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue