Lecture 3: Bias, Rules - 2024 PDF

Lecture 3 : - Decision trees vs. linear regression - Inductive bias - Rule learners Briefly revisiting Linear Regression Linear regression “Linear model” : Y = a + b1X1 + b2 X2 + … + bk Xk Usually fit such that sum of squared vertical deviations from line is minimal (“least squares” method) 2.5 2.0 to predict petal width from petal length 1.5 for iris flowers: Petal.Width best possible linear predictor (in the 1.0 sense of “least squares”) is 0.5 width = 0.416*length - 0.363 1 2 3 4 5 6 7 Petal.Length 3 Using linear regression What can we use such a model for? For predicting Y, given Xi, using the formula For understanding how well Y can be predicted from Xi (i.e., how informative the Xi are) For understanding what the effect of each Xi is on Y For visualizing the connection between Xi and Y 4 Interpreting linear regression Y = a + b1X1 + b2 X2 + … + bk Xk bi tells us how much Y goes up if Xi increases by 1 and all Xj, j≠i, remain the same for Y = a+bX (only one predictive variable), the linear correlation r tells us how well the points fit a line, and whether the line is ascending or descending. -1 ≤ r ≤ 1. the coefficient of determination R2 tells us to what extent Y is determined by the Xi. R2 varies between 0 (Y is completely independent of the Xi) and 1 (Y is completely determined) 5 Interpreting linear regression Careful with interpretation of coefficients ! L = 80 - 2C - 1W L = life expectancy (years) C = packs of cigarettes / day W = glasses of wine / day Which of the following statements holds? “smoking affects life expectancy more strongly than drinking” “in the dataset from which this model was derived, non-drinkers lived longer (on average) than people who had a glass per day” Answer: neither! coefficients are not scale-free (#cigarettes vs. #packs) “multicollinearity” : correlations among Xi 6 Important assumptions This model implicitly assumes: Effect of each variable on target is constant (does not depend on other variables) Effects of different variables are cumulative (add up) (+ some more technical assumptions) In statistics terminology: “No interaction” (the effects do not interact) 7 Complex terms Free to introduce terms that represent functions of other variables, e.g. X12, sin(X2), X1X2, … Example: Y = a + b1X1 + b2 X2 + b12 X1X2 b12 X1X2: “interaction term” “overall” coefficient of X2 is (b2+b12X1): effect of X2 on Y depends on X1 8 Nominal variables What if input variables are not numerical, but symbolic (nominal)? For nominal Xi with k values, introduce k-1 “0/1” variables, called “indicator” or “dummy” variables blue red green yellow Xi,b 1 0 0 0 Xi,r 0 1 0 0 Xi,g 0 0 1 0 9 Trees vs. linear regression & Inductive bias Assumptions of tree learners After splitting on Xi, each branch is developed entirely independently Xj may have a positive effect on Y in the left branch, but a negative in the right branch No assumption whatsoever that “Xj has the same effect everywhere” From this point of view, decision trees are quite the opposite of linear regression ! (Note: consecutive splitting quickly reduces size of training set available for each branch…) Trees vs. linear regression A,S,C,J: Age, Sex, Country, Job Y = a + b 1A + b 2S + b 2C + b 3J On average, professor salaries are 500$ higher than postdoc salaries. That’s silly! You don’t know I’ll assume the same holds for each that. It may be different in country separately. different countries. Nah… I don’t think so. Until Wow… you’re quite biased! very strong evidence to the contrary is presented, I’ll stick to this assumption. Trees vs. linear regression Y = a + b 1A + b 2S + b 2C + b 3J Professors earn 500$ more than postdocs if we look at males aged 30-40; males aged 40-50; males aged 50-60; females aged 30-40; females aged 50-60. But what about Well, surely that’ll be females aged 40-50? I have no data! the same, no? I’ll believe that when I see it with my own eyes, not before. You’re kind of paranoid! Trees vs. linear regression Which method will perform best largely depends on your problem Each learning approach has a “bias”: implicit assumptions it makes Learners whose bias fits your problem will perform better Unfortunately, problem is often not sufficiently understood to decide in a principled way (neither are learners, in fact!) “Cumulative effects” vs. “interaction” is a good start 14 Removing all bias? We can theoretically prove that bias-free learning is impossible Wolpert’s “no free lunch theorems” For each problem where method A works better than B, there’s a problem where B works better than A (using a probabilistic characterization) There is no single best method for learning Mitchell’s argument, in Version Spaces context: The bias of VS is that it assumes conjunctive concepts What happens if we drop that bias, and allow any concept? Then for any h ∈ VS that predicts + for a new instance x, there is a h’ ∈ VS that predicts - for the same x 15 Mitchell’s proof Allow any concept means: for any subset S of X, there is a hypothesis that says everything in S (and nothing else) is positive Now take a hypothesis h that corresponds to some S, and h’ that corresponds to S ∪ {x}, with x an unseen example (not in the training set) Since h’ and h differ only on x, and x is not in the training set, h and h’ must be both consistent, or both inconsistent, with the training set; so h ∈ VS if and only if h’ ∈ VS This means that for each unseen instance x, exactly half of VS predicts +, and exactly half predicts - Without bias, no generalization 16 Other methods We will see many more learning methods They can be used for all kinds of modeling, including predictive modeling Examples: Naive Bayes, probabilistic graphical models, discriminant analysis, … All of them have their own bias = implicit assumptions about what properties the true model has Next slide illustrates this for classification 17 Classifiers illustrated in 2-D (source code: Scikit-Learn documentation) 18 Choices to make Modeling your problem as a prediction task: What is input, what is output (target value) ? Regression, classification, probability prediction? Choosing a learning approach Efficiency of learning/prediction phase Bias (linear, interaction effects?) Interpretability of returned models 19 Learning if-then rules Rule sets A rule set is a set of rules of the form “if … then …” Rule sets can be ordered (rule i only applies if rules 1 to i-1 did not apply) or unordered (each rule always applies) An ordered rule set is also called a decision list Ordered rule sets really have “if - then - else if …” semantics Example: definition of leap years (years with 366 days) Examples of leap years Examples of non-leap years 2000, 1992, 2004, … 1993, 1900, 2011, 2018, … If year is a multiple of 400 then leap else if year is a multiple of 100 then not leap else if year is multiple of 4 then leap else not leap 21 Rules with exceptions : Rule sets vs. decision lists Decision lists can be more compact than rule sets, but individual rules are more difficult to interpret, context is needed Compare: If year is a multiple of 400 then leap else if year is a multiple of 100 then not leap if-then part is correct else if year is multiple of 4 then leap within a certain context else not leap Each rule is correct If year is a multiple of 400 then leap by itself if year is multiple of 4 but not of 100 then leap 22 Another illustration R1 Unordered rules R1: if.... then pos R2 R3 R2: if.... then pos R3: if.... then pos R4 R4: if.... then pos R2 Concept = gray area Ordered rules: R1: if.... then neg R1 R2: if.... then pos If 1 rule covers a rectangle, how to represent this? (implicit else if) 16 23 Learning rule sets Note : a decision tree can be turned into a set of rules ! This is one way to learn sets of rules These sets of rules tend to have many conditions in common and no overlap We here look at some other approaches to learning rules If Outlook=sunny and humidity=normal then yes If Outlook=overcast then yes If Outlook=rainy and Wind=weak then yes [Otherwise : no] 24 Sequential covering Or: “separate-and-conquer” (as opposed to trees, which are “divide-and- conquer”) A rule covers an instance if the instance fulfils its conditions (= the rule makes a prediction for that instance) Principle: Learn one rule at a time; ideally, the rule has High accuracy: when it makes a prediction, it should be correct Reasonable coverage: it need not make a prediction for each instance - but the more, the better Mark the examples covered by the rule Repeat for unmarked examples, until all examples taken care of 25 Learning one rule Typically done using a greedy search in a generality lattice Can be top-down or bottom-up Top-down: Start with maximally general rule (has maximal coverage but low accuracy) Add conditions one by one Gradually maximize accuracy without sacrificing coverage (using some heuristic) Bottom-up: Start with maximally specific rule (has minimal coverage but maximal accuracy) Remove conditions one by one Gradually maximize coverage without sacrificing accuracy (using some heuristic) 26 Illustration on “cocktails” dataset Set of conditions to consider: Shape = cylinder Shape = coupe Shape = trapezoid Color = orange Color = black Shape Color Content Sick? Color = white Cylinder Orange 25cl No Color = green Cylinder Black 25cl No Color = yellow Could also treat Content as Coupe White 10cl No Content = 25cl continuous and have, e.g. Trapezoid Green 15cl No Content = 15cl Content < c Content > c Content = 10cl for some threshold c. Coupe Yellow 15cl No In this example, we treat it Trapezoid Orange 15cl Yes as nominal. Coupe Orange 15cl Yes Coupe Orange 10cl Yes 27 Illustration on “cocktails” dataset Start IF true THEN Sick?=yes 3/8 IF Shape=cylinder THEN Sick?=yes 0/2 Shape Color Content Sick? IF Shape=coupe THEN Sick?=yes 2/4 Cylinder Orange 25cl No IF Shape=trapezoid THEN Sick?=yes 1/2 Candidate refinements Cylinder Black 25cl No Coupe White 10cl No Trapezoid Green 15cl No IF Color=orange THEN Sick?=yes 3/4 Coupe Yellow 15cl No Trapezoid Orange 15cl Yes IF Color=black THEN Sick?=yes 0/1 Coupe Orange 15cl Yes Coupe Orange 10cl Yes IF Color=white THEN Sick?=yes 0/1 Shape = cylinder IF Color=green THEN Sick?=yes 0/1 Shape = coupe Shape = trapezoid IF Color=yellow THEN Sick?=yes 0/1 Color = orange Color = black Color = white IF Content=25cl THEN Sick?=yes 0/2 Color = green Color = yellow Content = 25cl IF Content=15cl THEN Sick?=yes 2/4 Content = 15cl Content = 10cl IF Content=10cl THEN Sick?=yes 1/2 28 Illustration on “cocktails” dataset Current IF Color=orange THEN Sick?=yes 3/4 IF Color=orange and Shape=Cylinder THEN Sick?=yes 0/1 Candidat refinements Shape Color Content Sick? IF Color=orange and Shape=coupe THEN Sick?=yes 2/2 Cylinder Orange 25cl No Cylinder Black 25cl No IF Color=orange and Shape=trapezoid THEN Sick?=yes 1/1 Coupe White 10cl No Trapezoid Green 15cl No IF Color=orange and Content=25cl THEN Sick?=yes 0/1 Coupe Yellow 15cl No Trapezoid Orange 15cl Yes IF Color=orange and Content=15cl THEN Sick?=yes 2/2 Coupe Orange 15cl Yes Coupe Orange 10cl Yes IF Color=orange and Content=10cl THEN Sick?=yes 1/1 Shape = cylinder Shape = coupe Shape = trapezoid Color = orange Color = black Color = white Rule is accurate (100%), but does not cover all positives yet. Color = green Color = yellow Hence, mark covered examples & learn another rule, Content = 25cl Content = 15cl focusing on what’s not yet covered. Content = 10cl 29 Illustration on “cocktails” dataset Start IF true THEN Sick?=yes 1/6 IF Shape=cylinder THEN Sick?=yes 0/2 Shape Color Content Sick? IF Shape=coupe THEN Sick?=yes 0/2 Cylinder Orange 25cl No IF Shape=trapezoid THEN Sick?=yes 1/2 Candidate refinements Cylinder Black 25cl No Coupe White 10cl No Trapezoid Green 15cl No IF Color=orange THEN Sick?=yes 1/2 Coupe Yellow 15cl No Trapezoid Orange 15cl Yes IF Color=black THEN Sick?=yes 0/1 Coupe Orange 15cl Yes Coupe Orange 10cl Yes IF Color=white THEN Sick?=yes 0/1 Shape = cylinder IF Color=green THEN Sick?=yes 0/1 Shape = coupe Shape = trapezoid IF Color=yellow THEN Sick?=yes 0/1 Color = orange Color = black Color = white IF Content=25cl THEN Sick?=yes 0/2 Color = green Color = yellow Content = 25cl IF Content=15cl THEN Sick?=yes 1/3 Content = 15cl Content = 10cl IF Content=10cl THEN Sick?=yes 0/1 30 Illustration on “cocktails” dataset Current IF Shape=trapezoid THEN Sick?=yes 1/2 IF Shape=trapezoid and Color=orange THEN Sick?=yes 1/1 Candidat refinements Shape Color Content Sick? IF Shape=trapezoid and Color=black THEN Sick?=yes 0/0 Cylinder Orange 25cl No Cylinder Black 25cl No IF Shape=trapezoid and Color=white THEN Sick?=yes 0/0 Coupe White 10cl No Trapezoid Green 15cl No IF Shape=trapezoid and Color=green THEN Sick?=yes 0/1 Coupe Yellow 15cl No Trapezoid Orange 15cl Yes IF Shape=trapezoid and Color=yellow THEN Sick?=yes 0/0 Coupe Orange 15cl Yes Coupe Orange 10cl Yes IF Shape=trapezoid and Content=25cl THEN Sick?=yes 0/0 IF Shape=trapezoid and Content=15cl THEN Sick?=yes 1/2 Shape = cylinder Shape = coupe Shape = trapezoid IF Shape=trapezoid and Content=10cl THEN Sick?=yes 0/0 Color = orange Color = black Color = white Color = green Color = yellow Content = 25cl Second rule is accurate (100%) and covers all Content = 15cl Content = 10cl remaining positives - done! 31 A simple algorithm Training set D is partitioned into Pos (instances of class we want to predict) function LearnRuleSet(Pos, Neg): and Neg (all other instances) RuleSet = ∅ while Pos not empty: R = learnOneRule(Pos, Neg) if R does not meet acceptance criteria: break add R to RuleSet remove instances covered by R from Pos return RuleSet Assumes we go for 100% accuracy. function LearnOneRule(Pos,Neg): Other stopping criteria possible. Rule = if true then positive while Rule covers elements of Neg: C* = argmaxC∈CandidateConditions() heuristic(refine(Rule,C), Pos, Neg) Rule = refine(Rule,C*) function refine(Rule, C): let Rule = if conditions then positive return if conditions and C then positive 32 Can you think of a good heuristic to select rules? Note: We need a heuristic that favors (1) high accuracy, (2, less important) reasonably high coverage 33 Heuristics for rule learners The accuracy of a rule that predicts class C is p / (p+n), with p, n = #positives (class C), resp. #negatives (not class C), covered = what proportion of predictions by this rule are correct? Rule A scores 2/3, rule B scores 20/30. Which rule is most likely to ultimaly lead to a rule with high accuracy? 1) A might have got lucky: a rule with 50% accuracy can accidentally score 2 out of 3. For B, that is much less likely. 2) B has better coverage: that’s good in itself — we will need fewer rules! 3) rule B has more potential of eventually (after further specialization) leading to a good rule That’s 3 reasons why B should score higher 34 Heuristics for rule learners p + mq The m-estimate of a rule is m-estimate(m, q) = p+n+m with p, n as before; q a prior estimate of accuracy; and m a weight for that prior estimate = a conservative estimate of accuracy when p+n small, estimate is closer to some given prior estimate q converges to accuracy as p+n grows Larger m gives more conservative (closer to prior) m-estimate m=0 m >> p+n p q p+n 35 Example-driven top-down rule induction Originally introduced in Michalski’s AQ algorithms Works like regular top-down approach, except: Pick a not-yet-covered example Use as hypothesis space, rules that cover this example Search within this (much smaller!) hypothesis space More efficient: searching smaller search space Less robust to noise What if initial example was noisy? May need to restart several times 36 Cocktails: example-driven Start IF true THEN Sick?=yes 3/8 IF Shape=trapezoid THEN Sick?=yes 1/2 refinements Candidate Shape Color Content Sick? IF Color=orange THEN Sick?=yes 3/4 Cylinder Orange 25cl No Cylinder Black 25cl No IF Content=15cl THEN Sick?=yes 2/4 Coupe White 10cl No Trapezoid Green 15cl No Coupe Yellow 15cl No Trapezoid Orange 15cl Yes Coupe Orange 15cl Yes Coupe Orange 10cl Yes Shape = cylinder Shape = coupe IF Color=orange THEN Sick?=yes 3/4 Shape = trapezoid Color = orange Color = black IF Color=orange and Shape=trapezoid THEN Sick?=yes 1/1 Color = white Color = green IF Color=orange and Content=15cl THEN Sick?=yes 2/2 Color = yellow Content = 25cl Content = 15cl Content = 10cl 37 RIPPER A well-known implementation of rule learning, perhaps the most effective rule set learner to date Separate-and-conquer approach, with key modifications: Prune each rule after learning it, using a separate pruning set (= “reduced error pruning”, see also tree learning) Learn rules for one class at a time, starting with the smallest classes (hence, ordered rule set) Optimize the rule set afterwards, by re-learning each rule (in the order first learned) within the context of the other rules, and replacing the original rule by the new one when better + carefully chosen heuristics and stopping criteria See Cohen, 1995, “Fast Effective Rule Induction”, for details 38 Rule learning in Weka Weka implementation of Ripper: JRip Below: comparison of J48 and JRip on “contact lenses” dataset Try it yourself on, e.g., Soybean dataset (tear-prod-rate = normal) and (astigmatism = yes) => contact-lenses=hard (6.0/2.0) (tear-prod-rate = normal) => contact-lenses=soft (6.0/1.0) JRIP rules => contact-lenses=none (12.0/0.0) tear-prod-rate = reduced: none (12.0) J48 tree tear-prod-rate = normal | astigmatism = no: soft (6.0/1.0) | astigmatism = yes | | spectacle-prescrip = myope: hard (3.0) | | spectacle-prescrip = hypermetrope: none (3.0/1.0) 39 Link: Association rules Association rules look somewhat similar to classification rules, but have a very different purpose: they are descriptive rules that indicate patterns in data For that reason, the algorithm for finding them is very different (e.g., look for all rules that satisfy the given criteria, rather than a minimal subset) Covered in the course on Data Mining Client cheese bread butter wine jam ham 1 yes yes yes yes no yes 2 yes no yes no no no 3 no yes yes no no yes..................... IF bread & butter THEN cheese confidence: 50% support: 5% 40

Lecture 3: Bias, Rules - 2024 PDF

Document Details

Tags

Related

Summary

Full Transcript