Basic Methods PDF
Document Details
Uploaded by StableBigBen
Brandon University
Tags
Summary
This document focuses on basic methods in algorithms and discusses simple structures, the 1R method, discretization, and simple simplicity related to machine learning and data. It also covers the topic of Naïve Bayes, its derivation of Bayes rule, and handling missing values.
Full Transcript
Basic Methods This chapter focuses on basic methods; more advanced (industrial-strength) versions of algorithms will be discussed in Chapter 6. Datasets often exhibit simple structures 1. one attrib may do all the work 2. attrib may contribute equally and independently...
Basic Methods This chapter focuses on basic methods; more advanced (industrial-strength) versions of algorithms will be discussed in Chapter 6. Datasets often exhibit simple structures 1. one attrib may do all the work 2. attrib may contribute equally and independently to outcome 3. Might have simple logical structure (that can be captured naturally by decision tree). 4. Perhaps a few independent rules suffice. 5. Perhaps dependencies among attributes. 6. Perhaps linear dependence among numeric attributes 7. Classifications appropriate to different parts of the space may be governed by distance between instances. 8. Perhaps no class is provided. A DM tool that is looking for class of structure may compeletely miss regularities of a different kind. Result could be a dense, opaque structure concept description of one kind instead of a simple, elegant, immediately-comprehensible structure of another. The 1R Method Very simple yet sometimes surprisingly effective (when 1 attrib dominates). 1R Missing treated as just another legal value. Numeric attributes are discretized. What about the two 72’s? yes and no Perhaps move the 72 breakpoint up to 73.5 Discretization and overfitting The procedure above tends to produce excessive number of categories. 1R will naturally gravitate towards an attribute that split into many categories. Extreme is an attrib that has a diff value for each instance (a type of ID code). Such a highly-branching attrib will generally do poorly on test (overfitting). Discretization and 1R For 1R, overfitting is likely to occur when att has a large number of possible values. Thus, when discretizing, a min number (e.g. 3) is imposed on # of examples of majority class in each partition. Discretization & Weather This rule has only 3 errors on training set. What about missing values & discretization? Simple Simplicity “Very simple classification rules perform well on most commonly used datasets”, Holte, 1993. Comprehensive study on 1R on 16 datasets with cross-validation. Min example per partition = 6. 1R was just a few percentage points below state- of-the-art decision tree methods on almost all datasets. 1R has smaller and much simpler output than decision trees. Don’t discount simple 1R is often a viable alternative to more complex structures. Simplicity-first methodology: first establish baseline performance with rudimentary techniques before progressing to more sophisticated techniques --- which will usually have less transparent output. Naïve Bayes Simple method that assumes all att are equally important and independent of each other. Unrealistic. Bayes Rule: Consider empirical probability estimates for Weather dataset. cool, high, L(yes) = Pr (sunny|yes) x Pr (cool|yes) x Pr (high|yes) x Pr (windy|yes) x Pr (yes) L(yes) Pr(yes|observations) H : class E : observations ∗ ( ) ( ) Derivation of Bayes Rule Pr (A | B) = Pr (A ꓵ B) / Pr(B) from definition of conditional prob. Pr (B | A) = Pr (A ꓵ B) / Pr(A) from definition of conditional prob. Pr (A ꓵ B) = Pr (A | B) * Pr (B) Pr (A ꓵ B) = Pr (B | A) * Pr (A) Pr (A | B) * Pr (B) = Pr (B | A) * Pr (A) Pr (A | B) = Pr (B | A) * Pr (A) / Pr (B) Pr (H | E) = Pr (E | H) * Pr (H) / Pr (E) Pr (class | observations) = Pr (observations | class) * Pr (class) / Pr (evidence) Naïve Bayes Naïve because it assumes independence (it is only valid to multiply probabilities if they are independent). Despite this, it works very effectively when tested on actual datasets, particularly when combined with attribute selection (which removes redundant, and hence nonindependent, attributes). NB “crashes” if a particular attribute value does not occur in the training set in conjunction with every class value giving it a probability of 0. That zero will cause anything it is multiplied by to be “washed out”. Probabilities that are zero hold a veto. What to do? Laplace correction (aka Laplace estimator): add to each class count for each feature. In practice, doesn’t have to be 1. This is implicitly assuming equal prior probabilities for the three legal values of the attribute. Mu provides a weight for how influential the a priori values should be. Possible not to assign prior probabilities equally: Very rigorous Problem: we usually don’t have an intelligent way to assign p1, p2, p3. In practice, most people follow Laplace and simply add to each count. Missing Values Missing values are nicely handled in NB. Eg if outlook were missing, it would simply be omitted from calculations: After normalization, we get Pr(yes)=41%, Pr(no)=59%. If a value is missing in training set, it is simply not used in the counts. Naïve Bayes and Numeric Attributes Numeric attributes handled by assuming that they follow “normal” or Gaussian distribution. Calculate mean and stdev for each attribute and each class [not including missing values] Suppose we have temperature at 68, and are considering the likelihood of a “yes” outcome: Brand New Day Outlook = Overcast Humidity = 75 Temperature = 82 Windy = false Naïve Bayes is a simple approach that can achieve impressive results. People are often surprised to find that NB rivals, and indeed outperforms, more sophisticated techniques on many datasets. Moral: always try simple things first! Many times people have obtained good results with sophisticated schemes, only to discover later that simple methods, such as 1R or NB, can do just as well, or even better. Of course, there are many datasets for which NB performs poorly, and it is easy to see why. NB treats attributes as though they are independent given the class. Adding redundant attributes skews the learning process. Extreme example: if you were to add another attrib to Weather with the same values as temperature. The effect of temperature would be multiplied (all its probabilities would be squared) giving it a great deal more influence in the decision. If you were to add 10 such attributes, the decisions would be effectively made based on temperature alone. Dependencies among attributes weaken NB. Attribute selection helps. Constructing Decision Trees Task of constructing a decision tree can be expressed recursively. First, select an attribute (called the pivot) to place at the root node, and make one branch for each possible value. This partitions the example set, one partition for each value of the attribute. Repeat process recursively using relevant partition. Stop when all instances in partition have same class. How to Select the Pivot Consider weather dataset. Four possibilities for first pivot: Choosing a Pivot Number of yes and no classes is shown at each leaf. Any leaf with only one class will not have to be split further (good!). Because we seek small trees, we want this to happen ASAP. If we had a measure of “purity”, we could choose the attribute that produces the purest daughter nodes. Measure of Purity The measure of purity we will use is called the information (or entropy) and is measured in units called bits. Associated with each node of the tree, it represents the expected amount of information that would be needed to specify whether a new instance should be classified as yes or no, given that it reached that node. Expected amount of information is usually as a fraction of a bit, and is calculated based on the number of yes and no classes at node. Information (or Entropy) Suppose the class distribution in a leaf node represents a good estimate of the true distribution of instances which end up at that node. Suppose we have a new unlabeled instance which ends up at that node. Now, suppose the label of the new instance is revealed. How much information (measured in bits) is conveyed by this revelation? Entropy Quantifies amount of uncertainty in a probability distribution. Suppose, we flip a biased coin, where Pr(heads)=0.75, Pr(tails)=0.25. How much information is conveyed when the outcome of the coin toss is revealed? Entropy (or information) can be thought of as the expected value of the surprisal. Surprisal If the coin comes up heads, we are not very surprised since the coin is biased. If the coin comes up tails, we are more surprised. The more unfair the coin, the greater the surprisal for tails and the lesser for heads. Definition of surprisal: Entropy is the weighted average surprisal, weighted by the probability of each outcome. Entropy Entropy is expected value of surprisal. Suppose, we flip a biased coin, where Pr(heads)=0.75, Pr(tails)=0.25. Then, 1 H Pr(outcomei ) log 2 i Pr(outcomei ) 1 1 H Pr(heads) log 2 Pr(tails) log 2 Pr(heads) Pr(tails) H 0.81 Entropy Consider a fair coin: 1 H Pr(outcomei ) log 2 i Pr(outcomei ) 1 1 H 0.5 log 2 0.5 log2 1 log 2 (2) 0.5 0.5 Entropy for Biased Coin 1 0.9 0.8 0.7 E 0.6 n t r 0.5 o p 0.4 y 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Pr(heads) Entropy for Uniform Distribution Consider a fair 6-sided die: 1 1 H 6 log 2 log 2 (6) 6 1 6 In general, when there are m outcomes, and all outcomes are equally likely: H log 2 m Entropy for Decision Tree Node Suppose a particular node x has 10 yes and 2 no. Using these as estimate of true distribution yields: Pr (yes | x) = 10/12, Pr (no | x) = 2/12 Suppose, we have a new unlabeled instance that reached x. What is the expected surprisal when the label of x is revealed? If label is revealed to be yes, surprisal is log(12/10)=0.263 bits (and this will happen with prob 83%). With prob 17%, surprisal will be log(12/2). Thus, H = 0.2643*83% + 2.585 * 17% = 0.43 bits. How to Select the Pivot Consider weather dataset. Four possibilities for first pivot: Root has: 9 y, 5 n. H(root) = 0.94 H(sunny) = H ([2,3]) = 0.971 H(overcast) = 0 H(rainy) = H ([3,2]) = 0.971 H(split on outlook) = H([2,3], [4,0], [3,2]) = 5/14 * H(sunny) + 4/14 * H(overcast) + 5/14 * H(rainy) = 0.694 IG (split on outlook) = H (root) – H(split on outlook) = 0.94 – 0.694 = 0.247 H (hot) = H([2,2]) = 1 H (mild) = H([4,2]) = 0.918 H(cool) = H ([3,1]) = 0.81 H([2,2], [4,2], [3,1]) = 4/14 * H(hot) + 6/14 * H(mild) + 4/14 * H(cool) IG(temperature) = H (root) – H([2,2], [4,2], [3,1]) = 0.94 – 0.911 = 0.029 Information Gain Information Gain for a node x = Information (root) – Information (x) Goal: To maximize information gain, which corresponds to minimizing entropy. Root has: 9 y, 5 n. H(root) = 0.94 H(sunny) = H ([2,3]) = 0.971 H(overcast) = 0 H(rainy) = H ([3,2]) = 0.971 IG (split on outlook) = H (root) – H([2,3], [4,0], [3,2]) = H(root) - (5/14 * H(sunny) + 4/14 * H(overcast) + 5/14 * H(rainy) ) = 0.94 – 0.694 = 0.247 H (hot) = H([2,2]) = 1 H (mild) = H([4,2]) = 0.918 H(cool) = H ([3,1]) = 0.81 H([2,2], [4,2], [3,1]) = 4/14 * H(hot) + 6/14 * H(mild) + 4/14 * H(cool) IG(temperature) = H (root) – H([2,2], [4,2], [3,1]) = 0.94 – 0.911 = 0.029 Next Split After choosing Outlook as our pivot, we need to choose the next pivot. We consider each value of Outlook. First, let us consider Outlook = Sunny. We remove from the dataset all instances where Outlook is not Sunny. This leave 5 instances: 2 yes and 3 no. We then choose the next pivot from Temperature, Windy & Humidity based on only on this smaller dataset of 5 instances. H(no further split) = H([2,3]) = 0.971 H(hot) = H([0,2]) = 0 H(mild) = H ([1,1]) = 1 H(cool) = 0 gain (temperature) = 0.971 – 0.4 * 1 = 0.571 H(false) = H([1,2]) = 0.918 H(true) = 1 gain (windy) = 0.971 – (3/5 * 0.918 + 2/5 * 1) = 0.020 H(high) = 0; H(normal) = 0 gain (humidity) = 0.971 – 0 = 0.971 Highly-Branching Attributes This method will give undue preference to attributes that have a large number of possible values, giving rise to many child nodes. Consider the extreme case of an ID field. Information Gain Ratio To avoid this bias, we use information gain ratio, which is IG divided by (Intrinsic Value). Intrinsic Value is calculated based on how the training instances distribute among the child nodes, without consideration for the class value. Intrinsic Value Suppose an attribute splits into 3 nodes with [4, 10, 6]. 4 1 10 1 6 1 IV ([4,10, 6]) *log *log log 20 4 20 10 20 6 20 20 20 Consider a new unlabeled instance x. IV is the expected value of surprisal when it is revealed which node x ends up in (not x’s class). ID3 and C4.5 What we have described so far (without the gain ratio adjustment) corresponds to the ID3 algorithm. ID3 was further enhanced to become C4.5. C4.5 includes handling numeric attributes and missing values, and other improvements. C4.5 is considered a state-of-the-art algorithm. Top 10 Algorithms in Data Mining Constructing Rules Constructing rules follows a covering approach. We take each class, in turn, then seek to cover all instances in it. Example: first rule splits vertically at x=1.2 But, this first rule covers many b’s as well. So, we can add a second condition at y=2.6. This leaves a single a mislabeled. Probably better to leave it at that (to avoid overfitting the training set) But, if we insisted on covering it, we can add: Rules We could’ve used the same process to produce rules for the b’s instead: Again, a single a is mislabeled. To exclude it, we would need to add more rules. Rules vs Trees Constructing rules and trees may seem superficially similar, and often result in similar results. For example, for the same dataset, the tree that would be produced would be: Rules vs Trees But, in many situations, there is a difference. We have previously discussed the replicated subtree problem. Intuitively, when choosing a pivot when constructing a tree, we take into account the purity of all classes, whereas rule methods concentrate on only one class at a time, disregarding what happens to the others. Covering Covering algorithms operate by adding tests (attribute-value pairs) to the rule that is under construction to maximize the probability of the desired classification. Every new term added restricts the coverage of the rule. The idea is to include as many instances of the desired class, and exclude as many instances of other classes as possible. Adding Terms Suppose the new rule (after adding the next term) will cover t instances, of which p instances are positive examples of the desired class, and (t-p) are in other classes. We choose the new term to maximize the ratio p/t. Consider contact lens problem. Let us start with the “hard” class: Choosing the last term yields the rule: Perhaps, we should stop here. (why? to avoid overfitting) If we insist on going further, our possibilities for the next term are: We have a tie between 1st and 4th term, so we choose the one with greater coverage: This rule covers 3 out of the 4 hard recommendations. So, we delete these 3 (already covered) instances from the dataset and start again: Now that all the hard-lens cases are covered, the next step is to proceed with the soft-lens cases in the same way. Finally, we generate rules for the none class, unless we are seeking a rule set with a default rule, in which case explicit rules are not needed for the final class. PRISM Method What we have described is the essence of the PRISM method. It generates only correct or perfect rules. (which is a drawback because it leads to overfitting, which is not necessarily good for generalization) It measures success of a rule by p/t, and any rule with less than 100% accuracy is incorrect. PRISM continues adding clauses to each rule until it is perfect. Summary of PRISM Method Association Rules A brute-force divide-and-conquer approach to association rule induction is not practical (too many possibilities!) Instead, we capitalize on the fact that we are only interested in rules with high coverage. We seek attribute-value pairs (called items) that have a pre-specified minimum coverage. We are seeking high coverage item-sets. We find first 2-item sets, then 3-item sets, then 4-item sets, etc. Once we have the item sets with minimum coverage, the next step is to turn each into a rule, or a set of rules, with at least the specified minimum accuracy. Some item sets will produce more than 1 rule; others will produce none. Accuracy= (#inst which satisfy rule)/(#inst which satisfy antecedent only) Generating Item Sets Efficiently How to generate item sets with specified minimum coverage? The first stage proceeds by generating all 1- item sets with mincover. Then, the next stage generates 2-item sets. Then, the next stage 3-item sets. One pass is needed to increase k by 1. After each pass, the surviving item sets are stored in a hash table --- O(1) insert & recall. Candidate 2-item sets are simply all of the 1- item sets taken in pairs. Then, the coverage of each is computed & some are eliminated. A 3-item set can only have the min coverage if all three of its 2-item subsets have minimum coverage as well. Are you sure? A 4-item set can only have the mincover if all 3 of its 3-item sets have mincover. Etc. Suppose there are 5 surviving 3-item sets. Assume sorted in lexical order. (A B C), (A B D), (A C D), (A C E), (B C D). Union of 1st two (A B C D) is a candidate: all 3 of its subsets are listed. We only need to consider the union of pairs with two members in common. E.g. we do not consider the union of (A C D) and (B C D). Why? If (A B C) and (A B D) are candidates, then (A B C D) would’ve been generated already. If not, then (A B C D) cannot be a candidate. Also, when comparing each 3-item candidate, we only need to consider its potential union with 3- item candidates that follow it in lexical order. Why? (A B C), (A B D), (A C D), (A C E), (B C D). Only two pairs need to be considered: (A B C) & (A B D) candidate must compute coverage (the hard way). (A C D) & (A C E) not a candidate How to check quickly that (A C D E) is not a candidate? Remove each item in turn, and check if the remaining 3-item set is in the hash table. Generating Rules Once we have the high-coverage item sets, the next step is to generate rules from each. If we sought only rules with a single condition in the consequent, the process would be simpler. Simply remove each item, in turn, and use it as a consequent. Then, compute accuracy as coverage of entire set divided by coverage of antecedent. These coverages are found in hash table. But, we want to allow rules with multiple conditions in consequent. We can consider each possible subset as a consequent (brute force) but this gets inefficient very quickly. There is a better way. If the double-consequent rule holds (with min coverage and accuracy): Then, both single-consequent rules must also hold (with min coverage and accuracy): Conversely, if one or the other of the single- consequent rules does not hold, then there is no point to check the double-consequent rule. Thus, we can build up from single-consequent rules to double-consequent rules. Then, triple consequent etc. This process of building up rules is actually similar to building up high-coverage item sets. Association rules are often used in market basket analysis. In such applications, attributes are often binary and sparse. In many situations, we have a target number of rules (e.g. 50). We seek the 50 rules with greatest possible coverage, with a specified minimum accuracy. One way to do this is to begin by specifying the coverage to be rather high. Then, if the number of rules is less than goal, the entire process is repeated with a smaller minimum coverage. Instance-Based Learning In IBL, the training examples are stored verbatim, and a distance function is used to determine which member (or k members) is closest to an unknown test instance. How to choose distance function? Most IBL methods use Euclidean distance: Of course, the sqrt need not be computed. Distance function options Manhattan distance: use abs instead of square. Powers higher than 2 can also be used. Higher powers increase the influence of large differences at the expense of small differences. Generally, Euclidean is a good compromise and a good default. But, other metrics may be more appropriate in some applications. Different attributes are often measured on different scales. So, with raw Euclidean distance, the effect of some attributes might be dwarfed by others. Thus, it is common to normalize all attributes: Attribute weighting also possible. What about categorical attributes? Distance is taken to be 0 if match, and to be 1 if mismatch. How to handle missing values? For categorical attributes, assume that the missing value has a maximal distance (i.e. 1) from all other values (even missing values). For numeric attributes, distance is 1 if both values are missing. If only 1 value is missing, then distance is taken to be the maximum possible.