C4.5 Class Imbalance & Cost Sensitivity (PDF)
Document Details
Uploaded by Deleted User
Chris Drummond and Robert C. Holte
Tags
Summary
This paper examines two sampling schemes (over-sampling and under-sampling) that adapt machine learning algorithms for datasets with imbalanced classes and misclassification costs. The authors use a performance analysis technique called 'cost curves' to investigate how these sampling methods interact with the decision tree learner C4.5. The research highlights under-sampling's generally better cost sensitivity compared to over-sampling.
Full Transcript
C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling Chris Drummond [email protected] Institute for Information Technology, National Research Council Canada, Ottawa, Ontario, Canad...
C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling Chris Drummond [email protected] Institute for Information Technology, National Research Council Canada, Ottawa, Ontario, Canada, K1A 0R6 Robert C. Holte [email protected] Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada, T6G 2E8 Abstract C4.5 (Quinlan, 1993). We chose C4.5 not only be- This paper takes a new look at two sampling cause it is one of the most commonly used algorithms schemes commonly used to adapt machine al- in the machine learning and data mining communities gorithms to imbalanced classes and misclas- but also because it has become a de facto commu- sification costs. It uses a performance anal- nity standard against which every new algorithm is ysis technique called cost curves to explore judged. More recently, as research into cost sensitiv- the interaction of over and under-sampling ity and class imbalance have become more prevalent, with the decision tree learner C4.5. C4.5 C4.5 combined with under-sampling or over-sampling was chosen as, when combined with one of is quickly becoming the accepted baseline to beat the sampling schemes, it is quickly becom- (Domingos, 1999; Pazzani et al., 1994). ing the community standard when evaluat- Using our own performance analysis technique, called ing new cost sensitive learning algorithms. cost curves (Drummond & Holte, 2000a), discussed This paper shows that using C4.5 with under- briefly in the next section, we show that under- sampling establishes a reasonable standard sampling produces a reasonable sensitivity to changes for algorithmic comparison. But it is recom- in misclassification costs and class distribution. On mended that the least cost classifier be part of the other hand, over-sampling is surprisingly ineffec- that standard as it can be better than under- tive, often producing little or no change in performance sampling for relatively modest costs. Over- as the training set distribution is changed. We go on to sampling, however, shows little sensitivity, explore which aspects of C4.5 result in under-sampling there is often little difference in performance being so effective and why they fail to be useful when when misclassification costs are changed. over-sampling. We have previously shown that the splitting criterion has relatively little effect (Drum- mond & Holte, 2000b) on cost sensitivity. Breiman 1. Introduction et al. (1984, p.94) observed that costs and class dis- In this paper, we experimentally study the two most tribution primarily affect pruning. Still, we did not common sampling schemes which are used to adapt find that this was the main cause of the difference be- machine algorithms to imbalanced classes and misclas- tween the two sampling schemes. Over-sampling tends sification costs. We look at under-sampling and over- to reduce the amount of pruning that occurs. Under- sampling that decrease and increase respectively the sampling often renders pruning unnecessary. By re- frequency of one class in the training set to reflect the moving instances from the training set, it stunts the desired misclassification costs. These schemes are at- growth of many branches before pruning can take ef- tractive as the only change required is to the training fect. We find that over-sampling can be made cost- data rather than to the algorithm itself. It is hard sensitive if the pruning and early stopping parameters to justify a more sophisticated algorithm if it cannot are set in proportion to the amount of over-sampling outperform existing learners using one of these sim- that is done. But the extra computational cost of us- ple sampling schemes. Here, we study the sampling ing over-sampling is unwarranted as the performance schemes and how they affect the decision tree learner achieved is, at best, the same as under-sampling. 2. Cost Curves continuous line going from (0,0) to (1,1). The sec- ond is the classifier that always predicts positive and In this section, we discuss cost curves at an intu- forms the opposite diagonal. The triangle underneath itive level, hopefully sufficient for the reader to under- these two lines is the majority classifier that chooses stand the experimental results. We refer the interested the most common class. We feel it is important to take reader to our paper (Drummond & Holte, 2000a) for particular note of how the learning algorithm performs a more in-depth discussion of these curves. The bold with respect to this classifier. Using an algorithm with continuous line in Figure 1 is an example of a cost a performance that is worse than such a trivial clas- curve produced when under-sampling the training set. sifier is of doubtful merit. For this data set, C4.5 is Ignoring the outer (parenthetical) labels on the axes at only useful when one class is less than four times more present, this represents the error rate of C4.5 for the frequent than the other class. full range of class distributions. The x-value of each point on the curve indicated by a black circle is the 1.0 fraction of the training set that is positive, P (+). The (Normalized Expected Cost) y-value is the expected error rate of the decision tree 0.8 grown on each of these training sets. To estimate er- ror rates for other class distributions, for intermediate Error Rate x-values, linear interpolation between points is used. 0.6 If a confusion matrix is generated for each training set 0.4 distribution, we can determine the error rate on the positive and negative instances separately. Knowing ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ these error rates allows the performance for each de- 0.2 ⊕ ⊕ ⊕ ⊕ cision tree to be assessed across all class distributions. ⊕ ⊕ The performance of each individual tree is represented 0.0 ⊕ ⊕ in Figure 1 by a dotted straight line. The performance 0.0 0.2 0.4 0.6 0.8 1.0 Probability of Positive P(+) of the tree at the training set distribution is indicated (Probability Cost Function PCF(+)) by a black circle. But this in just one point on the line, other points give the classifier’s performance for quite Figure 1. An Example Cost Curve different distributions. From this perspective a learn- ing algorithm’s cost sensitivity has two components: (1) producing a good range of classifiers suitable for It is now commonplace to deal with class imbalances different class distributions; (2) selecting the right clas- much more extreme than 4:1. But it is not class dis- sifier for the particular distribution. The dotted lines tribution alone which must be considered it is also the in Figure 1 have a wide range of gradients, indicating cost of misclassification. By applying a different inter- a wide variety of different trade-offs in the number of pretation to the x- and y-axes in the cost curve plot, it positive and negative examples correctly classified. So can be seen to represent the expected cost of a classifier under-sampling has produced a good range of classi- across all possible choices of misclassification costs and fiers. We can decide whether it has chosen the right class distributions. We begin by relabeling the y-axis classifier by seeing if the line with the minimum error in Figure 1 so that instead of representing error rate it rate at any particular probability has been chosen. 1 represents expected cost. But like error rate it is nor- malized so that the best performance is zero and the Generally this has happened, but we do notice that worst performance is one. This we call the normalized there is a better classifier from a probability of 0 to 0.2 expected cost. The x-axis is also relabeled to include and another one from a probability of 0.8 to 1. The misclassification costs. We multiply the original value, first is the classifier that always predicts a negative, P (+), by the cost of misclassifying a positive instance so has zero error rate when the probability of positive as negative. and normalize so that x ranges from 0 to is zero (all instances are negative), and an error rate 1. We call the normalized version of C(−|+) ∗ P (+) of one when the probability of positive is one (all in- the “probability cost function”, P CF (+). stances are positive). It is represented by the diagonal There are two simplifications worthy of note. When 1 as emphasized in Drummond and Holte (2000a) it is misclassification costs are equal, C(−|+) = C(+|−), not necessarily optimal to have the class distribution in the this definition of x reduces to P (+) our original mea- training set identical to the class distribution that will be used in testing. Although this paper ignores this important sure of class distribution. When the class frequencies issue, the conclusions drawn apply in the more general case. are equal, P (+) = P (−), this definition reduces to C(−|+)/(C(−|+) + C(+|−)). Here the misclassifica- tion costs are normalized so that, like the probabilities, set. The bold dashed line in Figure 2 shows the per- they sum to one. In the more general case, we must formance of C4.5 using under-sampling on the sonar consider variation in both costs and class frequencies data set. Sonar has 208 instances, 111 mines and 97 at the same time. So in Figure 1 even if the class dis- rocks, with 60 real valued attributes. Under-sampling tribution is worse than 4:1, if the misclassification cost produces a cost curve that is reasonably cost sensitive, for the minority class is greater than that for the ma- it is quite smooth and largely within the lower triangle jority class this will tend to pull us back towards the that represents the majority, or least cost, classifier. center of the diagram. If the misclassification costs ex- The bold continuous curve represents over-sampling. actly balance the class distribution, we would be at the What is most striking about it is that it is almost center and have potentially the maximum advantage straight. The performance varies little from that at over the majority, or least cost, classifier. data set’s original frequency, indicated by the vertical dotted line. If we look at the dotted lines, represent- To explore the differences between the two sampling ing the performance of the trees generated using over- schemes we generate curves based on decision trees sampled training sets, we see remarkably little differ- grown on training sets with the class distribution ence between them. changed by adding or deleting instances. For under- sampling a fraction of the instances of one class are 0.4 randomly deleted to form the new training set. Under- Normalized Expected Cost sampling is done on each training set produced by 10- ⊕ fold cross-validation, which is repeated 50 times to re- 0.3 ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ duce sampling variance. For over-sampling, instances ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ of one of the classes are duplicated up to the floor of the desired ratio. The remaining fraction is chosen 0.2 randomly without replacement from the training data. Thus if we had a ratio of 4.35 to 1, the instances of one class would be duplicated 4 times and then a random 0.1 sample would make up the remaining 0.35. For over- sampling, 10-fold cross-validation is repeated only 10 times as the sampling variance is smaller. 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Probability Cost Function PCF(+) 3. Comparing the Sampling Schemes Figure 2. Sonar: Comparing Sampling Schemes In this section, we compare the performance of un- der and over-sampling on 4 data sets, three from the Figure 3 shows the different sampling schemes for the UCI Irvine collection (Blake & Merz, 1998) and one Japanese credit data set. It has 690 instances, 307 from earlier work by one of the authors (Kubat et al., positive and 383 negative, with 15 attributes, 6 real 1997). We chose these data sets as they produced and 9 nominal. For under-sampling, again the curve is cost curves that captured all the qualitative features reasonably smooth and this time remains completely we observed in a larger set of experiments (including within the triangle representing the majority, or least other UCI data sets: vote, hepatitis, labor, letter-k cost, classifier. It is still noticeable, however, for a and glass2). For these data sets, under-sampling com- PCF(+) value of 0 to 0.1 and 0.9 to 1.0 that it is only bined with C4.5 is a useful baseline to evaluate other marginally better. So for class or cost ratios of greater algorithms. Over-sampling, on the other hand, is not than 9:1 there is little to be gained over using a trivial to be recommended when used with C4.5. Even with classifier. Here the bold curve for over-sampling shows large changes to the training set distribution it often some sensitivity to costs. The dotted lines, represent- produced classifiers little different in performance to ing individual trees, show a reasonable diversity. But the one trained on the original distribution. Thus it overall the expected cost, particularly when misclassi- largely fails to achieve its very purpose. fication costs or class frequencies are severely imbal- In the following figures, we have scaled the y-axis dif- anced, is a lot worse than when under-sampling. ferently for each data set to improve clarity and do Figure 4 shows the under-sampling schemes for the not show the performance of individual classifiers for breast cancer data set from the Institute of Oncology, under-sampling only for over-sampling. We also in- Ljubljana. It has 286 instances, 201 non-recurrences clude a vertical dashed line at the x-value correspond- and 85 recurrences, with 9 nominal attributes. For this ing to the fraction of positive examples in the data data set, C4.5 only just outperforms the least cost clas- 0.35 1.0 0.30 Normalized Expected Cost Normalized Expected Cost 0.8 0.25 0.6 0.20 0.15 ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ 0.4 ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ 0.10 ⊕ ⊕ ⊕ ⊕ 0.2 0.05 ⊕ ⊕ ⊕ 0.00 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Probability Cost Function PCF(+) Probability Cost Function PCF(+) Figure 3. Credit: Comparing Sampling Schemes Figure 5. Breast Cancer: Comparing Sampling Schemes sifier at the original frequency of the data set. There is smooth and stays within the triangle of the least cost a reasonable diversity in the individual trees, but this classifier. But there is little diversity in the individual produces little real advantage in using C4.5, except trees produced when over-sampling and the resulting perhaps if the misclassification costs balance the dif- cost curve is quite straight. Interestingly though when ference in class frequencies. Still under-sampling does the costs or class frequencies become extremely imbal- not misbehave badly, the curve mainly stays within anced we suddenly start to see some improvement in the triangle for the least cost classifier. expected cost. 1.0 0.4 Normalized Expected Cost Normalized Expected Cost 0.8 0.3 0.6 ⊕ ⊕ ⊕ 0.2 ⊕ ⊕ ⊕ ⊕ 0.4 ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ 0.1 ⊕ 0.2 ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ 0.0 ⊕ ⊕ 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Probability Cost Function PCF(+) Probability Cost Function PCF(+) Figure 4. Breast Cancer: Under-sampling Figure 6. Sleep: Comparing Sampling Schemes Figure 5 compares the costs curves for under- and over- sampling on this data set. The bold over-sampling 4. Investigating Over-sampling Curves curve tracks that for under-sampling when P CF (+) In the previous section, it was apparent that the cost exceeds the frequency in the data set. However, when sensitivity for over-sampling was much less than that P CF (+) is less than the original frequency the curve for under-sampling. At least with hindsight this is not quickly flattens out and the dotted lines representing surprising. We might expect over-sampling to limit individual trees become bunched. the amount of pruning. Increasing the number of Figure 6 shows the different sampling schemes for the instances of the already larger class on a particular sleep data set. It has 840 instances, 700 1’s and 140 2’s, branch should make pessimistic pruning reluctant to with 15 real valued attributes. Here, under-sampling remove that branch. For the credit data set, though, produces a reasonably cost sensitive curve, that is over-sampling did show some degree of cost sensitivity, pruning and the early stopping criterion were still hav- rion to a maximum of 25. As we are looking at two ing some effect. Figure 7 shows the cost curve and lines class problems (MaxClass = 1) and we have set the for individual trees once pruning is disabled. There is stopping criterion to 1 (MINOBJS = 1) this has no ef- still some variability in the performance of individ- fect until the number of known items reached 20. But ual trees but it is much reduced, resulting in a much as more and more samples are added through over- straighter cost curve. The variability is further re- sampling this minimum number increases, thus pre- duced when the stopping criterion is decreased to 1, venting the growth of some branches of the tree. If as shown in Figure 8. The cost curve is now almost this feature is disabled, making the minimum number straight except at the far right hand side where the 1, we remove the down turn of the cost curve, as shown expected cost decreases appreciably. for the sleep data set in Figure 10. 0.3 MinSplit = 0.10 * KnownItems / (MaxClass + 1); if ( MinSplit 25 ) MinSplit = 25; Normalized Expected Cost 0.2 ⊕ ⊕ Figure 9. Special Stopping Criterion ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ 0.4 ⊕ ⊕ ⊕ ⊕ 0.1 ⊕ Normalized Expected Cost 0.3 ⊕ ⊕ ⊕ 0.0 0.2 ⊕ ⊕ ⊕ 0.0 0.2 0.4 0.6 0.8 1.0 ⊕ Probability Cost Function PCF(+) ⊕ 0.1 ⊕ Figure 7. Credit: Disabling Pruning ⊕ ⊕ 0.3 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Probability Cost Function PCF(+) Normalized Expected Cost ⊕ ⊕ ⊕ 0.2 ⊕ ⊕ ⊕ Figure 10. Sleep: Disabling Special Stopping Criterion ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ Generally, for data sets where there was appreciable pruning on the tree at the original frequency oversam- 0.1 pling produced some overall cost sensitivity. This was true for the both the credit and breast cancer data sets. Turning off pruning and the standard stopping crite- rion resulted in almost flat costs curves. For data sets 0.0 with continuous attributes (all bar the breast cancer 0.0 0.2 0.4 0.6 0.8 1.0 data set) disabling the special stopping criterion also Probability Cost Function PCF(+) removed the small additional sensitivity shown at the ends of the curves. Surprisingly though for the sonar Figure 8. Credit: Disabling Stopping Criterion data set, where the cost curve was initially flat, remov- ing this special stopping criterion actually caused the The over-sampling cost curve for the sleep data set also curve to turn up at the end. In this case, with many exhibited this same tendency, being largely straight extra instances added by over-sampling, the trees ap- except when P CF (+) is much less than the original peared to be much more complex but somewhat less frequency of the data set. This curvature was traced to accurate across the whole distribution. Why this oc- the special stopping criterion for continuous attributes curred is the subject of future work. included in C4.5. The code for this criterion is shown in Figure 9. As the number of known items increases, Throughout this paper we have not shown end points the minimum number of instances allowed on each side for the cost curves for over-sampling, values when of the split increases from the normal stopping crite- PCF(+) is one or zero. For under-sampling, the limit 0.35 at either end is to have a single class. In this case Threshold off C4.5 will simply choose that class and therefore have 0.30 Special Stop Off Normalized Expected Cost a normalized expected cost of zero. For over-sampling, the minority class never disappears entirely, the ratio 0.25 just gets larger and larger. Experiments increasing 0.20 the degree of over-sampling were stopped due to the excessive size of the training set. Using very large in- 0.15 ternal weights (c. 100,000,000) does produce just the majority classifier. All the instances appear to be of 0.10 one class, an error made by trying to add two numbers 0.05 with vastly different exponents but both represented as floats. 0.00 0.0 0.2 0.4 0.6 0.8 1.0 Probability Cost Function PCF(+) 5. Investigating Under-sampling Curves If disabling pruning and the early stopping criterion re- Figure 11. Sonar: Under-sampling sulted in straighter curves for over-sampling, it would be tempting to think it might have the same effect the weight of one of the classes keeping the weight when under-sampling. But this turns out not to be of the other class at one. Down-weighting, analogous the case. to under-sampling, decreases the weight of one of the Figure 11 shows cost curves for the sonar data set us- classes keeping the weight of the other class at one. ing under-sampling as the different features of C4.5 are Figure 13 compares the performance of the sampling disabled. Surprisingly, we see no real change when we schemes (the continuous lines) to the weighting (the disable pruning, nor do we see a change when the early dashed lines) for the sonar data set. The curve for stopping criterion is reduced to one. All the curves are up-weighting is very close to that for over-sampling, indistinguishable from the original curve discussed in perhaps not surprising as in over-sampling we dupli- Section 3. The apparently bold line, the lowest curve cate instances. The curve for down-weighting is close in Figure 11, is these three curves overlapping. We do but sometimes better than that for under-sampling. get a change by disabling the special stopping crite- This was also true for the other data sets. rion, the dashed line, but it is very small. We get a 0.20 reasonably large change when we disable the threshold Stop = 1 Pruning Off for continuous attributes (log(# distinct instances)/# Normalized Expected Cost instances) added in release 8 (Quinlan, 1996). But 0.15 the major change is observed within the triangle de- fined by the least cost classifier. Although the curves stray further outside this triangle, they still maintain 0.10 roughly the same shape and certainly do not become as straight as those produced when over-sampling. Pruning has a much larger effect on classifier perfor- 0.05 mance for the Japanese credit data set. When pruning + is turned off, we get the dashed line shown in Figure 12. Setting the stopping criterion to one also has a sizeable 0.00 0.0 0.2 0.4 0.6 0.8 1.0 effect. But again both effects show up mainly while the Probability Cost Function PCF(+) curve is inside the least cost classifier triangle and the basic shape of the curves is maintained. For this data Figure 12. Credit: Under-sampling set, disabling the special stopping criterion and the release 8 threshold produce no appreciable effect. So Figure 14 shows the effect of disabling various factors disabling these features did not produce the same per- using the down-weighting scheme. To produce this formance for under-sampling as over-sampling. But set of curves we first disabled the threshold added in if we represent misclassification costs and class fre- release 8 as this had a strong interaction with down- quencies by means of internal weights within C4.5, dis- weighting. Now we can see that for the sonar data set abling these features does seem to make the difference. turning off pruning and then the stopping criterion Up-weighting, analogous to over-sampling, increases does produce a curve that is very straight. Although, 0.35 simply disabling these factors did not make under- 0.30 sampling as ineffective as over-sampling. Here we show Normalized Expected Cost that increasing the amount of pruning and the influ- 0.25 ence of the early stopping criterion in relation to the number of duplicates in the training set when over- 0.20 sampling does have a strong effect. Figure 15 illus- 0.15 trates the effect on the sonar data set. 1.0 0.10 Normalized Expected Cost 0.05 0.8 0.00 0.0 0.2 0.4 0.6 0.8 1.0 0.6 Probability Cost Function PCF(+) Figure 13. Sonar: Comparing Weighting and Sampling 0.4 ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ 0.2 ⊕ as we noted before when over-sampling the sonar data ⊕ ⊕ set, if the special stopping criterion is removed the line curves up at the ends. ⊕ ⊕⊕ 0.0 ⊕ 0.0 0.2 0.4 0.6 0.8 1.0 If the performance curves of under-sampling and Probability Cost Function PCF(+) down-weighting are similar and these internal factors have the required effect when down-weighting why do Figure 15. Sonar: Changing Defaults for Over-sampling they seem not to have the same effect when under- sampling? The answer seems to be that much of the Our change to the default values are based on the ratio cost sensitivity when under-sampling comes from the used when over-sampling the training set. The early actual removal of instances. When we turned off many stopping criterion (the -m switch in C4.5) is set to of the factors when down weighting, the branch was 2 times this ratio and the pruning confidence factor still grown and the region still labeled. When the in- (the -c switch in C4.5) is set to 0.25 divided by the stances are removed from the training set this cannot ratio. So, if one class is over-sampled by 2.5 times then happen. the stopping criterion is 5 and the confidence factor is 0.45 0.1. This produces the bold continuous curve in Figure 15 for the sonar data. This is remarkably similar to 0.40 the dashed curve for under-sampling, although there Normalized Expected Cost 0.35 is some difference on the right hand side of the figure. 0.30 On the other data sets the difference in much smaller. 0.25 In fact, on the credit data set as shown in Figure 16 the difference in negligible. 0.20 0.15 7. Discussion 0.10 Generally we found that using under-sampling estab- 0.05 lished a reasonable baseline for algorithmic compar- 0.00 ison. However, one problem with under-sampling is 0.0 0.2 0.4 0.6 0.8 1.0 that it introduces non-determinism into what is other- Probability Cost Function PCF(+) wise a deterministic learning process. The values ob- tained from cross-validation give the expected perfor- Figure 14. Sonar: Disabling Factors when Weighting mance of a classifier based on a random subsample of the data set. With a deterministic learning process any 6. Improving Over-sampling variance from the expected performance that occurs when deploying the classifier is largely due to testing We have seen that pruning and the stopping criterion on a limited sample of the true data. But for under- often have a large impact on cost sensitivity. But sampling there is also variance due to non-determinism 1.0 in misclassification costs and class distribution. On the other hand, over-sampling is surprisingly ineffec- Normalized Expected Cost 0.8 tive, often producing little or no change in perfor- mance. But it is noteworthy that representing these 0.6 changes internally by down-weighting gives the best performance overall. 0.4 References 0.2 Blake, C. L., & Merz, C. J. (1998). UCI repository ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ of machine learning databases, University of Cali- ⊕ ⊕ ⊕ ⊕ ⊕ fornia, Irvine, CA. 0.0 ⊕⊕ ⊕ ⊕ www.ics.uci.edu/∼mlearn/MLRepository.html. 0.0 0.2 0.4 0.6 0.8 1.0 Probability Cost Function PCF(+) Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. Bel- Figure 16. Credit: Changing Defaults for Over-sampling mont, CA: Wadsworth. Domingos, P. (1999). MetaCost: A general method for of the under-sampling process. If our measure of suc- making classifiers cost-sensitive. Proceedings of the cess is purely the difference in the expectations then Fifth International Conference on Knowledge Dis- this not important. But the choice between two classi- covery and Data Mining (pp. 155–164). fiers might also depend on the variance and then using under-sampling might less desirable. Drummond, C., & Holte, R. C. (2000a). Explicitly rep- resenting expected cost: An alternative to ROC rep- resentation. Proceedings of the Sixth International 8. Related Work Conference on Knowledge Discovery and Data Min- Most relevant to this paper are previous works com- ing (pp. 198–207). paring under-sampling and over-sampling. This sur- Drummond, C., & Holte, R. C. (2000b). Exploiting vey will focus on the results for variants of C4.5 the cost (in)sensitivity of decision tree splitting cri- since they are most closely related to the present pa- teria. Proceedings of the Seventeenth International per. Domingos (1999) reports that, on 2-class prob- Conference on Machine Learning (pp. 239–246). lems, C4.5-Rules produces lower cost (better) clas- sifiers using under-sampling than it did using over- Japkowicz, N., & Stephen, S. (2002). The class imbal- sampling. Ling and Li (1998) compare over-sampling ance problem: A systematic study. Intelligent Data and under-sampling for boosted C4.5 (with certainty Analysis, 6. factors added) on three different direct-marketing data Kubat, M., Holte, R. C., & Matwin, S. (1997). Learn- sets and report that under-sampling produces a larger ing when negative examples abound: One-sided se- (better) lift index, although extreme over-sampling lection. Proceedings of the Ninth European Confer- performs almost as well. ence on Machine Learning (pp. 146–153). Japkowicz and Stephen (2002) compare random and Ling, C., & Li, C. (1998). Data mining for direct mar- systematic methods of over-sampling and under- keting: problems and solutions. Proceedings of the sampling. In the artificial domains studied, under- Fourth International Conference on Knowledge Dis- sampling was ineffective at reducing error rate. Over- covery and Data Mining (pp. 73–79). sampling was effective, but most effective was supply- ing an appropriate misclassification cost matrix. The Pazzani, M., Merz, C., Murphy, P., Ali, K., Hume, reason for this study coming to the opposite conclusion T., & Brunk, C. (1994). Reducing misclassifica- of all other studies is not clear. tion costs. Proceedings of the Eleventh International Conference on Machine Learning (pp. 217–225). 9. Conclusions Quinlan, J. R. (1993). C4.5 programs for machine learning. San Mateo, CA: Morgan Kaufmann. In this paper, we have used cost curves to explore the interaction of over and under-sampling with the de- Quinlan, J. R. (1996). Improved use of continuous cision tree learner C4.5. We have shown that under- attributes in C4.5. Journal of Artificial Intelligence sampling produces a reasonable sensitivity to changes Research, 4, 77–90.