M3-4.1AND4.3.pdf
Document Details
Uploaded by Deleted User
Tags
Full Transcript
MODULE 4 PREDICTIVE AND CLASSIFICATION METHODS SUBTOPIC 1: PREDICTIVE AND CLASSIFICATION METHODS 1 At the end of the topic, the learner should be able to: Identify and understand the different methods in predictive and classification Multiple Linear Re...
MODULE 4 PREDICTIVE AND CLASSIFICATION METHODS SUBTOPIC 1: PREDICTIVE AND CLASSIFICATION METHODS 1 At the end of the topic, the learner should be able to: Identify and understand the different methods in predictive and classification Multiple Linear Regression k-Nearest Neighbors (kNN) Naïve Bayes Classifier Classification and Regression Trees MULTIPLE LINEAR REGRESSION This model is used to fit a relationship between a numerical outcome variable Y (also called the response, target, or dependent variable) and a set of predictors X1;X2; : : : ;Xp (also referred to as independent variables, input variables, regressors, or covariates). The assumption is that the following function approximates the relationship between the predictors and outcome variable: where B0; : : : ; Bp are coefficients and ϵ is the noise or unexplained part. Data are then used to estimate the coefficients and to quantify the noise. In predictive modeling, the data are also used to evaluate model performance. Regression modeling means not only estimating the coefficients but also choosing which predictors to include and in what form. Choosing the right form depends on domain knowledge, data availability, and needed predictive power. Multiple linear regression is applicable to numerous predictive modeling situations. The two popular but different objectives behind fitting a regression model are: Explaining or quantifying the average effect of inputs on an outcome (explanatory or descriptive task, respectively) Predicting the outcome value for new records, given their input values (predictive task) The regression model estimated from this sample is an attempt to capture the average relationship in the larger population. When the causal structure is unknown, then this model quantifies the degree of association between the inputs and outcome variable, and the approach is called descriptive modeling. In predictive analytics, however, the focus is typically on the second goal: predicting new individual records. Here we are not interested in the coefficients themselves, nor in the average record,” but rather in the predictions that this model can generate for new records. 1. A good explanatory model is one that fits the data closely, whereas a good predictive model is one that predicts new records accurately. Choices of input variables and their form can therefore differ. 2. In explanatory models, the entire dataset is used for estimating the best fit model, to maximize the amount of information that we have about the hypothesized relationship in the population. When the goal is to predict outcomes of new individual records, the data are typically split into a training set and a validation set. The training set is used to estimate the model, and the validation or holdout set is used to assess this model’s predictive performance on new, unobserved data. 3. Performance measures for explanatory models measure how close the data fit the model (how well the model approximates the data) and how strong the average relationship is, whereas in predictive models performance is measured by predictive accuracy (how well the model predicts new individual records). 4. In explanatory models the focus is on the coefficients (), whereas in predictive models the focus is on the predictions (^y). k-NEAREST NEIGHBORS (k-NN) Classification Algorithm The k-nearest-neighbors algorithm is a classification method that does not make assumptions about the form of the relationship between the class membership (Y ) and the predictors X1;X2; : : : ;Xp This is a nonparametric method because it does not involve estimation of parameters in an assumed function form, such as the linear form assumed in linear regression Classification Rule Using the single nearest neighbor to classify records can be very powerful when we have a large number of records in our training set. In turns out that the misclassification error of the 1-nearest neighbor scheme has a misclassification rate that is no more than twice the error when we know exactly the probability density functions for each class. The idea of the 1-nearest neighbor can be extended to k > 1 neighbors as follows: Find the nearest k neighbors to the record to be classified. Use a majority decision rule to classify the record, where the record is classified as a member of the majority class of the k neighbors. Choosing k The advantage of choosing k > 1 is that higher values of k provide smoothing that reduces the risk of overfitting due to noise in the training data. Generally speaking, if k is too low, we may be fitting to the noise in the data. However, if k is too high, we will miss out on the method’s ability to capture the local structure in the data, one of its main advantages. We choose the k that has the best classification performance. We use the training data to classify the records in the So how is k validation data, then compute chosen? error rates for various choices of k. The main advantage k-NN methods is their simplicity and lack of parametric assumptions. Three difficulties with the practical exploitation of the power of the k-NN approach. 1. The time to find the nearest neighbors in a large training set can be prohibitive. A number of ideas have been implemented to overcome this difficulty. Reduce the time taken to compute distances by working in a reduced dimension using dimension reduction techniques such as principal components analysis Use sophisticated data structures such as search trees to speed up identification of the nearest neighbor. This approach often settles for an “almost nearest” neighbor to improve speed. 2. The number of records required in the training set to qualify as large increases exponentially with the number of predictors p. 3. k-NN is a “lazy learner”: the time-consuming computation is deferred to the time of the prediction. For every record to be predicted, we compute its distances from the entire set of training records only at the time of prediction. THE NAÏVE BAYES CLASSIFIER The naive Bayes method (and, indeed, an entire branch of statistics) is named after the Reverend Thomas Bayes (1702- 1761). Naïve Bayes Method For each new record to be classified: 1. Find all of the training records with the same predictor profile (i.e., records having the same predictor values). 2. Determine what classes the records belong to and which class is most prevalent. 3. Assign that class to the new record Cutoff Probability Method For each new record to be classified: 1. Establish a cutoff probability for the class of interest (C1) above which we consider that a record belongs to that class. 2. Find all the records with the same predictor profile as the new record (i.e., records having the same predictor values). 3. Determine the probability that these records belong to the class of interest. 4. If this probability is above the cutoff probability, assign the new record to the class of interest. Conditional Probability Both procedures incorporate the concept of the conditional probability, or the probability of even A given that event B has occurred, P(A|B). In general response, for a response with m classes C1,C2,…Cm, and the predictor values x1,x2…xp, we want to compute 𝑃 (𝐶𝑖|𝑥1,…, 𝑥𝑝) The Bayesian classifier is the only classification or prediction method that is especially suited for (and limited to) categorical predictor variables. Shmueli G., et al. Data Mining for Business Intelligence Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner 2nd Ed. A John Wiley & Sons, Inc. Publication Han J., et al. Data Mining Concepts and Techniques 3rd Ed. 2012. Elsevier Inc. Bruce P., et al. Data Mining for Business Analytics Concepts, Techniques and Applications. 2020. John Wiley & Sons, Inc. MODULE 4 PREDICTIVE AND CLASSIFICATION METHODS SUBTOPIC 3: COMBINING METHODS: ENSEMBLES AND UPLIFT MODELING At the end of the topic, the learner should be able to: Learn and understand the approaches of combine methods Understand the use of ensemble model and uplift modeling Ensembles played a major role in the million-dollar Netflix Prize contest that started in 2006. At the time, Netflix, the largest DVD rental service in the United States, wanted to improve their movie recommendation system (from www.netflixprize.com) In a bold move, the company decided to share a large amount of data on movie ratings by their users, and set up a contest, open to the public, aimed at improving their recommendation system. During the contest, an active leader-board showed the results of the competing teams. An interesting behavior started appearing: different teams joined forces to create combined, or ensemble predictions which proved more accurate than the individual predictions. The winning team, called “BellKor/s Pragmatic Chaos” combined results from the “BellKor” and “Big Chaos” teams alongside additional members. In a 2010 article in Chance magazine, the Netflix Prize winners described the power of their ensemble approach Why Ensembles Can Improve Predictive Power In predictive modeling, “risk” is equivalent to variation in prediction error. The more variable our prediction errors, the more volatile our predictive model. e1, i is the prediction error of the ith observation by method 1 and e2, i is the prediction for the same observation by method 2. If for each observation we take an average of the two predictions, then the expected average error will also be zero Using an average of two predictions can potentially lead to smaller error variance, and therefore better predictive power. Simple Averaging Bagging Boosting The simplest approach for creating an ensemble is to combine the predictions, classifications, or propensities from multiple models. For example, we might have a linear regression model, a regression tree, and a k-NN algorithm. We use all of three methods to score, say a test set. We then combine the three sets of results. 1. Combining Predictions. In prediction tasks where the outcome variable is numerical, we can combine the predictions from the different methods by simply taking an average. 2. Combing Classifications. Combining the results from multiple classifiers can be done using “voting” if for each record we have multiple classifications, a simple rule would be to choose the most popular class among these classifications. 3. Combining Propensities. Similar to predictions, propensities can be combined by taking a simple (or weighted) average. Another form of ensembles is based on averaging across multiple random data samples. Bagging, short for “Bootstrap aggregating,” comprises of two steps: 1. Generate multiple random samples (bootstrap sampling) 2. Running an algorithm on each sample and producing scores Bagging improves the performance stability of a model and helps avoid overfitting by separately modeling different data samples and then combining the results. Boosting is a slightly different approach to creating ensembles. The goal is to directly improve areas in the data where our model makes errors. The steps in Boosting are: Fit a model to the data Draw a sample from the data so that misclassified observations (or observations with large prediction errors) have higher probabilities of selection Fit the model to the new sample Repeat steps 2-3 multiple times Combining scores from multiple models is aimed at generating more precise predictions (lowering the prediction error variance) Ensembles can use simple averaging, weighted averaging, voting, medians, and so forth. Ensembles have become a major strategy for participants in data mining contests, where the goal is to optimize some predictive measure. Ensembles also provide an operational way to obtain solutions with high predictive power in a fast way, by engaging multiple teams of “data crunchers” working in parallel and combining their results. Ensemble is the resources that it requires: computationally, as well as in terms of software availability and the analyst’s skill and time investment. Implementing ensembles that combine results from different algorithms requires developing each of the models and evaluating them. Boosting-type ensembles and bagging-type ensembles do not require such effort, but they do have a computational cost Ensembles that rely on multiple data sources require collecting and maintaining multiple data sources. Ensembles are “black-box” methods, in that the relationship between the predictors and the output variable usually becomes nontransparent. Uplift modeling is a causal learning approach for estimating an experiment's individual treatment effect. Using experimental data, the end-user can calculate the incremental impact of a treatment (such as a direct marketing action) on an individual's behaviour. Long before Ihe advent of Ihe Internet, sending messages directly to individnals (i.e., direct mail) held a big share of Ihe advertising market. Direct marketing affords the marketer the ability to invite and monitor direct responses from consnmers. This, in torn, allows the marketer to learn whether the messaging is paying off. A message can be tested with a small section of a large list and, if it pays off, Ihe message can be rolled out to Ihe entire list. With predictive modeling, we have seen Ihat Ihe rollout can be targeted to Ihat portion of the list that is most likely to respond or behave in a certain way. None of this was possible with media advertising (television, radio, newspaper, magazine). Direct response also made it possible to test one message against anolher and find out which does better. A-B Testing A-B testing is Ihe marketing industry's term for a standard scientific experiment in which results can be tracked by an individual. The idea is to test one treatment against another, or a treatment against a control. An important element of A-B testing is randomization-Ihe treatments are assigned or delivered to individuals randomly. This way any difference between treatment A and treatment B can be attributed to Ihe treatment (uuless it is due to chance). Uplift An A-B test tells you which treatment does better on average but says nothing about which treatment does better for which individual Uplift modeling is used mainly in marketing and, more recently, in political campaigns. It has two main purposes: To determine whether to send someone a persuasion message, or just leave them alone When a message is definitely going to be sent, to determine which message, among several possibilities, to send Shmueli G., et al. Data Mining for Business Intelligence Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner 2nd Ed. A John Wiley & Sons, Inc. Publication Han J., et al. Data Mining Concepts and Techniques 3rd Ed. 2012. Elsevier Inc. Bruce P., et al. Data Mining for Business Analytics Concepts, Techniques and Applications. 2020. John Wiley & Sons, Inc. MODULE 3 PERFORMANCE AND EVALUATION At the end of the topic, the learner should be able to: Learn and understand the different techniques used in evaluating the predictive performance Identify and understand the difference of each techniques for the performance evaluation of predictive analytics Three main types of outcomes of interest are: Predicted numerical value: when the outcome variable is numerical (e.g., house price) Predicted class membership: when the outcome variable is categorical (e.g., buyer/nonbuyer) Propensity: the probability of class membership, when the outcome variable is categorical (e.g., the propensity to default) For assessing prediction performance, several measures are used. In all cases the measures are based on the validation set, which servers as a more objective ground than the training set to assess predictive accuracy. Naïve Benchmark: The Average The benchmark criterion in prediction is using the average outcome value. In other words, the prediction for a new record is simply the average across the outcome values of the records in the training set. Prediction Accuracy Measures The prediction error for record i is defined as the difference between its actual outcome value and tis predicted outcome value: ei = yi – yi. A few popular numerical measures of predictive accuracy are: MAE (mean absolute error/deviation). This gives the magnitude of the average absolute error. Mean Error. This measure is similar to MAE except that it retains the sign of the errors, so that negative errors cancel out positive errors of the same magnitude. MPE (mean percentage error). This gives the percentage score of how predictions deviate from the actual values (on average), taking into account the direction of the error. MAPE (mean absolute percentage error). This measure gives a percentage score of how predictions deviate from the actual values. RMSE (root mean squared error). This is similar to the standard error of estimate in linear regression, except that it is computed on the validation data rather than on the training data. Errors that are based on the training set tell us about model fit, whereas those that are based on the validation set (called “prediction errors”) measure the model’s ability to predict new data (predictive performance). We expect training errors to be smaller than the validation errors (because the model was fitted using the training set), and the more complex the model, the greater the likelihood that it will overfit the training data (indicated by a greater difference between the training and validation errors). In an extreme case of overfitting, the training errors would be zero (perfect fit of the model to the training data), and the validation errors would be non-zero and non-negligible. A natural criterion for judging the performance of a classifier is the probability of making a misclassification error. Misclassification means that the record belongs to one class but the model classifies it as a member of a different class. A classifier that makes no errors would be perfect, but we do not expect to be able to construct such classifiers in the real world due to “noise” and not having all the information needed to classify records precisely. Figure 5.2 Cumulative gains chart (a) and decile lift chart (b) for continuous outcome variable (sales of Toyota cars) Benchmark: The Naïve Rule Class Separation The Confusion (Classification) Matrix Using the Validation Data Accuracy Measures Propensities and Cutoff for Classification A very simple rule for classifying a record into one of m classes, ignoring all predictor information (x1, x2, …, xp) that we may have, is to classify the record as a member of the majority class. In other words, “classify as belonging to the most prevalent class.” The naive rule is used mainly as a baseline or benchmark for evaluating the performance of more complicated classifiers. Clearly, a classifier that uses external predictor information (on top of the class membership allocation) should outperform the naive rule. If the classes are well separated by the predictor information, even a small dataset will suffice in finding a good classifier, whereas if the classes are not separated at all by the predictors, even a very large dataset will not help. Figure 5.3 High (a) and low (b) levels of separation between two classes, using two predictors This matrix summarizes the correct and incorrect classifications that a classifier produced for a certain dataset. Rows and columns of the confusion matrix correspond to the predicted and true (actual) classes, respectively. The confusion matrix gives estimates of the true classification and misclassification rates. Table 5.2 Confusion matrix based on 3000 records and two classes Different accuracy measures can be derived from the classification matrix. Consider a two-class case with classes C1 and C2 (e.g., buyer/non-buyer). The schematic confusion matrix in Table 5.3 uses the notation ni, j to denote the number of records that are class Ci members and were classified as Cj members. Of course, if i ≠ j, these are counts of misclassifications. The total number of records is n = n1, 1 + n1, 2 + n2, 1 + n2, 2. The main accuracy measure is the estimated misclassification rate, also called the overall error rate. It is given by where n is the total number of cases in the validation dataset. Table 5.2 Confusion matrix based on 3000 records and two classes Table 5.3 Confusion matrix: Meaning of Each Cell Propensities are typically used either as an interim step for generating predicted class membership (classification), or for rank-ordering the records by their probability of belonging to a class of interest. If overall classification accuracy (involving all the classes) is of interest, the record can be assigned to the class with the highest probability. In many records, a single class is of special interest, so we will focus on that particular class and compare the propensity of belonging to that class to a cutoff value set by the analyst. This approach can be used with two classes or more than two classes, though it may make sense in such cases to consolidate classes so that you end up with two: the class of interest and all other classes. Table 5.4 24 Records with Their Actual Class and the Probability (Propensity) of Them Being Class “owner” Members, as Estimated by a Classifier FIGURE 5.6 Classification metrics based on cutoffs of 0.5, 0.25, and 0.75 for the Riding Mower data Lift Curves Lift curves (also called lift charts, gain curves, or gain charts) are models involving categorical outcomes. The lift curve helps us determine how effectively we can “skim the cream” by selecting a relatively small number of cases and getting a relatively large portion of the responders. It is often the case that the more rare events are the more interesting or important ones: responders to a mailing, those who commit fraud, defaulters on debt, and the like. This same stratified sampling procedure is sometimes called weighted sampling or undersampling, the latter referring to the fact that the more plentiful class is undersampled, relative to the rare class. Step 1. The response and nonresponse data are separated into two distinct sets, or strata Step 2. Records are randomly selected for the training set from each stratum. Typically, one might select half the (scarce) responders for the training set, then an equal number of nonresponders. Step 3. Remaining responders are put in the validation set Step 4. Nonresponders are randomly selected for the validation set in sufficient numbers to maintain the original ration of responders to nonresponders Step 5. If a test is required, it can be taken randomly from the validation set. Shmueli G., et al. Data Mining for Business Intelligence Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner 2nd Ed. A John Wiley & Sons, Inc. Publication Bruce P., et al. Data Mining for Business Analytics Concepts, Techniques and Applications. John Wiley & Sons, Inc. 2020