Cross Validation (BAIS 3500)

BAIS 3500: Data Mining Cross Validation Hassan Rafique Department of Business Analytics Assessing Model Performance Why is it necessary to introduce so many different data mining approaches, rather than just a single best method? ‒ We have learned KNN, Logistic Regression, Decision Trees, Random Forest, Boosting. There is no free lunch in statistics (data mining): no one method dominates all others over all possible data sets. One specific method may work best on a particular data set, but another method may work better on a similar but different data set. Selecting the best approach can be one of the most challenging parts of performing data mining in practice. BAIS 3500: Data Mining Training and Testing BAIS 3500: Data Mining Training vs Test Error Note that there will be Training error and Testing error. ‒ Ideally, we want both training and testing errors to be small. Both plays a significant role, ‒ We want the training error to be small so that the learned model fˆ has a small error (f − fˆ)2 wrt to true model, where f is the true model. ‒ And we want the testing error small as well, meaning that the learned model is doing well on new (test) data. Primarily, this is the main focus of our attention. We want the model to generalize well Estimating test error is very important for model selection! ‒ But the training error rate often is quite different from the test error, and in particular the former can dramatically underestimate the latter. BAIS 3500: Data Mining Test Error Recall that test error is the main performance measure of a model. How do we measure (approximate) test error rate? ‒ A relatively very large designated test set that can be used to directly estimate the test error rate (usually we do not have such a test set) ‒ a number of techniques can be used to estimate this quantity using the available training data. In case we don’t have a test set or when tuning hyperparameters for model selection, or not enough data altogether Validation based approaches presents a class of methods that estimate the test error by ‒ holding out a subset of the training observations from the fitting process, ‒ and then applying the data mining method to those held out observations. BAIS 3500: Data Mining Cross Validation (CV) Cross-validation is a technique used to estimate the test error CV can be used to select the appropriate level of flexibility for a model type ‒ Usually involves tuning hyperparameters for a model type. E.g. choose a decision tree with depth 3 or 10. to compare the performance of different model types. E.g., logistic regression vs decision trees. BAIS 3500: Data Mining Validation Set Approach Here we randomly divide the available set of samples into two parts: a training set and a validation or hold-out set. The model is fit on the training set, and the fitted model is used to predict the responses for the observations in the validation set. The resulting validation-set error provides an estimate of the test error. BAIS 3500: Data Mining Validation Set Example BAIS 3500: Data Mining Limitations of Validation Set Approach The validation estimate of the test error rate can be highly variable, depending on precisely which observations are included in the training set and which observations are included in the validation set. In the validation approach, only the training set is used to fit the model. Since data mining methods tend to perform worse when trained on fewer observations, this suggests that the ‒ validation set error rate may tend to overestimate the test error rate for the model fit on the entire data set. Next we discuss the cross-validation approach that fixes the above issues. BAIS 3500: Data Mining K-fold Cross Validation BAIS 3500: Data Mining K-fold Cross Validation (CV) BAIS 3500: Data Mining K-fold CV Example BAIS 3500: Data Mining Leave one out CV (LOOCV) What’s the k value for LOOCV? BAIS 3500: Data Mining CV Example BAIS 3500: Data Mining Exercise How do we go about choosing k for k-fold CV? Effects of smaller or larger value of k? BAIS 3500: Data Mining Right choice for k in CV LOOCV (k = n) sometimes useful, but typically doesn’t shake up the data enough. The estimates from each fold are highly correlated and hence their average can have high variance. ‒ Recall that in LOOCV, n − 1 data points are used in training. ‒ The relatively large training set leads to low bias but higher variance. For a medium level value of k = 5 or k = 10, in general k < n, we are averaging the outputs of k fitted models that are somewhat less correlated with each other, since the overlap between the training sets in each model is smaller. ‒ Recall that in k-fold CV, (n − 1)k/n data points are used in training. So there will be some bias present in our estimate, as the training set is not as large as that in LOOCV. ‒ However, the less correlation between training sets lead to lower variance. This makes k-fold (k=5 or k=10) a more balanced approach. BAIS 3500: Data Mining Cross Validation (CV) Cross-validation is a technique used to estimate the test error CV can be used to select the appropriate level of flexibility for a model type ‒ Usually involves tuning hyperparameters for a model type. E.g. choose a decision tree with depth 3 or 10. to compare the performance of different model types. E.g., logistic regression vs decision trees. BAIS 3500: Data Mining Model Selection and Model Assessment Model Selection: The process of using cross- validation metrics to choose a model, out of many different options. ‒ We use cross validation, because it helps us estimate test performance (on unseen data), while using only the training data. Model Assessment: The process of evaluating your best model on the test set. BAIS 3500: Data Mining Visual Explanation: Cross Validation https://mlu-explain.github.io/cross-validation/ BAIS 3500: Data Mining Reading Data Science for Business Ch-5 (pg 126-128) Intro to Stat Learning Ch-5 BAIS 3500: Data Mining Closing Slide Header

Cross Validation (BAIS 3500)

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue