Statistical Learning and Data Mining Lecture 6: Practical Methodology PDF

Statistical Learning and Data Mining Lecture 6: Practical Methodology Discipline of Business Analytics A framework for machine learning projects Think iteratively! 1. Business understanding. 2. Data collection and pre-processing. 3. Exploratory data analysis. 4. Feature engineering. 5. Machine learning. 6. Evaluation. 7. Deployment and monitoring. 2/51 Lecture 6: Practical Methodology Learning objectives Model selection. Hyperparameter optimisation. Model stacking. Model assessment. 3/51 Lecture 6: Practical Methodology 1. Model selection 2. Hyperparameter optimisation 3. Model stacking 4. Model assessment 5. Developing successful machine learning projects 4/51 Model selection Model selection Given the training data, each learned model results from the combination of: Learning algorithm. Hyperparameter values. Features. Random numbers. 5/51 Model selection Model selection methods estimate the generalisation performance of a model from training data. We use the estimates to: Guide experimentation and iteration. Select hyperparameters. Select features. Combine predictions from different models. Select a final model for prediction. 6/51 Example: predicting house prices In the house prediction application, we still need to select: The features for linear regression. The number of neighbours, distance function, and features for kNN. The final model for prediction. 7/51 Model selection 8/51 Model selection There are three approaches: Validation set. Cross-validation. Analytical criteria. 9/51 Validation set In the validation set approach, we randomly split the training data into a training set and a validation set. We estimate the models on the training set and compute predictions for the validation set. !"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$! %!!""!! #!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!& ! We select the model with the best metrics the validation set. 10/51 Validation set ▲ Simple and convenient. ▼ There may not be enough validation cases to reliably estimate performance. ▼ The metrics can have high variability over random splits. ▼ Biased estimation of performance since we fit the models with less than the full training data. 11/51 Example: polynomial regression 28 28 Mean Squared Error Mean Squared Error 26 26 24 24 22 22 20 20 18 18 16 16 2 4 6 8 10 2 4 6 8 10 Degree of Polynomial Degree of Polynomial Figure from ISL. 12/51 Cross-validation Cross-validation methods are based on multiple training-validation splits. Cross-validation methods predict every data point at least once, which uses the data more efficiently than the validation set approach. 13/51 K-fold cross-validation Figure by ethen8181 on Github. 14/51 K-fold cross-validation !"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$! !%&!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%! !%&!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%! !%&!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%! !%&!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%! !%&!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%! 1. Randomly split the training sample into K folds of roughly equal size. 2. For each k ∈ {1,... , K}, train the model using all folds other than k combined, and use fold k as the validation set. 3. The cross-validation metric is the average metric across the K validation sets. 15/51 Types of cross-validation 5-fold and 10-fold CV. The most common choices are K = 5 or K = 10. Leave-one-out cross CV (LOOCV). If we set K = n, this is called leave-one-out cross validation. We use all other observations to predict each i. Repeated K-fold CV. We repeat the K-fold CV algorithm with multiple random splits, which decreases variance. This is especially helpful when the dataset is not large. 16/51 Number of folds The higher the K, the higher the computational cost. The higher the K, the lower the bias for estimating performance (because we want to estimate the performance of the model when trained with all n examples). Increasing K may increase variance (because the training sets become more similar to each other). 17/51 K-fold cross-validation ▲ More accurate than the validation set approach. ▼ The estimate depends on the random split (but we can reduce this source of variability with repeated K-fold). ▼ Biased estimation of performance since we fit the models with less than the full training data. 18/51 Leave-one-out cross-validation ▲ Approximately unbiased. ▲ No random splitting. ▲ Tends to have low variance with stable estimators such as linear regression. ▲ There are formulas available for special cases such as linear regression. ▲ Statistically efficient for model selection (n → ∞). ▼ Very high computational cost when there is no formula. ▼ High variance in some cases. 19/51 Example: polynomial regression LOOCV 10−fold CV 28 28 Mean Squared Error Mean Squared Error 26 26 24 24 22 22 20 20 18 18 16 16 2 4 6 8 10 2 4 6 8 10 Degree of Polynomial Degree of Polynomial Figure from ISL. 20/51 Analytical criteria Analytical criteria have the form: criterion = training error + penalty for number of parameters 21/51 Akaike Information Criterion The Akaike information criterion (AIC) applies to models estimated by maximum likelihood: AIC = −2ℓ(θ) b + 2d, where ℓ(θ) b is the maximised log-likelihood and d is the number of estimated parameters. We select the model with the lowest AIC. The AIC is equivalent to LOOCV when n → ∞, under theoretical assumptions. 22/51 Bayesian information criterion The Bayesian information criterion (BIC) also applies to models estimated by maximum likelihood: BIC = −2ℓ(θ) b + log(n)d. The BIC is more conservative than the AIC. 23/51 Analytical criteria ▲ No computational cost, we just need to compute a formula. ▼ Relies on theoretical assumptions. ▼ Less applicable than the validation set approach and cross-validation. 24/51 Hyperparameter optimisation Hyperparameter optimisation Hyperparameter optimisation methods optimise a model selection criterion as function of the hyperparameters. The most common approaches are: Hand-tuning. Grid search. Random search. Bayesian optimisation. Multi-fidelity methods. Metaheuristic methods. 25/51 Grid search In the grid search approach, we specify a list of values for each hyperparameter and evaluate every possible configuration. This is only computationally feasible if the number of hyperparameters and possible values is not too high. 26/51 Example: k-Nearest Neighbours 27/51 Random search In a random search, we specify a statistical distribution for each hyperparameter and randomly sample configurations to evaluate until the procedure exhausts the computational budget. Figure by Sydney Firmin. 28/51 Random search ▲ More efficient than a grid search. ▼ Wastes computation on configurations that are unlikely to perform well based on past trials. ▼ Not guaranteed to find good hyperparameter values within the time allowed by the computational budget. 29/51 Bayesian optimisation Bayesian optimisation (BO) methods perform model-based optimisation. At each iteration, the algorithm selects a promising hyperparameter configuration to evaluate based on previous trials. 30/51 Multi-fidelity optimisation Multi-fidelity methods attempt to increase efficiency by combining full evaluations with trials based on subsets of the data or model. HyperBand is a popular multi-fidelity optimisation method that balances the number of hyperparameter configurations and the allocated computational budget for each trial. Bayesian Optimisation HyperBand is a state-of-the-art method that combines Bayesian optimisation and HyperBand. 31/51 Metaheuristic methods Metaheuristic refers to large class of algorithms designed to find near-optimal solutions to difficult optimisation problems. Evolutionary optimisation methods, which are inspired by the theory of natural selection, are commonly used for hyperparameter optimisation. 32/51 Model stacking Choosing the final model for prediction We typically explore multiple learning algorithms, feature engineering strategies, and hyperparameters until we obtain one or multiple models that seem to perform well. Ultimately, we need to choose a final model for prediction. That is, a candidate model for deployment in a business production system. 33/51 Ensemble learning In ensemble learning, we combine predictions from multiple learning algorithms as our final model. Ensemble learning, rather than selecting a single best model, usually achieves the best generalisation performance. 34/51 Model averaging A simple method is to compute a weighted average M X fave (x) = wm fm (x), m=1 where f1 (x),... , fM (x) are predictions from M different models and w1 ,... , wm are the model weights. 35/51 Model averaging How can we choose the weights? One option is to simply pick the model weights, say wm = 1/M for a simple average of the models: M 1 X fave (x) = fm (x). M m=1 This approach has the advantage of not adding variability through the choice of the weights, but can lead to sub-optimal predictions. 36/51 Model averaging Another approach is to select the weights by optimisation, for example n M ( !) X X w b = argmin (1/n) L yi , wm fm (xi ) , w1 ,...,wm m=1 i=1 where we often impose the restriction that the weights are non-negative and sum to one. This method does not work well if based on the training set. In practice, it places too much weight on the most complex models. 37/51 Model averaging A better approach is to select the weights using a validation set or cross-validation. In the validation set approach, we obtain the model weights as nval M ( !) X X w b = argmin (1/nval ) L yival , wm fbm (xval ) i , w1 ,...,wm m=1 i=1 where fbm (xval i ) are the validation set predictions from model m. 38/51 Which models to average? Ideally, you should combine models that are as accurate and diverse as possible. Think of it in analogy to portfolio optimisation, where we obtain better gains from diversification by combining assets that are less correlated. 39/51 Model averaging ▲ Better generalisation performance than selecting the best individual model (in most cases). ▲ It is useful to interpret the model weights. ▼ Risk of overfitting the validation set. ▼ Higher computational cost than using individual models. ▼ The predictions are harder to interpret. 40/51 Model stacking In model stacking, we go beyond model averaging by specifying a meta-model that takes predictions from different models as inputs. 41/51 Model stacking Source: https://www.kdnuggets.com/2017/02/stacking-models-imropved-predictions.html 42/51 Model stacking The model averaging procedure that we have just described is a special case of model stacking where the meta-model is a linear model. More generally, the meta-model can be any learning algorithm. We fit the meta-model using the validation set or cross-validation. 43/51 Model stacking ▲ Highest potential for maximising performance. ▼ The higher the complexity of the meta-model, the higher the risk of overfitting the validation set. ▼ Higher computational cost than using individual models. ▼ The predictions are harder to interpret. 44/51 Model assessment Model assessment Model assessment is the process of evaluating the performance of your final model to ensure that is it meets the requirements of the project. Because model selection and experimentation can overfit the available data, we need to introduce another level of reserved data for the specific purpose of assessment: the test set. 45/51 Model assessment The most important concept in this unit is that the fundamental goal of supervised learning to generalise to future data. The biggest source of failure in machine learning projects is to overestimate how well the models will perform on new data. The best way to avoid such failures is to hold out a test set that is never seen or used in any way except to evaluate the final model instance. 46/51 Training, validation and test split The training set is for training models. The validation set is for model selection. The test set is for model assessment. 47/51 Training, validation and test split 48/51 Model selection vs. model assessment Model selection Model assessment Validation set Test set Experimentation No optimisation of any kind Hyperparameter optimisation Feature selection Model stacking Iterative One-shot Biased estimation Unbiased estimation Statistical objective Business goals 49/51 Data drift The fundamental assumption in machine learning is that the data used to train and evaluate the model is representative of future data that we want the model to generalise to. Deploying a machine learning model in a production system is subject to data drift. Therefore, machine learning teams need to continuously monitor the performance of their systems. 50/51 Developing successful machine learning projects Some principles to follow 1. It’s generalisation that counts. 2. More data beats cleverer algorithms. 3. Data understanding. 4. Develop an end-to-end system as quickly as possible. 5. Experimentation. 6. Rapid iteration. 7. Meaningful baselines. 8. Learn multiple models. 9. Robust evaluation. 10. Don’t just optimise your metrics, eliminate ways your model can fail. 51/51

Statistical Learning and Data Mining Lecture 6: Practical Methodology PDF

Document Details

Tags

Related

Summary

Full Transcript