Learning from Data Lecture 4 PDF
Document Details
Uploaded by SportyDeciduousForest4462
University of Exeter
Dr Marcos Oliveira
Tags
Summary
This document is a university lecture on Learning from Data, specifically on topics including regression models, polynomial regression and data splits. It includes syntax examples, graphs and definitions.
Full Transcript
Learning from Data Lecture 4 Dr Marcos Oliveira Other regression models, comparing models, and data splits Linear regression: The Syntax Import the class containing the regression method from sklearn.linear_model import LinearRegressio Create an instance of the class LR = LinearRegr...
Learning from Data Lecture 4 Dr Marcos Oliveira Other regression models, comparing models, and data splits Linear regression: The Syntax Import the class containing the regression method from sklearn.linear_model import LinearRegressio Create an instance of the class LR = LinearRegression() Fit the instance on the data and then predict the expected value LR = LR.fit(X_train, Y_train) y_predict = LR.predict(X_test) https://christophm.github.io/interpretable-ml-book/limo.html Interpretable Machine Learning: A Guide for Making Black Box Models Explainable Christoph Molnar n Recap from Lecture 3 We determined the slope of the regression line and Y-intercept of the regression line. = 0.914, = 1.16 Another pair of Another pair of 3.0 3.0 3.0 Box office Box office Box office 2.0 2.0 2.0 1.0 1.0 1.0 0.0 1.0 2.0 3.0 0.0 1.0 2.0 3.0 0.0 1.0 2.0 3.0 Movie budget Movie budget Movie budget Recap from Lecture 3 3.0 3.0 3.0 Box office Box office Box office 2.0 2.0 2.0 1.0 1.0 1.0 0.0 1.0 2.0 3.0 0.0 1.0 2.0 3.0 0.0 1.0 2.0 3.0 Movie budget Movie budget Movie budget Learning objectives To describe different regression models. To demonstrate understanding of polynomial regression. To demonstrate understanding of data splits and cross validation. Outline Introduction to the different regression models: Extending and enhancing the linear model. Polynomial features. Introduction to polynomial regression model: The form of the polynomial regression. Bayes Information Criterion (BIC). Introduction to data splits and cross validation: Training and testing samples. k-fold cross validation and stratified sampling. Enhancing the linear model One way to enhance the linear model is to use polynomials in our linear regression We might be able to predict better: polynomials being help to fit the curvature of the actual data We might be able to explain better: polynomials help to find variables that explain variations in data better. 3.0 3.0 3.0 Box office Box office Box office 2.0 2.0 2.0 1.0 1.0 1.0 0.0 1.0 2.0 3.0 0.0 1.0 2.0 3.0 0.0 1.0 2.0 3.0 Movie budget Movie budget Movie budget.. Polynomials Polynomials For a polynomial, you do not need to have every single x term. The order of the polynomial refers to the largest exponent in any of the terms. Polynomial regression 1st order polynomial regression. Polynomial regression 1st order polynomial regression. 2nd order polynomial regression. 60 60 40 40 Y Y 20 20 0 10 20 30 0 10 20 30 X X Polynomial regression Capture higher order features of data by adding polynomial features. Polynomial regression Capture higher order features of data by adding polynomial features. Note that the resulting algorithm is still linear regression, since the outcome is still going to be a linear combination of features. The non-linear relationship between one feature and another is not going to make the algorithm non-linear, it is the algorithm itself that is still a linear combination of features. Polynomial regression Which model should we use? 3.0 3.0 3.0 Box office Box office Box office 2.0 2.0 2.0 1.0 1.0 1.0 0.0 1.0 2.0 3.0 0.0 1.0 2.0 3.0 0.0 1.0 2.0 3.0 Movie budget Movie budget Movie budget Polynomial regression Which model should we use? Complexity? 3.0 3.0 3.0 Box office Box office Box office 2.0 2.0 2.0 1.0 1.0 1.0 0.0 1.0 2.0 3.0 0.0 1.0 2.0 3.0 0.0 1.0 2.0 3.0 Movie budget Movie budget Movie budget Polynomial regression Which model should we use? Complexity? How to measure? Model order selection using Bayes Information Criterion (BIC) p is the number of parameters. n is the number of observations. ln is the natural logarithm. SSe is the sum of squared error. Polynomial regression Which model should we use? Complexity? How to measure? Model order selection using Bayes Information Criterion (BIC) 140 p is the number of parameters. 130 n is the number of observations. BIC 120 ln is the natural logarithm. SSe is the sum of squared error. 110 100 “best model” 90 2 4 6 8 10 Parameters Extending the linear model In addition to polynomial features, there are several additional variants of standard models, using many for both regression and classification: Logistic regression (Week 4) Support Vector Machines (Week 8) Deep learning approaches such as MLP, CNN and LSTM (Week 4 and Week 5). For each one of these, we have the same idea of finding the balance of complexity and generalisability, overfitting or underfitting the model, called Bias-Variance trade-off (next Week). Extending the linear model We can also include interaction terms: How is the correct functional form chosen? Check relationship of each variable or with outcome. Check previous works for potential proven interactions. We would have a preset idea of which variable to try. Create all combinations of features. Use feature selection techniques. Data splits & test Data splits & test Let’s fit our model in such a way that it fits exactly to the data. Date Title Budget Movie revenue Run time 2013-11-22 The Hunger Games: Mocking Jay 13,000,000 42,000,000 146 2013-12-24 Iron Man 3 20,000,000 40,900,000 129 2013-11-22 Frozen 15,000,000 36,003,023 108 2013-04-09 Despicable Me 2 76,020,023 89,356,596 96 2013-10-04 Gravity 10,000,000 27,888,000 91 The model might perform 2013-06-21 Fast & furious 6 16,000,000 28,686,355 130 poorly on unseen data. The model will not generalise well. Data splits & test Let’s fit our model in such a way that it fits exactly to the data. It is also important to have a holdout set to see how well the model will perform on unseen data. Date Title Budget Movie revenue Run time 2013-11-22 The Hunger Games: Mocking Jay 13,000,000 42,000,000 146 2013-12-24 Iron Man 3 20,000,000 40,900,000 129 2013-11-22 Frozen 15,000,000 36,003,023 108 2013-04-09 Despicable Me 2 76,020,023 89,356,596 96 2013-10-04 Gravity 10,000,000 27,888,000 91 The model might perform 2013-06-21 Fast & furious 6 16,000,000 28,686,355 130 poorly on unseen data. The model will not generalise well. Data splits & test We can split the data into a training set and a testing set. First, we use the training set to learn the optimal parameters for the model. Then, we use the testing set and apply the learned model to see how well the model performs on the test set. Date Title Budget Movie revenue Run time 2013-11-22 The Hunger Games: Mocking Jay 13,000,000 42,000,000 146 2013-12-24 Iron Man 3 20,000,000 40,900,000 129 Training 2013-11-22 Frozen 15,000,000 36,003,023 108 set 2013-04-09 Despicable Me 2 76,020,023 89,356,596 96 2013-10-04 Gravity 10,000,000 27,888,000 91 Test set 2013-06-21 Fast & furious 6 16,000,000 28,686,355 130 This aims to ensure that the model generalises well. Data splits & test We can split the data into a training set and a testing set. First, we use the training set to learn the optimal parameters for the model. Then, we use the testing set and apply the learned model to see how well the model performs on the test set. Data leakage can occur: Knowledge of the test set leaks into the training set. Date Title Budget Movie revenue Run time 2013-11-22 The Hunger Games: Mocking Jay 13,000,000 42,000,000 146 2013-12-24 Iron Man 3 20,000,000 40,900,000 129 Training 2013-11-22 Frozen 15,000,000 36,003,023 108 set 2013-04-09 Despicable Me 2 76,020,023 89,356,596 96 2013-10-04 Gravity 10,000,000 27,888,000 91 Test set 2013-06-21 Fast & furious 6 16,000,000 28,686,355 130 This aims to ensure that the model generalises well. https://en.wikipedia.org/wiki/Leakage_(machine_learning) Beyond a single test set Training Test set set 60 60 40 40 Y Y 20 20 0 10 20 30 0 10 20 30 X X Beyond a single test set Training Test set set 60 60 40 40 Y Y 20 20 0 10 20 30 0 10 20 30 X X We can measure by the lines that measure We will use the best model the distance from the predicted value to the learned from the training data. actual value to determine how well the model is doing on the unseen data. Using training and test data Training Test set set Fit the model Measure performanc Compare with actual value Predict label with model Measure error. e.. Fitting training and test data: workflow Training X_train model(X_train, Y_train).fit() model set Y_train X_test Test set model.predict(X_test) Y_predict Y_test (actuals) error_metric(Y_test, Y_predict) test error Train test split: Python syntax Import the train and test split function: from sklearn.model_selection import train_test_spli Split the data and put 40% into the test set: train, test = train_test_split(data, test, size=0.4 If we have X = features, Y = outcome variables, we split this into four different sets: X_train, x_test, Y_train, y_test = train_test_split(X, y, test_size = 0.4) ) t Beyond a single test set Instead of using a single training and test set, we use cross validation to calculate the error across multiple training and test sets. With cross validation, we split the data into multiple pairs of training and test sets. Average the error across each one of the test set errors. Performance measure will be more statistically significant. Date Title Budget Movie revenue Run time 2013-11-22 The Hunger Games: Mocking Jay 13,000,000 42,000,000 146 2013-12-24 Iron Man 3 20,000,000 40,900,000 129 Training 2013-11-22 Frozen 15,000,000 36,003,023 108 set 2013-04-09 Despicable Me 2 76,020,023 89,356,596 96 2013-10-04 Gravity 10,000,000 27,888,000 91 Test set 2013-06-21 Fast & furious 6 16,000,000 28,686,355 130 Beyond a single test set How do we get multiple training and test sets from one historical dataset? Date Title Budget Movie revenue Run time 2013-11-22 The Hunger Games: Mocking Jay 13,000,000 42,000,000 146 2013-12-24 Iron Man 3 20,000,000 40,900,000 129 2013-11-22 Frozen 15,000,000 36,003,023 108 Training set 1 2013-04-09 Despicable Me 2 76,020,023 89,356,596 96 2013-10-04 Gravity 10,000,000 27,888,000 91 Test set 2013-06-21 Fast & furious 6 16,000,000 28,686,355 130 1 Date Title Budget Movie revenue Run time 2013-11-22 The Hunger Games: Mocking Jay 13,000,000 42,000,000 146 2013-12-24 Iron Man 3 20,000,000 40,900,000 129 Test set No overlap with 2013-11-22 Frozen 15,000,000 36,003,023 108 2 Test set 1. 2013-04-09 Despicable Me 2 76,020,023 89,356,596 96 Training 2013-10-04 Gravity 10,000,000 27,888,000 91 set 2 2013-06-21 Fast & furious 6 16,000,000 28,686,355 130 Beyond a single test set Training set Test set + First, evaluate the error for Test set Training set each test set, then calculate the average of the errors across all Average cross validation results. these test sets for a specific model. The reason for doing this is if we are testing on multiple models, one of the models can perform well depending on the test split if there is only one training test split. By performing cross validation on multiple training and test splits, that model overall is going to perform better given a test set that you have not seen. Cross validation approach k-fold cross validation: A single parameter called k is the number of groups a data sample is to be split. Example: k = 10 is 10-fold cross validation. Process: 1. Shuffle dataset. 2. Split dataset into k groups. 3. For each unique group: a) Use the group as a test set. b) Use the remaining group as a training set. c) Fit model on the training set, then evaluate model on the test set. d) Store the evaluation score, then discard the model. 4. Summarise model performance using the model evaluation scores. k-fold cross validation: Python syntax Import the cross validation function: from sklearn.model_selection import cross_val_scor Perform cross validation with a given model: cross_val = cross_val_score(model, X_data, y_data, cv=10, scoring=“neg_mean_squared_error”) The goal is to maximise the scoring value, thus to minimise the MSE, it maximises the negative MSE which means minimising MSE. e Balancing data sets Random sampling? 90% group A 10% group B How to ensure that this proportion will be preserved in the random sample? Stratified sampling Stratified sampling is a sampling technique where the samples are selected in the same proportion as they appear in the population. For example, if the population of interest has 60% CEOs and 40% CTOs, then we divide the population into 2 groups and choose 60% from the CEO group and 40% from the CTO group. Implementing stratified sampling in cross validation ensures that the training and test sets have the same proportion of the feature of interest as in the original dataset. By doing this with the target feature, we ensure that the cross validation result is a close approximation of the generalisation error. Stratified sampling Let’s generate a synthetic classification dataset with 500 records, 3 features and 3 classes. URL: https://towardsdatascience.com/what-is-stratified-cross-validation-in-machine-learning-8844f3e7ae8e Hold out cross validation w/o stratified sampling Using the ``train_test_split’’ method, we implement hold out cross validation. This method returns the training and test set. The code is shown below: We haven’t use stratified sampling here. Thus, we can see that the proportion of the target variable varies between the original dataset, training set and test set. Hold out cross validation with stratified sampling We implement hold out cross validation with stratified sampling such that the training and test sets have the same proportion of the target variable. We use the ``stratify’’ argument in the ``train_test_split’’ method and set it to the characteristic of the feature of interest. Using stratified sampling, the proportion of the target variable is consistent across the original dataset, training and test sets. k-fold cross validation w/o stratified sampling We implement 3-fold cross validation without stratified sampling. The proportion of the target variable is inconsistent among the original dataset, training and test data across splits. k-fold cross validation with stratified sampling We implement k-fold cross validation with stratified sampling by using the ``StratifiedKFold’’ class of scikit-learn. The proportion of the target variable is consistent among the original dataset, training and test data across splits. Lessons learned We listed different regression models and described how to select the “best” polynomial regression model. We described the process for splitting data into training and test sets and fitting the training and test data. We described the k-fold cross validation approach and showed how to implement k-fold cross validation with stratified sampling. Learning from Data Lecture 3 Dr Marcos Oliveira