Model Selection and Regularization PDF

Summary

This document provides an introduction to model selection and regularization techniques in statistical learning. It covers different methods such as Subset Selection, Shrinkage, and Dimension Reduction, along with examples. The document emphasizes the importance of these techniques in data analysis, especially when dealing with large datasets.

Full Transcript

11/28/24 MODEL SELECTION AND REGULARIZATION Reading: Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2009. G. James, D. Witten, T. Hastie, R. Tibshirani, J. Taylor. An...

11/28/24 MODEL SELECTION AND REGULARIZATION Reading: Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2009. G. James, D. Witten, T. Hastie, R. Tibshirani, J. Taylor. An Introduction to Statistical Learning with Applications in Python, Springer, July 2023. 1 Linear Model Selection and Regularization Recall the linear model Y = b0 + b1X1 + b2X2 + … + bpXp + e, In the next lectures, we will consider some approaches for extending the linear model framework. First, we generalize the linear model in order to accommodate nonlinear, but still additive, relationships. Later we will consider even more general nonlinear models. 2 2 1 11/28/24 In praise of linear models! Despite its simplicity, the linear model has distinct advantages in terms of its interpretability and often shows good predictive performance. Hence, we discuss in this lecture some ways in which the simple linear model can be improved, by replacing ordinary least squares fitting with some alternative fitting procedures. 3 3 Why consider alternatives to least squares? Prediction accuracy: especially when p > n, to control the variance. Model interpretability: By removing irrelevant features – that is, by setting the corresponding coefficient estimates to zero – we can obtain a model that is easier to interpret. We will present some approaches for automatically performing feature selection. 4 4 2 11/28/24 Three classes of methods Subset Selection. Identify subset of the p predictors that can be related to the response. Then, fit a model using least squares on the reduced set of variables. Shrinkage. Fit a model involving all p predictors, but the estimated coefficients are shrunken towards zero relative to the least squares estimates. This shrinkage (regularization) reduces variance and can perform variable selection. Dimension Reduction. Project the p predictors into an M-dimensional subspace, where M < p. This is achieved by computing M different linear combinations, or projections, of the variables. Then, the M projections are used as predictors to fit a linear regression model by least squares. 5 5 Subset Selection Best subset selection 6 6 3 11/28/24 Example: Credit data set Each model contain a subset of the 10 predictors in the Credit data set. RSS and R2 are displayed. The red frontier tracks the best model for a given number of predictors. x-axis ranges from 1 to 11; one of the variables is categorical and takes on three values: needs the creation of two dummy variables. 7 7 Extensions to other models Although we have presented best subset selection here for least squares regression, the same ideas apply to other types of models, such as logistic regression. The deviance – negative two times the maximized log- likelihood – plays the role of RSS for a broader class of models. 8 8 4 11/28/24 Stepwise Selection For computational reasons, best subset selection cannot be applied with very large p. Why not? Best subset selection may also suffer from statistical problems when p is large: larger the search space, the higher the chance of finding models that look good on the training data, even though they might not have any predictive power on future data. Thus, an enormous search space can lead to overfitting and high variance of the coefficient estimates. For both of these reasons, stepwise methods, which explore a far more restricted set of models, are attractive alternatives to best subset selection. 9 9 Forward Stepwise Selection Forward stepwise selection begins with a model containing no predictors, and then adds predictors to the model, one- at-a-time, until all the predictors are in the model. In particular, at each step the variable that gives the greatest additional improvement to the fit is added to the model. 10 10 5 11/28/24 Forward Stepwise Selection 11 11 Forward Stepwise Selection Computational advantage over best subset selection is clear. It is not guaranteed to find the best possible model out of all 2p models containing subsets of the p predictors. Why not? Give an example. 12 12 6 11/28/24 Credit data example The first four selected models for best subset selection and forward stepwise selection on the Credit data set. The first three models are identical, but the fourth models differ. 13 13 Backward Stepwise Selection Like forward stepwise selection, backward stepwise selection provides an efficient alternative to best subset selection. However, unlike forward stepwise selection, it begins with the full least squares model containing all p predictors, and then iteratively removes the least useful predictor, one-at-a- time. 14 14 7 11/28/24 Backward Stepwise Selection 15 15 Backward Stepwise Selection Like forward stepwise selection, the backward selection approach searches through only 1 + p(p +1)/2 models, and so can be applied in settings where p is too large to apply best subset selection. Like forward stepwise selection, backward stepwise selection is not guaranteed to yield the best model containing a subset of the p predictors. Backward selection requires that the number of samples n is larger than the number of variables p (so that the full model can be fit). In contrast, forward stepwise can be used even when n < p, and so is the only viable subset method when p is very large. 16 16 8 11/28/24 Choosing the best model The model containing all the predictors will always have the smallest RSS and the largest R2, since these quantities are related to the training error. We wish to choose a model with low test error, not a model with low training error. Recall that training error is usually a poor estimate of test error. Therefore, RSS and R2 are not suitable for selecting the best model among a collection of models with different numbers of predictors. 17 17 Estimating test error: two approaches We can indirectly estimate test error by making an adjustment to the training error to account for the bias due to overfitting. We can directly estimate the test error, using either a validation set approach or a cross-validation approach, as discussed in previous lectures. We illustrate both approaches next. 18 18 9 11/28/24 Cp, AIC, BIC, and Adjusted R2 These techniques adjust the training error for the model size, and can be used to select among a set of models with different numbers of variables. The next figure displays Cp, BIC, and adjusted R2 for the best model of each size produced by best subset selection on the Credit data set. 19 19 Credit data example 20 20 10 11/28/24 Definitions: Mallow´s Cp and AIC Mallow’s Cp: 1 𝐶! = (RSS + 2𝑑𝜎+ ") 𝑛 where 𝑑 is the total # of parameters used and 𝜎+ " is an estimate of the variance of the error e associated with each response measurement. The AIC criterion is defined for a large class of models fit by maximum likelihood: AIC = −2 log 𝐿 + 2 7 𝑑 where L is the maximized value of the likelihood function for the estimated model. In the case of the linear model with Gaussian errors, maximum likelihood and least squares are the same thing, and 𝐶! and AIC are equivalent. Prove this. 21 21 Definition: BIC 1 BIC = (RSS + log(𝑛)𝑑𝜎+ ") 𝑛 Like 𝐶! , the BIC will tend to take on a small value for a model with a low test error, and so generally we select the model that has the lowest BIC value. Notice that BIC replaces the 2𝑑𝜎+ " used by 𝐶! with a log(𝑛)𝑑𝜎+ " term, where 𝑛 is the number of observations. Since log 𝑛 > 2 for any 𝑛 > 7, the BIC statistic generally places a heavier penalty on models with many variables, and hence results in the selection of smaller models than 𝐶!. See Figure on slide 20. 22 22 11 11/28/24 Definition: Adjusted R2 For a least squares model with d variables, the adjusted 𝑅 ! statistic is calculated as RSS/(𝑛 − 𝑑 − 1) Adjusted 𝑅 ! = 1 − TSS(𝑛 − 1) where TSS is the total sum of squares. Unlike 𝐶" , AIC and BIC, for which a small value indicates a model with a low test error, a large value of adjusted 𝑅 ! indicates a model with a small test error. 23 23 Adjusted R2 #$$ Maximizing the adjusted 𝑅 ! is equivalent to minimizing. While %&'&( RSS always decreases as the number of variables in the model #$$ increases, %&'&( may increase or decrease, due to the presence of 𝑑 in the denominator. Unlike the 𝑅 ! statistic, the adjusted 𝑅 ! statistic pays a price for the inclusion of unnecessary variables in the model. See Figure on slide 20. 24 24 12 11/28/24 Validation and Cross-validation Each of the procedures returns a sequence of models ℳ# indexed by model size B Once selected, we will return model ℳ#$. 𝑘 = 0, 1, 2, …. Our job here is to select 𝑘. We compute the validation set error or the cross-validation error for each model ℳ# under consideration, and then select the 𝑘 for which the resulting estimated test error is smallest. This procedure has an advantage relative to AIC, BIC, 𝐶! , and adjusted 𝑅", in that it provides a direct estimate of the test error, and doesn’t require an estimate of the error variance 𝜎 ". It can also be used in a wider range of model selection tasks, even in cases where it is hard to pinpoint the model degrees of freedom (e.g. the number of predictors in the model) or hard to estimate the error variance 𝜎 ". 25 25 Credit data example 26 26 13 11/28/24 Details of figure The validation errors were calculated by randomly selecting ¾ of the observations as the training set, and the remainder ¼ as the validation set. The cross-validation errors were computed using k = 10 folds. In this case, the validation and cross-validation methods both result in a six-variable model. However, all three approaches suggest that the four-, five-, and six-variable models are roughly equivalent in terms of their test errors. In this setting, we can select a model using the one-standard-error rule. We first calculate the standard error of the estimated test MSE for each model size, and then select the smallest model for which the estimated test error is within one standard error of the lowest point on the curve. 27 27 Shrinkage methods Ridge regression and Lasso The subset selection methods use least squares to fit a linear model that contains a subset of the predictors. As an alternative, we can fit a model containing all p predictors using a technique that constrains or regularizes the coefficient estimates, or equivalently, that shrinks the coefficient estimates towards zero. It may not be immediately obvious why such a constraint should improve the fit, but it turns out that shrinking the coefficient estimates can significantly reduce their variance. 28 28 14 11/28/24 Ridge regression Recall that the least squares fitting procedure estimates b0, b1, …, bp using the values that minimize In contrast, the ridge regression coefficient estimates 𝛽L % are the values that minimize where 𝜆 ≥ 0 is a tuning parameter, to be determined separately. 29 29 Ridge regression As with least squares, ridge regression seeks coefficient estimates that fit the data well, by making the RSS small. However, the second term, 𝜆 ∑& 𝛽&" , called a shrinkage penalty, is small when 𝛽' , … , 𝛽! , are close to zero, and so it has the effect of shrinking the estimates of 𝛽& towards zero. The tuning parameter 𝜆 serves to control the relative impact of these two terms on the regression coefficient estimates. Selecting a good value for 𝜆 is critical; cross-validation is used for this. 30 30 15 11/28/24 Credit data example 31 31 Details of previous figure In the left-hand panel, each curve corresponds to the ridge regression coefficient estimate for one of the 10 variables, plotted as a function of 𝜆. The right-hand panel displays the same ridge coefficient estimates as the left-hand panel, but instead of displaying 𝜆 on the 𝑥-axis, we now display 𝛽+(% * 𝛽+ , where 𝛽+ denotes the vector of least squares coefficient " " estimates. The notation 𝛽+ " denotes the ℓ" norm of a vector, and is defined as 𝛽+ = ∑!&)' 𝛽&". " 32 32 16 11/28/24 Ridge regression: scaling of predictors The standard least squares coefficient estimates are scale equivariant: multiplying 𝑋& by a constant 𝑐 simply leads to a scaling of the least squares coefficient estimates by a factor of 1/𝑐. In other words, regardless of how the 𝑗th predictor is scaled, 𝑋& 𝛽L& will remain the same. In contrast, the ridge regression coefficient estimates can change substantially when multiplying a given predictor by a constant, due to the sum of squared coefficients term in the penalty part of the ridge regression objective function. Therefore, it is best to apply ridge regression after standardizing the predictors, using the formula: 33 33 Ridge Regression vs Least Squares Simulated data with n = 50 observations, p = 45 predictors, all having nonzero coefficients. Squared bias (black), variance (green), and test mean squared error (purple) for the ridge regression predictions on a simulated data set, as a function of 𝜆 and 𝛽+!" * 𝛽+. The # # horizontal dashed lines indicate the minimum possible MSE. The purple crosses indicate the ridge regression models for which the MSE is smallest. 34 34 17 11/28/24 The Lasso Ridge regression does have one obvious disadvantage: unlike subset selection, which will generally select models that involve just a subset of the variables, ridge regression will include all p predictors in the final model. The least absolute shrinkage and selection operator (Lasso) is a relatively recent alternative to ridge regression that overcomes this disadvantage. The lasso coefficients, 𝛽+(* , minimize In statistical parlance, the lasso uses an ℓ' penalty instead of an ℓ" penalty. The ℓ' norm of a coefficient vector 𝛽 is given by 𝛽 ' = ∑ 𝛽&. 35 35 The Lasso As with ridge regression, the lasso shrinks the coefficient estimates towards zero. However, in the case of the lasso, the ℓ' penalty has the effect of forcing some of the coefficient estimates to be exactly equal to zero when the tuning parameter 𝜆 is sufficiently large. Hence, much like best subset selection, the lasso performs variable selection. We say that the lasso yields sparse models; i.e., models that involve only a subset of the variables. As in ridge regression, selecting a good value of 𝜆 for the lasso is critical; cross- validation is again the method of choice. 36 36 18 11/28/24 Example: Credit dataset 37 37 Variable selection property of Lasso Why is it that the lasso, unlike ridge regression, results in coefficient estimates that are exactly equal to zero? One can show that the lasso and ridge regression coefficient estimates solve the problems and respectively. 38 38 19 11/28/24 Lasso: error and constraints 39 39 Comparing Lasso and Ridge regression Left: Plots of squared bias (black), variance (green), and test MSE (purple) for the lasso on simulated data set of Slide 34. Right: Comparison of squared bias, variance and test MSE between lasso (solid) and ridge (dashed). Both are plotted against R2 on the training data, as a common form of indexing. The crosses in both plots indicate the lasso model for which the MSE is smallest. 40 40 20 11/28/24 Comparing Lasso and Ridge regression Left: Plots of squared bias (black), variance (green), and test MSE (purple) of lasso. The simulated data is similar to slide 40, but now only two predictors are related to the response. Right: Comparison of squared bias, variance and test MSE between lasso (solid) and ridge (dashed). Both are plotted against R2 on the training data, as a common form of indexing. The crosses in both plots indicate the lasso model for which the MSE is smallest. 41 41 Conclusions These two examples illustrate that neither ridge regression nor the lasso will universally dominate the other. In general, one might expect the lasso to perform better when the response is a function of only a relatively small number of predictors. However, the number of predictors that is related to the response is never known a priori for real data sets. A technique such as cross-validation can be used in order to determine which approach is better on a particular data set. 42 42 21 11/28/24 Tuning Ridge Regression and Lasso As for subset selection, for ridge regression and lasso we require a method to determine which of the models under consideration is best. That is, we require a method selecting a value for the tuning parameter l or equivalently, the value of the constraint s. Cross-validation provides a simple way to tackle this problem. We choose a grid of l values and compute the cross-validation error rate for each value of l. We then select the tuning parameter value for which the cross-validation error is smallest. Finally, the model is re-fit using all available observations and the selected value of the tuning parameter. 43 43 Credit data example Left: Cross-validation errors that result from applying ridge regression to the Credit data set with various values of l. Right: The coefficient estimates as a function of l. The vertical dashed lines indicates the value of l selected by cross-validation. 44 44 22 11/28/24 Simulated data example Left: 10-fold cross-validation MSE for the lasso, applied to the sparse simulated data set from Slide 34. Right: The corresponding lasso coefficient estimates are displayed. The vertical dashed lines indicate the lasso fit for which the cross-validation error is smallest. 45 45 Dimension reduction methods The methods that we have discussed so far have involved fitting linear regression models, via least squares or a shrunken approach, using the original predictors, X1, X2,…, Xp. We now explore a class of approaches that transform the predictors and then fit a least squares model using the transformed variables. We will refer to these techniques as dimension reduction methods. 46 46 23 11/28/24 Dimension reduction methods Let Z1, Z2, …, ZM represent M < p linear combinations of the original p predictors. That is, ( 𝑍$ = / 𝜙$% 𝑋% %&' for some constants 𝜙+', … , 𝜙+!. We can then fit the linear regression model + 𝑦) = 𝜃* + / 𝜃$ 𝑧)$ + 𝜀) , 𝑖 = 1, … , 𝑛 $&' using ordinary least squares. Here, the regression coefficients are given by 𝜃,, 𝜃', … , 𝜃-. If the constants 𝜙+', … , 𝜙+! are chosen wisely, then such dimension reduction approaches can often outperform OLS regression. 47 47 Dimension reduction methods Notice that from the definition of 𝑍+ where Hence the model for 𝑦. can be seen of as a special case of the original linear regression model. Dimension reduction serves to constrain the estimated 𝛽& coefficients. This form can have a better bias-variance tradeoff. 48 48 24 11/28/24 Principal Components Regression Here we apply principal components analysis (PCA) to define the linear combinations of the predictors, for use in the regression. The first principal component is that (normalized) linear combination of the variables with the largest variance. The second principal component has largest variance, subject to being uncorrelated with the first. And so on. Hence with many correlated original variables, we replace them with a small set of principal components that capture their joint variation. 49 49 PCA for advertising data The population size (pop) and advertising spending (ad) for 100 different cities are shown as purple circles. Green solid line: first principal component. Blue dashed line: second principal component. 50 50 25 11/28/24 PCA for advertising data Left: First principal component, chosen to minimize the sum of the squared perpendicular distances to each point (shown in green). These distances are represented using the black dashed line segments. Right: The left-hand panel has been rotated so that the first principal component lies on the x-axis. 51 51 PCA for advertising data First principal component scores zi1 versus pop and ad. The relationships are strong. 52 52 26 11/28/24 PCA for advertising data Second principal component scores zi2 versus Population and Ad spending. The relationships are weak. 53 53 Application to Principal Components Regression PCR was applied to two simulated data sets. Black, green, and purple lines correspond to squared bias, variance, and test mean squared error, respectively. Left: Simulated data from slide 34. Right: Simulated data from slide 41. 54 54 27 11/28/24 Choosing number of directions M Left: PCR standardized coefficient estimates on the Credit data set for different values of M. Right: The 10-fold cross validation MSE obtained using PCR, as a function of M. 55 55 Partial Least Squares PCR identifies linear combinations, or directions, that best represent the predictors X1, X2, …, Xp. These directions are identified in an unsupervised way, since the response Y is not used to help determine the principal component directions. That is, the response does not supervise the identification of the principal components. Consequently, PCR suffers from a potentially serious drawback: there is no guarantee that the directions that best explain the predictors will also be the best directions to use for predicting the response. 56 56 28 11/28/24 Partial Least Squares Like PCR, PLS is a dimension reduction method, which first identifies a new set of features Z1,…, Zm that are linear combinations of the original features, and then fits a linear model via OLS using these M new features. But unlike PCR, PLS identifies these new features in a supervised way – that is, it makes use of the response Y to identify new features that not only approximate the old features well, but also that are related to the response. Roughly speaking, the PLS approach attempts to find directions that help explain both the response and the predictors. 57 57 Partial Least Squares After standardizing the p predictors, PLS computes the first direction Z1 by setting each 𝜙'& equal to the coefficient from the simple linear regression of Y onto Xj. One can show that this coefficient is proportional to the correlation between Y and Xj. ! Hence, in computing 𝑍' = ∑&)' 𝜙'& 𝑋! , PLS places the highest weight on the variables that are most strongly related to the response. Subsequent directions are found by taking residuals and then repeating the above prescription. 58 58 29 11/28/24 Summary Model selection methods are an essential tool for data analysis, especially for big datasets involving many predictors. Research into methods that give sparsity, such as the lasso is an especially hot area. Later, we will return to sparsity in more detail, and will describe related approaches such as the elastic net. 59 59 30

Use Quizgecko on...
Browser
Browser