Podcast
Questions and Answers
Which of the following is a primary advantage of using linear models in statistical analysis?
Which of the following is a primary advantage of using linear models in statistical analysis?
- They offer ease of interpretation and often demonstrate good predictive performance. (correct)
- They are computationally intensive, ensuring a thorough exploration of the feature space.
- They can effortlessly handle complex, non-additive relationships without modification.
- They always provide the highest prediction accuracy compared to non-linear models.
In the context of model selection, why might one prefer alternatives to ordinary least squares (OLS) fitting?
In the context of model selection, why might one prefer alternatives to ordinary least squares (OLS) fitting?
- To decrease model complexity regardless of the number of predictors.
- To simplify calculations, even at the cost of increasing variance.
- To improve prediction accuracy, especially when the number of predictors exceeds the number of observations, and to enhance model interpretability. (correct)
- To ensure the model includes all possible features, thereby eliminating bias.
Which of the following is NOT a class of methods for linear model selection?
Which of the following is NOT a class of methods for linear model selection?
- Subset Selection
- Feature Expansion (correct)
- Dimension Reduction
- Shrinkage
In best subset selection, how is the 'best' model typically determined for a given number of predictors?
In best subset selection, how is the 'best' model typically determined for a given number of predictors?
In the context of subset selection, what role does the 'null model' play?
In the context of subset selection, what role does the 'null model' play?
What statistical measure replaces the role of Residual Sum of Squares (RSS) for a broader class of models beyond least squares regression?
What statistical measure replaces the role of Residual Sum of Squares (RSS) for a broader class of models beyond least squares regression?
Why might best subset selection not be suitable for datasets with a very large number of predictors (p)?
Why might best subset selection not be suitable for datasets with a very large number of predictors (p)?
What is a primary concern with best subset selection when dealing with a large number of predictors?
What is a primary concern with best subset selection when dealing with a large number of predictors?
Which statement accurately describes forward stepwise selection?
Which statement accurately describes forward stepwise selection?
In forward stepwise selection, how is the predictor added at each step determined?
In forward stepwise selection, how is the predictor added at each step determined?
What is a key computational advantage of forward stepwise selection over best subset selection?
What is a key computational advantage of forward stepwise selection over best subset selection?
Which statement correctly describes backward stepwise selection?
Which statement correctly describes backward stepwise selection?
What condition is required for backward stepwise selection to be applicable?
What condition is required for backward stepwise selection to be applicable?
Why are Residual Sum of Squares (RSS) and R-squared not suitable for choosing the best model among a collection of models with different numbers of predictors?
Why are Residual Sum of Squares (RSS) and R-squared not suitable for choosing the best model among a collection of models with different numbers of predictors?
What are two general approaches to estimating test error in model selection?
What are two general approaches to estimating test error in model selection?
How do AIC, BIC, and adjusted R-squared aid in model selection?
How do AIC, BIC, and adjusted R-squared aid in model selection?
Which of the following is true regarding the interpretation of AIC, BIC, and adjusted R²?
Which of the following is true regarding the interpretation of AIC, BIC, and adjusted R²?
How does the Bayesian Information Criterion (BIC) typically compare to Mallow's $C_p$ in model selection?
How does the Bayesian Information Criterion (BIC) typically compare to Mallow's $C_p$ in model selection?
What is the 'one-standard-error' rule used for in the context of model selection?
What is the 'one-standard-error' rule used for in the context of model selection?
What is the primary role of shrinkage methods in linear regression?
What is the primary role of shrinkage methods in linear regression?
What is the purpose of the tuning parameter (λ) in ridge regression?
What is the purpose of the tuning parameter (λ) in ridge regression?
How does ridge regression address the issue of multicollinearity?
How does ridge regression address the issue of multicollinearity?
Why is it typically important to standardize predictors before applying ridge regression?
Why is it typically important to standardize predictors before applying ridge regression?
How does the Lasso differ from ridge regression in shrinking coefficient estimates?
How does the Lasso differ from ridge regression in shrinking coefficient estimates?
What is the term used to describe models generated by the Lasso that involve only a subset of the variables?
What is the term used to describe models generated by the Lasso that involve only a subset of the variables?
In the context of the Lasso, what is the effect of increasing the tuning parameter λ?
In the context of the Lasso, what is the effect of increasing the tuning parameter λ?
What is a key difference between how Lasso and ridge regression handle multicollinearity?
What is a key difference between how Lasso and ridge regression handle multicollinearity?
Which of the following statements best describes Principal Components Regression (PCR)?
Which of the following statements best describes Principal Components Regression (PCR)?
What is a primary advantage of using Partial Least Squares (PLS) over Principal Components Regression (PCR)?
What is a primary advantage of using Partial Least Squares (PLS) over Principal Components Regression (PCR)?
In Partial Least Squares (PLS), by what mechanism are the created components related to the response variable?
In Partial Least Squares (PLS), by what mechanism are the created components related to the response variable?
What is the main goal of model selection methods?
What is the main goal of model selection methods?
Why is there significant research interest in creating sparsity, in methods similar to the Lasso?
Why is there significant research interest in creating sparsity, in methods similar to the Lasso?
Which method involves fitting a model with all predictors, but shrinks coefficient estimates toward zero?
Which method involves fitting a model with all predictors, but shrinks coefficient estimates toward zero?
Which method involves identifying a subset of predictors believed to be related to the response, then fitting a model using only this subset?
Which method involves identifying a subset of predictors believed to be related to the response, then fitting a model using only this subset?
Which method projects predictors into a lower-dimensional subspace?
Which method projects predictors into a lower-dimensional subspace?
What is the main goal of variable selection in the context of model building?
What is the main goal of variable selection in the context of model building?
In which situation would you expect the lasso to perform better?
In which situation would you expect the lasso to perform better?
Why is cross-validation used for selecting a tuning parameter?
Why is cross-validation used for selecting a tuning parameter?
A researcher wants to build the best linear model to predict housing prices in a city. In the dataset, many variables are strongly correlated. Should the researcher use Subset Selection, Ridge Regression, or The Lasso?
A researcher wants to build the best linear model to predict housing prices in a city. In the dataset, many variables are strongly correlated. Should the researcher use Subset Selection, Ridge Regression, or The Lasso?
Flashcards
Advantages of Linear Models
Advantages of Linear Models
The linear model is advantageous due to its distinct interpretability and good predictive performance.
Why Alternatives to Least Squares?
Why Alternatives to Least Squares?
Alternatives to least squares are considered to improve prediction accuracy, especially when p > n, and to enhance model interpretability through feature selection.
Subset Selection
Subset Selection
Subset selection identifies a subset of predictors related to the response and fits a model using least squares on the reduced set.
Shrinkage (Regularization)
Shrinkage (Regularization)
Signup and view all the flashcards
Dimension Reduction
Dimension Reduction
Signup and view all the flashcards
Best Subset Selection
Best Subset Selection
Signup and view all the flashcards
Forward Stepwise Selection
Forward Stepwise Selection
Signup and view all the flashcards
Backward Stepwise Selection
Backward Stepwise Selection
Signup and view all the flashcards
Why not use RSS and R²?
Why not use RSS and R²?
Signup and view all the flashcards
Estimating Test Error
Estimating Test Error
Signup and view all the flashcards
Cp, AIC, BIC, Adjusted R²
Cp, AIC, BIC, Adjusted R²
Signup and view all the flashcards
Cp, AIC, BIC, Adjusted R²
Cp, AIC, BIC, Adjusted R²
Signup and view all the flashcards
Validation and Cross-Validation
Validation and Cross-Validation
Signup and view all the flashcards
Shrinkage methods
Shrinkage methods
Signup and view all the flashcards
Ridge Regression
Ridge Regression
Signup and view all the flashcards
Scaling Predictors
Scaling Predictors
Signup and view all the flashcards
Why Ridge Regression?
Why Ridge Regression?
Signup and view all the flashcards
Lasso Regression
Lasso Regression
Signup and view all the flashcards
Selecting Tuning Parameter
Selecting Tuning Parameter
Signup and view all the flashcards
Dimension Reduction Methods
Dimension Reduction Methods
Signup and view all the flashcards
How to calculate dimension reduction?
How to calculate dimension reduction?
Signup and view all the flashcards
Principal Components Regression (PCR)
Principal Components Regression (PCR)
Signup and view all the flashcards
PCR vs PLS
PCR vs PLS
Signup and view all the flashcards
Study Notes
- Linear Model Selection and Regularization is the main topic
Linear Model
- Recall the linear model equation: Y = βο + β1X1 + ... + βpXp + є
- In subsequent lectures, approaches extend the linear model framework
- Linear models can adapt to non-linear relationships
- Chapter 8 covers even more general non-linear models
Praise of Linear Models
- Linear models have distinct advantages in interpretability, despite simplicity
- Linear models show good predictive performance
- Replacing ordinary least squares fitting can improve the simple linear model
Alternatives to Least Squares
- Alternatives to least squares are important for prediction accuracy when p > n
- Alternatives to least squares are important to control the variance
- Model interpretability can be improved by removing irrelevant features and performing feature selection
- Setting corresponding coefficient estimates to zero improves model interpretation
Three Classes of Methods
- Consider three classes of methods; subset selection, shrinkage, and dimension reduction
- Subset Selection identifies a subset of p predictors believed to relate to the response
- Fit a model using least squares on the reduced set of variables
- Shrinkage involves fitting a model with all p predictors
- Estimated coefficients are shrunken toward zero relative to least squares estimates
- Also known as regularization, shrinkage reduces variance and performs variable selection
- Dimension Reduction projects p predictors into an M-dimensional subspace, where M < p
- M different linear combinations or projections of the variables are computed
- These M projections are predictors to fit a linear regression model by least squares
Subset Selection
- Subset selection includes best subset and stepwise model selection procedures
Best Subset Selection
- M0 denotes the null model containing no predictors
- Null model predicts the sample mean for each observation
- For k ranging from 1 to p:
- Fit all models that contain exactly k predictors
- Pick the best among these models and call it Mk
- Best is defined as having the smallest RSS, or equivalently largest R2
- A single best model is selected from among all models using cross-validated prediction error, Cp (AIC), BIC, or adjusted R2
Credit Data Set Example
- The red frontier tracks the best model for a given number of predictors
- Best model decided according to RSS and R²
- The x-axis ranges from 1 to 11, since one of the variables is categorical and takes on three values, which leads to the creation of two dummy variables
Extensions to Other Models
- Ideas from Best subset selection apply to other model types like logistic regression
- Deviance plays the role of Residual Sum of Squares, RSS, for a broader class of models
- Deviance represents negative two times the maximized log-likelihood
Stepwise Selection
- Best subset selection is often not applied with very large p due to computational reasons
- Best subset selection may also suffer from statistical problems when p is large
- Enormous search space can lead to overfitting and high variance of the coefficient estimates
- Stepwise methods, explore a far more restricted set of models, are alternatives to best subset selection
Forward Stepwise Selection
- Forward stepwise selection begins with a model containing no predictors, and then adds predictors to the model, one-at-a-time, until all of the predictors are in the model
- At each step, the variable that gives the greatest additional improvement to the fit is added to the model
Forward Stepwise Selection Detailed
-
M0 denotes the null model, which contains no predictors
-
Models augment the predictors in Mk with one additional predictor
-
Choose the best among these models, and call best defined as having smallest RSS or highest R2
-
A single best model is selected from among all models using cross-validated prediction error, Cp (AIC), BIC, or adjusted R2
More on Forward Stepwise Selection
- Clear computational advantage over best subset selection
- It is not guaranteed to find the best possible model out of all 2º models containing subsets of the p predictors
Credit Data Example: Best Subset vs Forward Stepwise
- First three selected models are the same between two methods, fourth models differ
Backward Stepwise Selection
- Like forward stepwise selection, backward stepwise selection provides an efficient alternative to best subset selection
- However, unlike forward stepwise selection, it begins with the full least squares model containing all p predictors, and then iteratively removes the least useful predictor, one-at-a-time
Backward Stepwise Selection: Details
- Mp denotes the full model, which contains all p predictors
- Consider all k models that contain all but one of the predictors in Mk, for a total of k-1 predictors
- Choose the best among these k models, and call it Mk-1 - best defined as having smallest RSS or highest R2
- A single best model is selected from among all models using cross-validated prediction error, Cp (AIC), BIC, or adjusted R2
More on Backward Stepwise Selection
- The backward selection approach searches through only 1 + p(p + 1)/2 models, and so can be applied in settings where p is too large to apply best subset selection
- Backward stepwise selection is not guaranteed to yield the best model containing a subset of the p predictors
- Backward selection needs to number of samples, n, larger than number of variables, p or full model can be fit
- Forward stepwise can be used even when n < p, and so is the only viable subset method when p is very large
Choosing the Optimal Model
- The model containing all of the predictors will always have the smallest RSS and the largest R2, since these quantities are related to the training error
- Choose a model with low test error, not a model with low training error
- Training error is usually a poor estimate of test error
- RSS and R2 are not suitable for selecting the best model among a collection of models with different numbers of predictors
Estimating Test Error: Two Approaches
- Test error can be indirectly estimated by making an adjustment to the training error to account for the bias due to overfitting
- Test error can be directly estimated using validation set or cross-validation approaches
Cp, AIC, BIC, and Adjusted R2
- These techniques adjust the training error for the model size
- Can be used to select among a set of models with different numbers of variables
Mallow's Cp
- Cp = (1/n) * (RSS + 2dô²)
- d is the total number of parameters
- ² is an estimate of the variance of the error associated with each response measurement
AIC
- The AIC criterion is defined for a large class of models fit by maximum likelihood
- AIC = -2 log L + 2 * d, where L is the maximized value of the likelihood function for the estimated model
- In the case of the linear model with Gaussian errors, maximum likelihood and least squares are the same thing, and Cp and AIC are equivalent
BIC
- BIC = (1/n) * (RSS + log(n)dô²)
- The smaller the BIC value the better for a model with a low test error
- BIC replaces the 2dô² used by Cp with a log(n)dô² term, where n is the number of observations
- Since log n > 2 for any n > 7, the BIC statistic generally places a heavier penalty on models with many variables, and hence results in the selection of smaller models than Cp
Adjusted R2
- Adjusted R² = 1 - (RSS/(n - d - 1))/(TSS/(n - 1))
- TSS total sum of squares.
- Unlike Cp, AIC, and BIC, for which a small value indicates a model with a low test error, a large value of adjusted R2 indicates a model with a small test error
- Maximizing the adjusted R² is equivalent to minimizing RSS/(n-d-1)
- While RSS always decreases as the number of variables in the model increases, RSS/(n-d-1) may increase or decrease, due to the presence of d in the denominator
- Unlike the R² statistic, the adjusted R² statistic pays a price for the inclusion of unnecessary variables in the model
Validation and Cross-Validation
- Each of the procedures returns a sequence of models Mk indexed by model size k = 0, 1, 2, ....
- Select k, and return model
- Compute the validation set error or the cross-validation error for each model Mk under consideration, and then select the k for which the resulting estimated test error is smallest
- This provides a direct estimate of the test error
- It does not require an estimate of the error variance σ2
- This can be used in a wider range of model selection tasks, even in cases where it is hard to pinpoint the model degrees of freedom or estimate the error variance σ2
Previous Figure Details
- Validation errors were calculated by selecting three-quarters of the observations as the training set, and the remainder as the validation set
- Cross-validation errors were computed using k = 10 folds
- Validation and cross-validation methods both result in a six-variable model in this case
- All three approaches suggest that the four-, five-, and six-variable models are roughly equivalent in terms of their test errors
- Select a model using the one-standard-error rule
- First calculate the standard error of the estimated test MSE for each model size, then select the smallest model for which the estimated test error is within one standard error of the lowest point on the curve
Shrinkage Methods
- Shrinkage Methods can be done by using Ridge regression and Lasso, Ridge regression uses least squares to fit a linear model that contains a subset of the predictors
- As an alternative, fit a model containing all p predictors using a technique that constrains or regularizes the coefficient estimates, or shrinks the coefficient estimates towards zero
Ridge Regression
- Least squares fitting procedure estimates from Во, В1,..,Вр by minimizing
- The ridge regression coefficient estimates BR are the values that minimize
- λ > 0 is a tuning parameter determined separately
Ridge Regression: Continued
- As with least squares, ridge regression seeks coefficient estimates that fit the data well, by making the RSS small
- The second term, λΣβ, is a shrinkage penalty and is small when β1,..., βp are close to zero, and so it has the effect of shrinking the estimates of Bj towards zero
- Tuning parameter λ controls the relative impact of these two terms on the regression coefficient estimates
- Selection of good λ is critical, so cross-validation is used for this
Ridge Regression: Scaling of Predictors
- Standard least squares coefficient estimates are scale equivariant
- Multiplying X; by a constant c simply leads to a scaling of the least squares coefficient estimates by a factor of 1/c
- Regardless of how the jth predictor is scaled, Xjẞj will remain the same
- In contrast, the ridge regression coefficient estimates can change substantially when multiplying a given predictor by a constant, due to the sum of squared coefficients term in the penalty part of the ridge regression objective function
- Standardize the predictors before applying ridge regression, using given the formula.
Bias-Variance Tradeoff
- Simulated data with n= 50 observations, p = 45 predictors, all having nonzero coefficients
- Squared bias (black), variance (green), and test mean squared error (purple) for the ridge regression predictions on a simulated data set, as a function of λ and ||||2|||ẞ||2
- Horizontal dashed lines indicate the minimum possible MSE
- Purple crosses indicate the ridge regression models for which the MSE is smallest
The Lasso
- Unlike subset selection, which will generally select models that involve just a subset of the variables, ridge regression will include all p predictors in the final model
- The Lasso overcomes this disadvantage
- The lasso coefficients, BR, minimize the quantity
nΣ(yi-Bo-P - Σ pBjxij)² + λεΣp |Вj| = RSS+ AE Σp |Bj|
- The lasso uses an l₁ (pronounced "ell 1") penalty instead of an 12 penalty
- The l₁ norm of a coefficient vector is given by ||B||: S|B|.
The Lasso: Continued
- Lasso shrinks the coefficient estimates towards zero
- The ℓ1 penalty has the effect of forcing some of the coefficient estimates to be exactly equal to zero when the tuning parameter A is sufficiently large
- The lasso performs variable selection
- The lasso yields sparse models that involve only a subset of the variables
- Selecting a good value of A is critical for l-validation is again the method of choice
Variable Selection Property of the Lasso
- Unlike ridge regression, lasso results in coefficient estimates that are exactly equal to zero
- Lasso and ridge regression solve the problems, and respectevily
Conclusions
- Neither ridge regression nor the lasso will universally dominate the other
- The lasso is performed better when the response is a function of only a relatively small number of predictors
- The number of predictors that is related to the response is never known a priori for real data sets
- Cross-validation can be used in order to determine which approach is better on a particular data set.
Tuning Parameter Selection for Ridge Regression/Lasso
- Need a method to determine which of the models under consideration is best
- Need a method selecting a value for the tuning parameter λ or correspondingly, the value of the constraint s
- Cross-validation provides a simple way to tackle this problem
- Compute the cross-validation error rate for each value of λ
- The tuning parameter value is seleected for which the cross-validation error is smallest
- Using all available observations and the selected value of the tuning parameter
- Re-fit the model
Dimension Reduction Methods
- Methods have involved fitting linear regression models, via least squares or a shrunken approach, using the original predictors, X1, X2, ..., Xp
- Approaches that transform the predictors and then fit a least squares model using the transformed variables
- Will refer to these techniques as dimension reduction methods
Dimension Reduction Methods: Details
-
Z1, Z2, ..., ZM represent M < p linear combinations of our original p predictors
- Zm is sum of ØmjXj
- For some constants Øm1,..., Ømp.
-
Then fit the linear regression model, yi=00 +∑0mzim + ei, i = 1,..., n, using ordinary least squares
-
In model, (2), the regression coefficients are given by 0о, 01,..., 0м. If the constants Øm1,..., Ømp are chosen wisely, then such dimension reduction approaches can often outperform OLS regression
-
Given definition (1), summing Ommizm = Omm ∑ ØmjXij
-
Hence model can be thought of as a special case of the original linear regression model
-
Dimension reduction serves to constrain the estimated Bj coefficients, since now they must take the given form
-
Can win in the bias-variance tradeoff
Principal Components Regression
- Apply principal components analysis (PCA) to define the linear combinations of the predictors, for use in regression
- The first principal component is that linear combination of the variables with the largest variance
- The second principal component has largest variance, subject to being uncorrelated with the first
- With correlated original variables, replace them with a small set of principal components that capture their joint variation
Partial Least Squares
- PCR identifies linear combinations, directions, that best represent the predictors X1, ..., Xp
- These directions are identified in an unsupervised way, since the response Y is not used to help determine the principal component directions
- The response does not supervise the identification of the principal components
- There is no guarantee that the directions that best explain the predictors will also be the best directions to use for predicting the response
Partial Least Squares: Continued
-
Like PCR, PLS is a dimension reduction method, which first identifies a new set of features Z1,..., ZM that are linear combinations of the original features, and then fits a linear model via OLS using these M new features.
-
But unlike PCR, PLS identifies these new features in a supervised way
- Makes use of the response Y to identify new features that not only approximate the old features well, but also that are related to the response.
-
The PLS approach attempts to find directions that help explain both the response and the predictors
Details of Partial Least Squares
- After standardizing the p predictors, PLS computes the first direction Z1 by setting each $1; in (1) equal to the coefficient from the simple linear regression of Y onto X
- This coefficient is proportional to the correlation between Y and Xj
- PLS places the highest weight on the variables that are most stronglyrelated to the response in computing
- Subsequent directions are found by taking residuals and then repeating the above prescription.
Summary
- Model selection methods are an essential tool for data analysis, especially for big datasets involving many predictors
- Research into methods that give sparsity, such as the lasso is an especially important area
- Will also return to sparsity in more detail, and will describe related approaches such as the elastic net
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.