CH 3

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson
Download our mobile app to listen on the go
Get App

Questions and Answers

Explain the problem with RSS and R² as model selection measures.

They are merely goodness-of-fit measures of a linear model to the training data, with no explicit regard to its prediction performance. As we add more predictors to a linear model, the RSS will always decrease and R² will always increase. So when we choose between linear models with different numbers of predictors, they will always choose the most complex model.

Explain why AIC and BIC are considered indirect measures of prediction performance.

They adjust a goodness-of-fit measure on the training set such as training RSS or training log likelihood, to account for model complexity and prevent overfitting.

Explain the rationale behind and the difference between the AIC and BIC.

Both the AIC and BIC are composed of two quantities - goodness of fit to the training data, measured by -2l and complexity. Ideally, we would like both of these quantities to be small, but they are typically in conflict with each other. AIC and BIC bring together these two quantities with the second quantity acting as a penalty term penalizing an overfitted model. The difference between AIC and BIC lies in their penalty term, with BIC being more stringent than AIC and tends to result in a simpler final model when used as the model selection criterion.

Describe the properties of residuals of a linear model if the model assumptions hold.

<p>No special patterns: The residuals should cluster around zero in a random fashion, both on their own and when plotted against fitted values. Homoscedasticity: The residuals should possess approximately the same variance. Normality: The residuals should be approximately normally distributed.</p> Signup and view all the answers

Explain the advantages and disadvantages of polynomial regression.

<p>Pros: We are able to take care of substantially more complex relationships between the target and predictors than compared to linear regression. Cons: The regression coefficients in a polynomial model are much more difficult to interpret. There is no simple rule as to how to choose the value of m, which is a hyperparameter. (m refers to the order of the polynomial term)</p> Signup and view all the answers

Explain how categorical predictors are handled by a linear model.

<p>They are handled through binarization. Binarization turns a given categorical predictor into a collection of artificial “binary” variables (aka dummy or indicator variables), each which serves as an indicator of one and only one level of the categorical predictor. These dummy variables then serve as predictors in the model equation.</p> Signup and view all the answers

Explain the meaning of interaction.

<p>An interaction arises if the association between one predictor and the target depends on the value (or level) of another predictor.</p> Signup and view all the answers

Explain the problems with collinear variables in a linear model.

<p>The presence of collinear variables means that some of the variables do not bring much additional information because their values can be largely deduced from the values of other variables, leading to redundancy. (Note: The result of an exact linear relationship among some of the features, or exact collinearity, leads to a rank-deficient model. In practice, R can detect these exact collinearity issues however the real issue arises when features are very strongly related).</p> Signup and view all the answers

Explain how best subset selection works and its limitations.

<p>This involves fitting a separate linear model for each possible combination of available features and selecting the 'best subset' of features to form the best model. It is computationally difficult to implement because for p features you need to fit 2^p models.</p> Signup and view all the answers

Explain how stepwise selection works and how it addresses the limitations of best subset selection.

<p>Stepwise selection does not search through all possible combinations of features, but instead determines the best model from a carefully restricted list of candidate models by sequentially adding/dropping features one at a time. With stepwise selection, the max amount of models we need to fit for p features is 1 + P(p+1)/2, which is a lot less than the best subset method.</p> Signup and view all the answers

Explain why it is not a good idea to add or drop multiple features at a time when doing stepwise selection.

<p>It is not a good idea to add/drop multiple features at a time because the significance of a feature can be significantly affected by the presence or absence of other features due to their correlations. For example, a feature can be significant on its own, but become highly insignificant in the presence of another feature.</p> Signup and view all the answers

Explain two differences between forward and backward stepwise selections.

<p>Forward stepwise selection starts with an intercept-only model and the backward stepwise selection starts with the full model. Forward selection is more likely to get a simpler and more interpretable model because the starting model is simpler.</p> Signup and view all the answers

Explain how regularization works.

<p>Regularization generates coefficient estimates by minimizing a modified objective function. Using the RSS as the starting point, regularization incorporates a penalty term that reflects the complexity of the model and minimizes the augmented objective function. λ is the regularization parameter that controls the extent of regularization. f (β) is the penalty function that captures the size of the regression coefficients. The product of the regularization parameter and the penalty function, defines the regularization penalty.</p> Signup and view all the answers

Describe an important modification to the variables before fitting a regularized regression model.

<p>It is important to standardize the predictors by dividing by their standard error. This will put all predictors and coefficients on the same scale.</p> Signup and view all the answers

Explain how the regularization parameter λ affects a regularized model.

<p>When λ = 0, the regularization penalty vanishes and the coefficient estimates are identical to the OLS estimates. As λ increases, the effect of regularization becomes more severe. There is increasing pressure for the coefficient estimates to be closer and closer to zero, and the flexibility of the model drops.</p> Signup and view all the answers

Explain why λ and α are hyperparameters of a regularized model and how they are typically selected.

<p>They are hyperparameters because they are pre-specified inputs that go into the model fitting process and are not determined as part of the optimization process.</p> <ol> <li>Set up a fine grid of values of (λ, α) in advance.</li> <li>Compute the CV error for each pair of candidate values of (λ, α).</li> <li>Choose the pair that produces the lowest CV error.</li> </ol> Signup and view all the answers

Explain the two differences between stepwise selection and regularization.

<ol> <li>How the model is chosen: The model selected by stepwise will have certain predictors entirely dropped, while the regularization method in general will have the coefficient estimates pushed to zero.</li> <li>The parameter indexing model complexity: Stepwise selection uses the number of features as a direct measure of model complexity, while regularization uses lamda as an indirect measure of model complexity.</li> </ol> Signup and view all the answers

Explain the one-standard-error rule for selecting the value of the regularization parameter of an elastic net.

<p>The idea is that the simplest regularized regression model whose CV error is within “one standard error” of the minimum error have more or less the same predictive performance and therefore we might as well opt for the simplest and most interpretable model among these models.</p> Signup and view all the answers

Flashcards

RSS and R² problems as model selection

RSS and R² measure goodness-of-fit to training data, ignoring prediction performance; adding predictors always decreases RSS and increases R², favoring complex models.

AIC and BIC as indirect measures

AIC and BIC adjust goodness-of-fit (e.g., training RSS) to account for model complexity and prevent overfitting by adding a penalty term.

AIC and BIC rationale and difference

Both balance goodness-of-fit and complexity but differ in their penalty term; BIC is more stringent, resulting in simpler models.

Properties of residuals in a linear model

Residuals should cluster randomly around zero, have constant variance (homoscedasticity), and be normally distributed.

Signup and view all the flashcards

Advantages/disadvantages of polynomial regression

Polynomial regression captures complex relationships but coefficients are hard to interpret, and choosing the polynomial order (m) is difficult.

Signup and view all the flashcards

Handling categorical predictors in linear models

Categorical predictors are converted into binary (dummy) variables, each indicating one level of the category, and used as predictors in the model.

Signup and view all the flashcards

Meaning of interaction

An interaction occurs when the association between one predictor and the target variable depends on the value of another predictor.

Signup and view all the flashcards

Problems with collinear variables

Collinearity means variables provide redundant information, as their values can be predicted from others, potentially leading to a rank-deficient model.

Signup and view all the flashcards

Best subset selection

It fits a linear model for every possible feature combination and selects the 'best subset', but is computationally difficult with many features (2^p models).

Signup and view all the flashcards

How stepwise selection addresses limitations

It sequentially adds/drops features, building from a restricted list of candidate models, making it computationally cheaper than best subset selection.

Signup and view all the flashcards

Why not add/drop multiple features at once?

Significance of a feature can change depending on the presence of correlated features.

Signup and view all the flashcards

Differences between forward/backward stepwise

Forward starts with an intercept-only model, backward starts with the full model; forward tends to yield simpler models.

Signup and view all the flashcards

How regularization works

Regularization adds a penalty term to the objective function (e.g., RSS) to penalize model complexity, controlled by the regularization parameter λ.

Signup and view all the flashcards

Modification before regularized regression

Standardize predictors by dividing by their standard error to put them and their coefficients on the same scale before regularization.

Signup and view all the flashcards

How λ affects regularized model

As λ increases, the regularization penalty becomes stronger, pushing coefficient estimates closer to zero and reducing model flexibility.

Signup and view all the flashcards

Why λ and α are hyperparameters

They are pre-specified inputs not determined during optimization; selected by testing values (λ, α) in advance and picking the pair with the lowest CV error.

Signup and view all the flashcards

Differences: stepwise versus regularization

Stepwise drops predictors entirely, while regularization shrinks coefficients towards zero; stepwise uses feature count, regularization uses λ.

Signup and view all the flashcards

One-standard-error rule

Choose the simplest model with CV error within one standard error of the minimum error; aim for simplicity and interpretability.

Signup and view all the flashcards

Study Notes

Problems with RSS and R² as Model Selection Measures

  • RSS and R² are goodness-of-fit measures of a linear model to the training data
  • These measures don't explicitly consider prediction performance
  • Adding more predictors to a linear model always decreases RSS and increases R²
  • Choosing between linear models with different predictor numbers will result in complex model selection

AIC and BIC as Indirect Measures of Prediction Performance

  • AIC and BIC adjust a goodness-of-fit measure on the training set, like training RSS or log likelihood
  • They account for model complexity and prevent overfitting

Rationale and Difference Between AIC and BIC

  • Both AIC and BIC consist of goodness of fit to training data (-2l) and complexity.
  • The goal is to reduce these quantities, but they often conflict each other
  • AIC and BIC combine these quantities, using the second quantity to penalize an overfitted model
  • AIC and BIC differ in their penalty term: BIC is more stringent than AIC
  • BIC tends to yield a simpler final model when used as the model selection criterion

Properties of Residuals in a Linear Model

  • Residuals should cluster around zero randomly, both on their own and when plotted against fitted values
  • Homoscedasticity: Residuals should possess approximately the same variance
  • Normality: Residuals should be approximately normally distributed

Advantages and Disadvantages of Polynomial Regression

  • Polynomial regression handles more complex relationships between the target and predictors than linear regression
  • Polynomial model regression coefficients are more difficult to interpret
  • There isn't a simple rule for choosing the value of m (the order of the polynomial term), which is a hyperparameter

Handling Categorical Predictors in a Linear Model

  • Categorical predictors are handled via binarization.
  • Binarization converts a categorical predictor into a collection of artificial "binary" variables (dummy or indicator variables)
  • Each binary variable indicates one level of the categorical predictor
  • These dummy variables serve as predictors in the model equation

Meaning of Interaction

  • Interaction occurs when the association between one predictor and the target depends on the value (or level) of another predictor

Problems with Collinear Variables in a Linear Model

  • Collinear variables may not provide much additional information because their values can be deduced from other variables' values
  • This leads to redundancy
  • Exact collinearity results in a rank-deficient model, which R can detect
  • Real collinearity issues arise when features are strongly related but not perfectly collinear

How Best Subset Selection Works and Its Limitations

  • Best subset selection fits a separate linear model for each possible combination of available features
  • It selects the "best subset" of features to form the best model
  • For p features, fitting 2^p models makes it computationally difficult to implement

Stepwise Selection as an Alternative to Best Subset Selection

  • Stepwise selection determines the best model from a restricted list of candidate models by sequentially adding/dropping features
  • For p features, the maximum number of models to fit is 1 + p(p+1)/2 which is far less than best subset selection

Drawbacks of Adding/Dropping Multiple Features in Stepwise Selection

  • Adding/dropping multiple features can significantly affect a feature's significance because of correlations with others
  • A feature that is significant on its own may become insignificant in the presence of another feature

Differences Between Forward and Backward Stepwise Selections

  • Forward stepwise selection starts with an intercept-only model, while backward stepwise selection starts with the full model
  • Forward selection is more likely to result in a simpler and more interpretable model because it starts simpler

Regularization

  • Regularization generates coefficient estimates by minimizing a modified objective function
  • Using RSS as the starting point, regularization adds a penalty term reflecting model complexity
  • Regularization minimizes the augmented objective function
  • λ (lambda) is the regularization parameter that controls regularization extent
  • fR(β) is the penalty function that captures the size of the regression coefficients
  • The product of the regularization parameter and the penalty function is the regularization penalty

Modification Before Fitting a Regularized Regression Model

  • Standardize predictors by dividing by their standard error
  • This puts all predictors and coefficients on the same scale

How the Regularization Parameter Affects a Regularized Model

  • When λ = 0, the regularization penalty vanishes and coefficient estimates are identical to OLS estimates
  • As λ increases, regularization becomes more severe
  • Increased pressure forces coefficient estimates closer to zero, reducing model flexibility

Hyperparameters of a Regularized Model

  • λ and α are hyperparameters since they are pre-specified inputs not determined during optimization
    • A fine grid of (λ, α) values is set up in advance
    • CV error is computed for each pair of candidate values of (λ, α)
    • The pair producing the lowest CV error is chosen

Differences Between Stepwise Selection and Regularization

  • Model selection: Stepwise selection drops certain predictors entirely, while regularization pushes coefficient estimates towards zero
  • Parameter indexing: Stepwise selection uses the number of features as a measure of model complexity, while regularization uses λ as an indirect measure

The One-Standard-Error Rule

  • The simplest regularized regression model with a CV error within "one standard error" of minimum error has similar predictive performance
  • Selecting the simplest and most interpretable model among these is preferred.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Fabriq - Une bonne démo
25 questions
Use Quizgecko on...
Browser
Browser