CH 3

Podcast

Play an AI-generated podcast conversation about this lesson

Download our mobile app to listen on the go

Get App

Questions and Answers

Explain the problem with RSS and R² as model selection measures.

They are merely goodness-of-fit measures of a linear model to the training data, with no explicit regard to its prediction performance. As we add more predictors to a linear model, the RSS will always decrease and R² will always increase. So when we choose between linear models with different numbers of predictors, they will always choose the most complex model.

Explain why AIC and BIC are considered indirect measures of prediction performance.

They adjust a goodness-of-fit measure on the training set such as training RSS or training log likelihood, to account for model complexity and prevent overfitting.

Explain the rationale behind and the difference between the AIC and BIC.

Both the AIC and BIC are composed of two quantities - goodness of fit to the training data, measured by -2l and complexity. Ideally, we would like both of these quantities to be small, but they are typically in conflict with each other. AIC and BIC bring together these two quantities with the second quantity acting as a penalty term penalizing an overfitted model. The difference between AIC and BIC lies in their penalty term, with BIC being more stringent than AIC and tends to result in a simpler final model when used as the model selection criterion.

Describe the properties of residuals of a linear model if the model assumptions hold.

No special patterns: The residuals should cluster around zero in a random fashion, both on their own and when plotted against fitted values. Homoscedasticity: The residuals should possess approximately the same variance. Normality: The residuals should be approximately normally distributed. Signup and view all the answers

Explain the advantages and disadvantages of polynomial regression.

Pros: We are able to take care of substantially more complex relationships between the target and predictors than compared to linear regression. Cons: The regression coefficients in a polynomial model are much more difficult to interpret. There is no simple rule as to how to choose the value of m, which is a hyperparameter. (m refers to the order of the polynomial term) Signup and view all the answers

Explain how categorical predictors are handled by a linear model.

They are handled through binarization. Binarization turns a given categorical predictor into a collection of artificial “binary” variables (aka dummy or indicator variables), each which serves as an indicator of one and only one level of the categorical predictor. These dummy variables then serve as predictors in the model equation. Signup and view all the answers

Explain the meaning of interaction.

An interaction arises if the association between one predictor and the target depends on the value (or level) of another predictor. Signup and view all the answers

Explain the problems with collinear variables in a linear model.

The presence of collinear variables means that some of the variables do not bring much additional information because their values can be largely deduced from the values of other variables, leading to redundancy. (Note: The result of an exact linear relationship among some of the features, or exact collinearity, leads to a rank-deficient model. In practice, R can detect these exact collinearity issues however the real issue arises when features are very strongly related). Signup and view all the answers

Explain how best subset selection works and its limitations.

This involves fitting a separate linear model for each possible combination of available features and selecting the 'best subset' of features to form the best model. It is computationally difficult to implement because for p features you need to fit 2^p models. Signup and view all the answers

Explain how stepwise selection works and how it addresses the limitations of best subset selection.

Stepwise selection does not search through all possible combinations of features, but instead determines the best model from a carefully restricted list of candidate models by sequentially adding/dropping features one at a time. With stepwise selection, the max amount of models we need to fit for p features is 1 + P(p+1)/2, which is a lot less than the best subset method. Signup and view all the answers

Explain why it is not a good idea to add or drop multiple features at a time when doing stepwise selection.

It is not a good idea to add/drop multiple features at a time because the significance of a feature can be significantly affected by the presence or absence of other features due to their correlations. For example, a feature can be significant on its own, but become highly insignificant in the presence of another feature. Signup and view all the answers

Explain two differences between forward and backward stepwise selections.

Forward stepwise selection starts with an intercept-only model and the backward stepwise selection starts with the full model. Forward selection is more likely to get a simpler and more interpretable model because the starting model is simpler. Signup and view all the answers

Explain how regularization works.

Regularization generates coefficient estimates by minimizing a modified objective function. Using the RSS as the starting point, regularization incorporates a penalty term that reflects the complexity of the model and minimizes the augmented objective function. λ is the regularization parameter that controls the extent of regularization. f (β) is the penalty function that captures the size of the regression coefficients. The product of the regularization parameter and the penalty function, defines the regularization penalty. Signup and view all the answers

Describe an important modification to the variables before fitting a regularized regression model.

It is important to standardize the predictors by dividing by their standard error. This will put all predictors and coefficients on the same scale. Signup and view all the answers

Explain how the regularization parameter λ affects a regularized model.

When λ = 0, the regularization penalty vanishes and the coefficient estimates are identical to the OLS estimates. As λ increases, the effect of regularization becomes more severe. There is increasing pressure for the coefficient estimates to be closer and closer to zero, and the flexibility of the model drops. Signup and view all the answers

Explain why λ and α are hyperparameters of a regularized model and how they are typically selected.

They are hyperparameters because they are pre-specified inputs that go into the model fitting process and are not determined as part of the optimization process. <ol> <li>Set up a fine grid of values of (λ, α) in advance.</li> <li>Compute the CV error for each pair of candidate values of (λ, α).</li> <li>Choose the pair that produces the lowest CV error.</li> </ol> Signup and view all the answers

Explain the two differences between stepwise selection and regularization.

<ol> <li>How the model is chosen: The model selected by stepwise will have certain predictors entirely dropped, while the regularization method in general will have the coefficient estimates pushed to zero.</li> <li>The parameter indexing model complexity: Stepwise selection uses the number of features as a direct measure of model complexity, while regularization uses lamda as an indirect measure of model complexity.</li> </ol> Signup and view all the answers

Explain the one-standard-error rule for selecting the value of the regularization parameter of an elastic net.

The idea is that the simplest regularized regression model whose CV error is within “one standard error” of the minimum error have more or less the same predictive performance and therefore we might as well opt for the simplest and most interpretable model among these models. Signup and view all the answers

Flashcards

RSS and R² problems as model selection

RSS and R² measure goodness-of-fit to training data, ignoring prediction performance; adding predictors always decreases RSS and increases R², favoring complex models.

AIC and BIC as indirect measures

AIC and BIC adjust goodness-of-fit (e.g., training RSS) to account for model complexity and prevent overfitting by adding a penalty term.

AIC and BIC rationale and difference

Both balance goodness-of-fit and complexity but differ in their penalty term; BIC is more stringent, resulting in simpler models.

Properties of residuals in a linear model

Residuals should cluster randomly around zero, have constant variance (homoscedasticity), and be normally distributed.