Feature Selection and Importance Overview

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

What is the primary goal of feature selection in data mining?

To evaluate all available features
To choose a subset of relevant features (correct)
To increase model complexity
To strictly include all available variables

Which method treats feature selection as part of the training process?

Filter methods
Best subset selection
Wrapper methods
Embedded methods (correct)

Which of the following is NOT a benefit of feature selection?

Reducing computational cost
Making the model more complex (correct)
Enhancing interpretability
Avoiding overfitting

What distinguishes an irrelevant feature from a redundant feature?

Irrelevant features have weak relationships in isolation; redundant features have weak relationships conditional on others. (D) Signup and view all the answers

What is a common limitation of feature screening?

It cannot identify redundant features. (B) Signup and view all the answers

In best subset selection, how are features chosen?

By fitting all possible models and selecting the best according to criteria (C) Signup and view all the answers

Which measure of dependence might be used in feature screening?

Mutual information (A) Signup and view all the answers

Which statement accurately reflects feature selection?

Feature selection can aid in reducing computational costs. (D) Signup and view all the answers

What is the total number of models that can be estimated with 3 predictors?

8 (C) Signup and view all the answers

What is a significant drawback of the brute-force search method in best subset selection?

It requires estimating a large number of models. (C) Signup and view all the answers

Which of the following correctly describes forward stepwise selection?

It starts with a null model and adds one predictor at a time. (D) Signup and view all the answers

What is a benefit of using best subset selection?

Higher predictive accuracy compared to full feature models. (D) Signup and view all the answers

Which of the following statements about stepwise selection is true?

It dramatically reduces computational cost compared to brute-force search. (D) Signup and view all the answers

What happens when the number of predictors p is large, such as 30?

It can lead to fitting over a billion models. (D) Signup and view all the answers

What is a limitation of the best subset selection method mentioned in the content?

It may produce binary decisions not optimal in many cases. (D) Signup and view all the answers

Which algorithmic process does forward selection employ?

Fit one model at each step by adding predictors sequentially. (D) Signup and view all the answers

What does the variable λ balance in model selection methods?

Loss function and regularization (C) Signup and view all the answers

Which part of the equation describes the loss function in supervised learning?

L(yi, f(xi; θ)) (B) Signup and view all the answers

In the context of linear regression, what is the purpose of the regulariser C(β)?

To discourage large parameter values (C) Signup and view all the answers

What characteristic of ridge regression qualifies it as ℓ2 regularisation?

It imposes a penalty based on the squared Euclidean norm. (C) Signup and view all the answers

What does the term argmin represent in the model equations?

The optimal parameter values (D) Signup and view all the answers

In the ridge regression formula, what is being minimized?

The sum of squared residuals along with the model parameters (B) Signup and view all the answers

Which of the following is NOT required to create a learning algorithm from the supervised learning equation?

A dataset D (A) Signup and view all the answers

The regularised estimation problem in ridge regression includes which specific penalty term?

$eta_j^2$ (B) Signup and view all the answers

What happens to the coefficients in ridge regression as the value of λ increases?

They are decreased towards zero. (D) Signup and view all the answers

Which statement correctly describes what occurs when λ equals 0 in ridge regression?

The method becomes equivalent to ordinary least squares (OLS). (B) Signup and view all the answers

What kind of solution does lasso regularization favor?

Sparse solutions with many zero coefficients (C) Signup and view all the answers

What form does the ridge regression estimator take?

βridge = (X⊤X + λI)^{-1} X⊤y (B) Signup and view all the answers

What is the key difference in the constraint formulations between lasso and ridge regression?

Lasso imposes an L1 penalty, while ridge imposes an L2 penalty (A) Signup and view all the answers

In ridge regression, what is the role of the regularizer?

To penalize large coefficients. (A) Signup and view all the answers

Which of the following statements about the coefficient profiles for lasso is accurate?

The lasso shrinkage factor measures the relative size of coefficients. (C) Signup and view all the answers

What happens to the model's coefficients when λ approaches infinity in ridge regression?

They all are set to zero. (B) Signup and view all the answers

What is the implication of having a higher predictive accuracy with lasso compared to OLS?

Lasso reduces variance while maintaining low bias. (D) Signup and view all the answers

How do orthonormal columns of the design matrix X affect ridge regression?

They simplify the understanding of coefficient interactions. (A) Signup and view all the answers

In the context of the ridge penalty, how does it compare to the lasso penalty?

The ridge penalty does not promote sparsity. (B) Signup and view all the answers

Which loss function is minimized in ridge regression?

The squared difference between predicted and observed values plus a penalty term. (B) Signup and view all the answers

Which condition must be true for two vectors to be considered orthonormal?

They must be orthogonal and have norms equal to one. (C) Signup and view all the answers

What characterizes the solutions of β = (0, 1) and β′ = (1/2, 1/2) in relation to the ridge penalty?

Both solutions have the same ridge penalty. (C) Signup and view all the answers

How are the regions of constraints described for lasso and ridge regression?

Lasso constrains coefficients with linear boundaries, ridge with circular boundaries. (D) Signup and view all the answers

Why is the lasso said to reduce variance?

It restricts the flexibility of the model to improve consistency. (A) Signup and view all the answers

Flashcards are hidden until you start studying

Study Notes