Statistical Learning and Data Mining Lecture 7 PDF
Document Details
Uploaded by Deleted User
The University of Sydney Business School
Tags
Summary
This document is a lecture on statistical learning and data mining, specifically focusing on feature selection and regularization. It covers the different types of feature selection methods and analyses the concepts in depth.
Full Transcript
Statistical Learning and Data Mining Lecture 7: Feature Selection and Regularisation Discipline of Business Analytics Lecture 7: Feature Selection and Regularisation Learning objectives Feature selection. Regularisation. Interpretability....
Statistical Learning and Data Mining Lecture 7: Feature Selection and Regularisation Discipline of Business Analytics Lecture 7: Feature Selection and Regularisation Learning objectives Feature selection. Regularisation. Interpretability. 2/69 Lecture 7: Feature Selection and Regularisation 1. Feature selection 2. Regularisation 3. Interpretability 3/69 Feature selection Feature selection Feature selection, variable selection, or subset selection methods choose a subset of the available features to be used by the model. 4/69 Feature selection Feature selection helps to: Avoid overfitting. Improve interpretability. Reduce the computational cost. 5/69 Feature selection There are three types of feature selection methods: Filter methods select the features before the learning algorithm. Wrapper methods evaluate the learning algorithm with different subsets of features. Embedded methods are learning algorithms that perform feature selection as part of training. 6/69 Feature selection It’s useful to make the following distinction: An irrelevant feature is one that has a very weak relationship with the response, whether in isolation or conditional on other features. A redundant feature is one that is potentially related to the response, but has a very weak relationship with it conditional on other features. 7/69 Feature selection Features that look irrelevant in isolation may be relevant in combination. – Pedro Domingos 8/69 Feature screening In feature screening, we remove features that have a weak bivariate relationship with the response according to a suitable measure of dependence. Common measures of dependence include the mutual information, ϕk , and the Pearson correlation. We treat the dependence threshold for excluding features as a hyperparameter. 9/69 Feature screening ▲ Low computational cost. ▲ Helpful when there are many irrelevant features. ▼ Excludes features that look irrelevant in isolation but are relevant in combination. ▼ Does not remove redudant features. 10/69 Best subset selection In best subset selection, we fit all possible models and select the best specification according to the model selection criterion. With p predictors there are 2p possible models to choose from. 11/69 Example: linear regression For example, if p = 3 we estimate 23 = 8 models: k = 0 : f (x) = β0 k = 1 : f (x) = β0 + β1 x1 f (x) = β0 + β2 x2 f (x) = β0 + β3 x3 k = 2 : f (x) = β0 + β1 x1 + β2 x2 f (x) = β0 + β1 x1 + β3 x3 f (x) = β0 + β2 x2 + β3 x3 k = 3 : f (x) = β0 + β1 x1 + β2 x2 + β3 x3 12/69 Best subset selection Algorithm Subset selection via brute-force search 1: Estimate the null model F0 which contains the intercept only. 2: for k = 1, 2,... , p do Fit all kp possible models with exactly k predictors. 3: 4: Pick the model with the lowest training error and call it Fk. 5: end for 6: Select the best model among F0 , F1 ,... , Fp according to a model selection method. 13/69 Combinatorial explosion The brute-search method suffers from a problem of combinatorial explosion, since it requires the estimation of 2p different models. For example, for p = 30 we would need to fit over a billion models! Brute-force search has a very high computational cost and is infeasible in practice unless p is small. 14/69 Best subset selection ▲ Higher predictive accuracy relative to the model based on the full set of features. ▲ Higher interpretability. ▼ High computational cost. ▼ Variability in the choice of predictors. ▼ Making binary decisions to include or exclude predictors is not optimal in many settings. 15/69 Stepwise selection Stepwise selection methods that search for near-optimal subsets by adding or removing one feature at a time. This modification dramatically reduces the computational cost compared to brute-force search. Stepwise selection is an algorithmic approximation to best subset selection. 16/69 Forward stepwise selection Algorithm Forward selection 1: Estimate the null model F0 which contains only the intercept. 2: for k = 1, 2,... , p do 3: Fit all the p − k + 1 models that add one predictor to Fk−1. 4: Choose the model with the lowest training error and call it Fk. 5: end for 6: Select the best model among F0 , F1 ,... , Fp according to a model selection method. 17/69 Forward stepwise selection Compared to brute-force search, the forward selection algorithm reduce the number of fitted models from 2p to 1 + p(p + 1)/2. For example, for p = 30 the number of fitted models is 466 compared to 1,073,741,824 for brute-force search. The disadvantage is that the final model selected by stepwise selection is not guaranteed to optimise any selection criterion among the 2p possible models. 18/69 Forward selection ▲ Tends to perform similarly to best subset selection. ▲ Much lower computational cost than best subset selection. ▼ Not guaranteed to find the best subset. 19/69 Recursive feature elimination Recursive feature elimination (RFE) works as follows: Train the model with all the features and identify the weakest feature according to a measure of feature importance. Train the model again with that feature removed and identify the weakest feature in the new specification. Repeat until the model has the desired number of features. 20/69 Recursive feature elimination RFE is similar to backward selection, but uses a measure of feature importance rather than the training error to select which feature to remove at each step. RFE has a lower computational cost than stepwise selection, since it only fits the model once per iteration. As usual, we treat the number of features as a hyperparameter. 21/69 Embedded methods Two examples of learning algorithms that perform variable selection are: The lasso, to be discussed in the next section. Tree-based methods, to be discussed later in the unit. 22/69 Regularisation Regularisation Regularisation is a general term that refers to any technique that encourages a learning algorithm to build a less complex model. The most common type of regularisation is to add a complexity penalty to the learning rule. 23/69 Building blocks of a learning algorithm Model. Estimator. Computational algorithm. 24/69 Empirical risk minimisation In empirical risk minimisation (ERM), we fit the model by minimising the training error, n ( ) 1X θb = argmin L yi , f (xi ; θ). θ n i=1 25/69 Example: linear regression For a linear regression under the squared error loss, ERM corresponds the OLS method for parameter estimation. We learn the parameters as 2 n X p X βb = argmin yi − β0 − βj xij , β i=1 j=1 where β = (β0 , β1 ,... , βp ), leading to the estimated predictive function fb(x) = βb0 + βb1 x1 +... + βbp xp. 26/69 Regularised risk minimisation ERM typically leads to overfitting. In regularised risk minimisation, we add a complexity penalty to the objective function, n ( ) 1X θb = argmin L yi , f (xi ; θ) + λ C(θ) , θ n i=1 where λ ≥ 0 is a hyperparameter and C(θ) measures the complexity of the model as function of the parameters. 27/69 Regularisation n ( ) 1X θb = argmin L yi , f (xi ; θ) + λ C(θ) , θ n i=1 When λ = 0, the method is equivalent to empirical risk minimisation. If λ → ∞, the learning algorithm fits the simplest possible model, such as a model that makes a constant prediction. For λ > 0, the higher λ, the higher the complexity term, and therefore the lower the complexity of the learned model. 28/69 Regularisation n ( ) 1X θb = argmin L yi , f (xi ; θ) + λ C(θ) θ n i=1 The advantage of this method comes from the bias-variance trade-off: If we increase λ, we decrease the variance by inducing lower model complexity. On the other hand, decreasing model complexity by increasing λ leads to higher bias. We use model selection methods to find the value of λ that optimally balances the two sources of error. 29/69 Supervised learning in one equation n ( ) 1X θb = argmin L yi , f (xi ; θ) + λ C(θ) θ n i=1 This equation describes most supervised learning methods. Choose a model f , loss function L, and regulariser C, and you have a learning algorithm. 30/69 Regularised linear models How do we specify the regulariser C? To make the discussion more concrete, we focus on linear regression: 2 X n Xp βb = argmin yi − β0 − βj xij + λ C(β) β i=1 j=1 for a choice of C(β). 31/69 Ridge regression The ridge regression method solves the regularised estimation problem 2 X n Xp p X βbridge = argmin yi − β0 − βj xij + λ βj2. β i=1 j=1 j=1 We refer to this approach as ℓ2 regularisation since the method penalises the squared ℓ2 (Euclidean) norm of β. 32/69 Ridge regression 2 X n Xp p X βbridge = argmin yi − β0 − βj xij + λ βj2 β i=1 j=1 j=1 This is a special case of regularised risk minimisation with: Loss function L(y, f (x)) = (y − f (x))2. Pp Model f (xi ; β) = β0 + j=1 βj xij Pp 2 Regulariser C(β) = j=1 βj. 33/69 Ridge regression 2 X n Xp p X 2 βridge = argmin b yi − β 0 − βj xij +λ βj β i=1 j=1 j=1 If λ = 0, the method is equivalent to OLS. If λ → ∞, the penalty dominates the cost function and the method sets all coefficients to zero. Therefore, the ridge regression method shrinks the coefficients towards zero for λ > 0. Increasing λ induces more shrinkage (smaller coefficients). 34/69 Ridge regression We can show the ridge estimator has the formula βbridge = (X ⊤ X + λ I)−1 X ⊤ y. 35/69 Ridge regression We can better understand the ridge regression method by assuming that the columns of the design matrix X are orthonormal. We say that two vectors u and v are orthonormal when they are orthogonal to each other and have norm one, √ √ ∥u∥ = u⊤ u = 1, ∥v∥ = v ⊤ v = 1, and u⊤ v = 0. 36/69 Ridge regression When the columns of X are orthonormal, the ridge estimate is just a scaled version of the OLS estimate 1 b βbridge = (I + λ I)−1 X ⊤ y = βOLS. 1+λ In a more general situation, we can say that the ridge regression method shrinks the coefficients of correlated inputs together. 37/69 Ridge regression Define the ridge shrinkage factor as ||βbridge ||2 s(λ) = , ||βbols ||2 for a given λ. As λ decreases from an infinitely large value down to zero, s(λ) increases from zero to one. The next slide illustrates the effect of varying the shrinkage factor on the estimated parameters. 38/69 Ridge coefficient profiles 39/69 Ridge regression ▲ Higher predictive accuracy than OLS. We reduce the variance without letting the bias increase too much. ▲ Robust to the presence of highly correlated predictors. ▲ Computationally efficient. It has much lower computational cost than subset selection. ▼ Includes all features in the model, which may not be optimal if there is a large number of irrelevant or weak features. 40/69 The lasso The lasso method solves the penalised estimation problem 2 X n Xp Xp βlasso = argmin b yi − β0 − βj xij +λ |βj | , β i=1 j=1 j=1 for a hyperparameter λ. We refer to this approach as ℓ1 regularisation. 41/69 The lasso The lasso has two key properties: Shrinkage. As with ridge regression, the lasso shrinks the coefficients towards zero. However, the nature of this shrinkage is different for the lasso as we illustrate below. Variable selection. Some coefficients may be exactly zero in the lasso solution for λ sufficiently large. Therefore, we say that the lasso can produce a sparse solution. 42/69 Why does the lasso yield a sparse solution? Consider two candidate solutions β = (0, 1) and √ √ β ′ = (1/ 2, 1/ 2). ∥β∥22 = ∥β ′ ∥22 = 1. Therefore, the ridge penalty is the same in these two cases. √ In contrast, ∥β∥1 = 1 and ∥β ′ ∥1 = 2. Therefore, the lasso regulariser favours the sparse solution. 43/69 Why does the lasso yield a sparse solution? The lasso has an equivalent formulation as a constrained minimisation problem 2 X n Xp βblasso = argmin yi − β0 − βi xij β i=1 j=1 p X subject to |βj | ≤ t. j=1 The corresponding formulation for ridge is 2 X n Xp βbridge = argmin yi − β0 − βi xij β i=1 j=1 p X subject to βj2 < t. j=1 44/69 Why does the lasso yield a sparse solution? Lasso (left) versus ridge regression (right) with two predictors: The contours in red represent the objective function, while the regions in blue represent the constraints. 45/69 The lasso Define the lasso shrinkage factor as ∥βblasso ∥1 s(λ) =. ∥βbols ∥1 The next slide illustrates the effect of varying the shrinkage factor on the estimated parameters. 46/69 Lasso coefficient profiles 47/69 The lasso ▲ Higher predictive accuracy than OLS. We reduce the variance without letting the bias increase too much. ▲ Variable selection, an important advantage over ridge regression. ▲ Computationally efficient, much lower computational cost than subset selection. ▲ Sparse models are more interpretable. ▼ Does not handle highly correlated predictors as well as ridge regression. ▼ Implicitly assumes sparsity. 48/69 Ridge regression vs. the lasso Neither method universally outperforms the other: The lasso tends to perform better when a small number of features has substantial coefficients, while the rest have coefficients that are zero or close to zero. Ridge regression tends to perform better when there are many relevant predictors with coefficients of similar size. 49/69 Elastic net The elastic net is a compromise between ridge regression and the lasso: 2 X n Xp Xp βbEN = argmin yi − β0 − βi xij + λ αβj2 + (1 − α)|βj | , β i=1 j=1 j=1 for λ ≥ 0 and 0 ≤ α ≤ 1. 50/69 Elastic net The penalty term involving |βj | encourages sparse solutions, as in the lasso. The penalty term involving βj2 improves the handling of highly correlated predictors, as in ridge regression. 51/69 Regularised linear models Practical details for all the previous methods: We do not penalise the intercept (β0 ). It is essential to scale all the features before running the learning algorithm. Unlike with OLS, one-hot encoding is applicable and often performs better. We use a model selection method combined with a grid search to select λ. 52/69 Types of regularisation To summarise, the three most common types of regularisation in machine learning are: ℓ2 regularisation: C(w) = ∥w∥22 ℓ1 regularisation: C(w) = ∥w∥1 (induces sparsity) Elastic net regularisation: C(w) = α∥w∥22 + (1 − α)∥w∥1 53/69 Interpretability Interpretability Understanding why a machine learning model makes certain predictions can be important for many reasons: It allows check if we can trust the model. It can provide insights to improve the model. It can provide business insights. It can help to support human decisions based on the model’s predictions. It may be essential to meet regulatory requirements, such as in credit scoring. 54/69 Interpretability Useful questions from Howard and Gugger (2019): How confident are we in a certain prediction? For a particular prediction, what are the most important factors, and how did they influence that prediction? Which inputs are strongest, and which can we ignore? Which inputs are redundant? How do they predictions vary as we vary the inputs? 55/69 Local vs. global explanation Local explanation refers to interpreting an individual prediction or group of predictions. Global explanation refers to interpreting the model’s behaviour on the entire data. 56/69 Accuracy vs. interpretability A simple approach to interpretability is to use simple and interpretable models such as linear regression, logistic regression, and decision trees. However, it’s often beneficial to use complex models for prediction, leading to a trade-off between accuracy and interpretability. 57/69 Accuracy vs. interpretability Complex methods tend to be less interpretable than simpler methods. High Subset Selection Lasso Least Squares Interpretability Generalized Additive Models Trees Bagging, Boosting Support Vector Machines Low Low High Flexibility 58/69 Interpretability Models such as linear regression and decision trees are said to have intrinsic interpretability. Model agnostic tools that can help us to interpret the predictions of “black-box” algorithms. 59/69 SHAP values The SHAP (Shapley additive explanations) framework assigns each feature an importance value for a particular prediction. The feature attribution is called a SHAP value. SHAP views any explanation of a model’s prediction as a model in itself, called the explanation model. 60/69 SHAP values SHAP values form an additive explanation, p X f (x) = ϕ0 + ϕj , j=1 where f (x) is the original prediction, ϕ0 is a base value, and ϕj is the SHAP value for feature j. In words, the SHAP values add up exactly to the difference between the prediction and a base value. 61/69 Example: linear regression The best explanation model for an interpretable model such as a linear regression is the model itself, f (x) = β0 + β1 x1 +... + βp xp. In this case, we can directly see that the contribution of input j to the prediction is βj xj. The SHAP value is ϕj = βj (xj − xj ), where xj is the sample mean of feature j. 62/69 SHAP values SHAP values are based on results from cooperative game theory that ensure that the feature attributions satisfy desirable theoretical properties. The theory and implementation are quite advanced, so we do not cover it here. 63/69 Implementations Kernel SHAP applies to any model, but is computationally expensive. Linear SHAP, Tree SHAP, and Deep SHAP are model-specific implementations. 64/69 SHAP values 65/69 SHAP values 66/69 SHAP feature importance We can measure the global importance of a feature by computing n X (i) Ij = ϕj , i=1 (i) where ϕj is the SHAP value of input j for prediction i. 67/69 SHAP feature importance 68/69 SHAP values ▲ Additive explanations are easy to understand and communicate. ▲ Solid theoretical basis. ▼ Kernel SHAP is very computationally expensive. ▼ The practical implementation assumes that the inputs are independent. 69/69