Learning from Data Lecture 5 PDF
Document Details
Uploaded by SportyDeciduousForest4462
University of Exeter
Dr Marcos Oliveira
Tags
Related
Summary
This document is a lecture on learning from data, specifically focusing on model complexity, bias-variance tradeoff, and regularization techniques, including LASSO and Ridge regression. The document from University of Exeter outlines the key concepts and provides Python syntax examples to illustrate practical application of the topics.
Full Transcript
Learning from Data Lecture 5 Dr Marcos Oliveira Model complexity and bias-variance trade off Learning objectives To demonstrate understanding of the tradeoff between model complexity and error. To demonstrate understanding of the bias and variance of a model. To demonstrate understanding...
Learning from Data Lecture 5 Dr Marcos Oliveira Model complexity and bias-variance trade off Learning objectives To demonstrate understanding of the tradeoff between model complexity and error. To demonstrate understanding of the bias and variance of a model. To demonstrate understanding of regularisation as an approach to overfitting. Outline of lecture Introduction to model complexity and bias-variance tradeoff: Define the terms bias and variance and discuss their relationship to model complexity. Discuss different sources of model error and how that relates to bias and variance. Discuss the bias-variance tradeoff, how that relates to complexity and how to find the optimal balance between bias, variance and complexity vs. error. Introduction to LASSO and Ridge regression: Explain how to tune a model with more granularity using regularisation. Discuss the relationship between regularisation and feature selection. Recap from Lecture 4 Model complexity vs. error How error metrics relate to the complexity of the model. Polynomial degree = 1 Polynomial degree = 4 Polynomial degree = 14 Y Y Y X X X Model True function Samples Model complexity vs. error How error metrics relate to the complexity of the model. Polynomial degree = 1 Y Jcv(θ) error cross validation error Jtrain(θ) training error X complexity Underfitting Training and cross validation error are both high. : Model complexity vs. error How error metrics relate to the complexity of the model. Polynomial degree = 14 Y Jcv(θ) error cross validation error Jtrain(θ) training error X complexity Overfitting Training error is low Cross validation error is high. :. Model complexity vs. error How error metrics relate to the complexity of the model. Polynomial degree = 4 Y Jcv(θ) error cross validation error Jtrain(θ) training error X complexity It is unlikely that we can graph and see that the model is very Just right close to the underlying function. Training and cross validation errors are both low. Good approach: as soon as the cross validation error start to increase, we stop making the model more complex. : Choosing the level of complexity Polynomial degree = 1 Polynomial degree = 4 Polynomial degree = 14 Y Y Y Poor at both training Just right for both Good at training. and predicting. training and predicting. Poor at predicting. X X X Choosing the level of complexity Polynomial degree = 1 Polynomial degree = 4 Polynomial degree = 14 Y Y Y Poor at both training Just right for both Good at training. and predicting. training and predicting. Poor at predicting. X X X Choosing the level of complexity Polynomial degree = 1 Polynomial degree = 4 Polynomial degree = 14 Y Y Y Poor at both training Just right for both Good at training. and predicting. training and predicting. Poor at predicting. X X X Intuition: Bias and variance Low variance High variance Low bias Bias: is a tendency to miss. Variance: is tendency to be inconsistent. Ideally, we want the top left outcome, i.e., highly consistent predictions that are close to perfect on average. High bias Intuitive view of bias and variance. Three sources of model error Being Unavoidable Being wrong unstable randomness Bias Variance Irreducible error Does not capture the Model identifies the Real world data will relationship between the relationship between always contain some features and outcome the features and randomness in the data variable outcome variable points perfectly. Predictions are Find a model that finds consistent but poor model Model incorporates the actual relationship and choices lead to wrong random noise besides avoids incorporating predictions. the underlying function. random noise... Three sources of model error: Bias Tendency of predictions to miss true values when predicting. High bias can be the result of: the model misrepresenting the data given missing information. an overly simple model, i.e., bias to the simplicity of the model. Thus, the model miss the real patterns causing it to underfit the data. High bias is associated with underfitting the training model. Three sources of model error: Variance Tendency of predictions to fluctuate. Characterised by high sensitivity of output to small changes in input data. Often due to overly complex or poorly fit models, e.g., polynomial degree 14 model. High variance is associated with overfitting the training model. Three sources of model error: Irreducible error Tendency to intrinsic uncertainty or randomness. It is impossible to perfectly model the majority of real world data. Thus, we have to be comfortable with some measure of error. The error is present even in the best model. Bias-variance tradeoff Summary of bias-variance tradeoff: Model adjustments that decrease bias often increase variance, and vice versa. The bias-variance tradeoff is analogous to a complexity tradeoff. Finding the best model means choosing the right level of complexity. Visualising the complexity tradeoff Reference: Scott Fortmann-Roe, “Accurately Measuring Model Prediction Error”, Want a model complex enough to not https://scott.fortmann-roe.com/docs/ underfit, but not so exceedingly MeasuringError.html complex that it overfits. We search for a model that describes the feature target relationship but not so complex that it fits to spurious patterns. Bias-variance tradeoff: Our example The higher the degree of a polynomial regression, the more complex that model is (lower bias, higher variance). At lower degrees: the predictions are too rigid to capture the curved pattern in the data (bias). At higher degrees: the predictions fluctuate wildly because of the model’s sensitivity (variance). The optimal model has sufficient complexity to describe the data without overfitting. Polynomial degree = 1 Polynomial degree = 4 Polynomial degree = 14 Y Y Y X X X High bias Low bias Just right Low variance High variance Linear model regularisation (or shrinkage) When we “shrink” the coefficients, we reduce variance. The method of shrinkage (or regularisation) adds an adjustable regularisation strength parameter directly into the cost 3.0 function. 2.0 Y This λ (lambda) adds a penalty proportional to the size of the estimated model parameter. 1.0 When λ is large, stronger parameters are penalised. Thus, a more 0.0 1.0 2.0 3.0 complex model will be penalised. X Adjusted cost function M(w): model error R(w): function of estimated parameter(s) λ: regularisation strength parameter... Linear model regularisation (or shrinkage) The regularisation strength parameter λ allows us to manage the complexity tradeoff. More regularisation introduces a simpler model or more bias. Less regularisation makes the model more complex and increases variance. If the model overfits (variance is too high), regularisation can improve generalisation error and reduce variance. We will see two approaches of regularisation: Ridge regression LASSO Adjusted cost function M(w): model error R(w): function of estimated parameter(s) λ: regularisation strength parameter... i=1 j=1 Ridge regression Ridge regression is very similar to least squares, except that the coefficie are estimated by minimizing a slightly different quantity. In particular, t In ridge regression, the penalty λ is applied Rsquared coefficient values. proportionallyβ̂to ridge regression coefficient estimates are the values that minimize This penalty imposes bias on the model and reduces variance. 2 n ! p ! p ! p ! y i − β0 − βj xij + λ βj2 = RSS + λ βj2 , (6 i=1 j=1 j=1 j=1 where λ ≥ 0 is a tuning parameter, to be determined separately. Equ tion 6.5 trades off two different criteria. As with least squares, ridge regr sion seeks coefficient estimates that fit & the2 data well, by making the R small. However, the second term, λ j βj , called a shrinkage penalty, small when β1 ,... , βp are close to zero, and so it has the effect of shrinki the estimates of βj towards zero. The tuning parameter λ serves to cont i=1 j=1 Ridge regression Ridge regression is very similar to least squares, except that the coefficie are estimated by minimizing a slightly different quantity. In particular, t In ridge regression, the penalty λ is applied Rsquared coefficient values. proportionallyβ̂to ridge regression coefficient estimates are the values that minimize This penalty imposes bias on the model and reduces variance. 2 The shrinkage penalty. n ! p ! p ! p ! y i − β0 − βj xij + λ βj2 = RSS + λ βj2 , (6 i=1 j=1 j=1 j=1 λ ≥ 0 is the tuning parameter. where λ ≥ 0 is a tuning parameter, to be determined separately. Equ tion 6.5 trades off two different criteria. As with least squares, ridge regr sion seeks coefficient estimates that fit & the2 data well, by making the R small. However, the second term, λ j βj , called a shrinkage penalty, small when β1 ,... , βp are close to zero, and so it has the effect of shrinki the estimates of βj towards zero. The tuning parameter λ serves to cont i=1 j=1 Ridge regression Ridge regression is very similar to least squares, except that the coefficie are estimated by minimizing a slightly different quantity. In particular, t In ridge regression, the penalty λ is applied Rsquared coefficient values. proportionallyβ̂to ridge regression coefficient estimates are the values that minimize This penalty imposes bias on the model and reduces variance. 2 The shrinkage penalty. n ! p ! p ! p ! y i − β0 − βj xij + λ βj2 = RSS + λ βj2 , (6 i=1 j=1 j=1 j=1 λ ≥ 0 is the tuning parameter. where λ ≥ 0 is a tuning parameter, to be determined separately. Equ The best value for λ can be selected via cross validation. tion 6.5 trades off two different criteria. As with least squares, ridge regr Itsion is bestseeks practicecoefficient estimates that fit & the2 data well, by making the R to scale features. small. However, the second term, λ j βj , called a shrinkage penalty, small when β1 ,... , βp are close to zero, and so it has the effect of shrinki the estimates of βj towards zero. The tuning parameter λ serves to cont Ridge regression: example on polynomial regression Polynomial regression of order k = 9 Y λ = 0.0001 λ = 10 λ = 1000000000000 λ ≈ 0, just fitting a λ just right, it finds the λ too high, it is adding polynomial model right balance between too much penalty for without regularisation. bias and variance. high coefficients. X X X involving all ten predictors. Increasing the value of λ will tend to re the magnitudes of the coefficients, but will not result in exclusion of an LASSO regression the variables. The lasso is a relatively recent alternative to ridge regression that o In LASSO (Least Absolute Shrinkage and Selection Operator), the penalty λ is applied L proportionally comes this disadvantage. to absolute coefficient values. The lasso coefficients, β̂ λ , minimize the qua 2 " n " p " p "p y i − β0 − βj xij + λ |βj | = RSS + λ |βj |. i=1 j=1 j=1 j=1 Comparing (6.7) to (6.5), we see that the lasso and ridge regression similar formulations. The only difference is that the βj2 term in the r regression penalty (6.5) has been replaced by |βj | in the lasso penalty ( In statistical parlance, the lasso uses an #1 (pronounced “ell 1”) pen instead ! of an #2 penalty. The #1 norm of a coefficient vector β is give #β#1 = |βj |. As with ridge regression, the lasso shrinks the coefficient estim involving all ten predictors. Increasing the value of λ will tend to re the magnitudes of the coefficients, but will not result in exclusion of an LASSO regression the variables. The lasso is a relatively recent alternative to ridge regression that o In LASSO (Least Absolute Shrinkage and Selection Operator), the penalty λ is applied L proportionally comes this disadvantage. to absolute coefficient values. The lasso coefficients, β̂ λ , minimize the qua 2 The shrinkage penalty. " n " p " p "p y i − β0 − βj xij + λ |βj | = RSS + λ |βj |. i=1 j=1 j=1 j=1 Comparing (6.7) to (6.5), weλ ≥ 0 see thatparameter. is the tuning the lasso and ridge regression similar formulations. The only difference is that the βj2 term in the r regression penalty (6.5) has been replaced by |βj | in the lasso penalty ( In statistical parlance, the lasso uses an #1 (pronounced “ell 1”) pen instead ! of an #2 penalty. The #1 norm of a coefficient vector β is give #β#1 = |βj |. As with ridge regression, the lasso shrinks the coefficient estim rating, and student. So we might wish to bu least squares fitting procedure these estimates predictors. However, ridge regression w L1 and L2 regularisation involving all ten predictors. Increasing the va at minimize the magnitudes of the coefficients, but will not LASSO and Ridge regression 2 are also known as L1 regularisation and L2 regularisation, respectively. ! p the variables. yi − β 0The ij − namesβL1j xand. L2 regularisation come The lasso from the L1 andis L2 a relatively norm recent of a vector w, alternative to respectively. j=1 comes this disadvantage. The lasso coefficients The L1-norm is calculated as the sum of the absolute vector values, where the absolute value of a scalar uses the notation |wN|. 2 " n " p "p least squares, except that the coefficients y1-norm +λ i − β0 − |βj | = (also known as β j xij ridge L1-norm) ghtly different quantity. In particular, the regression The vector’s norm or magnitude. i=1 j=1 j=1 ates β̂ R are the values that minimize The L2-norm is calculated as the square root of the sum(6.7) Comparing of squared to vector (6.5), values. we see that the lass 2 similar formulations. The only difference is th !p ! p 2-norm (also known as L2-norm) 2 2 regression penalty (6.5) has been replaced by |β +λ βj = RSS + λ βj , (6.5) j=1 j=1 In statistical parlance, the lasso uses an #1 (p instead of an # penalty. The # norm of a coe j=1 involving all ten predictors. Increasing the va the magnitudes of the coefficients, but will not least squares, except LASSO and Ridge regression that the coefficients the variables. ridge ghtly different quantity. In particular, the regression R The lasso is a relatively recent alternative t ates β̂ Coefficient are thegoes values to zero inthat minimize different ways. comes this disadvantage. The lasso coefficients For More Visit:www.Learn Engineering.in For More Visit:www.Learn Engineering.in 2 2 p p "n " p " p ! ! 220 +λ βj2 = 216 RSS + λ βj2 , Selection(6.5) 6. Linear Model Ridge and yi6.−Linear Regularization β0 LASSO βj xij −Model Selection and Regularization +λ |βj | = j=1 j=1 i=1 j=1 j=1 200 300 400 Income 400 400 400 Limit For More Visit:www.Learn Engineering.in Standardized Coefficients Coefficients Standardized Coefficients Standardized Coefficients 300 300 300 Rating Comparing (6.7) to (6.5), we see that the lass Student eter, to be determined separately. Equa- 200 200 200 similar formulations. tuning 216 The onlyModel 6. Linear difference Selection andisReguth 100 100 100 100 iteria. As with least squares, ridge regres- parameter Standardized regression penalty (6.5) has been replaced by |β 0 0 0 0 hat fit & the2 data well, by makingInthe RSS Income 400 statistical parlance, the lasso uses Limit #1 (p an Income −100 −100 −100 ardized Coefficients ardized Coefficients Rating Limit m, λ j βj , called a shrinkage penalty, is shrinkage 300 −200 instead ! of an #2 penalty. The #1 normStudent of Rating a coe 200 Student o zero, and so it has the effect of shrinking penalty −300 −300 −300 #β#1 = |βj |. 100 1e−02 1e+00 1e+02 1e+04 20 0.050 1000.2 200 0.4 500 0.6 2000 0.8 5000 1.0 0.0 0.2 0.4 The tuning parameter λ serves λ to control β̂λ / β̂ βˆ R L 0 λ 2 2 λ Regularisation and feature selection Figure out which one of our features are important to include in the model. For More Visit:www.Learn Engineering.in Regularisation performs feature selection by shrinking the contribution of features. 220 some6.coefficients For L1-regularisation, this is accomplished by driving Linear Model Selection and Regularization to zero. 400 400 For More Visit:www.Learn Engineering.in Standardized Coefficients Standardized Coefficients 300 300 200 200 216 6. Linear Model Selection and Regu 100 100 0 0 Income 400 −100 Limit Income ardized Coefficients ardized Coefficients Rating Limit 300 −200 StudentRating 200 Student −300 100 20 50 100 200 500 2000 5000 0.0 0.2 0.4 λ βˆλL 0 Feature selection Reducing the number of features can prevent overfitting. For some models, fewer features can improve fitting time and/or results. Identifying the most important features can improve model interpretability. Feature selection can also be performed by removing features. Remove feature one at a time, measure the predictive results using cross validation, if the feature elimination improves the cross validation results, or doesn’t increase the error much, that feature can be removed. Regularization: Python syntax Import the linear_model module: from sklearn import linear_mode Use ridge regression: ridge_regression = linear_model.Ridge(alpha=1.0 ridge_regression.fit(X, y Use Lasso regression: lasso_regression = linear_model.Lasso(alpha=1.0 lasso_regression.fit(X, y ) ) l ) ) Lessons learned We presented the relationship between model complexity vs. error, described the 3 sources of model error and discussed the bias-variance tradeoff. We presented the LASSO and Ridge regression approaches and discussed the relationship between regularisation and feature selection. Learning from Data Lecture 5 Dr Marcos Oliveira