Machine Learning Linear Regression PDF
Document Details
Uploaded by NimbleSaxophone
Hashemite University
Tags
Summary
These notes cover machine learning, specifically linear regression. They include examples, data sets, and formulas. The document explores the concept of linear regression using various numerical examples.
Full Transcript
Machine Learning Linear Regression Linear regression Linear regression is a simple approach to supervised learning. It assumes that the dependence of Y on X 1 , X 2 ,... X p is linear. True regression functions are never linear! 7 6...
Machine Learning Linear Regression Linear regression Linear regression is a simple approach to supervised learning. It assumes that the dependence of Y on X 1 , X 2 ,... X p is linear. True regression functions are never linear! 7 6 f(X) 5 4 3 2 4 6 8 X although it may seem overly simplistic, linear regression is extremely useful both conceptually and practically. 2 / 48 Linear regression for the advertising data Consider the advertising data shown on the next slide. Questions we might ask: Is there a relationship between advertising budget and sales? How strong is the relationship between advertising budget and sales? Which media contribute to sales? How accurately can we predict future sales? Is the relationship linear? Is there synergy among the advertising media? 3 / 48 Advertising data 25 25 25 20 20 20 15 15 15 Sales Sales Sales 10 10 10 5 5 5 0 50 200 300 0 10 20 30 40 0 20 40 60 80 100 50 100 TV Radio Newspaper 4 / 48 Simple linear regression using a single predictor X. We assume a model Y = β0 + β 1 X + ϵ, where β0 and β1 are two unknown constants that represent the intercept and slope, also known as coefficients or parameters, and ϵ is the error term. Given some estimates βˆ0and βˆ1for the model coefficients, we predict future sales using ŷ = β̂0 + β̂1 x, where ŷ indicates a prediction of Y on the basis of X = x. The hat symbol denotes an estimated value. 5 / 48 Estimation of the parameters by least squares The least squares approach chooses βˆ0 and βˆ1 to minimize residual sum of squares (RSS) the RSS. The minimizing values can be shown to be Where are the sample means. 6 / 48 Linear Regression (Example) Given one variable Data Goal: Predict Y X (years) Y (salary, Example: $1,000) Given Years of Experience Predict Salary 3 30 Questions: 8 57 When X=10, what is Y? When X=25, what is Y? 9 64 13 72 3 36 6 43 11 59 21 90 1 20 Linear Regression Example Linear Regression: Y=3.5*X+23.2 120 100 80 Salary 60 40 20 0 0 5 10 15 20 25 Years For the example data 0 23.2, 3.5 1 y 23.2 3.5 x Thus, when x=10 years, prediction of y (salary) is: 23.2+35=58.2 K dollars/year. Assessing the Overall Accuracy of the Model The quality of a linear regression fit is typically assessed using two related quantities: the residual standard error (RSE) (is an estimate of the standard deviation of ϵ. Roughly speaking, it is the average amount that the response will deviate from the true regression line) the R2 statistic (provides an absolute measure of lack of fit of the model, R2 measures the proportion of variability in Y that can be explained using X. An R2 statistic that is close to 1 indicates that a large proportion of the variability in the response is explained by the regression. A number near 0 indicates that the regression does not explain much of the variability in the response; this might occur because the linear model is wrong, or the error variance σ2 is high, or both.) Assessing the Overall Accuracy of the Model We compute the Residual Standard Error where the residual sum-of-squares is R-squared or fraction of variance explained is where is the total sum of squares. 11 / 48 In the case of the advertising data, we see from the linear regression output in Table that the RSE is 3.26. In other words, actual sales in each market deviate from the true regression line by approximately 3,260 units, on average. Another way to think about this is that even if the model were correct and the true values of the unknown coefficients β0 and β1 were known exactly, any prediction of sales on the basis of TV advertising would still be off by about 3,260 units on average. Of course, whether or not 3,260 units is an acceptable prediction error depends on the problem context. In the advertising data set, the mean value of sales over all markets is approximately 14,000 units, and so the percentage error is 3,260/14,000 = 23 %. Multiple Linear Regression Here our model is Y = β 0 + β 1 X 1 + β 2 X 2 + ···+ β p X p + ϵ, We interpret β j as the average effect on Y of a one unit increase in X j , holding all other predictors fixed. In the advertising example, the model becomes s a l e s = β 0 + β 1 × TV + β 2 × radio + β 3 × newspaper + ϵ. 13 / 48 Multiple Linear Regression X1, X2 For example, X1=‘years of experience’ X2=‘age’ Y=‘salary’ Equation: Y 0 1 x1 2 x2 The coefficients are more complicated. We will not worry about the actual calculation with this equation, but refer to software packages such as Excel Interpreting regression coefficients The ideal scenario is when the predictors are uncorrelated — a balanced design: - Each coefficient can be estimated and tested separately. - Interpretations such as “a unit change in X j is associated with a β j change in Y , while all the other variables stay fixed”, are possible. Correlations amongst predictors cause problems: - The variance of all coefficients tends to increase, sometimes dramatically - Interpretations become hazardous — when X j changes, everything else changes. Claims of causality should be avoided for observational data. 15 / 48 The woes of (interpreting) regression coefficients “Data Analysis and Regression” Mosteller and Tukey 1977 a regression coefficient β j estimates the expected change in Y per unit change in X j , with all other predictors held fixed. But predictors usually change together! Example: Y total amount of change in your pocket; X 1 = # of coins; X 2 = # of pennies, nickels and dimes. B y itself, regression coefficient of Y on X 2 will be > 0. But how about with X 1 in model? 16 / 48 Estimation and Prediction for Multiple Regression Given estimates βˆ0,βˆ1,... βˆp, we can make predictions using the formula ŷ = β̂0 + β̂1 x 1 + β̂2 x 2 + ···+ β̂p x p. We estimate β 0 , β 1 ,... , βp as the values that minimize the sum of squared residuals This is done using standard statistical software. The values βˆ0, βˆ1,... , βˆpthat minimize R S S are the multiple least squares regression coefficient estimates. 17 / 48 Some important questions 1. Is at least one of the predictors X 1 , X 2 ,... , X p useful in predicting the response? 2. Do all the predictors help to explain Y , or is only a subset of the predictors useful? 3. How well does the model fit the data? 4. Given a set of predictor values, what response value should we predict, and how accurate is our prediction? 18 / 48 Is at least one predictor useful? Recall that in the simple linear regression setting, in order to determine whether there is a relationship between the response and the predictor we can simply check whether β1 = 0. In the multiple regression setting with p predictors, we need to ask whether all of the regression coefficients are zero, i.e. whether β1 = β2 = · · · = βp = 0. For the first question, we can use the F-statistic when there is no relationship between the response and predictors, one would expect the F-statistic to take on a value close to 1. On the other hand, we expect F to be greater than 1. 19 / 48 Quantity Value Residual Standard Error 1.69 R2 0.897 F-statistic 570 In this example the F-statistic is 570. Since this is far larger than 1, the large F-statistic suggests that at least one of the advertising media must be related to sales. Deciding on the important variables The most direct approach is called all subsets or best subsets regression: we compute the least squares fit for all possible subsets and then choose between them based on some criterion that balances training error with model size. However we often can’t examine all possible models, since they are 2p of them; for example when p = 40 there are over a billion models! Instead we need an automated approach that searches through a subset of them. We discuss two commonly use approaches next. 21 / 48 Forward selection Begin with the null model — a model that contains an intercept but no predictors. Fit p simple linear regressions and add to the null model the variable that results in the lowest RSS. Add to that model the variable that results in the lowest R S S amongst all two-variable models. Continue until some stopping rule is satisfied, for example when all remaining variables have a p-value above some threshold. 22 / 48 Backward selection Start with all variables in the model. Remove the variable with the largest p-value — that is, the variable that is the least statistically significant. The new ( p −1)-variable model is fit, and the variable with the largest p-value is removed. Continue until a stopping rule is reached. For instance, we may stop when all remaining variables have a significant p-value defined by some significance threshold. 23 / 48 Other Considerations in the Regression Model Qualitative Predictors Some predictors are not quantitative but are qualitative, taking a discrete set of values. These are also called categorical predictors or factor variables. See for example the scatterplot matrix of the credit card data in the next slide. In addition to the 7 quantitative variables shown, there are four qualitative variables: gender, student (student status), s t a t u s (marital status), and e t h n i c i t y (Caucasian, African American ( A A ) or Asian). 24 / 48 Credit Card Data 20 40 60 80 100 5 10 15 20 2000 8000 14000 1500 Balance 500 0 80 100 Age 60 40 20 8 Cards 6 4 2 20 15 Education 10 5 100 150 Income 50 14000 8000 Limit 2000 1000 Rating 600 200 0 500 1500 2 4 6 8 50 100 150 200 600 1000 25 / 48 Qualitative Predictors — continued Example: investigate differences in credit card balance between males and females, ignoring the other variables. We create a new variable i Resulting model: 26 / 48 Qualitative predictors with more than two levels With more than two levels, we create additional dummy variables. For example, for the e t h n i c i t y variable we create two dummy variables. The first could be and the second could be 27 / 48 Qualitative predictors with more than two levels — continued. Then both of these variables can be used in the regression equation, in order to obtain the model There will always be one fewer dummy variable than the number of levels. The level with no dummy variable — African American in this example — is known as the baseline. 28 / 48 References Slides created by Trevor Hastie and Robert Tibshirani based on their textbook “An Introduction to Statistical Learning” and modified by Dr.Esra’a Alshdaifat.