Linear Regression PDF

Linear Regression V. Sridhar Some of the figures in this presentation are taken from "An Introduction to Statistical Learning, with applications in R" (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani " 1 Taxonomy of Data Models Models Supervised Unsupervised Learning Learning Non- Parametric Clustering parametric Linear Binary Non Linear Regression Regression Regression DAV 11/17/202 2 What is Statistical Learning? Suppose we observe Yi and Xi = (Xi1,..., Xip ) i =1,..., n for We believe that there is a relationship between Y and at least one of the X’s. We can model the relationship as Yi = f (Xi ) +  i Where f is an unknown function and ε is a random error with mean zero. DAV 11/17/2024 3 Problem Suppose that we are statistical consultants hired by a client to provide advice on how to improve sales of a particular product. The Advertising data set consists of the sales of that product in 200 different markets, along with advertising budgets for the product in each of those markets for three different media: TV, radio, and newspaper It is not possible for our client to directly increase sales of the product. On the other hand, they can control the advertising expenditure in each of the three media. Therefore, if we determine that there is an association between advertising and sales, then we can instruct our client to adjust advertising budgets, thereby indirectly increasing sales. In other words, our goal is to develop an accurate model that can be used to predict sales on the basis of the three media budgets. DAV 11/17/2024 4 Trends and Errors DAV 11/17/202 5 Parametric Methods It reduces the problem of estimating f down to one of estimating a set of parameters. They involve a two-step model based approach STEP 1: Make some assumption about the functional form of f, i.e. come up with a model. The most common example is a linear model i.e. f (Xi ) =  0 + 1 X i1 +  2 X i 2 +  +  p X ip DAV 11/17/2024 6 Parametric Methods (cont.) STEP 2: Use the training data to fit the model i.e. estimate f or equivalently the unknown parameters such as β0, β1, β2,…, βp. The most common approach for estimating the parameters in a linear model is ordinary least squares (OLS). However, other functional forms are possible DAV 11/17/2024 7 Outline The Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression Other Considerations in Regression Model Qualitative Predictors Interaction Terms Potential Fit Problems DAV 11/17/2024 8 The Linear Regression Model Yi = b0 + b1X1 + b2 X2 + + bp Xp +e The parameters in the linear regression model are very easy to interpret. 0 is the intercept (i.e. the average value for Y if all the X’s are zero), j is the slope for the jth variable Xj j is the average increase in Y when Xj is increased by one and all other X’s are held constant. DAV 11/17/2024 9 Least Squares Fit We estimate the parameters using least squares i.e. minimize: Mean Square Error(MSE) 2 1 n ( MSE = å Yi - Yˆi n i=1 ) 1 n ( = å Yi - b̂0 - b̂1 X1 - ) 2 - b̂ p X p n i=1 ෡ 𝒊 : 𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝒗𝒂𝒍𝒖𝒆 𝒀𝒊 : 𝑨𝒄𝒕𝒖𝒂𝒍 𝒗𝒂𝒍𝒖𝒆; 𝒀 DAV 11/17/2024 10 Relationship between population and least squares lines Population Yi = b0 + b1X1 + b2 X2 + + bp Xp +e line Least Squares line Yˆi = b̂0 + b̂1 X1 + b̂2 X2 + + b̂ p X p We would like to know 0 through p i.e. the population line. Instead we know ˆ0 through ˆ p i.e. the least squares line. Hence we use ˆ0 through ˆ p as guesses for 0 through p and Yˆi as a guess for Yi. The guesses will not be perfect just as X is not a perfect guess for . DAV 11/17/2024 11 Gauss-Markov Theorem On the basis of certain assumptions, the OLS method gives Best Linear Unbiased Estimators (BLUE): (1) Estimators are linear functions of the dependent variable Y. (2) The estimators are unbiased; in repeated applications of the method, the estimators approach their true values. (3) In the class of linear estimators, OLS estimators have minimum variance; i.e., they are efficient, or the “best” estimators. DAV 11/17/2024 12 Measures of Fit: R2 Some of the variation in Y can be explained by variation in the X’s and some cannot. R2 – Coefficient of determination Similar to: F = MSG/MSE tells you the fraction of variance that can be explained by X. R2 = ESS/TSS = (TSS-RSS)/TSS -> 1-(RSS/TSS) TSS: Explains the total variance inherent in the response Y RSS: Total variance not explained by the model ESS: Total variance explained by the model Explained Sum of Squares (ESS) = (Yˆ − Y ) 2 R2: Variance explained by the model Residual Sum of Squares (RSS) = e 2 Total Sum of Squares (TSS) = (Y − Y ) 2 R2 is always between 0 and 1 Zero means no variance has been explained One means it has all been explained (perfect fit to the data). DAV 11/17/2024 13 Inference in Regression Estimated (least squares) line. The regression line from the sample is 14 not the regression line from the 12 10 population. 8 What we want to do: 6 Assess how well the line describes the plot. 4 Guess the slope of the population line. 2 Guess what value Y would take for a given X 0 -10 -5 0 5 10 value X True (population) line. Unobserved DAV 11/17/2024 14 Some Relevant Questions 1. Is j=0 or not? We can use a hypothesis test to answer this question. If we can’t be sure that j≠0 then there is no point in using Xj as one of our predictors. 1. Can we be sure that at least one of our X variables is a useful predictor i.e. is it the case that β1= β2== β p=0? DAV 11/17/2024 15 Is j=0 i.e. is Xj an important variable? We use a hypothesis test to answer this question H0: j=0 vs H1: j0 Calculate ˆ j t= SE( ˆ j ) If t is large (equivalently p-value is small) we can be sure that j0 and that there is a relationship Regression coefficients Coefficient Std Err t-value p-value Constant 7.0326 0.4578 15.3603 0.0000 TV 0.0475 0.0027 17.6676 0.0000 ˆ1 SE( ˆ1 ) DAV ˆ1 is 17.67 SE’s from 0 11/17/2024 P-value 16 Hypothesis Testing of R2 Testing the following hypothesis is equivalent to testing the hypothesis that all the slope coefficients are 0: 𝑬𝑺𝑺 𝑹𝟐 𝒅𝒇 (𝒌 − 𝟏) H0: R2 = 0 n: Sample size (No of observations) 𝑭= 𝑹𝑺𝑺 = 𝟏 − 𝑹𝟐 H1: R2 ≠ 0 k: Number of independent variables 𝒅𝒇 (𝒏 − 𝒌) Calculate the following and use the F table to obtain the critical F value with k-1 degrees of freedom in the numerator and n-k degrees of freedom in the denominator for a given level of significance If this value is greater than the critical F value, reject H0 ANOVA Table Source df SS MS F p-value Explained 2 4860.2347 2430.1174 859.6177 0.0000 Unexplained 197 556.9140 2.8270 DAV 11/17/2024 17 When to reject the Null hypothesis? Large F-stat suggests that one of the variables must be related to the dependent var ○ How large should F be for rejecting the NULL hypothesis? It depends on n: sample size; and p: number of independent var ○ If n is large and p is small; if n is small relative to p Combine F, R2, and individual var p values to determine the model fit DAV 11/17/2024 18 Model Fit and Significance of Independent Variables If R2 is high with a number of independent variables significant (smaller p values for the coefficients)√ If R2 is low with a number of independent variables significant ? If R2 is high with few independent variables significant ? If R2 is low with few independent variables significant × DAV 11/17/2024 19 Multiple Regression vs. multiple Single Variable Linear Regressions DAV 11/17/202 20 What about testing the model with other variables DAV 11/17/2024 21 Testing Individual Variables Is there a (statistically detectable) linear relationship between Newspapers and Sales after all the other variables have been accounted for? Regression coefficients Coefficient Std Err t-value p-value Constant 2.9389 0.3119 9.4223 0.0000 TV 0.0458 0.0014 32.8086 0.0000 Radio 0.1885 0.0086 21.8935 0.0000 Newspaper -0.0010 0.0059 -0.1767 0.8599 No: big p-value Regression coefficients Small p-value in Coefficient Std Err t-value p-value Constant 12.3514 0.6214 19.8761 0.0000 Newspaper 0.0547 0.0166 3.2996 0.0011 simple regression Almost all the explaining that Newspapers could do in simple regression has already been done by TV and Radio in multiple regression! DAV 11/17/202 22 Difference between individual and multivariate regression Newspaper and Radio ad spends are highly correlated ○ Whenever Ads are provided in Radio, it is also fed through Newspapers ○ Sales are directly through Radio ads only; Hence newspaper ads show non-significant coefficient in multivariate regression What should the firm do? DAV 11/17/2024 23 Adjusted R2 R2: Measures the proportion of the variation in the regress and explained by the regressors ○ As then number of regressors increase, R2 increases Adjusted R2: it takes degrees of freedom into account ○ As k increases Adj R2 decreases ○ Balances the number of independent variables included in the model versus fit of the sparser model _ n −1 R 2 = 1 − (1 − R 2 ) 𝑹𝟐 ≤ 𝑹𝟐 n−k DAV 11/17/2024 24 Deciding on important variables Forward selection ○ Begin with a NULL model ○ Add to the model variable that results in the lowest RSS -> 2-var model ○ Keep iterating and until stopping rule -> what could this be? Backward selection ○ Start with all variables in the model ○ Remove the one with the largest p-value ○ Iterate and stop using a rule Mixed selection DAV 11/17/2024 25 Linear Regression with Categorical Variables DAV 11/17/202 26 Qualitative Predictors How do you stick “men” and “women” (category variables) into a regression equation? Code them as indicator variables (dummy variables) For example we can “code” Males=0 and Females= 1 β0 can be interpreted as the average credit card balance among males, β0 + β1 as the average credit card balance among females, and β1 as the average difference in credit card balance between females and males. DAV 11/17/2024 27 Rules for Dummy Variables First, if an intercept is included in the model and if a qualitative variable has m categories, then introduce only (m – 1) dummy variables Second, the category that gets the value of 0 is called the reference, benchmark or comparison category. ○ All comparisons are made in relation to the reference category if the sample size is relatively small, do not introduce too many dummy variables ○ Remember that each dummy coefficient will cost one degree of freedom DAV 11/17/2024 28 Mix of quantitative and qualitative variables DAV 11/17/2024 29 Interpretation Suppose we want to include income and gender. Two genders (male and female). Let then the regression equation is ì 0 if male Genderi = í î 1 if female 2 is the average extra balance each month that females have for given income level. Males are the “baseline”. ìï b0 + b1Incomei if male Yi » b0 + b1Incomei + b2Genderi = í ïî b0 + b1Incomei + b2 if female Regression coefficients Coefficient Std Err t-value p-value Constant 233.7663 39.5322 5.9133 0.0000 Income 0.0061 0.0006 10.4372 0.0000 Slope of both the lines are the same Only intercepts are different Gender_Female 24.3108 40.8470 0.5952 0.5521 DAV 11/17/2024 30 Another Example Gender Union Membership Race where D2i = 1 if female, 0 for male; D3i = 1 for nonwhite, 0 for white; and D4i = 1 if union member, 0 for non-union member, where the Ds are the dummy variables What is the reference group here? DAV 11/17/2024 31 Interpretation DAV 11/17/202 32 Other Coding Schemes There are different ways to code categorical variables. Two genders (male and female). Let ì -1 if male Genderi = í î 1 if female then the regression equation is ìï b0 + b1Incomei -b2, if male Yi » b0 + b1Incomei + b2Genderi = í ïî b0 + b1Incomei + b2 , if female 2 is the average amount that females are above the average, for any given income level. 2 is also the average amount that males are below the average, for any given income level. DAV 11/17/2024 33 Interaction Effect DAV 11/17/202 34 Interaction When the effect on Y of increasing X1 depends on another X2. Example: Maybe the effect on Salary (Y) when increasing Position (X1) depends on gender (X2)? For example maybe Male salaries go up faster (or slower) than Females as they get promoted. Advertising example: TV and radio advertising both increase sales. Perhaps spending money on both of them may increase sales more than spending the same amount on one alone? DAV 11/17/2024 35 36 Interaction in advertising Sales = b0 + b1 ´TV + b2 ´ Radio+ b3 ´TV ´ Radio Sales = b0 +(b1 + b3 ´ Radio)´TV + b2 ´ Radio Spending $1 extra on TV increases Interaction Term average sales by 0.0191 + 0.0011Radio if we include an interaction in a model, we Sales = b0 + (b2 + b3 ´TV)´ Radio + b2 ´TV should also include the main effects, even Spending $1 extra on Radio increases if the p-values associated with their coefficients are not significant. average sales by 0.0289 + 0.0011TV Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 6.7502202 0.247871 27.23

Linear Regression PDF

Document Details

Tags

Related

Summary

Full Transcript