Simple Regression Model
Document Details
Uploaded by ClearerKoala
Ani Katchova
Tags
Summary
This presentation provides a comprehensive overview of simple regression models. It covers key concepts like regression models, estimated equations, and residuals. The presentation further delves into the population regression function, derivation of OLS estimates, and variations, including R-squared, and log transformations (log-log, log-linear, linear-log).
Full Transcript
Simple Regression Model Ani Katchova © 2020 by Ani Katchova. All rights reserved. Outline • Simple regression terminology • Examples and interpretation of coefficients • Population regression function • Derivation of OLS estimates • Examples of simple regression –interpretation of results • V...
Simple Regression Model Ani Katchova © 2020 by Ani Katchova. All rights reserved. Outline • Simple regression terminology • Examples and interpretation of coefficients • Population regression function • Derivation of OLS estimates • Examples of simple regression –interpretation of results • Variations, R -squared • Log transformations -Log -log, log -linear, and linear -log forms • Gauss Markov assumptions • Unbiasedness of OLS estimators • Variance of OLS estimators 2 Terminology Regression model: ????????????= ???????????? 0 +???????????? 1 ????????????+ ???????????? y is dependent variable, xis independent variable (one independent variable for a simple regression), u is error, β 0 and β 1 are parameters. Estimated equation: � ???????????? = ̂ ???????????? 0 + ̂ ???????????? 1 ???????????? � ???????????? is predicted value, ̂ ???????????? 0 and ̂ ???????????? 1 are coefficients Residual: � ???????????? = ????????????− � ???????????? � ???????????? = actual value minus predicted value for dependent variable Population Sample Parameter ???????????? Coefficient ̂ ???????????? Error ???????????? Residual � ???????????? 3 Simple regression model example Dependent variable yIndep . variable x Predicted value � ???????????? = 20 +0.5???????????? Residual � ???????????? = ????????????− � ???????????? Hourly wage $ Years of experience 20 1=20+0.5*1=20.5 =20-20.5= -0.5 21 2=20+0.5*2=21 =21-21=0 21 1=20+0.5*1=20.5 =21-20.5=0.5 22 3=20+0.5*3=21.5 =22-21.5=0.5 Simple regression: hourly wage depends on years of experience. Figure shows regression line, slope, predicted values, actual values, and residuals. 19.5 20 20.5 21 21.5 22 22.5 0 0.5 11.5 22.5 33.5 Simple regression: actual and predicted values Hourly wage Predicted wage Linear (Predicted wage) 4 Regression line fits as good as possible through the data points Simple regression: actual values, predicted values, and residuals 5 Interpretation of coefficients ̂ ???????????? 1 = ∆ ???????????? ∆ ???????????? = ?????????????????????????????????????????????????????????????????? ???????????????????????????????????? ?????????????????????????????????????????????????????????????????? ???????????????????????????????????? • The coefficient ̂ ???????????? 1 measures by how much the dependent variable changes when the independent variable changes by one unit. • ̂ ???????????? 1 is also called slope in the simple linear regression. • A derivative of a function is another function showing the slope. • The formula above is correct if ∆ ???????????? ∆ ???????????? =0, which means all other factors are fixed. 6 Population regression function Population regression function: ???????????? ???????????????????????? =???????????? ???????????? 0 +???????????? 1 ????????????+ ???????????????????????? =???????????? 0 +???????????? 1 ????????????+ ???????????? ???????????????????????? =???????????? 0 +???????????? 1 ???????????? if ???????????? ???????????? ???????????? =0 (this assumption is called zero conditional mean) For the population, the average value of the dependent variable can be expressed as a linear function of the independent variable. 7 Population regression function • Population regression function shows the relationship between y and x for the population 8 Population regression function • For individuals with a particular x, the average value of y is ???????????? ???????????????????????? =???????????? 0 +???????????? 1 ???????????? • Note that x 1, x 2, x 3 here refers to x iand not different variables 9 Derivation of the OLS estimates • For a regression model: ????????????= ???????????? 0 +???????????? 1 ????????????+ ???????????? • We need to estimate the regression equation: � ???????????? = ̂ ???????????? 0 + ̂ ???????????? 1 ???????????? and find the coefficients ̂ ???????????? 0 and ̂ ???????????? 1 by looking at the residuals • � ???????????? = ????????????− � ???????????? = ????????????− ̂ ???????????? 0 − ̂ ???????????? 1 ???????????? • Obtain a random sample of data with n observations (???????????? ????????????, ???????????? ????????????) , where ????????????= 1… Derivation of the OLS estimates • Minimize the sum of squared residuals min� ????????????=1 ???????????? � ???????????? 2 = � ????????????=1 ???????????? ( ???????????? − ̂ ???????????? 0 − ̂ ???????????? 1 ???????????? ) 2 We obtain OLS coefficients: ̂ ???????????? 1 = ∑ (???????????? ???????????? − ̅ ???????????? )(???????????? ???????????? − � ???????????? ) ∑ (???????????? ???????????? − ̅ ???????????? ) 2 = ???????????????????????????????????? (???????????? ,???????????? ) ???????????????????????????????????? (???????????? ) ̂ ???????????? 0 = � ???????????? − ̂ ???????????? 1 ̅ ???????????? OLS is Ordinary Least Squares, based on minimizing the squared residuals. 11 OLS properties � ???????????? = ̂ ???????????? 0 + ̂ ???????????? 1 ̅ ???????????? The sample average of the dependent and independent variable are on the regression line � ????????????= 1 ???????????? � ???????????? = 0 The residuals sum up to zero (note that we minimized the sum of squared residuals) � ????????????= 1 ???????????? ???????????? � ???????????? = 0 The covariance between the independent variable and residual is zero. 12 Simple regression example: CEO’s salary Simple regression model explaining how return on equity (roe) affects CEO’s salary.Regression model ???????????? Estimated equation and interpretation • Estimated equation � ???????????? Stata output for simple regression _cons 963.1913 213.2403 4.52 0.000 542.7902 1383.592 roe 18.50119 11.12325 1.66 0.098 -3.428196 40.43057 salary Coef. Std. Err. t P>|t| [95% Conf. Interval] Total 391732982 208 1883331.64 Root MSE = 1366.6 Adj R-squared = 0.0084 Residual 386566563 207 1867471.32 R-squared = 0.0132 Model 5166419.04 1 5166419.04 Prob > F = 0.0978 F(1, 207) = 2.77 Source SS df MS Number of obs = 209 . regress salary roe � ???????????? Simple regression results in a table (1) VARIABLES salary roe 18.50* (11.12) Constant 963.2*** (213.2) Observations 209 R-squared 0.013 � ???????????? Regression line for sample vs population regression function for population 17 Estimated regression 0 5000 10000 15000 0 20 40 60 return on equity, 88-90 avg 1990 salary, thousands $ Fitted values Actual and predicted values 18 Actual values, predicted values, and residuals 0 5000 10000 15000 0 20 40 60 return on equity, 88-90 avg True value Predicted value Residual 19 Actual, predicted values, and residuals roesalary � ???????????? 1224 -129 10.9 1001 1165 -164 23.5 1122 1398 -276 5.9 578 1072 -494 13.8 1368 1219 149 20 1145 1333 -188 16.4 1078 1267 -189 16.3 1094 1265 -171 10.5 1237 1157 80 26.3 833 1450 -617 The mean salary is 1,281 ($1,281,000). The mean predicted salary is also 1,281. The mean for the residuals is zero. 20 Simple regression example: wage Simple regression model explaining how education affects wages for workers.Regression model ???????????? Estimated equation and interpretation • Estimated equation � ???????????? Stata output for simple regression _cons -.9048517 .6849678 -1.32 0.187 -2.250472 .4407687 educ .5413593 .053248 10.17 0.000 .43675 34 .6459651 wage Coef. Std. Err. t P>|t| [95% C onf. Interval] Total 7160.41431 525 13.6388844 Root MSE = 3.3784 Adj R-squared = 0.1632 Residual 5980.68226 524 11.4135158 R-squared = 0.1648 Model 1179.73205 1 1179.73205 Prob > F = 0.0000 F(1, 524) = 103.36 Source SS df MS Number of obs = 526 . reg wage educ 23 Variations ????????????????????????????????????=∑ ???????????? = 1 ???????????? ( ???????????? − � ???????????? ) 2 ???????????????????????? Variations 25 Goodness of fit measure R-squared • R 2 = SSE/SST = 1 –SSR/SST • R-squares is explained sum of squares divided by total sum of squares. • R -squared is a goodness of fit measure. It measures the proportion of total variation that is explained by the regression. • An R -squared of 0.7 is interpreted as 70% of the variation is explained by the regression and the rest is due to error. • R -squared that is greater than 0.25 is considered good fit. 26 R-squared calculated _cons -.9048517 .6849678 -1.32 0.187 -2.250472 .4407687 educ .5413593 .053248 10.17 0.000 .4367534 .6459651 wage Coef. Std. Err. t P>|t| [95% Conf. Interval] Total 7160.41431 525 13.6388844 Root MSE = 3.3784 Adj R-squared = 0.1632 Residual 5980.68226 524 11.4135158 R-squared = 0.1648 Model 1179.73205 1 1179.73205 Prob > F = 0.0000 F(1, 524) = 103.36 Source SS df MS Number of obs = 526 . reg wage educ R-squared = SS Model /SS Total = 1179.73 / 7160.41 = 0.1648 16% of the variation in wage is explained by the regression and the rest is due to error. This is not a very good fit. 27 Log transformation (logged variables) • Sometimes variables (y or x) are expressed as logs, log(y) or log(x) • With logs, interpretation is in percentage/elasticity • Variables such as age and education that are measured in units such as years should not be logged • Variables measured in percentage points (e.g. interest rates) should not be logged • Logs cannot be used if variables have zero or negative values • Taking logs often reduces problems with large values or outliers • Taking logs helps with homoskedasticity and normality 28 Log-log form • Linear regression model: ????????????=???????????? 0 +???????????? 1 ????????????+ ???????????? • log -log form: ̂ ???????????? 1 = ∆ log (???????????? ) ∆log (???????????? ) = ∆ ???????????? ???????????? ???????????? ∆???????????? = ???????????????????????????????????????????????????????????????????????????????????? ?????????????????????????????????????????????????????????????????????????????????????????? ???????????? ???????????????????????????????????????????????????????????????????????????????????? ?????????????????????????????????????????????????????????????????????????????????????????? ???????????? • The dependent variable changes by ̂ ???????????? 1 percent when the independent variable changes by one percent . 29 Log-linear form (also called semi-log) • Linear regression model: ????????????= ???????????? 0 +???????????? 1 ????????????+ ???????????? • Log -linear form: ̂ ???????????? 1 = ∆ log (????????????) ∆ ???????????? = ∆ ???????????? ???????????? 1 ∆ ???????????? = ???????????? ???????????????????????????????????????????????????????????????????????? ?????????????????????????????????????????????????????????????????? ???????????????????????????????????? ?????????????????????????????????????????????????????????????????? ???????????????????????????????????? • The dependent variable changes by ̂ ???????????? 1 *100 percent when the independent variable changes by one unit . 30 Linear-log form • Linear regression model: ????????????= ???????????? 0 +???????????? 1 ????????????+ ???????????? • Linear -log form: ????????????= ???????????? 0 +???????????? 1 log (????????????)+ ???????????? • Instead of the independent variable, use log of the independent variable. ̂ ???????????? 1 = ∆ ???????????? ∆ log (????????????)= ∆???????????? ???????????? ∆ ???????????? = ?????????????????????????????????????????????????????????????????? ???????????????????????????????????? ???????????? ???????????????????????????????????????????????????????????????????????? ?????????????????????????????????????????????????????????????????? ???????????????????????????????????? • The dependent variable changes by ̂ ???????????? 1 /100 units when the independent variable changes by one percent . 31 Example of data with logs wage lwage educ 3.10 1.1311 3.24 1.1812 3.00 1.1011 6.00 1.79 8 5.30 1.6712 8.75 2.1716 11.25 2.4218 5.00 1.6112 3.60 1.2812 18.18 2.9017 32 Linear vs log-linear form -1 0 1 2 3 0 5 10 15 20 educ lwage Fitted values 0 5 10 15 20 25 0 5 10 15 20 educ wage Fitted values Linear form: wage on education Log-linear form: log wage on education 33 Linear vs log-linear form Linear form: wage increases by $0.54 for each additional year of education. Log-linear form: wage increases by 8.2% for each additional year of education . (1)(2) VARIABLES wagelwage educ 0.541***0.0827*** (0.0532) (0.00757) Constant -0.9050.584*** (0.685) (0.0973) Observations 526526 R-squared 0.1650.186 34 Example of data with logs Salary (thousand dollars) lsalary Sales (Million dollars) lsales 1095 7.0 27595 10.2 1001 6.9 9958 9.2 1122 7.0 6126 8.7 578 6.4 16246 9.7 1368 7.2 21783 10.0 1145 7.0 6021 8.7 1078 7.0 2267 7.7 1094 7.0 2967 8.0 1237 7.1 4570 8.4 833 6.7 2830 7.9 Note that one unit is thousand dollars for salary and million dollars for sales. 35 Linear vs log-log form 0 5000 10000 15000 0 20000 40000 60000 80000 100000 1990 firm sales, millions $ 1990 salary, thousands $ Fitted values 5 6 7 8 9 10 4 6 8 10 12 natural log of sales natural log of salary Fitted values Linear form: salary on sales Log-log form: log salary on log sales 36 Log-linear vs linear-log form Log-linear form: log salary on sales Linear- log form: salary on log sales 0 5000 10000 15000 4 6 8 10 12 natural log of sales 1990 salary, thousands $ Fitted values 5 6 7 8 9 10 0 20000 40000 60000 80000 100000 1990 firm sales, millions $ natural log of salary Fitted values 37 Interpretation of coefficients LinearLog-log Log-linear Linear-log VARIABLES salarylsalary lsalary salary sales 0.0155* 1.50e-05*** (0.00891) (3.55e-06) lsales 0.257***262.9*** (0.0345) (92.36) Constant 1,174***4.822*** 6.847*** -898.9 (112.8) (0.288) (0.0450) (771.5) Linear form: salary increases by 0.155 thousand dollars ($155 dollars) for each additional one million dollars in sales. Log-log form: salary increases by 0.25% for every 1% increase in sales. Log- linear form: salary increases by 0.0015% (=0.000015*100) for each additional one million dollar increase in sales. Linear- log form: salary increases by 2.629 (=262.9/100) thousand dollars for each additional 1% increase in sales. 38 Review questions 1. Define regression model, estimated equation, and residuals. 2. What method is used to obtain the coefficients? 3. What are the OLS properties? 4. How is R-squared defined and what does it measure? 5. By taking logs of the variable, how does the interpretation of coefficients change? 39 Gauss Markov assumptions • Gauss Markov assumptions are standard assumptions for the linear regression model 1. Linearity in parameters 2. Random sampling 3. No perfect collinearity (or sample variance in the independent variable) 4. Exogeneity or zero conditional mean –regressors are not correlated with the error term 5. Homoscedasticity –variance of error term is constant 40 Assumption 1: linearity in parameters ????????????= ???????????? 0 +???????????? 1 ????????????+ ???????????? • The relationship between y and xis linear in the population. • Note that the regression model can have logged variables (e.g. log sales), squared variables (e.g. education 2) or interactions of variables (e.g. education*experience) but the ????????????parameters are linear. 41 Assumption 2: random sampling ???????????? ????????????,???????????? ???????????? , where ???????????? = 1….n • The data are a random sample drawn from the population. • Each observation follows the population equation ????????????= ???????????? 0 +???????????? 1 ????????????+ ???????????? • Data on workers (y=wage, x=education). • Population is all workers in the U.S. (150 million) • Sample is workers selected for the study (1,000) • Drawing randomly from the population –each worker has equal probability of being selected • For example, if young workers are oversampled, this will not be a random/representative sample. 42 Assumption 3: no perfect collinearity ???????????????????????? Note that SST x is in the denominator of ̂ ???????????? 1 = ∑ (???????????? ????????????− ̅ ???????????? )(???????????? ????????????− � ???????????? ) ∑ (???????????? ????????????− ̅ ???????????? ) 2 43 Assumption 4: zero conditional mean (exogeneity) ???????????? ???????????? ???????????? ???????????? ????????????)= 0 • Expected value of error term ugiven independent variable x is zero. • The expected value of the error must not differ based on the values of the independent variable. • The errors must sum up to zero for each x. 44 Example of zero conditional mean Regression model ???????????? Example of exogeneity vs endogeneity -5 0 5 10 Residuals 10 12 14 16 18 educ 5 10 15 20 25 30 uhat_modified 10 12 14 16 18 educ E(u|x)=0 error term is the same given education E( u|x)>0 ability/error is higher when education is higher Exogeneity - zero conditional mean Endogeneity -conditional mean is not zero 46 Unbiasedness of the OLS estimators • Gauss Markov Assumptions 1-4 (linearity, random sampling, no perfect collinearity, and zero conditional mean) lead to the unbiasedness of the OLS estimators. ???????????? ̂ ???????????? 0 = ???????????? 0 and ???????????? ̂ ???????????? 1 = ???????????? 1 • Expected values of the sample coefficients ̂ ???????????? are the population parameters ????????????. • If we estimate the regression model with many random samples, the average of these coefficients will be the population parameter. • For a given sample, the coefficients may be very different from the population parameters. 47 Assumption 5: homoscedasticity • Homoscedasticity ???????????????????????????????????? ???????????? ???????????? ???????????? ???????????? =???????????? 2 • Variance of the error term ????????????must not differ with the independent variable ????????????. • Heteroscedasticity ???????????????????????????????????? ???????????? ???????????? ???????????? ???????????? ≠???????????? 2 is when the variance of the error term ????????????is not constant for each ????????????. 48 Homoscedasticity vs heteroscedasticity Homoscedasticity ???????????????????????????????????? ???????????????????????? =???????????? 2 Heteroscedasticity ???????????????????????? ???????????? ???????????????????????? ≠???????????? 2 49 Homoscedasticity vs heteroscedasticity -5 0 5 10 15 Residuals 0 5 10 15 20 educ -5 0 5 10 Residuals 10 12 14 16 18 educ Homoscedasticity ???????????????????????? ???????????? ???????????????????????? =???????????? 2 Heteroscedasticity ???????????????????????? ???????????? ???????????????????????? ≠???????????? 2 50 Unbiasedness of the error variance We can estimate the variance of the error term as: � ???????????? 2 = 1 ???????????? − 2 ∑ ????????????= 1 ???????????? � ???????????? ???????????? 2 • The degrees of freedom ( n-k-1) are corrected for the number of independent variables k=1. • Gauss Markov Assumptions 1 -5 (linearity, random sampling, no perfect collinearity, zero conditional mean, and homoscedasticity) lead to the unbiasedness of the error variance. ???????????? � ???????????? 2 = ???????????? 2 51 Variances of the OLS estimators • The estimated regression coefficients are random, because the sample is random. The coefficients will vary if a different sample is chosen. • What is the sample variability in these OLS coefficients? How far are the coefficients from the population parameters? 52 Variances of the OLS estimators ???????????????????????????????????? ̂ ???????????? 1 = ???????????? 2 ∑ ???????????? = 1 ???????????? (???????????? ???????????? − ̅ ???????????? ) 2 = ???????????? 2 ???????????????????????? ???????????? ???????????? ???????????????????????????????????? ̂ ???????????? 0 = ???????????? 2 ???????????? − 1 ∑ ????????????= 1 ???????????? ???????????? ???????????? 2 ∑ ???????????? = 1 ???????????? (???????????? − ̅ ???????????? ) 2 = ???????????? 2 ???????????? − 1 ∑ ????????????= 1 ???????????? ???????????? ???????????? 2 ???????????????????????? ???????????? ???????????? The variances are higher if the variance of the error term is higher and if the variance in the independent variable is lower. Estimators with lower variance are desirable. This means low variance in error term but high variance in the independent variable is desirable. 53 Standard errors of the regression coefficients ???????????????????????? ̂ ????????????1 = ???????????????????????????????????? ̂ ????????????1 = � ???????????? 2 ???????????????????????? ???????????? ???????????? ???????????????????????? ̂ ????????????0 = ???????????????????????????????????? ̂ ????????????0 = � ???????????? 2 ???????????? − 1 ∑ ????????????=1 ???????????? ???????????? ???????????? 2 ???????????????????????? ???????????? ???????????? • The standard errors are square root of the variances. • The unknown population variance of error term ???????????? 2 is replaced with the sample variance of the residuals � ???????????? 2 . • The standard errors measure how precisely the regression coefficients are calculated. 54 Review questions 1. List and explain the 5 Gauss Markov assumptions. 2. Which assumptions are needed for the unbiasedness of the coefficients? 3. Which assumptions are needed to calculate the variance of the OLS coefficients? 4. Is it possible to have zero conditional mean but heteroscedasticity? 55