Simple Regression Model PDF
Document Details
![FestiveNonagon](https://quizgecko.com/images/avatars/avatar-15.webp)
Uploaded by FestiveNonagon
Tags
Summary
This document explains the simple regression model, concepts like residuals, and some examples using real-world data concerning CEOs' salaries. The document provides a good introduction to fundamental statistical modeling techniques and provides detailed formulas and information. The examples presented will help to understand the practical application of the model.
Full Transcript
Simple Regression Model Terminology Regression model: π¦π¦ = π½π½0 + π½π½1 π₯π₯ + π’π’ y is dependent variable, x is independent variable (one independent variable for a simple regression), u is error, Ξ²0 and Ξ²1 are parameters. Estimated equation: π¦π¦ = π½π½Μ0 + π½π½Μ1 π₯π₯ Population Sample π¦...
Simple Regression Model Terminology Regression model: π¦π¦ = π½π½0 + π½π½1 π₯π₯ + π’π’ y is dependent variable, x is independent variable (one independent variable for a simple regression), u is error, Ξ²0 and Ξ²1 are parameters. Estimated equation: π¦π¦ = π½π½Μ0 + π½π½Μ1 π₯π₯ Population Sample π¦π¦ is predicted value, π½π½Μ0 and π½π½Μ1 are coefficients Parameter π½π½ Coefficient π½π½Μ Error π’π’ Residual π’π’ Residual: π’π’ = π¦π¦ β π¦π¦ π’π’ = actual value minus predicted value for dependent variable 3 Simple regression model example Dependent Indep. Predicted value Residual Simple regression: actual and predicted values variable y variable x π¦π¦ = 20 + 0.5π₯π₯ π’π’ = π¦π¦ β π¦π¦ 22.5 22 Hourly wage Years of 21.5 $ experience 21 20 1 =20+0.5*1=20.5 =20-20.5=-0.5 20.5 21 2 =20+0.5*2=21 =21-21=0 20 19.5 21 1 =20+0.5*1=20.5 =21-20.5=0.5 0 0.5 1 1.5 2 2.5 3 3.5 Hourly wage Predicted wage Linear (Predicted wage) 22 3 =20+0.5*3=21.5 =22-21.5=0.5 Simple regression: hourly wage depends on years of experience. Figure shows regression line, slope, predicted values, actual values, and residuals. 4 Simple regression: actual values, predicted values, and residuals Regression line fits as good as possible through the data points 5 Interpretation of coefficients βπ¦π¦ πππππππππππ ππππ π¦π¦ π½π½Μ1 = = βπ₯π₯ πππππππππππ ππππ π₯π₯ The coefficient π½π½Μ1 measures by how much the dependent variable changes when the independent variable changes by one unit. π½π½Μ1 is also called slope in the simple linear regression. A derivative of a function is another function showing the slope. βπ’π’ The formula above is correct if =0, which means all other factors βπ₯π₯ are fixed. 6 Population regression function Population regression function: πΈπΈ π¦π¦ π₯π₯ = πΈπΈ π½π½0 + π½π½1 π₯π₯ + π’π’ π₯π₯ = π½π½0 + π½π½1 π₯π₯ + πΈπΈ π’π’ π₯π₯ = π½π½0 + π½π½1 π₯π₯ if πΈπΈ π’π’ π₯π₯ =0 (this assumption is called zero conditional mean) For the population, the average value of the dependent variable can be expressed as a linear function of the independent variable. 7 Population regression function Population regression function shows the relationship between y and x for the population 8 Population regression function For individuals with a particular x, the average value of y is πΈπΈ π¦π¦ π₯π₯ = π½π½0 + π½π½1 π₯π₯ Note that x1, x2, x3 here refers to xi and not different variables 9 Derivation of the OLS estimates For a regression model: π¦π¦ = π½π½0 + π½π½1 π₯π₯ + π’π’ We need to estimate the regression equation: π¦π¦ = π½π½Μ0 + π½π½Μ1 π₯π₯ and find the coefficients π½π½Μ0 and π½π½Μ1 by looking at the residuals π’π’ = π¦π¦ β π¦π¦ = π¦π¦ β π½π½Μ0 β π½π½Μ1 π₯π₯ Obtain a random sample of data with n observations (π₯π₯ππ , π¦π¦ππ ), where ππ = 1 β¦ ππ is the observation The goal is to obtain as good fit as possible of the estimated regression equation 10 Derivation of the OLS estimates Minimize the sum of squared residuals ππ ππ min π’π’ 2 = (π¦π¦ β π½π½Μ0 β π½π½Μ1 π₯π₯ )2 ππ=1 ππ=1 We obtain OLS coefficients: β(π₯π₯ππ β π₯π₯)(π¦π¦ Μ ππ β π¦π¦) ππππππ(π₯π₯, π¦π¦) Μ π½π½1 = = β(π₯π₯ππ β π₯π₯)Μ 2 π£π£π£π£π£π£(π₯π₯) π½π½Μ0 = π¦π¦ β π½π½Μ1 π₯π₯Μ OLS is Ordinary Least Squares, based on minimizing the squared residuals. 11 OLS properties π¦π¦ = π½π½Μ0 + π½π½Μ1 π₯π₯Μ The sample average of the dependent and independent variable are on the regression line ππ π’π’ = 0 ππ=1 The residuals sum up to zero (note that we minimized the sum of squared residuals) ππ π₯π₯ π’π’ = 0 ππ=1 The covariance between the independent variable and residual is zero. 12 Simple regression example: CEOβs salary Simple regression model explaining how return on equity (roe) affects CEOβs salary. Regression model π π π π π π π π π π π π = π½π½0 + π½π½1 ππππππ + π’π’ Estimated equation for predicted value of wage = π½π½Μ0 + π½π½Μ1 ππππππ π π π π π π π π π π π π Residuals π’π’ = π π π π π π π π π π π π β π π π π π π π π π π π π We estimate the regression model to find the coefficients. π½π½Μ1 measures the change in the CEOβs salary associated with one unit increase in roe, holding other factors fixed. 13 Estimated equation and interpretation Estimated equation = π½π½Μ0 + π½π½Μ1 ππππππ = 963.191 + 18.501 ππππππ π π π π π π π π π π π π Salary is measured in thousand dollars, ROE (return on equity) is measured in %. π½π½Μ1 measures the change in the CEOβs salary associated with one unit increase in roe, holding other factors fixed. Interpretation of π½π½Μ1 : the CEOβs salary increases by $18,501 for each 1% increase in ROE. Interpretation of π½π½Μ0 : if the ROE is zero, the CEOβs salary is $963,191. 14 Stata output for simple regression. regress salary roe Source SS df MS Number of obs = 209 F(1, 207) = 2.77 Model 5166419.04 1 5166419.04 Prob > F = 0.0978 Residual 386566563 207 1867471.32 R-squared = 0.0132 Adj R-squared = 0.0084 Total 391732982 208 1883331.64 Root MSE = 1366.6 salary Coef. Std. Err. t P>|t| [95% Conf. Interval] roe 18.50119 11.12325 1.66 0.098 -3.428196 40.43057 _cons 963.1913 213.2403 4.52 0.000 542.7902 1383.592 = π½π½Μ0 + π½π½Μ1 ππππππ = 963.191 + 18.501 ππππππ π π π π π π π π π π π π 15 Simple regression results in a table (1) VARIABLES salary roe 18.50* (11.12) Constant 963.2*** (213.2) Observations 209 R-squared 0.013 = π½π½Μ0 + π½π½Μ1 ππππππ = 963.191 + 18.501 ππππππ π π π π π π π π π π π π 16 Regression line for sample vs population regression function for population 17 Estimated regression 15000 10000 5000 0 0 20 40 60 return on equity, 88-90 avg 1990 salary, thousands $ Fitted values Actual and predicted values 18 Actual values, predicted values, and residuals 15000 10000 5000 0 0 20 40 60 return on equity, 88-90 avg True value Predicted value Residual 19 Actual, predicted values, and residuals roe salary π π π π π π π π π π π π π’π’ predicted value Residual 963.191 + 18.501 ππππππ π π π π π π π π π π π π β π π π π π π π π π π π π 14.1 1095 1224 -129 10.9 1001 1165 -164 23.5 1122 1398 -276 5.9 578 1072 -494 13.8 1368 1219 149 20 1145 1333 -188 16.4 1078 1267 -189 16.3 1094 1265 -171 10.5 1237 1157 80 26.3 833 1450 -617 The mean salary is 1,281 ($1,281,000). The mean predicted salary is also 1,281. The mean for the residuals is zero. 20 Simple regression example: wage Simple regression model explaining how education affects wages for workers. Regression model π€π€π€π€π€π€π€π€ = π½π½0 + π½π½1 ππππππππ + π’π’ Estimated equation for predicted value of wage = π½π½Μ0 + π½π½Μ1 ππππππππ π€π€π€π€π€π€π€π€ Residuals π’π’ = π€π€π€π€π€π€π€π€ β π€π€π€π€π€π€π€π€ We estimate the regression model to find the coefficients. π½π½Μ1 measures the change in wage associated with one more year of education, holding other factors fixed. 21 Estimated equation and interpretation Estimated equation = π½π½Μ0 + π½π½Μ1 ππππππππ = β0.90 + 0.54 ππππππππ π€π€π€π€π€π€π€π€ Wage is measured in $/hour. Education is measured in years. π½π½Μ1 measures the change in personβs wage associated with one additional year increase in education, holding other factors fixed. Interpretation of π½π½Μ1 : the hourly wage increases by $0.54 for additional year of education. Interpretation of π½π½Μ0 : if education is zero, personβs wage is -$0.90 (but no one in the sample has zero education). 22 Stata output for simple regression. reg wage educ Source SS df MS Number of obs = 526 F(1, 524) = 103.36 Model 1179.73205 1 1179.73205 Prob > F = 0.0000 Residual 5980.68226 524 11.4135158 R-squared = 0.1648 Adj R-squared = 0.1632 Total 7160.41431 525 13.6388844 Root MSE = 3.3784 wage Coef. Std. Err. t P>|t| [95% Conf. Interval] educ.5413593.053248 10.17 0.000.4367534.6459651 _cons -.9048517.6849678 -1.32 0.187 -2.250472.4407687 23 Variations ππππππ = βππππ=1(π¦π¦ β π¦π¦) 2 ππππππ = βππππ=1(π¦π¦ β π¦π¦) 2 ππππππ = βππππ=1(π¦π¦ β π¦π¦) 2 = βππππ=1 π’π’ 2 SST = SSE + SSR SST is total sum of squares and measures the total variation in the dependent variable SSE is explained sum of squares and measures the variation explained by the regression SSR is residual sum of squares and measures the variation not explained by the regression Note: some call SSE error sum of squared and SSR regression sum of squares, where R & E are confusingly reversed. 24 Variations 25 Goodness of fit measure R-squared R2 = SSE/SST = 1 β SSR/SST R-squares is explained sum of squares divided by total sum of squares. R-squared is a goodness of fit measure. It measures the proportion of total variation that is explained by the regression. An R-squared of 0.7 is interpreted as 70% of the variation is explained by the regression and the rest is due to error. R-squared that is greater than 0.25 is considered good fit. 26 R-squared calculated. reg wage educ Source SS df MS Number of obs = 526 F(1, 524) = 103.36 Model 1179.73205 1 1179.73205 Prob > F = 0.0000 Residual 5980.68226 524 11.4135158 R-squared = 0.1648 Adj R-squared = 0.1632 Total 7160.41431 525 13.6388844 Root MSE = 3.3784 wage Coef. Std. Err. t P>|t| [95% Conf. Interval] educ.5413593.053248 10.17 0.000.4367534.6459651 _cons -.9048517.6849678 -1.32 0.187 -2.250472.4407687 R-squared = SS Model /SS Total = 1179.73 / 7160.41 = 0.1648 16% of the variation in wage is explained by the regression and the rest is due to error. This is not a very good fit. 27 Log transformation (logged variables) Sometimes variables (y or x) are expressed as logs, log(y) or log(x) With logs, interpretation is in percentage/elasticity Variables such as age and education that are measured in units such as years should not be logged Variables measured in percentage points (e.g. interest rates) should not be logged Logs cannot be used if variables have zero or negative values Taking logs often reduces problems with large values or outliers Taking logs helps with homoskedasticity and normality 28 Log-log form Linear regression model: π¦π¦ = π½π½0 + π½π½1 π₯π₯ + π’π’ log-log form: ππππππ(π¦π¦) = π½π½0 + π½π½1 log(π₯π₯) + π’π’ Instead of the dependent variable, use log of the dependent variable. Instead of the independent variable, use log of the independent variable. βlog(π¦π¦) βπ¦π¦ π₯π₯ ππππππππππππππ πππππππππππ ππππ π¦π¦ π½π½Μ1 = = = βlog(π₯π₯) π¦π¦ βπ₯π₯ ππππππππππππππ πππππππππππ ππππ π₯π₯ The dependent variable changes by π½π½Μ1 percent when the independent variable changes by one percent. 29 Log-linear form (also called semi-log) Linear regression model: π¦π¦ = π½π½0 + π½π½1 π₯π₯ + π’π’ Log-linear form: ππππππ(π¦π¦) = π½π½0 + π½π½1 π₯π₯ + π’π’ Instead of the dependent variable, use log of the dependent variable. βlog(π¦π¦) βπ¦π¦ 1 ππππππππππππππ πππππππππππ ππππ π¦π¦ π½π½Μ1 = = = βπ₯π₯ π¦π¦ βπ₯π₯ πππππππππππ ππππ π₯π₯ The dependent variable changes by π½π½Μ1 *100 percent when the independent variable changes by one unit. 30 Linear-log form Linear regression model: π¦π¦ = π½π½0 + π½π½1 π₯π₯ + π’π’ Linear-log form: π¦π¦ = π½π½0 + π½π½1 log(π₯π₯) + π’π’ Instead of the independent variable, use log of the independent variable. βπ¦π¦ π₯π₯ πππππππππππ ππππ π¦π¦ π½π½Μ1 = = βπ¦π¦ = βlog(π₯π₯) βπ₯π₯ ππππππππππππππ πππππππππππ ππππ π₯π₯ The dependent variable changes by π½π½Μ1 /100 units when the independent variable changes by one percent. 31 Example of data with logs wage lwage educ 3.10 1.13 11 3.24 1.18 12 3.00 1.10 11 6.00 1.79 8 5.30 1.67 12 8.75 2.17 16 11.25 2.42 18 5.00 1.61 12 3.60 1.28 12 18.18 2.90 17 32 Linear vs log-linear form 25 3 20 2 15 1 10 0 5 0 -1 0 5 10 15 20 0 5 10 15 20 educ educ wage Fitted values lwage Fitted values Linear form: wage on education Log-linear form: log wage on education 33 Linear vs log-linear form (1) (2) VARIABLES wage lwage educ 0.541*** 0.0827*** (0.0532) (0.00757) Constant -0.905 0.584*** (0.685) (0.0973) Observations 526 526 R-squared 0.165 0.186 Linear form: wage increases by $0.54 for each additional year of education. Log-linear form: wage increases by 8.2% for each additional year of education. 34 Example of data with logs Salary Sales (thousand (Million dollars) lsalary dollars) lsales 1095 7.0 27595 10.2 1001 6.9 9958 9.2 1122 7.0 6126 8.7 578 6.4 16246 9.7 1368 7.2 21783 10.0 1145 7.0 6021 8.7 1078 7.0 2267 7.7 1094 7.0 2967 8.0 1237 7.1 4570 8.4 833 6.7 2830 7.9 Note that one unit is thousand dollars for salary and million dollars for sales. 35 Linear vs log-log form 15000 10 9 10000 8 7 5000 6 0 5 0 20000 40000 60000 80000 100000 4 6 8 10 12 1990 firm sales, millions $ natural log of sales 1990 salary, thousands $ Fitted values natural log of salary Fitted values Linear form: salary on sales Log-log form: log salary on log sales 36 Log-linear vs linear-log form 15000 10 9 10000 8 7 5000 6 0 5 0 20000 40000 60000 80000 100000 4 6 8 10 12 1990 firm sales, millions $ natural log of sales natural log of salary Fitted values 1990 salary, thousands $ Fitted values Log-linear form: log salary on sales Linear-log form: salary on log sales 37 Interpretation of coefficients Linear Log-log Log-linear Linear-log VARIABLES salary lsalary lsalary salary sales 0.0155* 1.50e-05*** (0.00891) (3.55e-06) lsales 0.257*** 262.9*** (0.0345) (92.36) Constant 1,174*** 4.822*** 6.847*** -898.9 (112.8) (0.288) (0.0450) (771.5) Linear form: salary increases by 0.155 thousand dollars ($155 dollars) for each additional one million dollars in sales. Log-log form: salary increases by 0.25% for every 1% increase in sales. Log-linear form: salary increases by 0.0015% (=0.000015*100) for each additional one million dollar increase in sales. Linear-log form: salary increases by 2.629 (=262.9/100) thousand dollars for each additional 1% increase in sales. 38 Gauss Markov assumptions Gauss Markov assumptions are standard assumptions for the linear regression model 1. Linearity in parameters 2. Random sampling 3. No perfect collinearity (or sample variance in the independent variable) 4. Exogeneity or zero conditional mean β regressors are not correlated with the error term 5. Homoscedasticity β variance of error term is constant 40 Assumption 1: linearity in parameters π¦π¦ = π½π½0 + π½π½1 π₯π₯ + π’π’ The relationship between y and x is linear in the population. Note that the regression model can have logged variables (e.g. log sales), squared variables (e.g. education2) or interactions of variables (e.g. education*experience) but the π½π½ parameters are linear. 41 Assumption 2: random sampling π₯π₯ππ , π¦π¦ππ , where ππ= 1β¦.n The data are a random sample drawn from the population. Each observation follows the population equation π¦π¦ = π½π½0 + π½π½1 π₯π₯ + π’π’ Data on workers (y=wage, x=education). Population is all workers in the U.S. (150 million) Sample is workers selected for the study (1,000) Drawing randomly from the population β each worker has equal probability of being selected For example, if young workers are oversampled, this will not be a random/representative sample. 42 Assumption 3: no perfect collinearity ππ πππππππ₯π₯ = (π₯π₯ β π₯π₯)Μ 2 > 0 ππ=1 In the simple regression model with one independent variable, there needs to be sample variation in the independent variable (variance of x must be positive). If there is no variation, the independent variable will be a constant and a separate coefficient cannot be estimated because there is perfect collinearity with the constant in the model. β(π₯π₯ππ βπ₯π₯)(π¦π¦ Μ ππ βπ¦π¦) Μ Note that SSTx is in the denominator of π½π½1 = β(π₯π₯ππ βπ₯π₯)Μ 2 43 Assumption 4: zero conditional mean (exogeneity) πΈπΈ π’π’ππ π₯π₯ππ ) = 0 Expected value of error term u given independent variable x is zero. The expected value of the error must not differ based on the values of the independent variable. The errors must sum up to zero for each x. 44 Example of zero conditional mean Regression model π€π€π€π€π€π€π€π€ = π½π½0 + π½π½1 ππππππππ + π’π’ In the example of wage and education, when ability (which is unobserved and part of the error) is higher, education would also be higher. This is a violation of the zero conditional mean assumption. 45 Example of exogeneity vs endogeneity Exogeneity - zero conditional mean Endogeneity - conditional mean is not zero 10 30 25 5 uhat_modified 20 Residuals 15 0 10 -5 5 10 12 14 16 18 10 12 14 16 18 educ educ E(u|x)=0 error term is the same given education E(u|x)>0 ability/error is higher when education is higher 46 Unbiasedness of the OLS estimators Gauss Markov Assumptions 1-4 (linearity, random sampling, no perfect collinearity, and zero conditional mean) lead to the unbiasedness of the OLS estimators. πΈπΈ π½π½Μ0 = π½π½0 and πΈπΈ π½π½Μ1 = π½π½1 Expected values of the sample coefficients π½π½Μ are the population parameters π½π½. If we estimate the regression model with many random samples, the average of these coefficients will be the population parameter. For a given sample, the coefficients may be very different from the population parameters. 47 Assumption 5: homoscedasticity Homoscedasticity π£π£π£π£π£π£ π’π’ππ π₯π₯ππ = ππ 2 Variance of the error term π’π’ must not differ with the independent variable π₯π₯. Heteroscedasticity π£π£π£π£π£π£ π’π’ππ π₯π₯ππ β ππ 2 is when the variance of the error term π’π’ is not constant for each π₯π₯. 48 Homoscedasticity vs heteroscedasticity Homoscedasticity Heteroscedasticity π£π£π£π£π£π£ π’π’ π₯π₯ = ππ 2 π£π£π£π£π£π£ π’π’ π₯π₯ β ππ 2 49 Homoscedasticity vs heteroscedasticity Homoscedasticity Heteroscedasticity π£π£π£π£π£π£ π’π’ π₯π₯ = ππ 2 π£π£π£π£π£π£ π’π’ π₯π₯ β ππ 2 10 15 10 5 Residuals Residuals 5 0 0 -5 -5 10 12 14 16 18 0 5 10 15 20 educ educ 50 Unbiasedness of the error variance We can estimate the variance of the error term as: 1 2 ππ = βππππ=1 π’π’ ππ2 ππβ2 The degrees of freedom (n-k-1) are corrected for the number of independent variables k=1. Gauss Markov Assumptions 1-5 (linearity, random sampling, no perfect collinearity, zero conditional mean, and homoscedasticity) lead to the unbiasedness of the error variance. πΈπΈ ππ 2 = ππ 2 51 Variances of the OLS estimators The estimated regression coefficients are random, because the sample is random. The coefficients will vary if a different sample is chosen. What is the sample variability in these OLS coefficients? How far are the coefficients from the population parameters? 52 Variances of the OLS estimators ππ 2 ππ 2 π£π£π£π£π£π£ π½π½Μ1 = ππ 2 = βππ=1(π₯π₯ππ β π₯π₯)Μ πππππππ₯π₯ ππ 2 ππβ1 βππππ=1 π₯π₯ππ2 ππ 2 ππβ1 βππππ=1 π₯π₯ππ2 π£π£π£π£π£π£ π½π½Μ0 = ππ 2 = βππ=1(π₯π₯ β π₯π₯)Μ πππππππ₯π₯ The variances are higher if the variance of the error term is higher and if the variance in the independent variable is lower. Estimators with lower variance are desirable. This means low variance in error term but high variance in the independent variable is desirable. 53 Standard errors of the regression coefficients ππ 2 π π π π π½π½Μ1 = π£π£π£π£π£π£ π½π½Μ1 = πππππππ₯π₯ ππ 2 ππβ1 βππππ=1 π₯π₯ππ2 π π π π π½π½Μ0 = π£π£π£π£π£π£ π½π½Μ0 = πππππππ₯π₯ The standard errors are square root of the variances. The unknown population variance of error term ππ 2 is replaced with the sample variance of the residuals ππ 2. The standard errors measure how precisely the regression coefficients are calculated. 54