Ordinary Least Squares and Omitted Variables PDF - Lecture Notes
Document Details

Uploaded by LongLastingGalaxy
Magnus Carlsson
Tags
Summary
This document presents a lecture by Magnus Carlsson on ordinary least squares (OLS) and omitted variables. It covers OLS assumptions, the omitted variables formula, provides examples, and discusses potential issues like the "bad control" problem. The material is suitable for undergraduate economics students.
Full Transcript
Ordinary least squares and omitted variables Magnus Carlsson Ordinary least squares (OLS) We have seen that the experiment should be viewed as the gold standard for establishing causality Under what conditions will OLS provide unbiased and consistent estimates of a causal effect? As w...
Ordinary least squares and omitted variables Magnus Carlsson Ordinary least squares (OLS) We have seen that the experiment should be viewed as the gold standard for establishing causality Under what conditions will OLS provide unbiased and consistent estimates of a causal effect? As we will see, the zero conditional mean assumption is the most important condition for these conditions to be fulfilled OLS roadmap First, repeat the OLS assumptions and focus on the zero conditional mean assumption Derive the important “omitted variables” formula that formalizes the bias from omitted variables Regression and randomization The “bad control” problem: how controlling for the “wrong” type of variables can introduce bias Literature Angrist and Pischke, chapter 3 (focus on 3.1-3. 3) Wooldridge, chapter 3.3, 9.2 OLS example Consider: an outcome variable Y: e.g. labor earnings; a variable X which we consider as a possible determinant of Y in which we are interested (e.g. years of education) a variable u which describes all the other determinants of y that we do not observe. The general notation for the model that relates Y, X and u is: Y = f ( X, u) OLS example We are interested in the relationship between X and Y in the population, which we can study from two perspectives: i. To what extent knowing X allows us to “predict something” about Y. ii. Whether ∆𝑋 “causes” ∆𝑌 given a proper definition of causality. ln this course, we will focus on (ii) and thus: under what assumptions will OLS give us an estimate of the causal effect in the population of interest? Next, we will repeat the OLS assumptions The OLS assumptions in the bivariate case Assumption OLS.1 (Random Sampling): See previous courses. Assumption OLS.2 (Linearity in Parameters): See previous courses. The OLS assumptions in the bivariate case Assumption OLSI3 (Zero Conditional Mean): ln order to be able to say something about 𝛽0 and 𝛽1 , we need to restrict the dependence of x and u. If these could move freely together, we can never be sure whether it is x or u that changed y. The conditional distribution of u given x has a zero mean: 𝐸 𝑢 ȁ𝑥 = 0 The OLS assumptions in the bivariate case (for unbiasedness and consistency) Note that the zero conditional mean assumption is a very strong assumption! Only in the case of experiments can we be sure it is fulfilled. Whether or not we can derive causality in a regression framework thus very much depend on whether the zero conditional mean assumption is fulfilled The OLS assumptions in the bivariate case What is the “deep” meaning of OLS.3? Suppose that y is earnings, x is years of education and u denotes unobservable innate ability, such that: 𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝑢 (1) The assumption that 𝐸 𝑢ȁ𝑥 = 0 then means that the expected value of u does not depend on the value of x. ln our example, for any given level of education, the expected value of ability is the same Realistic? The OLS assumptions in the bivariate case Assumption OLS.4 (Sampling Variation in the Explanatory Variable): See previous courses The omitted variables formula The zero conditional mean assumption is crucial for deriving the OLS estimator With non-experimental data, however, the assumption is unlikely to hold This means that our estimates may be biased and inconsistent It would therefore be useful if one could derive a formula that explicitly shows what the bias would look like We will therefore next derive the important and very useful omitted variables formula The omitted variables formula: example Suppose that the “correct model” of the determinants of earnings, 𝑦𝑖 , can be written as: 𝑦𝑖 = 𝛼 + 𝜌𝑆𝑖 + 𝛾𝐴𝑖 + 𝜀𝑖 , where 𝑆𝑖 is schooling and 𝐴𝑖 is ability, and 𝜀𝑖 is a random term. In practice, we usually do not observe ability, but we strongly suspect ability to be correlated with schooling. Suppose that the researcher now mistakenly specifies the incorrect model: 𝑦𝑖 = 𝛼 + 𝜌𝑆𝑖 + 𝑢𝑖 , Now, we can use the bivariate regression formula to derive the bias of 𝜌 in the incorrectly specified model The omitted variables formula 𝐶𝑜𝑣 (𝑆𝑖 , 𝑦𝑖 ) 𝜌𝑂𝐿𝑆 = 𝑉𝑎𝑟 (𝑆𝑖 ) Now substitute the formula for 𝑌𝑖 from the correctly specified model: 𝐶𝑜𝑣 (𝑆𝑖 ,𝛼+𝜌𝑆𝑖 + 𝛾𝐴𝑖 +𝜀𝑖 ) = 𝑉𝑎𝑟 (𝑆𝑖 ) 𝐶𝑜𝑣 (𝑆𝑖 ,𝛼)+𝜌𝐶𝑜𝑣 𝑆𝑖 ,𝑆𝑖 + 𝛾𝐶𝑜𝑣 𝑆𝑖 ,𝐴𝑖 +𝐶𝑜𝑣(𝑆𝑖 ,𝜀𝑖 ) = 𝑉𝑎𝑟 (𝑆𝑖 ) The omitted variables formula Since 𝐶𝑜𝑣 (𝑆𝑖 , 𝛼) = 0, 𝐶𝑜𝑣 𝑆𝑖 , 𝜀𝑖 = 0, we now have that: 𝜌𝐶𝑜𝑣 𝑆𝑖 , 𝑆𝑖 + 𝛾𝐶𝑜𝑣 𝑆𝑖 , 𝐴𝑖 (16) 𝜌𝑂𝐿𝑆 = 𝑉𝑎𝑟 (𝑆𝑖 ) 𝜌𝑉𝑎𝑟 𝑆𝑖 + 𝛾𝐶𝑜𝑣 𝑆𝑖 , 𝐴𝑖 = (17) 𝑉𝑎𝑟 (𝑆𝑖 ) 𝛾𝐶𝑜𝑣 𝑆𝑖 , 𝐴𝑖 =𝜌+ = 𝜌 + 𝛾𝛿𝐴𝑆 (18) 𝑉𝑎𝑟 (𝑆𝑖 ) where 𝛿𝐴𝑆 is the regression coefficient of 𝑆𝑖 in a regression of 𝐴𝑖 on 𝑆𝑖. The omitted variables formula The omitted variables formula is useful because we can use it to reason about the expected bias of our estimates. Note that there are two conditions under which 𝜌𝑂𝐿𝑆 will not be biased: 𝛾 = 0. Obviously, if 𝛾 = 0 this means that the model was not mis-specified in the first place, since ability has no effect on earnings, and that (16) will be equal to 𝜌 𝛿𝐴𝑆 = 0. If 𝑆𝑖 and 𝐴𝑖 are unrelated, there will be no bias from excluding 𝐴𝑖 from the equation. The omitted variables formula: example Let’s take a concrete example. Consider a wage regression of log wages on education, where we do not control for ability. The results say: ෟ = 5.0455 + 0.0667∗ 𝑒𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛 log 𝑤𝑎𝑔𝑒 (19) What is the likely bias? The omitted variables formula: example First, lets assume we did have some measure for ability. We then estimate the model: ෟ = 4.7050 + 0.0443∗ 𝑒𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛 + 0.0063∗ 𝑖𝑞 log 𝑤𝑎𝑔𝑒 (20) The coefficient on education is now much smaller than when we did not control for ability. The difference in the coefficients is 0.0667 – 0.0443 = 0.0224. The omitted variables formula: example We can now use the omitted variables formula to see that the bias is indeed 0.0224. From the formula, we know that the difference in coefficients should be equal to the product of: 1. the coefficient on the omitted variable (0.0063) in a regression of earnings on the omitted variable and 2. the coefficient on the included x-variable (schooling) in a regression of the omitted variable on the included x-variable. The omitted variables formula: example Let’s say that the second type of regression leads to: = 53.6872 + 3.5388 ∗ 𝑒𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛, 𝑖𝑞 (21) with slope coefficient of 3.5388, so that the product of 3.5388*0.0063 is indeed 0.0024. The omitted variables formula: example 2 Take another example, where we estimate the relation between participating in a job search program for unemployed person, 𝑇𝑖 , and employment, 𝐸𝑖 : 𝐸𝑖 = 𝛼 + 𝜌𝑇𝑖 + 𝛾𝐴𝑖 + 𝜀𝑖 , (22) Again, we do not observe ability and estimate: 𝐸𝑖 = 𝛼 + 𝜌𝑇𝑖 + 𝑢𝑖 , (23) What is the bias? The omitted variables formula: example 2 ln this case, it may very well be that 𝐶𝑜𝑣 𝑇𝑖 , 𝐴𝑖 < 0 This may happen if lower ability workers and more inclined to join the program, whereas high ability workers do not find the need to join Since 𝛾 > 0, this will lead the “bias term” in the omitted variables formula to become negative: 𝛾𝐶𝑜𝑣 𝑇𝑖 , 𝐴𝑖