Lecture 3: Ordinary Least Squares (OLS)

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

Consider a scenario where an econometrician posits a linear regression model to ascertain the causal impact of variable X on outcome Y. Given the foundational principles of Ordinary Least Squares (OLS) estimation, under what precise condition is the OLS estimator for the coefficient of X guaranteed to be both unbiased and consistent in elucidating this causal effect?

  • Provided that the sample size is sufficiently large, irrespective of the correlation between X and unobserved determinants of Y.
  • If and only if X is randomly assigned across the population, ensuring exogeneity and orthogonality with all potential confounders.
  • Assuming the functional form relating X and Y is correctly specified and there is no multicollinearity among regressors.
  • Solely when the conditional expectation of the error term, given X, is rigorously zero, irrespective of other model specifications. (correct)

In the context of econometric modeling aiming for causal inference, imagine a researcher estimates a bivariate regression of earnings (Y) on years of education (X). Invoking the 'omitted variable bias' formula, under what specific circumstance would the estimated coefficient on education from this regression spuriously overestimate the true causal effect of education on earnings?

  • If the true relationship between education and earnings is non-linear, and the linear regression model fails to capture this complexity.
  • In the presence of heteroskedasticity in the error term, causing inefficient but not necessarily biased estimation of the coefficient.
  • If individuals with higher innate ability are systematically more likely to attain higher levels of education, and innate ability positively influences earnings, but is unobserved. (correct)
  • When there is substantial measurement error in the years of education variable, attenuating the estimated coefficient towards zero.

Consider a research design where randomization is employed to assign individuals to either a treatment or control group before assessing the impact of treatment (T) on an outcome (Y). Within this experimental framework, what is the most salient econometric advantage conferred by randomization concerning the application of Ordinary Least Squares (OLS) regression for causal inference?

  • Randomization inherently eliminates heteroskedasticity, thereby ensuring the efficiency of OLS estimators.
  • Randomization fundamentally ensures the satisfaction of the Zero Conditional Mean Assumption, rendering the OLS estimator for the treatment effect unbiased. (correct)
  • Randomization guarantees linearity in the relationship between T and Y, justifying the use of a linear regression model.
  • Randomization primarily addresses the issue of multicollinearity, making the identification of the causal effect of T more precise.

Imagine a researcher, aiming to estimate the causal effect of class size (X) on student test scores (Y), initially omits 'student ability' from their OLS regression. Subsequently, concerned about potential omitted variable bias, they decide to include 'parental income' (Z) as a control variable. However, unbeknownst to them, parental income is largely determined after and because of their child's inherent ability, and it also independently affects test scores. What econometric problem is most acutely introduced or exacerbated by controlling for 'parental income' in this scenario?

<p>The 'bad control' problem, potentially inducing or amplifying bias in the estimation of the causal effect of class size. (D)</p> Signup and view all the answers

In the context of instrumental variables (IV) regression, consider a scenario where 'distance to college' is proposed as an instrument for 'years of education' in an earnings regression. For 'distance to college' to be considered a valid instrument, which of the following conditions must be rigorously satisfied? (Assume relevance is already established)

<p>Distance to college must be uncorrelated with the error term in the earnings equation, which embodies all unobserved determinants of earnings. (C)</p> Signup and view all the answers

Suppose a researcher aims to estimate the causal effect of 'smoking during pregnancy' (X) on 'birth weight' (Y) using observational data. Recognizing the potential for confounding, they consider using 'cigarette taxes' (Z) as an instrument for 'smoking during pregnancy'. Assess the validity of 'cigarette taxes' as an instrument in this context, specifically focusing on the exogeneity assumption. Which of the following poses the most significant threat to the exogeneity of 'cigarette taxes' as an instrument?

<p>The possibility that cigarette taxes might also affect other maternal behaviors during pregnancy (e.g., diet, healthcare seeking) that independently influence birth weight. (C)</p> Signup and view all the answers

In the framework of regression discontinuity design (RDD), researchers exploit a sharp discontinuity in treatment assignment based on a threshold of a 'forcing variable'. Consider an RDD study examining the effect of a scholarship (treatment) awarded to students scoring above a certain threshold on a standardized test (forcing variable) on college enrollment (outcome). What is the most critical assumption for the validity of causal inference in this RDD setting?

<p>Potential outcomes must be smooth functions of the forcing variable around the threshold, except for the discontinuous effect of the treatment. (C)</p> Signup and view all the answers

When employing difference-in-differences (DID) methodology to estimate the causal effect of a policy intervention, the 'parallel trends' assumption is paramount. Assume a policy is implemented in one region (treatment group) but not in another comparable region (control group). Which of the following scenarios would most severely undermine the validity of the parallel trends assumption in a DID analysis?

<p>If there are region-specific shocks occurring concurrently with the policy intervention that differentially affect the outcome variable in the treatment region. (D)</p> Signup and view all the answers

Consider a scenario where a researcher is investigating the causal effect of 'job training' (X) on 'employment status' (Y). They suspect that individuals who are more motivated are both more likely to participate in job training and more likely to be employed, regardless of training. If 'motivation' is unobserved and omitted from the regression, what type of bias is most likely to arise in the OLS estimate of the effect of job training on employment?

<p>Upward bias, leading to an overestimation of the true effect of job training. (C)</p> Signup and view all the answers

In the context of assessing the validity of causal claims derived from observational studies using OLS regression, which of the following strategies provides the most robust approach to mitigating concerns about omitted variable bias and bolstering the credibility of causal interpretations?

<p>Employing research designs that approximate experimental conditions, such as instrumental variables, regression discontinuity, or difference-in-differences, to exploit exogenous variation. (C)</p> Signup and view all the answers

In an incorrectly specified model $y_i = \alpha + \rho s_i + u_i$ where the true model is $y_i = \alpha + \rho s_i + \gamma A_i + \epsilon_i$, under what precise condition, assuming $\gamma \neq 0$, will the OLS estimator $\rho_{OLS}$ from the incorrect model converge in probability to the true $\rho$?

<p>When $S_i$ and $A_i$ are statistically independent in the population, thereby ensuring $Cov(S_i, A_i) = 0$ eliminating omitted variable bias. (B)</p> Signup and view all the answers

In the context of the omitted variable bias formula, $\rho_{OLS} = \rho + \gamma \delta_{AS}$, where $\delta_{AS}$ represents the regression coefficient of $S_i$ in a regression of $A_i$ on $S_i$, what fundamentally critical assumption must hold true for this formulation to be valid?

<p>That the functional form relating $A_i$ and $S_i$ is correctly specified as linear, and that both $A_i$ and $S_i$ are measured without error. (D)</p> Signup and view all the answers

Consider a scenario where you are estimating a wage regression, and you suspect omitted variable bias due to unobserved ability. You initially estimate $log(wage) = 5.0455 + 0.0667 * education$. After including a proxy for ability ($IQ$), the regression becomes $log(wage) = 4.7050 + 0.0443 * education + 0.0063 * IQ$. What is the most accurate interpretation of the change in the coefficient on education?

<p>The initial coefficient on education suffered from upward bias due to the omission of ability; the new coefficient represents the effect of education independent of ability, all other factors constant. (C)</p> Signup and view all the answers

In the standard omitted variable bias framework, let $\hat{\beta}$ represent the coefficient of interest in a short regression, $\beta$ represent the true coefficient in the long regression, and $\delta$ represent the regression coefficient from regressing the omitted variable on the included variable. What additional information must we know with certainty to compute the exact magnitude of the bias?

<p>We need to know the coefficient on the omitted variable $(\gamma)$ in the long regression to quantify the bias as $\gamma\delta$. (C)</p> Signup and view all the answers

Suppose a researcher estimates a wage regression omitting a crucial variable: 'motivation'. They find the coefficient on education is 0.08. When they include 'motivation' (measured imperfectly), the education coefficient drops to 0.06, and the 'motivation' coefficient is 0.30. If regressing 'motivation' on 'education' yields a coefficient of 0.05, what is the estimated bias in the original education coefficient due to omitting 'motivation'?

<p>0.015, reflecting the product of the 'motivation' coefficient (0.30) and the regression coefficient of 'motivation' on 'education' (0.05). (A)</p> Signup and view all the answers

Assume that you estimate a simple regression model and suspect that an important variable, $Z$, has been omitted. Under what conditions would the inclusion of $Z$ not alter the coefficient of interest, $\beta$, on the included variable, $X$?

<p>When $Z$ lacks any causal impact on $Y$, and $Z$ is statistically independent of $X$, implying $Cov(X, Z) = 0$. (A)</p> Signup and view all the answers

In a regression framework where the true model is $Y = X\beta + Z\gamma + \epsilon$, but you estimate $Y = X\beta + u$, derive the precise mathematical condition under which the OLS estimator of $\beta$ in the short regression is unbiased, despite the omission of $Z$.

<p>$Cov(X, Z) = 0$ and $\gamma = 0$, implying that $X$ and $Z$ are uncorrelated and $Z$ has no effect on Y. (B)</p> Signup and view all the answers

Consider a study estimating the impact of parental education ($P$) on child's education ($C$). However, 'genetic endowment' ($G$) is unobserved and correlated with both $P$ and $C$. If you could only proxy either $P$ or $C$, which one would be optimal to proxy to address OVB and why?

<p>Proxy for $P$ because it directly addresses the omitted variable bias by controlling for the genetic factors influencing parental decisions and improving causal estimates. (D)</p> Signup and view all the answers

In the context of a wage regression, suppose you want to assess the potential bias introduced by omitting 'grit' (i.e., perseverance and passion for long-term goals). You know that individuals with higher education ($E$) tend to exhibit greater grit ($G$), but the relationship is complex and not perfectly linear. Furthermore, 'grit' is challenging to quantify reliably. What is your strategy to best assess the potential bias?

<p>Estimate the change in the $E$ coefficient when $G$ is included (even with measurement error). Then use the implied bias and your economic intuition to determine if this seems realistic. (D)</p> Signup and view all the answers

Let's say a researcher estimates the return to education but cannot observe innate ability. They apply the omitted variable bias formula and confidently conclude that the return to education is significantly biased upwards. An astute critic points out that the true relationship between ability, education, and wages might be more complex. What nuance would most effectively challenge the researcher's conclusion?

<p>If ability, education, and wages are non-linear then the linear OVB formula no longer holds. This may lead to over or under estimates of the bias. (C)</p> Signup and view all the answers

In the context of Ordinary Least Squares (OLS) regression, under what specific condition is the derivation of causality most critically dependent?

<p>The adherence to the zero conditional mean assumption, implying that the expected value of the error term, $E(u|x)$, is zero for all values of the independent variable, $x$. (C)</p> Signup and view all the answers

Assume a bivariate regression model where earnings ($y$) are regressed on years of education ($x$), with unobserved innate ability represented by $u$: $y = \beta_0 + \beta_1x + u$. If $E(u|x) \neq 0$, what is the most likely consequence for the OLS estimator of $\beta_1$?

<p>The OLS estimator will be biased and inconsistent, leading to incorrect inferences about the true effect of education on earnings, even with large samples. (A)</p> Signup and view all the answers

In the context of the 'omitted variable formula,' which condition must be met for the OLS estimator of a coefficient to be unbiased when a relevant variable is excluded from the regression?

<p>The omitted variable must be uncorrelated with the included explanatory variables to ensure that its exclusion does not bias the coefficients of the included variables. (B)</p> Signup and view all the answers

Consider a scenario where you aim to estimate the impact of schooling ($S_i$) on earnings ($y_i$), but you suspect that ability ($A_i$) is an omitted variable correlated with both schooling and earnings. Given the 'true model' $y_i = \alpha + \rho S_i + \gamma A_i + \epsilon_i$, what is the most likely consequence of omitting $A_i$ from your regression?

<p>The estimated coefficient on schooling ($\rho$) will be biased, potentially over- or underestimating the true effect, due to the correlation between schooling and the omitted variable (ability). (D)</p> Signup and view all the answers

In the framework of OLS assumptions, particularly concerning the zero conditional mean, what represents the most critical threat to the validity of causal inferences drawn from regression analysis using non-experimental data?

<p>The potential endogeneity of explanatory variables, leading to correlation between the explanatory variables and the error term, thus violating the zero conditional mean assumption. (B)</p> Signup and view all the answers

Suppose you estimate a regression model and suspect that the zero conditional mean assumption is violated. Which of the following strategies would be the MOST appropriate for addressing this issue, assuming you have access to rich data and advanced econometric techniques?

<p>Employ instrumental variables (IV) regression, seeking valid instruments that are correlated with the endogenous explanatory variable but uncorrelated with the error term. (A)</p> Signup and view all the answers

Given the 'omitted variable formula', under what specific circumstance will the bias resulting from omitting a relevant variable, correlated with the included variables, be precisely zero?

<p>When the included explanatory variables are perfectly orthogonal to the omitted variable, ensuring that the exclusion of the variable does not affect the coefficients of the included variables. (D)</p> Signup and view all the answers

In a scenario where an OLS regression is performed to estimate the effect of education on wages, and it is suspected that unobserved ability is correlated with both education and wages, what econometric technique could MOST effectively address the endogeneity issue, assuming suitable data is available?

<p>Two-stage least squares (2SLS) regression, using instrumental variables that are correlated with education but uncorrelated with unobserved ability to estimate the causal effect of education on wages. (A)</p> Signup and view all the answers

Suppose a researcher estimates a model of earnings ($y_i$) as a function of schooling ($S_i$) and ability ($A_i$), represented as $y_i = \alpha + \rho S_i + \gamma A_i + \epsilon_i$. However, ability is unobserved and omitted from the regression. Assuming $S_i$ and $A_i$ are positively correlated, and $\gamma > 0$, what is the expected direction of the bias in the OLS estimate of $\rho$ if ability is omitted?

<p>The OLS estimate of $\rho$ will be biased upward, overestimating the true effect of schooling on earnings. (D)</p> Signup and view all the answers

Consider a scenario where the true data generating process includes an interaction term between two explanatory variables, but this interaction is omitted from the estimated OLS model. How does the omission of this interaction term MOST directly impact the interpretation of the coefficients on the included main effects?

<p>The coefficients on the included main effects will represent conditional average effects, averaging over all possible values of the interacting variable. (B)</p> Signup and view all the answers

Suppose a researcher aims to estimate the impact of a job search program ($T_i$) on employment status ($E_i$). Due to data limitations, the researcher cannot directly observe individual ability ($A_i$). Under what specific circumstance will the estimated coefficient on $T_i$ in the short regression ($E_i = \alpha + \rho T_i + u_i$) be biased downwards, assuming $A_i$ positively influences employment?

<p>When individuals with lower unobserved ability are more inclined to participate in the job search program, leading to a negative covariance between $T_i$ and $A_i$. (D)</p> Signup and view all the answers

Consider a researcher estimating the effect of education on earnings, initially finding a coefficient of 0.0667. After including a proxy for ability, the coefficient drops to 0.0443. Given that regressing the ability proxy on education yields a coefficient of 3.5388 and the coefficient on the ability proxy in the earnings equation is 0.0063, what precisely does the product of 3.5388 and 0.0063 represent in this context?

<p>It constitutes the estimated bias in the original coefficient on education due to the omitted variable (ability), under the assumptions of the OVB formula. (A)</p> Signup and view all the answers

Suppose a researcher estimates the following two regressions: 1) $E_i = \alpha + \rho T_i + u_i$, and 2) $E_i = \alpha + \rho T_i + \gamma A_i + \epsilon_i$, where $E_i$ is employment status, $T_i$ is participation in a job search program, $A_i$ is ability, and $u_i$ is the error term. If $Cov(T_i, A_i) < 0$ and $\gamma > 0$, what sign will the bias term in the omitted variable formula have, and what does this imply about the estimated effect of the job search program in the first regression?

<p>The bias term will be negative, leading to a potential underestimation or even a negative estimated effect of the job search program. (B)</p> Signup and view all the answers

In the context of estimating the impact of education on earnings, a researcher initially omits 'innate ability' from their regression model. Critically, the researcher posits that the covariance between education and innate ability could realistically be negative. Under what specific condition would this negative covariance lead to the OLS estimate of the return to education being underestimated?

<p>When the true causal effect of innate ability on earnings is positive, reinforcing the downward bias introduced by the negative covariance between education and innate ability. (D)</p> Signup and view all the answers

Consider the scenario where a researcher is using the Omitted Variable Bias (OVB) formula. The researcher estimates that $\hat{\beta}{short} = \beta + \delta \cdot \gamma$, where $\beta$ is the true coefficient, $\hat{\beta}{short}$ is the biased coefficient from the short regression, $\delta$ is the coefficient from regressing omitted variable $Z$ on included variable $X$, and $\gamma$ is the coefficient from regressing $Y$ on $Z$ after including $X$. If the researcher finds that$\hat{\beta}_{short}$ is statistically insignificant despite strong theoretical reasons to believe $X$ affects $Y$, What can explain this?

<p>The influence of the omitted variable ($Z$) counteracts the true effect of $X$ on $Y$. (B)</p> Signup and view all the answers

A researcher suspects omitted variable bias in their wage regression due to the omission of 'social capital'. To assess the potential bias, they propose a novel approach: instead of directly measuring 'social capital,' they plan to instrument it using 'participation in extracurricular activities during high school'. Which of the following conditions MOST critically undermines the validity of this instrumental variable approach?

<p>The possibility that participation in extracurricular activities during high school directly affects wages independently of its effect on adult 'social capital'. (B)</p> Signup and view all the answers

Consider a scenario where a researcher hypothesizes that access to high-speed internet ($I$) positively affects student test scores ($S$). However, they suspect that families with higher socioeconomic status (SES) are more likely to have access to high-speed internet and to invest in other resources that enhance their children's academic performance. If the researcher estimates the following equations, where $u_i$ and $\epsilon_i$ are error terms:\ $S_i = \alpha + \beta I_i + u_i$ \$S_i = \alpha + \beta I_i + \gamma SES_i + \epsilon_i$ \ What would imply there is omitted variable bias?

<p>If the coefficient $\beta$ is larger in magnitude in the first equation than in the second equation. (A)</p> Signup and view all the answers

In a study examining the impact of a new teaching method ($X$) on student performance ($Y$), researchers suspect that student motivation ($Z$) is an omitted variable. The estimated model is $Y = \alpha + \beta X + \epsilon$, where $\epsilon$ is the error term. For the OLS estimator of $\beta$ to be unbiased despite the omission of $Z$, which of the following conditions must hold?

<p>Motivation ($Z$) must have no impact on student performance ($Y$), or it must be completely uncorrelated with the new teaching method ($X$). (D)</p> Signup and view all the answers

Consider a researcher using the omitted variable bias formula to assess the impact of an unobserved variable on the estimated coefficient of interest in a regression model. Under what specific condition would the application of the omitted variable bias formula lead to a misleading or inaccurate assessment of the true bias?

<p>When the functional form of the relationship between the omitted variable and the included variable is highly nonlinear. (C)</p> Signup and view all the answers

A researcher estimates a wage regression, but omits 'cognitive ability' from the model. They acknowledge that more educated individuals tend to have higher cognitive ability. If the true data generating process involves complex interactions between education, cognitive ability, and unobserved 'grit', what is the most salient reason why applying the basic omitted variable bias (OVB) formula might provide an incomplete or misleading assessment of the true bias?

<p>Complex interactions imply that the relationships captured by the OVB formula (linear effects of omitted variable and its correlation with the included variable) may not fully represent the true dependencies, leading to an inaccurate assessment (D)</p> Signup and view all the answers

Flashcards

Gold standard for causality

Establishing causality requires a standard, like an experiment.

Zero Conditional Mean Assumption

The most important condition for OLS to provide unbiased and consistent estimates of a causal effect.

Omitted Variables Formula

Formalizes the bias resulting from variables that are not included in the model.

Bad Control Problem

Controlling for the wrong variables can introduce bias into the analysis.

Signup and view all the flashcards

Variable X

A variable we consider as a possible determinant of the outcome variable.

Signup and view all the flashcards

Variable u

Describes determinants of the outcome variable that we do not observe.

Signup and view all the flashcards

Predicting Y from X

Knowing X allows us to predict something about Y.

Signup and view all the flashcards

Causal Effect

Concerned with whether a change in X causes a change in Y.

Signup and view all the flashcards

E(u|x) = 0

The conditional distribution of the error term given X has a zero mean.

Signup and view all the flashcards

Experiments

Can ensure that the zero conditional mean assumption is fulfilled.

Signup and view all the flashcards

Equation Form of Earnings Model

Earnings = Intercept + (Coefficient * Education) + Unobservable factors (like ability)

Signup and view all the flashcards

Meaning of E(u|x) = 0

The expected value of the error term (u) does not change based on the value of the independent variable (x). In the earnings example, average innate ability is the same regardless of education level.

Signup and view all the flashcards

What does 'A' stand for in the earnings equation?

Ability.

Signup and view all the flashcards

What does "S" stand for in the earnings equation?

Schooling.

Signup and view all the flashcards

Omitted Variable Bias

When a relevant variable is omitted from a regression, and that variable is correlated with both the included independent variable and the dependent variable, the OLS estimates may be biased and inconsistent.

Signup and view all the flashcards

Correct Model Definition

The 'correct model' of the determinants of earnings including both schooling and ability.

Signup and view all the flashcards

Importance of Zero Conditional Mean Assumption

The OLS estimator can be meaningfully interpreted.

Signup and view all the flashcards

Challenge with Non-Experimental Data

With observational data, it is common for the zero conditional mean assumption not to hold, resulting in biased estimates.

Signup and view all the flashcards

Bias term (Omitted Variables)

The coefficient from a regression of the error term on the included variable.

Signup and view all the flashcards

Job Search Program Example

Estimating the impact of a job search program (T) on employment (E) without considering ability (A).

Signup and view all the flashcards

Negative Covariance (Job Search)

When lower ability workers are more likely to join a job search program.

Signup and view all the flashcards

Negative Correlation

When higher ability workers do not need to join a job search program.

Signup and view all the flashcards

Gamma (γ)

The effect of ability on employment.

Signup and view all the flashcards

Negative Bias

Can lead to a negative bias in estimating the program's effect. This is in the omitted variable formula.

Signup and view all the flashcards

Omitted Variables Formula Product

The product of the coefficient on the omitted variable in a regression of earnings and the coefficient on the included x-variable in a regression of the omitted variable and the included x-variable.

Signup and view all the flashcards

IQ Equation

IQ = 53.6872 + 3.5388 * education, with slope coefficient of 3.5388.

Signup and view all the flashcards

𝛿𝐴𝑆

The coefficient from regressing the omitted variable (A) on the included variable (S).

Signup and view all the flashcards

No Effect of Omitted Variable

When the coefficient on the excluded variable (𝛾) is zero.

Signup and view all the flashcards

Unrelated Variables Condition

When the included and excluded variables are unrelated (𝛿𝐴𝑆 = 0).

Signup and view all the flashcards

Wage Regression Example (1)

log 𝑤𝑎𝑔𝑒 = 5.0455 + 0.0667∗ 𝑒𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛. Demonstrates the effect of education on wages without accounting for ability.

Signup and view all the flashcards

Wage Regression Example (2)

log 𝑤𝑎𝑔𝑒 = 4.7050 + 0.0443∗ 𝑒𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛 + 0.0063∗ 𝑖𝑞. Illustrates the impact of ability(IQ) on wages, and its effect on the education coefficient.

Signup and view all the flashcards

Omitted Variables

Variables not accounted for in a model that influence the outcome.

Signup and view all the flashcards

γ (Gamma)

The part of the OVB formula that represents the effect of the omitted variable on the outcome variable.

Signup and view all the flashcards

Specification Error

The error that occurs when a relevant variable is left out of the regression model.

Signup and view all the flashcards

Study Notes

Ordinary Least Squares (OLS)

  • Experimentation is the gold standard for establishing causality.
  • OLS provides unbiased and consistent estimates of a causal effect under specific circumstances.
  • The zero conditional mean assumption is the most important condition for these circumstances to be fulfilled.

OLS Roadmap

  • Assumptions for OLS will be repeated, focusing on the zero conditional mean assumption
  • The formula for "omitted variables" will be derived to formalize the bias
  • Regression and randomization will be discussed
  • The "bad control" problem is presented, and how the "wrong" type of variable in a regression can introduce bias

OLS Example

  • An outcome variable Y, such as labor earnings
  • A variable X, which is viewed as a possible determinant of Y, such as years of education
  • A variable u, defining all the other determinants of y not observed
  • The model that relates Y, X, and u is expressed as Y = f(X, u)
  • One can study relationships between X and Y in the population from two perspectives
  • The extent to which knowing X allows one to "predict something" about Y
  • Whether ΔX "causes" ΔY given a proper definition of causality
  • Under certain assumptions OLS gives an estimate of the causal effect in the population of interest

OLS Assumptions in the Bivariate Case

  • Assumption OLS.1 Random Sampling
  • Assumption OLS.2 Linearity in Parameters
  • Assumption OLSI3 Zero Conditional Mean, needed to restrict the dependence of x and u
  • Condition Distribution of u given x has a zero mean: E [u]x] = 0
  • Note that the zero conditional mean assumption is a very strong assumption
  • Experiments are the only case where one can be sure it is fulfilled
  • The ability to derive causality in a regression framework depends on whether the zero conditional mean assumption is fulfilled
  • Deep meaning- y = earnings x = years of education and u = unobservable innate ability y = β + β₁x + u
  • The assumption E [u]x] = 0 then means that the expected value of u does not depend on the value of x.
  • For any given level of education, the expected value of ability is the same
  • Assumption OLS.4 Sampling Variation in the Explanatory Variable

Omitted Variable Formula

  • The zero conditional mean assumption is crucial for deriving the OLS estimator.
  • The zero conditional mean assumption is unlikely to hold with non-experimental data.
  • Estimates may be biased and inconsistent.
  • A formula can be derived that explicitly shows what the bias looks like
  • Next, the omitted variables formula will be derived

Omitted Variable Formula: Example

  • The "correct model" of the determinants of earnings, yi, can be written as: Yi = a + pSi + YA¡ + E¡,
  • Si is schooling, A¡ is ability, and &¡ is a random term.
  • Ability is typically unobserved despite suspected correlation with schooling
  • The researcher then mistakenly specifies the incorrect model:Yi =a + pSi + Ui,
  • The bivariate regression formula can be used to derive the bias of p in the incorrectly specified model

The Omitted Variable Formula

  • POLS =Cov (Si, Yi) /Var (Si)
  • The formula for Y¿ from the correctly specified model is substituted
  • POLS = p + γδ
  • δAS is the regression coefficient of Si in a regression of Ai on Si.
  • The omitted variables formula is useful when reasoning about the expected bias of estimates
  • Two conditions under which Pols will not be biased
  • γ= 0. If y = 0, the model was not mis-specified in the first place, because ability has no effect on earnings, and equation (16) will equal p
  • δAS= 0. If Si and Aį are unrelated, no bias will result from excluding Ai from the equation

Omitted Variable Formula: Example

  • A wage regression of log wages on education where ability is not controlled for
  • log(wage) = 5.0455 + 0.0667* education
  • The model where there is a measure for ability: log(wage) = 4.7050 + 0.0443* education + 0.0063* iq
  • The coefficient on education is now much smaller than when not controlling for ability.
  • The difference in the coefficients is 0.0667 -0.0443 = 0.0224.
  • Omitted variables formula to test the bias is 0.0224.
  • From the formula, we know that the difference in coefficients should be equal to the product of: -The coefficient on the omitted variable (0.0063) in a regression of earnings on the omitted variable
    • The coefficient on the included x-variable (schooling) in a regression of the omitted variable on the included x-variable.

Omitted Variable Formula: Example

  • iq = 53.6872 + 3.5388 * education,
  • With a slope coefficient of 3.5388, so that the product of 3.5388*0.0063 is indeed 0.0024.

Omitted Variable Formula: Example 2

  • The relation between participating in a job search program for unemployed person, T, and employment, E
  • Ε = α + ρΤ + YA + Ei, (22)
  • Again, ability isn't observed so the estimated results are as follows: Ε = α + ρΤ + Ui
  • Bias exists

Omitted Variable Formula: Example 2

  • Cov (Ti, Ai) < 0 can happen if lower ability workers are more inclined to join the program, whereas high ability workers do not find the need to join
  • Since y > 0, this will lead the "bias term" in the omitted variables formula to become negative
  • γCov (Ti, Ai) /Var (Ti)< 0 (24)
  • Estimate of the effect of the job search program to become downward biased

Overcoming Omitted Variable Bias

  • Run an experiment where S¡ is randomized across individuals. This solves the omitted variables problem, and the zero conditional mean assumption will be automatically fulfilled (E[u]X] = 0.)
  • Run a multiple regression where the omitted variable A¡ is included. This requires the "conditional independence assumption (CIA)" to hold.
  • Use matching
  • Seek "natural" or "quasi"- experiments, using techniques such as instrumental variables, regression discontinuity, and difference-in-differences

Regression and Randomization

  • Given the regression model Yi = Bo + B₁Di + Ui
  • Di is a treatment dummy
  • Randomization insures Dᵢ is independently distributed of the unobserved factors in Uᵢ
  • Since it is random if one gets treated, treatment status, the intuition should be clear
  • This also means that the zero conditional mean assumption is automatically fulfilled!

Regression and Randomization

  • Because Dᵢ and Uᵢ are independently distributed we have: E(YiDi) = Bo + B₁Di
  • (This is the zero conditional mean assumption)
  • The OLS-estimator for will measure the causal effect (or average treatment effect) and is equivalent to the difference-in-means estimator.
  • Under the regression framework, the average treatment effect equals: β₁ = E[Yi Di = 1] - E[Y¿D = 0]

Regression and Randomization:

  • E[Yi D = 1] is the expected value of y for the treatment group
  • E[Y¿D = 0] is the expected value of y for the control group
  • Considers a regression and takes the two possible values when D is 1 and when D=0
    • When D = 0, Yi = Bo + Vi
    • Since E (UiDi) = 0, the conditional expectation of E[Yi [D = 0] = Bo, which is also known as the population mean value of Y¿ for the control group.

Randomization Solves the Selection Problem

  • When D = 1, Yi = Bo + B₁Di + Ui
  • E[Yi Di = 1] = Bo + B₁ , which is population mean value for the treatment group, i.e. the mean value in the control group plus the treatment effect

Regression and Randomization

  • Summing up:
    • Bo + B₁ is the population mean for Yi when D = 1 -Bo is the population mean for Yi when D = 0 -E[Yi/D = 1] - E[Y¿D = 0] = (Bo+B₁) - Bo = B₁ is the difference between those means, i.e. the Average Treatment Effects (ATE)
  • Therefore we refer to the OLS estimator of in the regression above as the difference estimator

Improvements of the Difference-In-Means Estimator

  • When using data from a randomized experiment we do not have to control for other factors, because the zero conditional mean assumption is met.
  • With randomization, there should be no systematic relationship between the treatment indicator and these controls, so leaving them out should not matter for the treatment effect
  • There are at least three reasons for including additional control variables
  • Data are often informative on other individual characteristics that affect the outcome, denoted by

Improvements of the Difference-In-Means Estimator cont.

  • Reasons for including additional regressors in the regression equation are:
    • Efficiency
    • If treatment is randomly assigned, the OLS estimator of b in a multiple regression model is more efficient (has smaller variance) than the OLS estimator without controls.
      • This is because including the additional determinants of y reduces the residual variance.

Checking for Random Assignment

  • Check for random assignment
  • Two ways to do it:
    • If random assignment is violated, the OLS estimator for ATE without control variables differs substantially from the OLS estimator for ATE with controls.
      • Check what happens to the OLS estimate of the ATE when one adds control variables.
      • Intuition: If randomization was violated, and more motivated workers got treated more often
      • Controlling for "motivation" should then change the estimated treatment effect
      • If the opposite occurred, as in, there was no violation and controlling for motivation should not matter

Checking for Random Assignment

  • Estimate the following regression: Di = Xo + X₁iẞ₁ + X₂iẞ₂ + ··· + Xkiẞk + Ui

where X₁, X₂, ..., Xk are control variables and testing whether the coefficients on the Xs are zero with an F-test.

  • Intuition: if it is completely random if one gets treated, there should be no variables that systematically predict treatment

The "Bad Control" Problem

  • Despite the omitted variables problem, more controls is not always better!
  • "Bad" controls are variables that are themselves outcome variables of your "treatment"-variable in the hypothetical experiment
  • "Good" controls are variables that we can think of as having been fixed at the time the treatment variable was determined. -The essence of the bad control problem is a version of selection bias, albeit somewhat more subtle than the selection bias discussed so far

The “Bad Control" Problem

  • Example: suppose we are interested in the effects of a college degree on earnings and that people can work in one of two occupations, white collar and blue collar.
  • If there is no data on occupation, should occupation then be seen as an omitted variable in a regression of earnings on schooling?
  • Occupation is related to both education and earnings
  • Controlling for occupation, (in this example), may introduce selection bias, even when education was randomly assigned.

The “Bad Control” Problem Formally

  • W₁ = a dummy variable indicating white collar workers and zero indicates those who are non-white collar workers

  • Yᵢ denote earnings

  • Both wi and yi are partly determined by holding a college degree, ci,

  • {W1i, Woi}: Potential white-collar status with a college degree

  • {Y1i, Yoi}: Potential earnings with a college degree

The “Bad Control” Problem

  • We have four possible potential outcomes for white-collar status
    • Woi = 0: potential white-collar status as not treatedequals zero
    • Woi = 1: potential white-collar status as not treated equals one
    • W₁i = 0: potential white-collar status as treated equals zero
    • W₁i = 1: potential white-collar status as treated equals one
  • Cases where an individual does not have a college degree but still works in the white-collar sector can occur, and vice versa.

The “Bad Control” Problem Formally

  • c₁ is randomly assigned and independent of all potential outcomes, both in white-collar status and earnings.
  • Comparisons of earnings conditional on w₁ are bad controls who do not have a causal interpretation.
  • College graduates and others conditional on working at a white collar job
  • the difference in means with ci switched off and on, conditional on w₁ = 1

The “Bad Control” Problem Formally

  • The difference in potential earnings is defined as treated and is expressed as untreated, conditional on being a white-collar worker: E[Y₁iW₁i = 1, C = 1] - E[yoi Woi = 1, C = 0]
  • Since ci is randomized, it is independent of potential outcomes
  • It can be re-written as: E[Y₁iW₁i = 1] - E[yoi Woi = 1]

The “Bad Control” Problem Formally

  • To analyze the nature of the bad-control problem, add and subtract E[yoi | W₁i = 1] and then re-arrange
  • E[Y₁iW₁i = 1] - E[yoi Woi = 1] = E[y₁i - yoi W₁i = 1] + {E[yoi W₁i = 1] - E[yoi Woi = 1]}
  • Causal Effect Selection = Bias*

The “Bad Control” Problem Formally

  • Potential earnings as untreated (no college) for the ones who would have become white-collar workers if treated (W₁ ᵢ = 1) can be expressed as E[yoi W₁ ᵢ = 1] since treatment status is independent of potential outcomes
  • For the ones that never went to collage who have become white-collar workers, this may be a group with unusually smart people; their ability is so high that they do not need a degree to become a white collar workerE[yoi Woi = 1]

The “Bad Control” Problem Formally

  • Selection bias equations show two types of persons;

  • One that would become a white-collar worker if treated with a college degree.

  • One that would become a white-collar worker irrespective of college degree

  • Selection issues arise given the person who would have become a white-collar can do so without the collage degree. With the collage degree that person probably has better potential.

  • The causal effect of collage on those with W₁ ᵢ = 1 = potential individuals at a white-collar job when they have a college degree A selection-bias is displayed reflecting that those without one who still work in the white-collar sector (Woi = 1) are more likely to have better potential earnings with or without it

The “Bad Control” Problem Formally

  • The bad control destroys benefits of conditional white-collar status or college degree, and cannot be independent.
  • Sign of the bias: E[yoiW₁ ᵢ =1]=is likely to exceed E[yoiW ₀ ᵢ = I]
  • Thus, it will likely be downward biased

The “Bad Control” Problem Formally

  • Do not control for variables that are outcomes of the treatment variable
  • Including will disguise the total causal effect through factors like white-collar status
  • Variables measured before are typically good controls

Summary

  • The importance of the zero conditional mean assumption is automatically fulfilled in an experiment
  • The omitted variables formula, can be used to reason about expected bias
  • Adding more variables in an OLS regression is not always a good idea
  • Next consider quasi-experimental estimators and natural experiments

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

More Like This

Deriving Ordinary Least Squares Estimates
18 questions
Ordinary Least Squares (OLS) Estimation
5 questions
ECON 266: Multivariate Ordinary Least Squares (OLS)
19 questions
Use Quizgecko on...
Browser
Browser