🎧 New: AI-Generated Podcasts Turn your study notes into engaging audio conversations. Learn more

3) Making Regression Make Sense.pdf

Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...

Document Details

AppreciatedUranium

Uploaded by AppreciatedUranium

University of Bern

2024

Tags

statistics regression analysis causal inference

Full Transcript

Causal Analysis Making Regression Make Sense Michael Gerfin University of Bern Spring 2024 Contents 1. Regression Fundamentals 2. Regression and Causality 3. Bad Controls 4. Binary Outcomes 2 / 40 ...

Causal Analysis Making Regression Make Sense Michael Gerfin University of Bern Spring 2024 Contents 1. Regression Fundamentals 2. Regression and Causality 3. Bad Controls 4. Binary Outcomes 2 / 40 Regression Fundamentals Regression fundamentals Setting aside the causality problem for the moment, let’s have a closer look at the mechanical properties of regression These are universal features of the population regression and its sample analogue that have nothing to do with a researcher’s interpretation of the output We review how and why regression coefficients change as covariates are added or removed from the model 3 / 40 Regression Fundamentals Regression Anatomy Bivariate case Cov(Yi , Xi ) β= V ar(Xi ) Multivariate case Cov(Yi , X̃ki ) βk = V ar(X̃ki ) X̃ki is residual from regression of Xki on all other covariates Anatomy of multivariate regression: each coefficient is the linear effect of the corresponding regressor, after “partialling out” the other variables in the model 4 / 40 Regression Fundamentals Omitted Variable Bias Suppose the population model is Yi = β0 + τ Di + γXi + εi but we cannot observe X Then the estimator of the coefficient of D in the regression of Y on D is Cov(Yi , Di ) = τ + γδXD V ar(Di ) where δXD is the coefficient of D in the regression of X on D. Short equals long plus the effect of omitted times the regression of omitted on included 5 / 40 Regression and Causality 1. Regression Fundamentals 2. Regression and Causality 3. Bad Controls 4. Binary Outcomes 6 / 40 Regression and Causality Regression and causality Think of schooling as binary decision: go to college (Di = 1) or not (Di = 0) The causal effect of schooling on wage Yi is stated in the potential outcome framework  Y if Di = 1 Potential outcome = 1i Y0i if Di = 0 The observed outcome is Yi = Y0i + (Y1i − Y0i )Di 7 / 40 Regression and Causality Regression and causality The observed difference in Y can be decomposed into E[Yi |Di = 1] − E[Yi |Di = 0] = E[Y1i |Di = 1] − E[Y0i |Di = 1] | {z } | {z } Observed difference in Y Average treatment effect on the treated + E[Y0i |Di = 1] − E[Y0i |Di = 0] | {z } Selection bias In general regression is not causal In order to give regression a causal interpretation when the data are non-experimental, we need a further assumption. 8 / 40 Regression and Causality Conditional Independence Assumption CIA The conditional independence assumption stated formally {Y1i , Y0i } ⊥ ⊥ Di |Xi In words: Potential outcomes are independent of D conditional on additional control variables X In other words, within each cell defined by X treatment D is as good as randomly assigned Then E[Y1i |Xi , Di = 1] = E[Y 1i |Xi , Di = 0] E[Y0i |Xi , Di = 1] = E[Y0i |Xi , Di = 0] 9 / 40 Regression and Causality Identification of AT T and AT E under CIA The data only reveal Yi. But given CIA, conditional-on-X comparisons have a causal interpretation E[Yi |Xi , Di = 1] − E[Yi |Xi , Di = 0] = E[Y1i |Xi , Di = 1] − E[Y0i |Xi , Di = 0] = E[Y1i |Xi , Di = 1] − E[Y0i |Xi , Di = 1] = E[Y1i |Xi ] − E[Y0i |Xi ] For each value of X there is an identified treatment effect Denote this X-specific effect with τX To repeat: τX = E[Yi |Xi , Di = 1] − E[Yi |Xi , Di = 0], so it can be estimated using the observed data on Y, D, X 10 / 40 Regression and Causality Identification of AT T and AT E under CIA How can we find the unconditional-on-X treatment effects? AT E: Take expectation of τX over distribution of X AT E = EX [τx ] AT T : Take expectation of τX over distribution of X in treated subpopulation AT E = EX|D=1 [τx ] Typically, with observational data AT E ̸= AT T We come back to this in the next chapter, for now let’s focus on a constant effect model in the regression context 11 / 40 Regression and Causality Constant Effect Model Consider the constant-effect specification Yi = Y0i + τ Di Then E[Yi |Di , Xi ] = E[Y0i |Di , Xi ] + τ Di = E[Y0i |Xi ] + τ Di (the second equality follows from CIA) Specify Y0i as a linear function of Xi Y0i = α + γXi + vi Plugging this into the first equation gives Yi = α + τ Di + γXi + vi 12 / 40 Regression and Causality Constant Effect Model Then the regression Yi = α + τ Di + γXi + vi identifies τ = AT E if Di ⊥ ⊥ vi |Xi In words: identification requires that X is the only reason that D and Y0 are correlated, so conditional on X, D and v are independent Identification does not require that X is independent of v Hence γ has no causal interpretation 13 / 40 Regression and Causality CIA in a DAG Here, the error term of the regression model, v, includes both the unobservable X2 and the independent error term for Y Without conditioning on X1 there are two back-door paths: D ← X1 → Y and D ← X1 ← X2 → Y Conditioning on X1 closes both so the causal effect of D on Y is identified. However, there is an arrow between X1 and X2 , so X1 and X2 (and hence v) are clearly not independent 14 / 40 Regression and Causality Demonstration of CIA I illustrate this with a simple simulation Data are generated with the following code set obs 5000 set seed 123456 g v = rnormal(2,2) g x = 0.5*v + rnormal() g d = 0.5*x + rnormal()>0.5 g y = d + x + v 15 / 40 Regression and Causality Demonstration of CIA Descriptive Statistics. sum x v if d==0 Variable Obs Mean Std. Dev. Min Max x 2,446.3309931 1.241768 -4.258484 4.147845 v 2,446 1.371212 1.893125 -4.730494 7.802777. sum x v if d==1 Variable Obs Mean Std. Dev. Min Max x 2,554 1.621334 1.276425 -2.83092 6.095407 v 2,554 2.693759 1.911395 -3.493239 9.745736. Obviously E[v|d = 0] ̸= E[v|d = 1] so d ̸⊥ ⊥ v unconditionally 16 / 40 Regression and Causality Demonstration of CIA Regression analysis without controlling for x. reg y d Source SS df MS Number of obs = 5,000 F(1, 4998) = 1943.96 Model 16308.5862 1 16308.5862 Prob > F = 0.0000 Residual 41930.0711 4,998 8.38936997 R-squared = 0.2800 Adj R-squared = 0.2799 Total 58238.6573 4,999 11.6500615 Root MSE = 2.8964 y Coef. Std. Err. t P>|t| [95% Conf. Interval] d 3.612888.0819428 44.09 0.000 3.452244 3.773532 _cons 1.702205.0585648 29.07 0.000 1.587392 1.817018 Coefficient of d is severely biased (true value = 1) → omitted variable bias OVB is effect of omitted times the regression of omitted on included Effect of omitted (x) is 2.003 (see next slide), the regression of x on d yields coefficient=1.29 17 / 40 Regression and Causality Demonstration of CIA Regression analysis controlling for x. reg y d x Source SS df MS Number of obs = 5,000 F(2, 4997) = 11895.59 Model 48129.6939 2 24064.847 Prob > F = 0.0000 Residual 10108.9634 4,997 2.02300648 R-squared = 0.8264 Adj R-squared = 0.8264 Total 58238.6573 4,999 11.6500615 Root MSE = 1.4223 y Coef. Std. Err. t P>|t| [95% Conf. Interval] d 1.028041.0452098 22.74 0.000.9394103 1.116672 x 2.003227.0159724 125.42 0.000 1.971914 2.03454 _cons 1.039151.0292407 35.54 0.000.9818261 1.096475 Conditional on x the causal effect is identified. And γ has no meaningful interpretation. γ = true effect (= 1) plus bias, which is again an omitted variable bias 18 / 40 Regression and Causality Demonstration of CIA Regression analysis illustrating OVB. reg v d x Source SS df MS Number of obs = 5,000 F(2, 4997) = 2512.68 Model 10166.3233 2 5083.16163 Prob > F = 0.0000 Residual 10108.9634 4,997 2.02300649 R-squared = 0.5014 Adj R-squared = 0.5012 Total 20275.2867 4,999 4.05586851 Root MSE = 1.4223 v Coef. Std. Err. t P>|t| [95% Conf. Interval] d.0280414.0452098 0.62 0.535 -.0605897.1166725 x 1.003227.0159724 62.81 0.000.9719144 1.03454 _cons 1.039151.0292407 35.54 0.000.9818261 1.096475 The coefficient of d in the regression of v on x and d is almost 0 (conditional on x, d and v are independent) 19 / 40 Regression and Causality Summarizing comments The CIA E[vi |Xi , Di ] = E[vi |Xi ] is a weaker and more focused assumption than the traditional assumption that all regressors are independent of v (E[vi |Xi , Di ] = 0) Focus is on identifying one causal effect, not on obtaining unbiased estimates for all right-hand side variables There is clear distinction between cause and controls on the right hand side of the regression only one variable is seen as having a causal effect all others are controls only included in service of this focused causal agenda The regression coefficients multiplying the controls have no causal interpretation 20 / 40 Regression and Causality Example for effect of adding controls Which specification is the most convincing? 21 / 40 Bad Controls 1. Regression Fundamentals 2. Regression and Causality 3. Bad Controls 4. Binary Outcomes 22 / 40 Bad Controls Bad Controls Bad controls are variables that introduce bias when controlled for, but leaving them out is fine Bad controls are often variables that are themselves outcomes of the treatment Good controls are variables we can think of as having been fixed before the treatment assignment Example of bad control in a college (yes/no) and occupation (blue/white-collar) setting C is dummy variable denoting college if one, W is a dummy variable denoting white collar if one, and Y is earnings Bad control means that a comparison of earnings conditional on W may not have a causal interpretation, even if C is randomized 23 / 40 Bad Controls Bad Control in a DAG The bad control problem is a case where a DAG is much more intuitive than a formal derivation Assume for simplicity that collegec C, is randomized White collar status W is a function of college degree and ability A (unobserved) Earnings Y are a function of C, W, andA This model is shown in the following DAG 24 / 40 Bad Controls Bad Control in a DAG There are two causal paths: C → Y (direct) and C → W → Y (indirect). The back-door path C → W ← A → Y is closed because W is a collider on this path. Conditioning on W opens a path between C and A (a spurious association, shown by the dashed bi-directed arrow). In the white-collar group, those without college have on average higher ability. In potential outcomes notation: Y0 ⊥ ⊥ C, but Y0 ̸⊥ ⊥ C|W In this case it is only possible to identify the total (direct plus indirect) causal effect 25 / 40 Bad Controls Bad controls formally {Y1 , Y0 } denotes potential earnings, {W1 , W0 } denotes potential white collar status (again, for simplicity, C is randomized) Consider the difference in mean earnings between college graduates and others conditional on working at a white collar job E[Yi |Wi = 1, Ci = 1] − E[Yi |Wi = 1, Ci = 0] = E[Y1i |W1i = 1, Ci = 1] − E[Y0i |W0i = 1, Ci = 0] By the joint independence of C and {Y1 , Y0 , W1 , W0 } we have E[Y1i |W1i = 1, Ci = 1] − E[Y0i |W0i = 1, Ci = 0] = E[Y1i |W1i = 1] − E[Y0i |W0i = 1] 26 / 40 Bad Controls Bad Controls This expression reflects the apples-to-oranges nature of the bad control problem E[Y1i |W1i = 1] − E[Y0i |W0i = 1] = E[Y1i − Y0i |W1i = 1] + (E[Y0i |W1i = 1] − E[Y0i |W0i = 1] | {z } | {z } causal effect selection bias The causal effect is the effect of college on those with W1 = 1 The selection-bias term reflects the fact that college changes the composition of the pool of white collar workers 27 / 40 Bad Controls Demonstration of bad control problem I illustrate this with a simple simulation Data are generated with the following code set obs 5000 set seed 123456 g c =rnormal()>0.6 g a = rnormal() g w = (0.2*d + 0.2*u)>0 g y = 1 + c + w +a + rnormal(0,3) The true direct effect of C on Y is 1 The true indirect effect is 1 times the partial correlation between W and C, which is ≈ 0.35 in this example indirect effect The true total effect is ≈ 1.35 28 / 40 Bad Controls Demonstration of bad control problem Regression Analysis. reg y c Source SS df MS Number of obs = 5,000 F(1, 4998) = 190.85 Model 2088.88527 1 2088.88527 Prob > F = 0.0000 Residual 54705.1252 4,998 10.9454032 R-squared = 0.0368 Adj R-squared = 0.0366 Total 56794.0105 4,999 11.3610743 Root MSE = 3.3084 y Coef. Std. Err. t P>|t| [95% Conf. Interval] c 1.440496.1042727 13.81 0.000 1.236075 1.644916 _cons 1.512001.0551168 27.43 0.000 1.403948 1.620054.. reg y c w Source SS df MS Number of obs = 5,000 F(2, 4997) = 483.18 Model 9203.47415 2 4601.73707 Prob > F = 0.0000 Residual 47590.5363 4,997 9.52382156 R-squared = 0.1621 Adj R-squared = 0.1617 Total 56794.0105 4,999 11.3610743 Root MSE = 3.0861 y Coef. Std. Err. t P>|t| [95% Conf. Interval] c.5467398.1026155 5.33 0.000.3455684.7479112 w 2.553759.0934352 27.33 0.000 2.370585 2.736933 _cons.2687884.0686459 3.92 0.000.1342123.4033645 Conditioning on w gives a completely wrong result 29 / 40 Bad Controls Demonstration of bad control problem Effect of conditioning on w on the distribution of a unconditional partial correlation between. reg c a c and a Source SS df MS Number of obs = 5,000 F(1, 4998) = 0.09 Model.018821953 1.018821953 Prob > F = 0.7598 Residual 1006.65938 4,998.201412441 R-squared = 0.0000 Adj R-squared = -0.0002 Total 1006.6782 4,999.201375915 Root MSE =.44879 c Coef. Std. Err. t P>|t| [95% Conf. Interval] a -.0019419.0063524 -0.31 0.760 -.0143953.0105115 _cons.2793354.0063504 43.99 0.000.2668859.2917849. reg c a if w==1 Source SS df MS Number of obs F(1, 2921) = = 2,923 461.03 conditional correlation between c and a Model 95.6246847 1 95.6246847 Prob > F = 0.0000 Residual 605.855302 2,921.20741366 R-squared = 0.1363 Adj R-squared = 0.1360 Total 701.479986 2,922.240068442 Root MSE =.45543 c Coef. Std. Err. t P>|t| [95% Conf. Interval] a -.2462548.0114688 -21.47 0.000 -.2687426 -.2237671 classic bad control _cons.5409842.0106824 50.64 0.000.5200384.5619301. reg c a if w==0 Unobserved Source SS df MS Number of obs F(1, 2075) = = 2,077 253.53 ability Model 22.0997162 1 22.0997162 Prob > F = 0.0000 Residual 180.871877 2,075.08716717 R-squared Adj R-squared = = 0.1089 0.1085 Job position Total 202.971594 2,076.097770517 Root MSE =.29524 c Coef. Std. Err. t P>|t| [95% Conf. Interval] a -.1632068.0102499 -15.92 0.000 -.1833081 -.1431056 _cons -.0348647.0111572 -3.12 0.002 -.0567452 -.0129843 G Y 30 / 40 Binary Outcomes 1. Regression Fundamentals 2. Regression and Causality 3. Bad Controls 4. Binary Outcomes 31 / 40 Binary Outcomes Binary Outcomes Assume {Y1i , Y0i } ⊥ ⊥ Di holds Then the expression E[Yi |Di = 1] − E[Yi |Di = 0] = E[Y1i − Y0i ] is valid, even if Yi is binary or non-negative If Yi is binary we have E[Y1i − Y0i ] = E[Y1i ] − E[Y0i ] = Pr[Yi = 1] − Pr[Yi = 0] The linear probability model Yi = α + τ Di + ui can be used to estimate the treatment effect 32 / 40 Binary Outcomes Binary Outcomes We also can analyse the problem using a Probit model. Assume a latent variable Yi∗ that satisfies Yi∗ = β0∗ + β1∗ Di + νi where νi is distributed N (0, 1) The observed binary outcome Yi is assumed to be generated by the rule ∗ in a Ylabor i = 1[Y i >context, supply 0], where Yi∗ 1[·] couldis be thethe indicator function difference between the offered wage and the reservation wage Then P (Yi = 1|Di ) = P (Yi∗ > 0|Di ) = P (νi < β0∗ + β1∗ Di ) 33 / 40 Binary Outcomes Probit Model Because νi is standard-normal we can write this as follows P (νi < β0∗ + β1∗ Di ) = Φ(β0∗ + β1∗ Di ) where Φ is the standard-normal CDF. Hence the CEF for Yi can be written as E[Yi |Di ] = Φ [β0∗ + β1∗ Di ] The treatment effect is then E[Yi |Di = 1] − E[Yi |Di = 0] = Φ [β0∗ + β1∗ ] − Φ [β0∗ ] 34 / 40 Binary Outcomes Probit Model Put differently, E[Yi |Di ] = Φ [β0∗ ] + {Φ [β0∗ + β1∗ ] − Φ [β0∗ ]} Di This is a linear function of Di , so the slope coefficient of linear regression of Yi on Di is just the difference in probit fitted values The probit coefficients β0∗ and β1∗ do not give the size of the effect unless we feed them back into the normal CDF. 35 / 40 Binary Outcomes RAND Health Insurance Experiment (HIE) Individuals were randomly assigned to different health insurance contracts Free care vs. cost sharing (deductibles and co-payments) Important outcome in HIE : incidence of health care Let us focus on one treatment: full insurance vs. insurance with patient cost sharing Treatment is randomly assigned and denoted by Di = 1 E[Yi |Di = 1] − E[Yi |Di = 0] = E[Y1i |Di = 1] − E[Y0i |Di = 1] = E[Y1i − Y0i ] because D is independent of potential outcomes 36 / 40 Binary Outcomes Example Probit Use the dataset randdata.dta, which is a subsample of the original RAND data (available on Ilias) Let the outcome be the dummy pos exp, which is equal to one if the person had positive medical expenditures (Stata: g pos exp=meddol>0). Treatment is the dummy full, which is equal to one if the person has full insurance (Stata: g full=coins==0) The following slides show the linear regression and the Probit regression, including the transformation of the Probit coefficients into the causal effect 37 / 40 Binary Outcomes Example Linear Regression. g pos_exp=meddol>0. g full=coins==0. reg pos_exp i.full Source SS df MS Number of obs = 20190 F( 1, 20188) = 153.75 Model 26.234812 1 26.234812 Prob > F = 0.0000 Residual 3444.63498 20188.170627847 R-squared = 0.0076 Adj R-squared = 0.0075 Total 3470.86979 20189.171918856 Root MSE =.41307 pos_exp Coef. Std. Err. t P>|t| [95% Conf. Interval] 1.full.0723838.0058375 12.40 0.000.0609418.0838258 _cons.7400196.0043082 171.77 0.000.7315751.748464 38 / 40 Binary Outcomes Example Probit: estimation. probit pos_exp i.full Iteration 0: log likelihood = -10652.429 Iteration 1: log likelihood = -10576.427 Iteration 2: log likelihood = -10576.387 Iteration 3: log likelihood = -10576.387 Probit regression Number of obs = 20,190 LR chi2(1) = 152.08 Prob > chi2 = 0.0000 Log likelihood = -10576.387 Pseudo R2 = 0.0071 pos_exp Coef. Std. Err. z P>|z| [95% Conf. Interval] 1.full.2433819.0197509 12.32 0.000.2046707.282093 _cons.6434058.0141041 45.62 0.000.6157622.6710493 39 / 40 Binary Outcomes Example Probit: marginal effects The marginal effect is obtained by the command margins. The option dydx(*) calls for the derivative of y with respect to x (the ∗ is the wildcard for all elements of x). If x is a dummy (indicated to Stata by specifying i.full), the difference in probabilities when going from full=0 to full=1 is computed... margins, dydx(*) Conditional marginal effects Number of obs = 20,190 Model VCE : OIM Expression : Pr(pos_exp), predict() dy/dx w.r.t. : 1.full Delta-method dy/dx Std. Err. z P>|z| [95% Conf. Interval] 1.full.0723838.005898 12.27 0.000.0608239.0839437 Note: dy/dx for factor levels is the discrete change from the base level. Bottom Line: it makes no difference whether you use regression or more complicated binary response model! 40 / 40

Use Quizgecko on...
Browser
Browser