Causal Analysis Subclassification and Matching PDF

Summary

This document presents causal analysis methods, including subclassification, matching, and propensity score techniques. It discusses the concepts and provides an example from the University of Bern's Spring 2024 course.

Full Transcript

Causal Analysis Subclassification and Matching Michael Gerfin University of Bern Spring 2024 Contents 1. Introduction 2. Subclassification and Regression 3. Matching 4. Methods based on the propensity score 5. Empirical example...

Causal Analysis Subclassification and Matching Michael Gerfin University of Bern Spring 2024 Contents 1. Introduction 2. Subclassification and Regression 3. Matching 4. Methods based on the propensity score 5. Empirical example 2 / 54 Introduction Introduction Subclassification and Matching are strategies to control for selection bias Motivated by the conditional independence assumption (CIA) Selection on observables, but not on unobservables Subclassification: do causal analysis in subgroups and aggregate Matching in a nutshell: find control units for each treatment unit that are (almost) identical in all relevant, observable characteristics (”statistical twins”) 3 / 54 Introduction Purposes of Matching 1 Data cleaning process before regression analysis (sometimes called pruning or pre-processing ) 2 Use matching to generate data that can be analyzed as if it is the result of a randomized experiment → no regression necessary, mean comparison reveals AT T but be careful: as opposed to a real randomized trial here in general AT T ̸= AT E why? because in the original non-experimental sample the covariates are not balanced, e.g. treatment and control group differ in their characteristics 4 / 54 Introduction Example for pruning and common support problem The left-hand graph shows scatterplot differentiated by treatment status. There are many control units for which there are no treatment units The right-hand graph shows E[Y |D, X] differentiated by D. The vertical distance is the treatment effect. The predictions are obtained from the coefficients of the regression Y = α + τ D + βX + U 5 / 54 Introduction Example for pruning and common support problem The left-hand graph shows scatterplot differentiated by treatment status. Control units without similar treatment units are shaded, i.e. pruned The right-hand graph shows E[Y |D, X] differentiated by D. The treatment effect drops to 0. The predictions are obtained from the coefficients of the regression Y = α + τ D + βX + U using the pruned data. 6 / 54 Introduction Recall CIA CIA : {Y0i , Y1i } ⊥ ⊥ Di |Xi Conditional on X potential outcomes are independent of treatment status D In other words, within each cell defined by the values of X treatment is as good as randomly assigned, i.e. no selection effect So the cell-specific causal effect can be estimated by OLS within each cell The average treatment effect is a weighted average of these cell-specific causal effects 7 / 54 Introduction Running Example To give some intuition we will look at an example with only two covariates, X1 and X2 , both dummies Outcome Yi is given by Di Y1i + (1 − Di )Y0i , where Di is the treatment indicator The data we are working with is called sim cia 2. It has 4000 observations and is available on Ilias. The data were simulated with underlying causal effects AT E = E[Y1i − Y0i ] = 3 AT T = E[Y1i − Y0i |Di = 1] = 4 (What does it mean that these are different?) 8 / 54 Subclassification and Regression 1. Introduction 2. Subclassification and Regression 3. Matching 4. Methods based on the propensity score 5. Empirical example 9 / 54 Subclassification and Regression Identification under CIA Given the CIA, the selection bias disappears after conditioning on X, so the treatment effect on the treated can obtained by iterating expectations over X τAT T ≡ E[Y1i |Di = 1] − E[Y0i |Di = 1] = E{E[Y1i |Xi , Di = 1] − E[Y0i |Xi , Di = 1]} E[Y0i |Xi , Di = 1] is counterfactual, but by the CIA E[Y0i |Xi , Di = 1] = E[Y0i |Xi , Di = 0] = E[Yi |Xi , Di = 0] 10 / 54 Subclassification and Regression Identification under CIA Therefore, τAT T = E{E[Y1i |Xi , Di = 1] − E[Y0i |Xi , Di = 0]|Di = 1} = E[τx |Di = 1] where τx ≡ E[Yi |Xi , Di = 1] − E[Yi |Xi , Di = 0] In words, τx are the coefficients of regressing Y on D separately for each possible value of X. 11 / 54 Subclassification and Regression Average Treatment Effect on the Treated AT T In the discrete-only covariate setting, the AT T can be written as X τAT T = E[Y1i − Y0i |Di = 1] = τx P (Xi = x|Di = 1) x where P (Xi = x|Di = 1) is the probability function for Xi |Di = 1 Calculation of τAT T is a weighted average of X-specific differences in Y using the empirical distribution of X among the treated This is called the subclassification estimator (or exact matching estimator) 12 / 54 Subclassification and Regression AT T in running example In our example run the following regressions reg y d if x1==0 & x2==0 etc. The estimates of τj are τ0 = 1.1955447 τ1 =.96121034 τ2 = 6.4193684 τ3 = 5.9344436 13 / 54 Subclassification and Regression AT T in running example An easy way to obtain the weights is to tabulate the variable that identifies the four groups. This variable is called x3 in the data and is defined as follows x3 = 0 if x1 = x2 = 0; = 1 if x1 = 1, x2 = 0; =2 if x1 = 0, x2 = 1; =3 if x1 = x2 = 1 Now simply tab x3 if d The resulting weights are P (x3 = 0|Di = 1) =.215 P (x3 = 1|Di = 1) =.197 P (x3 = 2|Di = 1) =.291 P (x3 = 3|Di = 1) =.297 The resulting subclassification estimate of AT T is 4.077 14 / 54 Subclassification and Regression Unconditional average treatment effect AT E Write τAT E = E{E[Y1i |Xi , Di = 1] − E[Y0i |Xi , Di = 0]} X = τx P (Xi = x) add another step into the equation x = E[Y1i − Y0i ] which is the expectation of τx using the marginal distribution of Xi (unconditional on Di = 1) Note: τAT T is the average effect for the treated, τAT E is the average treatment effect for the entire population 15 / 54 Subclassification and Regression AT E in running example To obtain the AT E weights you need to tab x3 The resulting weights are P (x3 = 0) =.349 P (x3 = 1) =.263 P (x3 = 2) =.201 P (x3 = 3) =.187 The resulting subclassification estimate of AT E is 3.070 16 / 54 Subclassification and Regression How About Just Running A Regression? Yi = α + τ Di + γ1 X1i + γ2 X2i + γ3 X1i X2i + εi. reg y d i.x1##i.x2 Source SS df MS Number of obs = 4000 F( 4, 3995) = 4695.06 Model 99650.8798 4 24912.72 Prob > F = 0.0000 Residual 21198.0951 3995 5.30615648 R-squared = 0.8246 Adj R-squared = 0.8244 Total 120848.975 3999 30.2197987 Root MSE = 2.3035 y Coef. Std. Err. t P>|t| [95% Conf. Interval] d 2.874028.079915 35.96 0.000 2.71735 3.030706 1.x1 2.801989.0941887 29.75 0.000 2.617327 2.986652 1.x2 7.842797.1071487 73.20 0.000 7.632726 8.052868 x1#x2 1 1.3473209.150128 2.31 0.021.0529862.6416556 _cons.453353.0662999 6.84 0.000.3233681.5833379 17 / 54 Subclassification and Regression Regression Least squares regression coefficients do not identify AT E or AT T (unless the effect is constant) P P It can be shown that τR = w(X)τx with w(X) = 1, where w(X) > 0 are weights These weights have no meaningful intuition - they are chosen such that OLS minimizes the sum of squared residuals (Recall that τx ≡ E[Yi |Xi , Di = 1] − E[Yi |Xi , Di = 0] ) 18 / 54 Subclassification and Regression AT T and AT E with regression How to identify both AT E and AT T with regression 1 Regress Y on X separately in treatment and control group 2 Predict Ŷ1 and Ŷ0 for each observation With these predictions compute 1X AT E = (Ŷ1i − Ŷ0i ) n 1 X AT T = (Ŷ1i − Ŷ0i ) (n1 is number of treated observations) n1 i:Di =1 AT E is the average of (Ŷ1i − Ŷ0i ) in the full sample, AT T is the average in the treated subsample Implemented in Stata as part of the teffects command teffects ra (y x) (d), where ra stands for regression adjustment 19 / 54 Subclassification and Regression Regression with teffects. teffects ra (y i.x1##i.x2) (d), ate Treatment-effects estimation Number of obs = 4,000 Estimator : regression adjustment Outcome model : linear Treatment model: none Robust y Coef. Std. Err. z P>|z| [95% Conf. Interval] ATE d (1 vs 0) 3.070077.080891 37.95 0.000 2.911534 3.228621 POmean d 0 4.227337.0675975 62.54 0.000 4.094849 4.359826. teffects ra (y i.x1##i.x2) (d), atet Treatment-effects estimation Number of obs = 4,000 Estimator : regression adjustment Outcome model : linear Treatment model: none Robust y Coef. Std. Err. z P>|z| [95% Conf. Interval] ATET d (1 vs 0) 4.076991.0961247 42.41 0.000 3.88859 4.265392 POmean d 0 5.347353.0911311 58.68 0.000 5.168739 5.525966 20 / 54 Matching 1. Introduction 2. Subclassification and Regression 3. Matching 4. Methods based on the propensity score 5. Empirical example 21 / 54 Matching General Matching Idea 1 For each treatment unit find control units with the same configuration of X 2 Estimate Ŷ0i as the weighted average of Y of all matched control units 3 Repeat for all treatment units 4 Estimate AT T as the difference of the means of Y1i (the observed Y1 of the treated) and the estimated Ŷ0i 22 / 54 Matching Example with ideal data 23 / 54 Matching Example with ideal data 24 / 54 Matching Example with ideal data Distribution of age is perfectly matched 25 / 54 Matching Matching estimator Typical form of estimator for AT T :   1 X Yi − X w(i, j)Yj(i)  n1 i:Di =1 j:Dj =0 n1 is number of treated units, Yj(i) is the outcome of the matched non-treated unit(s), and w(i, j) are the weights for computation of the counterfactual for treatment unit i with P w(i, j) = 1 Yj(i) can be from perfect matches as above (exact matching) or the closest match(es) such that Xj(i) is closest to Xi Different matching algorithms use different definitions of w(i, j) 26 / 54 Matching Exact Matching Use only matches with identical values of X Then the weights are ( 1/ki if Xi = Xj w(i, j) = 0 if Xi ̸= Xj with ki as the number of observations for which Xi = Xj in other words, Ŷj(i) is the arithmetic mean of Y of all matched control units Problem: If X contains many variables there is a large probability that no exact matches can be found for many observations (curse of dimensionality ). 27 / 54 Matching Matching Routines There are several matching routines available in Stata (in R as well) Prior to Stata v.13 the most popular routine was psmatch2 Since Stata v.13 matching is part of the teffects command Further user-written matching routines kmatch (ssc install kmatch) radiusmatch (ssc install radiusmatch) The next slides illustrate the use of teffects with the running example from the previous section (where exact matching is possible) 28 / 54 Matching teffects. teffects nnmatch (y) (d ), ate ematch(x1 x2) Treatment-effects estimation Number of obs = 4,000 Estimator : nearest-neighbor matching Matches: requested = 1 Outcome model : matching min = 161 Distance metric: Mahalanobis max = 970 AI Robust y Coef. Std. Err. z P>|z| [95% Conf. Interval] ATE d (1 vs 0) 3.070077.0809538 37.92 0.000 2.911411 3.228744. teffects nnmatch (y) (d ), atet ematch(x1 x2) Treatment-effects estimation Number of obs = 4,000 Estimator : nearest-neighbor matching Matches: requested = 1 Outcome model : matching min = 161 Distance metric: Mahalanobis max = 970 AI Robust y Coef. Std. Err. z P>|z| [95% Conf. Interval] ATET d (1 vs 0) 4.076991.096191 42.38 0.000 3.88846 4.265521 29 / 54 Matching Multivariate Distance Matching (MDM) If perfect matches are not available we need to define a distance metric that measures the proximity between observations in the multivariate space of X A common approach is to use q M D(Xi , Xj ) = (Xi − Xj )′ Σ(Xi − Xj ) as distance metric, where Σ is an appropriate scaling matrix Mahalanobis matching: Σ is the covariance matrix of X Euclidean matching: Σ is the identity matrix 30 / 54 Matching Matching Algorithms Various matching algorithms exist to find potential matches based on M D and to determine the matching weights wij We focus on Nearest-neighbor matching For each observation i in the treatment group find the M closest observations in the control group A single control can be used multiple times as a match In case of ties (multiple controls with same M D), use all ties M is set by the researcher " M # 1 X 1 X AT T = Yi − Yj (i) n1 M m=1 m i:Di =1 Other algorithms include caliper matching, radius matching, kernel matching, and coarsened exact matching Since matching is no longer exact in these cases, it may make sense to apply regression-adjustment to the matched data 31 / 54 Matching Matching Bias If exact matching is not feasible (reality) we need to define a norm ∥ · ∥ to define matching discrepancies, ∥Xi − Xj(i) ∥ here, Xj(i) denotes the vector Xj , where control observations j are are matched to treatment unit i Matching discrepancies ∥Xi − Xj(i) ∥ tend to increase with k, the dimension of X Matching discrepancies converge to zero, but very slowly if k is large Intuitively, these discrepancies generate a bias because the estimate of E[Y0i |Xi , Di = 1] is based on Xj(i) not Xi It is difficult to find good matches in large dimensions → need many observations if k is large 32 / 54 Matching Bias correction Each treated observation contributes E[Y0 |Xi ] − E[Y0 |(Xj(i) ] to the bias Bias-corrected matching addresses problem by trying to estimate the bias Basic idea: run regression of Y on X in control group to obtain µ̂0 (X), the estimated expected value of E[Y0 |X] The estimate of the bias is then Bi = µ̂0 (Xi ) − µ̂0 (Xj(i) ), and the AT T is estimated by 1 X   Yi − Yj(i) − Bi n1 i:Di =1 for M = 1 33 / 54 Matching Matching bias: Implications for practice Bias because large matching discrepancies make the difference between E[Y0 |Xi ] and E[Y0 |(Xj(i) ] large To minimize matching discrepancies 1 Use a small number of matches, M Large values of M produce large discrepancies because each subsequent match is worse than the one before 2 Use matching with replacement matching without replacement can throw away the best matches for other treatment units, thereby increasing discrepancies 3 Try to match covariates with large effect on E[Y0 ] particularly well 34 / 54 Methods based on the propensity score 1. Introduction 2. Subclassification and Regression 3. Matching 4. Methods based on the propensity score 5. Empirical example 35 / 54 Methods based on the propensity score Propensity score Matching on all covariates often is not feasible because X has too many elements (curse of dimensionality ) Can dimension of matching variables be reduced? Use probability of treatment conditional on X, P (D = 1|X), instead of X as the single matching variable Now the control observation which is closest in terms of treatment probability is used as match for each treatment observation 36 / 54 Methods based on the propensity score Propensity score theorem Define p(Xi ) = E[Di |Xi ]. Theorem Suppose {Y1i , Y0i } ⨿ Di |Xi (CIA), then {Y1i , Y0i } ⨿ Di |p(Xi ) Implications: The propensity score theorem says you need only control for covariates that affect the probability of treatment Moreover, the only covariate you really need to control for is the probability of treatment itself In practice, two steps: Estimate p(Xi ) (e.g., by Logit or Probit) Estimate AT E or AT T by matching on the fitted values from first step, or by a weighting scheme (see below) 37 / 54 Methods based on the propensity score Propensity score theorem in a DAG The back-door path D ← p(X) ← X → Y is closed by conditioning on p(X) 38 / 54 Methods based on the propensity score Common support Common support: only use observations for which a comparable unit exists in the other group Propensity score allows for an easy check of common support Minimum requirement: 0 < p(Xi ) < 1 Further suggestions in the literature discard all treated units for which p(Xi ) < max(min(p(Xi )c , p(Xi )t )) or p(Xi ) > min(max(p(Xi )c , p(Xi )t )) discard all treated units for which p(Xi ) >.9 or p(Xi ) <.1 (Crump, Hotz, Imbens and Mitnik, 2006) 39 / 54 Methods based on the propensity score Weighting and weighted regression Matching was very popular until recently However, a growing number of researchers feels that matching is somewhat cumbersome in practice and subject to many researcher degrees of freedom (which matching routine, how many neighbors, how to impose common support,...) For these reasons, approaches like Inverse Probability Weighting (IPW) and regression combined with IPW have become more popular The following slides provide a brief introduction to these approaches 40 / 54 Methods based on the propensity score Inverse Probability Weighting (IPW) h i h i Yi D i Yi (1−Di ) CIA implies E[Y1i ] = E p(Xi ) and E[Y0i ] = E 1−p(Xi ) Given an estimate of p(Xi ) we can construct an estimate of the average treatment effect from the sample analogue of   Yi Di Yi (1 − Di ) E[Y1i − Y0i ] = E − p(Xi ) 1 − p(Xi )   (Di − p(Xi ))Yi =E p(Xi )(1 − p(Xi )) The first equation is more easy to interpret Intuitively, observations with large p(X) are over-represented in the treatment group and are thus weighted down when treated, and weighted up when untreated 41 / 54 Methods based on the propensity score Inverse Probability Weighting (IPW) Similarly, the AT T can be estimated by   (Di − p(Xi ))Yi E[Y1i − Y0i |Di = 1] = E (1 − p(Xi ))P (Di ) Combining IPW with regression essentially means that the regression adjustment is done not using the controls, but using the inverse propensity scores as weights in a weighted regression 42 / 54 Methods based on the propensity score IPW with Stata. teffects ipw (y) (d i.x1##i.x2), ate Iteration 0: EE criterion = 1.030e-16 Iteration 1: EE criterion = 1.570e-30 Treatment-effects estimation Number of obs = 4,000 Estimator : inverse-probability weights Outcome model : weighted mean Treatment model: logit Robust y Coef. Std. Err. z P>|z| [95% Conf. Interval] ATE d (1 vs 0) 3.070077.080891 37.95 0.000 2.911534 3.228621 POmean d 0 4.227337.0675975 62.54 0.000 4.094849 4.359826 True ATE: 3 43 / 54 Methods based on the propensity score IPW with Stata. teffects ipw (y) (d i.x1##i.x2), atet Iteration 0: EE criterion = 1.030e-16 Iteration 1: EE criterion = 3.569e-30 Treatment-effects estimation Number of obs = 4,000 Estimator : inverse-probability weights Outcome model : weighted mean Treatment model: logit Robust y Coef. Std. Err. z P>|z| [95% Conf. Interval] ATET d (1 vs 0) 4.076991.0961247 42.41 0.000 3.88859 4.265392 POmean d 0 5.347353.0911311 58.68 0.000 5.168739 5.525966 True ATT: 4 44 / 54 Empirical example 1. Introduction 2. Subclassification and Regression 3. Matching 4. Methods based on the propensity score 5. Empirical example 45 / 54 Empirical example Example from MHE Analysis of the NSW (National Supported Work) data (available on Ilias) Originally a randomized experiment, in which training was randomly assigned to unemployed. Outcome is labor earnings after training (in the year 1978) Lalonde (1986) compared the results from the NSW randomized study to econometric results using non- experimental control groups drawn from the CPS and PSID Main finding: plausible non-experimental methods generated a wide range of results, many of which were far from the experimental baseline Can we obtain better estimates using the methods discussed in this chapter? 46 / 54 Empirical example Example from MHE The example looks at two CPS comparison groups a largely unselected sample (CPS-1) a narrower comparison group selected from recently unemployed (CPS-3) The NSW treatment group and the randomly selected NSW control group are very similar CPS-1 sample is very different from the experimental groups CPS-3 sample matches the NSW treatment group more closely but still shows differences, particularly in terms of pre-program earnings The final two columns are obtained by imposing common support on the samples (.1 < p̂(X) <.9), improving balance significantly (still not perfect) 47 / 54 Empirical example Descriptive Statistics 48 / 54 Empirical example Results Table 3.3.3 reports OLS regression estimates of the NSW treatment effect Rows of the table show results with alternative sets of controls Focus on rows marked in red huge bias in raw difference using the full set of controls estimates get closer to experimental benchmark, especially for CPS-3 imposing common support improves the estimate a lot for CPS-1 (not for CPS-3, though) however, the construction of CPS-3 was somewhat ad hoc, whereas imposing common support is data driven 49 / 54 Empirical example Results in MHE 50 / 54 Empirical example Matching approaches, CPS1 Matching should not be affected by the highly unbalanced treatment and control groups as other approaches, because matching explicitly addresses this problem The following slides show results for matching on the propensity score and nearest neighbor matching (with M = 1) Both come close to the experimental benchmark, indicating that matching works even in this extremely unbalanced case I also show how matching significantly improves covariate balance The third set of results illustrates the matching bias when M = 10, and how how the described bias-correction is successful in removing the bias 51 / 54 Empirical example Matching on propensity score, CPS1. teffects psmatch (re78) (treat age age2 ed black hisp married nodeg re74 re75), atet Treatment-effects estimation Number of obs = 16,177 Estimator : propensity-score matching Matches: requested = 1 Outcome model : matching min = 1 Treatment model: logit max = 9 AI Robust re78 Coef. Std. Err. z P>|z| [95% Conf. Interval] ATET treat (1 vs 0) 2031.696 879.0842 2.31 0.021 308.7224 3754.669. tebalance summarize note: refitting the model using the generate() option Covariate balance summary Raw Matched Number of obs = 16,177 370 Treated obs = 185 185 Control obs = 15,992 185 Standardized differences Variance ratio Raw Matched Raw Matched age -.7961833 -.1728696.4196365.8593841 age2 -.8031274 -.1738768.3020035.9286768 ed -.6785021.0439621.4905163.502087 black 2.427747 0 1.950622 1 hisp -.0506973.1003586.8410938 1.536116 married -1.232648.0420061.7516725 1.072304 nodeg.9038111.0930614.9975255.9276161 re74 -1.56899 -.0541769.2607425 1.467212 re75 -1.746428.0827304.1205903 1.894263 52 / 54 Empirical example Matching on nearest neighbor (M = 1), CPS1. teffects nnmatch (re78 age age2 ed black hisp married nodeg re74 re75) (treat), atet Treatment-effects estimation Number of obs = 16,177 Estimator : nearest-neighbor matching Matches: requested = 1 Outcome model : matching min = 1 Distance metric: Mahalanobis max = 9 AI Robust re78 Coef. Std. Err. z P>|z| [95% Conf. Interval] ATET treat (1 vs 0) 1541.959 791.0628 1.95 0.051 -8.495899 3092.413. tebalance summarize note: refitting the model using the generate() option Covariate balance summary Raw Matched Number of obs = 16,177 370 Treated obs = 185 185 Control obs = 15,992 185 Standardized differences Variance ratio Raw Matched Raw Matched age -.7961833 -.0422592.4196365.9287735 age2 -.8031274 -.0452085.3020035.921322 ed -.6785021 -.0108371.4905163 1.031697 black 2.427747 0 1.950622 1 hisp -.0506973 0.8410938 1 married -1.232648 0.7516725 1 nodeg.9038111 0.9975255 1 re74 -1.56899 -.0695601.2607425 1.216494 re75 -1.746428 -.0911758.1205903.8348652 53 / 54 Empirical example Matching on nearest neighbor (M = 10), CPS1. teffects nnmatch (re78 age age2 ed black hisp married nodeg re74 re75) (treat), atet nn(10) Treatment-effects estimation Number of obs = 16,177 Estimator : nearest-neighbor matching Matches: requested = 10 Outcome model : matching min = 10 Distance metric: Mahalanobis max = 13 AI robust re78 Coefficient std. err. z P>|z| [95% conf. interval] ATET treat (1 vs 0) 737.4727 677.4011 1.09 0.276 -590.209 2065.154. teffects nnmatch (re78 age age2 ed black hisp married nodeg re74 re75) (treat), atet biasadj(re74 re75 age age2 ) > nn(10) Treatment-effects estimation Number of obs = 16,177 Estimator : nearest-neighbor matching Matches: requested = 10 Outcome model : matching min = 10 Distance metric: Mahalanobis max = 13 AI robust re78 Coefficient std. err. z P>|z| [95% conf. interval] ATET treat (1 vs 0) 1628.964 684.1711 2.38 0.017 288.0131 2969.915 54 / 54

Use Quizgecko on...
Browser
Browser