Counterfactuals and Their Applications PDF
Document Details
Uploaded by Deleted User
Tags
Summary
This document explores counterfactuals and their applications in causal inference and statistics, including examples like recruitment to a program and additive interventions. It features multiple study questions.
Full Transcript
k Counterfactuals and Their Applications 107 multiplicative (i.e., nonlinear) interaction term is added to the output equation. For example, if the arrow X → H were reversed in Figure...
k Counterfactuals and Their Applications 107 multiplicative (i.e., nonlinear) interaction term is added to the output equation. For example, if the arrow X → H were reversed in Figure 4.1, and the equation for Y read Y = bX + cH + 𝛿XH + UY (4.19) 𝜏 would differ from ETT. We leave it to the reader as an exercise to show that the difference 𝛿a 𝜏 − ETT then equals 1+a 2 (see Study question 4.3.2(c)). Study questions Study question 4.3.2 (a) Describe how the parameters a, b, c in Figure 4.1 can be estimated from nonexperi- mental data. (b) In the model of Figure 4.3, find the effect of education on those students whose salary is Y = 1. [Hint: Use Theorem 4.3.2 to compute E[Y1 − Y0 |Y = 1].] (c) Estimate 𝜏 and the ETT = E[Y1 − Y0 |X = 1] for the model described in Eq. (4.19). [Hint: Use the basic definition of counterfactuals, Eq. (4.5) and the equality E[Z|X = x′ ] = RZX x′.] 4.4 Practical Uses of Counterfactuals k Now that we know how to compute counterfactuals, it will be instructive—and motivating—to k see counterfactuals put to real use. In this section, we examine examples of problems that seem baffling at first, but that can be solved using the techniques we just laid out. Hopefully, the reader will leave this chapter with both a better understanding of how counterfactuals are used and a deeper appreciation of why we would want to use them. 4.4.1 Recruitment to a Program Example 4.4.1 A government is funding a job training program aimed at getting jobless peo- ple back into the workforce. A pilot randomized experiment shows that the program is effective; a higher percentage of people were hired among those who finished the program than among those who did not go through the program. As a result, the program is approved, and a recruit- ment effort is launched to encourage enrollment among the unemployed, by offering the job training program to any unemployed person who elects to enroll. Lo and behold, enrollment is successful, and the hiring rate among the program’s graduates turns out even higher than in the randomized pilot study. The program developers are happy with the results and decide to request additional funding. Oddly, critics claim that the program is a waste of taxpayers’ money and should be ter- minated. Their reasoning is as follows: While the program was somewhat successful in the experimental study, where people were chosen at random, there is no proof that the program k k 108 Causal Inference in Statistics accomplishes its mission among those who choose to enroll of their own volition. Those who self-enroll, the critics say, are more intelligent, more resourceful, and more socially connected than the eligible who did not enroll, and are more likely to have found a job regardless of the training. The critics claim that what we need to estimate is the differential benefit of the program on those enrolled: the extent to which hiring rate has increased among the enrolled, compared to what it would have been had they not been trained. Using our subscript notation for counterfactuals, and letting X = 1 represent training and Y = 1 represent hiring, the quantity that needs to be evaluated is the effect of training on the trained (ETT, better known as “effect of treatment on the treated,” Eq. (4.18)): ETT = E[Y1 − Y0 |X = 1] (4.20) Here the difference Y1 − Y0 represents the causal effect of training (X) on hiring (Y) for a ran- domly chosen individual, and the condition X = 1 limits the choice to those actually choosing the training program on their own initiative. As in our freeway example of Section 4.1, we are witnessing a clash between the antecedent (X = 0) of the counterfactual Y0 (hiring had training not taken place) and the event it is con- ditioned on, X = 1. However, whereas the counterfactual analysis in the freeway example had no tangible consequences save for a personal regret statement—“I should have taken the freeway”—here the consequences have serious economic implications, such as terminating a training program, or possibly restructuring the recruitment strategy to attract people who would benefit more from the program offered. The expression for ETT does not appear to be estimable from either observational or exper- imental data. The reason rests, again, in the clash between the subscript of Y0 and the event k X = 1 on which it is conditioned. Indeed, E[Y0 |X = 1] stands for the expectation that a trained k person (X = 1) would find a job had he/she not been trained. This counterfactual expectation seems to defy empirical measurement because we can never rerun history and deny training to those who received it. However, we see in the subsequent sections of this chapter that, despite this clash of worlds, the expectation E[Y0 |X = 1] can be reduced to estimable expressions in many, though not all, situations. One such situation occurs when a set Z of covariates satisfies the backdoor criterion with regard to the treatment and outcome variables. In such a case, ETT probabilities are given by a modified adjustment formula: P(Yx = y|X = x′ ) ∑ = P(Y = y|X = x, Z = z)P(Z = z|X = x′ ) (4.21) z This follows directly from Theorem 4.3.1, since conditioning on Z = z gives ∑ P(Yx = y|x′ ) = P(Yx = y|z, x′ )P(z|x′ ) z but Theorem 4.3.1 permits us to replace x′ with x, which by virtue of (4.6) permits us to remove the subscript x from Yx , yielding (4.21). Comparing (4.21) to the standard adjustment formula of Eq. (3.5), ∑ P(Y = y|do(X = x)) = P(Y = y|X = x, Z = z)P(Z = z) we see that both formulas call for conditioning on Z = z and averaging over z, except that (4.21) calls for a different weighted average, with P(Z = z|X = x′ ) replacing P(Z = z). k k Counterfactuals and Their Applications 109 Using Eq. (4.21), we readily get an estimable, noncounterfactual expression for ETT ETT = E[Y1 − Y0 |X = 1] = E[Y1 |X = 1] − E[Y0 |X = 1] ∑ = E[Y|X = 1] − E[Y|X = 0, Z = z]P(Z = z|X = 1) z where the first term in the final expression is obtained using the consistency rule of Eq. (4.6). In other words, E[Y1 |X = 1] = E[Y|X = 1] because, conditional on X = 1, the value that Y would get had X been 1 is simply the observed value of Y. Another situation permitting the identification of ETT occurs for binary X whenever both experimental and nonexperimental data are available, in the form of P(Y = y|do(X = x)) and P(X = x, Y = y), respectively. Still another occurs when an intermediate variable is available between X and Y satisfying the front-door criterion (Figure 3.10(b)). What is common to these situations is that an inspection of the causal graph can tell us whether ETT is estimable and, if so, how. Study questions Study question 4.4.1 (a) Prove that, if X is binary, the effect of treatment on the treated can be estimated from both observational and experimental data. Hint: Decompose E[Yx ] into k E[Yx ] = E[Yx |x′ ]P(x′ ) + E[Yx |x]P(x) k (b) Apply the result of Question (a) to Simpson’s story with the nonexperimental data of Table 1.1, and estimate the effect of treatment on those who used the drug by choice. [Hint: Estimate E[Yx ] assuming that gender is the only confounder.] (c) Repeat Question (b) using Theorem 4.3.1 and the fact that Z in Figure 3.3 satisfies the backdoor criterion. Show that the answers to (b) and (c) coincide. 4.4.2 Additive Interventions Example 4.4.2 In many experiments, the external manipulation consists of adding (or sub- tracting) some amount from a variable X without disabling preexisting causes of X, as required by the do(x) operator. For example, we might give 5 mg/l of insulin to a group of patients with varying levels of insulin already in their systems. Here, the preexisting causes of the manip- ulated variable continue to exert their influences, and a new quantity is added, allowing for differences among units to continue. Can the effect of such interventions be predicted from observational studies, or from experimental studies in which X was set uniformly to some predetermined value x? If we write our question using counterfactual variables, the answer becomes obvious. Suppose we were to add a quantity q to a treatment variable X that is currently at level X = x′. k k 110 Causal Inference in Statistics The resulting outcome would be Yx′ +q , and the average value of this outcome over all units currently at level X = x′ would be E[Yx |x′ ], with x = x′ + q. Here, we meet again the ETT expression E[Yx |x′ ], to which we can apply the results described in the previous example. In particular, we can conclude immediately that, whenever a set Z in our model satisfies the backdoor criterion, the effect of an additive intervention is estimable using the ETT adjustment formula of Eq. (4.21). Substituting x = x′ + q in (4.21) and taking expectations gives the effect of this intervention, which we call add(q): E[Y|add(q)] − E[Y] ∑ = E[Yx′ +q |X = x′ ]P(X = x′ ) − E[Y] x′ ∑∑ = E[Y|X = x′ + q, Z = z]P(Z = z|X = x′ )P(X = x′ ) − E[Y] (4.22) x′ z In our example, Z may include variables such as age, weight, or genetic disposition; we require only that each of those variables be measured and that they satisfy the backdoor condition. Similarly, estimability is assured for all other cases in which ETT is identifiable. This example demonstrates the use of counterfactuals to estimate the effect of practical interventions, which cannot always be described as do-expressions, but may nevertheless be estimated under certain circumstances. A question naturally arises: Why do we need to resort to counterfactuals to predict the effect of a rather common intervention, one that could be k estimated by a straightforward clinical trial at the population level? We simply split a randomly k chosen group of subjects into two parts, subject half of them to an add(q) type of intervention and compare the expected value of Y in this group to that obtained in the add(0) group. What is it about additive interventions that force us to seek the advice of a convoluted oracle, in the form of counterfactuals and ETT, when the answer can be obtained by a simple randomized trial? The answer is that we need to resort to counterfactuals only because our target quantity, E[Y|add(q)], could not be reduced to do-expressions, and it is through do-expressions that scientists report their experimental findings. This does not mean that the desired quantity E[Y|add(q)] cannot be obtained from a specially designed experiment; it means only that save for conducting such a special experiment, the desired quantity cannot be inferred from sci- entific knowledge or from a standard experiment in which X is set to X = x uniformly over the population. The reason we seek to base policies on such ideal standard experiments is that they capture scientific knowledge. Scientists are interested in quantifying the effect of increas- ing insulin concentration in the blood from a given level X = x to a another level X = x + q, and this increase is captured by the do-expression: E[Y|do(X = x + q)] − E[Y|do(X = x)]. We label it “scientific” because it is biologically meaningful, namely its implications are invariant across populations (indeed laboratory blood tests report patients’ concentration levels, X = x, which are tracked over time). In contrast, the policy question in the case of additive interven- tions does not have this invariance feature; it asks for the average effect of adding an increment q to everyone, regardless of the current x level of each individual in this particular population. It is not immediately transportable, because it is highly sensitive to the probability P(X = x) in the population under study. This creates a mismatch between what science tells us and what policy makers ask us to estimate. It is no wonder, therefore, that we need to resort to a unit-level analysis (i.e., counterfactuals) in order to translate from one language into another. k k Counterfactuals and Their Applications 111 The reader may also wonder why E[Y|add(q)] is not equal to the average causal effect ∑ [E[Y|do(X = x + q)] − E[Y|do(X = x)]] P(X = x) x After all, if we know that adding q to an individual at level X = x would increase its expected Y by E[Y|do(X = x + q)] − E[Y|do(X = x)], then averaging this increase over X should give us the answer to the policy question E[Y|add(q)]. Unfortunately, this average does not capture the policy question. This average represents an experiment in which subjects are chosen at random from the population, a fraction P(X = x) are given an additional dose q, and the rest are left alone. But things are different in the policy question at hand, since P(X = x) represents the proportion of subjects who entered level X = x by free choice, and we cannot rule out the possibility that subjects who attain X = x by free choice would react to add(q) differently from subjects who “receive” X = x by experimental decree. For example, it is quite possible that sub- jects who are highly sensitive to add(q) would attempt to lower their X level, given the choice. We translate into counterfactual analysis and write the inequality: ∑ ∑ E[Y|add(q)] = E[Yx+q |x]P(X = x) ≠ E[Yx+q ]P(X = x) x x Equality holds only when Yx is independent of X, a condition that amounts to nonconfounding (see Theorem 4.3.1). Absent this condition, the estimation of E[Y|add(q)] can be accomplished either by q-specific intervention or through stronger assumptions that enable the translation of ETT to do-expressions, as in Eq. (4.21). Study question 4.4.2 k Joe has never smoked before but, as a result of peer pressure and other personal factors, he k decided to start smoking. He buys a pack of cigarettes, comes home, and asks himself: “I am about to start smoking, should I?” (a) Formulate Joe’s question mathematically, in terms of ETT, assuming that the outcome of interest is lung cancer. (b) What type of data would enable Joe to estimate his chances of getting cancer given that he goes ahead with the decision to smoke, versus refraining from smoking. (c) Use the data in Table 3.1 to estimate the chances associated with the decision in (b). 4.4.3 Personal Decision Making Example 4.4.3 Ms Jones, a cancer patient, is facing a tough decision between two possible treatments: (i) lumpectomy alone or (ii) lumpectomy plus irradiation. In consultation with her oncologist, she decides on (ii). Ten years later, Ms Jones is alive, and the tumor has not recurred. She speculates: Do I owe my life to irradiation? Mrs Smith, on the other hand, had a lumpectomy alone, and her tumor recurred after a year. And she is regretting: I should have gone through irradiation. Can these speculations ever be substantiated from statistical data? Moreover, what good would it do to confirm Ms Jones’s triumph or Mrs Smith’s regret? k k 112 Causal Inference in Statistics The overall effectiveness of irradiation can, of course, be determined by randomized exper- iments. Indeed, on October 17, 2002, the New England Journal of Medicine published a paper by Fisher et al. describing a 20-year follow-up of a randomized trial comparing lumpectomy alone and lumpectomy plus irradiation. The addition of irradiation to lumpectomy was shown to cause substantially fewer recurrences of breast cancer (14% vs 39%). These, however, were population results. Can we infer from them the specific cases of Ms Jones and Mrs Smith? And what would we gain if we do, aside from supporting Ms Jones’s satisfaction with her decision or intensifying Mrs Smith’s sense of failure? To answer the first question, we must first cast the concerns of Ms Jones and Mrs Smith in mathematical form, using counterfactuals. If we designate remission by Y = 1 and the decision to undergo irradiation by X = 1, then the probability that determines whether Ms Jones is justified in attributing her remission to the irradiation (X = 1) is PN = P(Y0 = 0|X = 1, Y = 1) (4.23) It reads: the probability that remission would not have occurred (Y = 0) had Ms Jones not gone through irradiation, given that she did in fact go through irradiation (X = 1), and remission did occur (Y = 1). The label PN stands for “probability of necessity” that measures the degree to which Ms Jones’s decision was necessary for her positive outcome. Similarly, the probability that Ms Smith’s regret is justified is given by PS = P(Y1 = 1|X = 0, Y = 0) (4.24) k It reads: the probability that remission would have occurred had Mrs Smith gone through k irradiation (Y1 = 1), given that she did not in fact go through irradiation (X = 0), and remission did not occur (Y =0). PS stands for the “probability of sufficiency,” measuring the degree to which the action X =1, which was not taken. We see that these expressions have almost the same form (save for interchanging ones with zeros) and, moreover, both are similar to Eq. (4.1), save for the fact that Y in the freeway example was a continuous variable, so its expected value was the quantity of interest. These two probabilities (sometimes referred to as “probabilities of causation”) play a major role in all questions of “attribution,” ranging from legal liability to personal decision making. They are not, in general, estimable from either observational or experimental data, but as we shall see below, they are estimable under certain conditions, when both observational and experimental data are available. But before commencing a quantitative analysis, let us address our second question: What is gained by assessing these retrospective counterfactual parameters? One answer is that notions such as regret and success, being right or being wrong, have more than just emotional value; they play important roles in cognitive development and adaptive learning. Confirmation of Ms Jones’s triumph reinforces her confidence in her decision-making strategy, which may include her sources of medical information, her attitude toward risks, and her sense of priority, as well as the strategies she has been using to put all these considerations together. The same applies to regret; it drives us to identify sources of weakness in our strategies and to think of some kind of change that would improve them. It is through counterfactual reinforcement that we learn to improve our own decision-making processes and achieve higher performance. As Kathryn Schultz says in her delightful book Being Wrong, “However disorienting, difficult, or humbling our mistakes might be, it is ultimately wrongness, not rightness, that can teach us who we are.” k k Counterfactuals and Their Applications 113 Estimating the probabilities of being right or wrong also has tangible and profound impact on critical decision making. Imagine a third lady, Ms Daily, facing the same decision as Ms Jones did, and telling herself: If my tumor is the type that would not recur under lumpectomy alone, why should I go through the hardships of irradiation? Similarly, if my tumor is the type that would recur regardless of whether I go through irradiation or not, I would rather not go through it. The only reason for me to go through this is if the tumor is the type that would remiss under treatment and recur under no treatment. Formally, Ms Daily’s dilemma is to quantify the probability that irradiation is both necessary and sufficient for eliminating her tumor, or PNS = P(Y1 = 1, Y0 = 0) (4.25) where Y1 and Y0 stand for remission under treatment (Y1 ) and nontreatment (Y0 ), respectively. Knowing this probability would help Ms Daily’s assessment of how likely she is to belong to the group of individuals for whom Y1 = 1 and Y0 = 0. This probability cannot, of course, be assessed from experimental studies, because we can never tell from experimental data whether an outcome would have been different had the person been assigned to a different treatment. However, casting Ms Daily’s question in mathematical form enables us to investigate algebraically what assumptions are needed for estimating PNS and from what type of data. In the next section (Section 4.5.1, Eq. (4.42)), we see that indeed, PNS can be estimated if we assume monotonicity, namely, that irradiation cannot cause the recurrence of a tumor that was about to remit. Moreover, under monotonicity, experimental data are sufficient to conclude k PNS = P(Y = 1|do(X = 1)) − P(Y = 1|do(X = 0)) (4.26) k For example, if we rely on the experimental data of Fisher et al. (2002), this formula permits us to conclude that Ms Daily’s PNS is PNS = 0.86 − 0.61 = 0.25 This gives her a 25% chance that her tumor is the type that responds to treatment—specifically, that it will remit under lumpectomy plus irradiation but will recur under lumpectomy alone. Such quantification of individual risks is extremely important in personal decision making, and estimates of such risks from population data can only be inferred through counterfactual analysis and appropriate assumptions. 4.4.4 Discrimination in Hiring Example 4.4.4 Mary files a law suit against the New York-based XYZ International, alleging discriminatory hiring practices. According to her, she has applied for a job with XYZ Interna- tional, and she has all the credentials for the job, yet she was not hired, allegedly because she mentioned, during the course of her interview, that she is gay. Moreover, she claims, the hiring record of XYZ International shows consistent preferences for straight employees. Does she have a case? Can hiring records prove whether XYZ International was discriminating when declining her job application? k k 114 Causal Inference in Statistics At the time of writing, U.S. law doesn’t specifically prohibit employment discrimination on the basis of sexual orientation, but New York law does. And New York defines discrim- ination in much the same way as federal law. U.S. courts have issued clear directives as to what constitutes employment discrimination. According to law makers, “The central question in any employment-discrimination case is whether the employer would have taken the same action had the employee been of a different race (age, sex, religion, national origin, etc.) and everything else had been the same.” (In Carson vs Bethlehem Steel Corp., 70 FEP Cases 921, 7th Cir. (1996).) The first thing to note in this directive is that it is not a population-based criterion, but one that appeals to the individual case of the plaintiff. The second thing to note is that it is formulated in counterfactual terminology, using idioms such as “would have taken,” “had the employee been,” and “had been the same.” What do they mean? Can one ever prove how an employer would have acted had Mary been straight? Certainly, this is not a variable that we can intervene upon in an experimental setting. Can data from an observational study prove an employer discriminating? It turns out that Mary’s case, though superficially different from Example 4.4.3, has a lot in common with the problem Ms Smith faced over her unsuccessful cancer treatment. The probability that Mary’s nonhiring is due to her sexual orientation can, similarly to Ms Smith’s cancer treatment, be expressed using the probability of sufficiency: PS = P(Y1 = 1|X = 0, Y = 0) In this case, Y stands for Mary’s hiring, and X stands for the interviewer’s perception of Mary’s sexual orientation. The expression reads: “the probability that Mary would have been k hired had the interviewer perceived her as straight, given that the interviewer perceived her as k gay, and she was not hired.” (Note that the variable in question is the interviewer’s perception of Mary’s sexual orientation, not the orientation itself, because an intervention on perception is quite simple in this case—we need only to imagine that Mary never mentioned that she is gay; hypothesizing a change in Mary’s actual orientation, although formally acceptable, brings with it an aura of awkwardness.) We show in 4.5.2 that, although discrimination cannot be proved in individual cases, the probability that such discrimination took place can be determined, and this probability may sometimes reach a level approaching certitude. The next example examines how the problem of discrimination—in this case on gender, not sexual orientation may appear to a policy maker, rather than a juror. 4.4.5 Mediation and Path-disabling Interventions Example 4.4.5 A policy maker wishes to assess the extent to which gender disparity in hir- ing can be reduced by making hiring decisions gender-blind, rather than eliminating gender inequality in education or job training. The former concerns the “direct effect” of gender on hiring, whereas the latter concerns the “indirect effect,” or the effect mediated via job qualification. k k Counterfactuals and Their Applications 115 In this example, fighting employers’ prejudices and launching educational reforms are two contending policy options that involve costly investments and different implementation strategies. Knowing in advance which of the two, if successful, would have a greater impact on reducing hiring disparity is essential for planning, and depends critically on mediation analysis for resolution. For example, knowing that current hiring disparities are due primarily to employers’ prejudices would render educational reforms superfluous, a fact that may save substantial resources. Note, however, that the policy decisions in this example concern the enabling and disabling of processes rather than lowering or raising values of specific vari- ables. The educational reform program calls for disabling current educational practices and replacing them with a new program in which women obtain the same educational opportuni- ties as men. The hiring-based proposal calls for disabling the current hiring process and replacing it with one in which gender plays no role in hiring decisions. Because we are dealing with disabling processes rather than changing levels of variables, there is no way we can express the effect of such interventions using a do-operator, as we did in the mediation analysis of Section 3.7. We can express it, however, in a counterfactual language, using the desired end result as an antecedent. For example, if we wish to assess the hiring disparity after successfully implementing gender-blind hiring procedures, we impose the condition that all female applicants be treated like males as an antecedent and proceed to estimate the hiring rate under such a counterfactual condition. The analysis proceeds as follows: the hiring status (Y) of a female applicant with qualifi- cation Q = q, given that the employer treats her as though she is a male is captured by the counterfactual YX=1,Q=q , where X = 1 refers to being a male. But since the value q would vary among applicants, we∑need to average this quantity according to the distribution of female k qualification, giving q E [YX=1,Q=q ]P(Q = q|X = 0). Male applicants would have a similar k chance at hiring except that the average is governed by the distribution of male qualification, giving ∑ E[YX=1,Q=q ]P(Q = q|X = 1) q If we subtract the two quantities, we get ∑ E[YX=1,Q=q ][P(Q = q|X = 0) − P(Q = q|X = 1)] q which is the indirect effect of gender on hiring, mediated by qualification. We call this effect the natural indirect effect (NIE), because we allow the qualification Q to vary naturally from applicant to applicant, as opposed to the controlled direct effect in Chapter 3, where we held the mediator at a constant level for the entire population. Here we merely disable the capacity of Y to respond to X but leave its response to Q unaltered. The next question to ask is whether such a counterfactual expression can be identified from data. It can be shown (Pearl 2001) that, in the absence of confounding the NIE can be estimated by conditional probabilities, giving ∑ NIE = E[Y|X = 1, Q = q][P(Q = q|X = 0) − P(Q = q|X = 1)] q k k 116 Causal Inference in Statistics This expression is known as the mediation formula. It measures the extent to which the effect of X on Y is explained by its effect on the mediator Q. Counterfactual analysis permits us to define and assess NIE by “freezing” the direct effect of X on Y, and allowing the mediator (Q) of each unit to react to X in a natural way, as if no freezing took place. The mathematical tools necessary for estimating the various nuances of mediation are sum- marized in Section 4.5. 4.5 Mathematical Tool Kits for Attribution and Mediation As we examined the practical applications of counterfactual analysis in Section 4.4, we noted several recurring patterns that shared mathematical expressions as well as methods of solu- tion. The first was the effect of treatment on the treated, ETT, whose syntactic signature was the counterfactual expression E[Yx |X = x′ ], with x and x′ two distinct values of X. We showed that problems as varied as recruitment to a program (Section 4.4.1) and additive interventions (Example 4.4.2) rely on the estimation of this expression, and we have listed conditions under which estimation is feasible, as well as the resulting estimand (Eqs. (4.21) and (4.8)). Another recurring pattern appeared in problems of attribution, such as personal decision problems (Example 4.4.3) and possible cases of discrimination (Example 4.4.4). Here, the pattern was the expression for the probability of necessity: PN = P(Y0 = 0|X = 1, Y = 1) k k The probability of necessity also pops up in problems of legal liability, where it reads: “The probability that the damage would not have occurred had the action not been taken (Y0 = 0), given that, in fact, the damage did occur (Y = 1) and the action was taken (X = 1).” Section 4.5.1 summarizes mathematical results that will enable readers to estimate (or bound) PN using a combination of observational and experimental data. Finally, in questions of mediation (Example 4.4.5) the key counterfactual expression was E[Yx,Mx′ ] which reads, “The expected outcome (Y) had the treatment been X = x and, simultaneously, had the mediator M attained the value (Mx′ ) it would have attained had X been x′ ”. Section 4.5.2 will list the conditions under which this “nested” counterfactual expression can be estimated, as well as the resulting estimands and their interpretations. 4.5.1 A Tool Kit for Attribution and Probabilities of Causation Assuming binary events, with X = x and Y = y representing treatment and outcome, respec- tively, and X = x′ , Y = y′ their negations, our target quantity is defined by the English sentence: “Find the probability that if X had been x′ , Y would be y′ , given that, in reality, X is x and Y is y.” k k Counterfactuals and Their Applications 117 Mathematically, this reads PN(x, y) = P(Yx′ = y′ |X = x, Y = y) (4.27) This counterfactual quantity, named “probability of necessity” (PN), captures the legal cri- terion of “but for,” according to which judgment in favor of a plaintiff should be made if and only if it is “more probable than not” that the damage would not have occurred but for the defendant’s action (Robertson 1997). Having written a formal expression for PN, Eq. (4.27), we can move on to the identification phase and ask what assumptions permit us to identify PN from empirical studies, be they observational, experimental, or a combination thereof. Mathematical analysis of this problem (described in (Pearl 2000, Chapter 9)) yields the following results: Theorem 4.5.1 If Y is monotonic relative to X, that is, Y1 (u) ≥ Y0 (u) for all u, then PN is identifiable whenever the causal effect P(y|do(x)) is identifiable, and P(y) − P(y|do(x′ )) PN = (4.28) P(x, y) or, substituting P(y) = P(y|x)P(x) + P(y|x′ )(1 − P(x)), we obtain P(y|x) − P(y|x′ ) P(y|x′ ) − P(y|do(x′ )) PN = + (4.29) P(y|x) P(x, y) k k The first term on the r.h.s. of (4.29) is called the excess risk ratio (ERR) and is often used in court cases in the absence of experimental data (Greenland 1999). It is also known as the Attributable Risk Fraction among the exposed (Jewell 2004, Chapter 4.7). The second term (the confounding factor (CF)) represents a correction needed to account for confounding bias, that is, P(y|do(x′ )) ≠ P(y|x′ ). Put in words, confounding occurs when the proportion of population for whom Y = y, when X is set to x′ for everyone is not the same as the proportion of the population for whom Y = y among those acquiring X = x′ by choice. For instance, suppose there is a case brought against a car manufacturer, claiming that its car’s faulty design led to a man’s death in a car crash. The ERR tells us how much more likely people are to die in crashes when driving one of the manufacturer’s cars. If it turns out that people who buy the manufacturer’s cars are more likely to drive fast (leading to deadlier crashes) than the general population, the second term will correct for that bias. Equation (4.29) thus provides an estimable measure of necessary causation, which can be used for monotonic Yx (u) whenever the causal effect P(y|do(x)) can be estimated, be it from randomized trials or from graph-assisted observational studies (e.g., through the backdoor cri- terion). More significantly, it has also been shown (Tian and Pearl 2000) that the expression in (4.28) provides a lower bound for PN in the general nonmonotonic case. In particular, the upper and lower bounds on PN are given by { } { } P(y) − P(y|do(x′ )) P(y′ |do(x′ )) − P(x′ , y′ ) max 0, ≤ PN ≤ min 1, (4.30) P(x, y) P(x, y) In drug-related litigation, it is not uncommon to obtain data from both experimental and observational studies. The former is usually available from the manufacturer or the agency k k 118 Causal Inference in Statistics that approved the drug for distribution (e.g., FDA), whereas the latter is often available from surveys of the population. A few algebraic steps allow us to express the lower bound (LB) and upper bound (UB) as LB = ERR + CF UB = ERR + q + CF (4.31) where ERR, CF, and q are defined as follows: CF ≜ [P(y|x′ ) − P(yx′ )]∕P(x, y) (4.32) ERR ≜ 1 − 1∕RR = 1 − P(y|x )∕P(y|x) ′ (4.33) q ≜ P(y′ |x)∕P(y|x) (4.34) Here, CF represents the normalized degree of confounding among the unexposed (X = x′ ), ERR is the “excess risk ratio” and q is the ratio of negative to positive outcomes among the exposed. Figure 4.5(a) and (b) depicts these bounds as a function of ERR, and reveals three useful features. First, regardless of confounding, the interval UB–LB remains constant and depends on only one observable parameter, P(y′ |x)∕P(y|x). Second, the CF may raise the lower bound to meet the criterion of “more probable than not,” PN > 12 , when the ERR alone would not suffice. Lastly, the amount of “rise” to both bounds is given by CF, which is the only estimate needed from the experimental data; the causal effect P(yx ) − P(yx′ ) is not needed. k Theorem 4.5.1 further assures us that, if monotonicity can be assumed, the upper and lower k bounds coincide, and the gap collapses entirely, as shown in Figure 4.5(b). This collapse does not reflect q = 0, but a shift from the bounds of (4.30) to the identified conditions of (4.28). If it is the case that the experimental and survey data have been drawn at random from the same population, then the experimental data can be used to estimate the counterfactuals PN Upper PN PN bound Lower bound 1 1 PN PN q CF CF 0 ERR 1 0 ERR 1 (a) (b) Figure 4.5 (a) Showing how probabilities of necessity (PN) are bounded, as a function of the excess risk ratio (ERR) and the confounding factor (CF) (Eq. (4.31)); (b) showing how PN is identified when monotonicity is assumed (Theorem 4.5.1) k k Counterfactuals and Their Applications 119 of interest, for example, P(Yx = y), for the observational as well as experimental sampled populations. Example 4.5.1 (Attribution in Legal Setting) A lawsuit is filed against the manufacturer of drug x, charging that the drug is likely to have caused the death of Mr A, who took it to relieve back pains. The manufacturer claims that experimental data on patients with back pains show conclusively that drug x has only minor effects on death rates. However, the plaintiff argues that the experimental study is of little relevance to this case because it represents average effects on patients in the study, not on patients like Mr A who did not participate in the study. In particular, argues the plaintiff, Mr A is unique in that he used the drug of his own volition, unlike subjects in the experimental study, who took the drug to comply with experimental protocols. To support this argument, the plaintiff furnishes nonexperimental data on patients who, like Mr A, chose drug x to relieve back pains but were not part of any experiment, and who experienced lower death rates than those who didn’t take the drug. The court must now decide, based on both the experimental and nonexperimental studies, whether it is “more probable than not” that drug x was in fact the cause of Mr A’s death. To illustrate the usefulness of the bounds in Eq. (4.30), consider (hypothetical) data asso- ciated with the two studies shown in Table 4.5. (In the analyses below, we ignore sampling k variability.) k The experimental data provide the estimates P(y|do(x)) = 16∕1000 = 0.016 (4.35) ′ P(y|do(x )) = 14∕1000 = 0.014 (4.36) whereas the nonexperimental data provide the estimates P(y) = 30∕2000 = 0.015 (4.37) P(x, y) = 2∕2000 = 0.001 (4.38) P(y|x) = 2∕1000 = 0.002 (4.39) ′ P(y|x ) = 28∕1000 = 0.028 (4.40) Table 4.5 Experimental and nonexperimental data used to illustrate the estimation of PN, the probability that drug x was responsible for a person’s death (y) Experimental Nonexperimental ′ do(x) do(x ) x x′ Deaths (y) 16 14 2 28 Survivals (y′ ) 984 986 998 972 k k 120 Causal Inference in Statistics Assuming that drug x can only cause (but never prevent) death, monotonicity holds, and Theorem 4.5.1 (Eq. (4.29)) yields P(y|x) − P(y|x′ ) P(y|x′ ) − P(y|do(x′ )) PN = + P(y|x) P(x, y) 0.002 − 0.028 0.028 − 0.014 = + = −13 + 14 = 1 (4.41) 0.002 0.001 We see that while the observational ERR is negative (−13), giving the impression that the drug is actually preventing deaths, the bias-correction term (+14) rectifies this impression and sets the probability of necessity (PN) to unity. Moreover, since the lower bound of Eq. (4.30) becomes 1, we conclude that PN = 1.00 even without assuming monotonicity. Thus, the plain- tiff was correct; barring sampling errors, the data provide us with 100% assurance that drug x was in fact responsible for the death of Mr A. To complete this tool kit for attribution, we note that the other two probabilities that came up in the discussion on personal decision-making (Example 4.4.3), PS and PNS, can be bounded by similar expressions; see (Pearl 2000, Chapter 9) and (Tian and Pearl 2000). In particular, when Yx (u) is monotonic, we have PNS = P(Yx = 1, Yx′ = 0) = P(Yx = 1) − P(Yx′ = 1) (4.42) as asserted in Example 4.4.3, Eq. (4.26). k k Study questions Study question 4.5.1 Consider the dilemma faced by Ms Jones, as described in Example 4.4.3. Assume that, in addition to the experimental results of Fisher et al. (2002), she also gains access to an obser- vational study, according to which the probability of recurrent tumor in all patients (regardless of irradiation) is 30%, whereas among the recurrent cases, 70% did not choose therapy. Use the bounds provided in Eq. (4.30) to update her estimate that her decision was necessary for remission. 4.5.2 A Tool Kit for Mediation The canonical model for a typical mediation problem takes the form: t = fT (uT ) m = fM (t, uM ) y = fY (t, m, uY ) (4.43) where T (treatment), M (mediator), and Y (outcome) are discrete or continuous random vari- ables, fT , fM , and fY are arbitrary functions, and UT , UM , UY represent, respectively, omit- ted factors that influence T, M, and Y. The triplet U = (UT , UM , UY ) is a random vector that accounts for all variations among individuals. In Figure 4.6(a), the omitted factors are assumed to be arbitrarily distributed but mutually independent. In Figure 4.6(b), the dashed arcs connecting UT and UM (as well as UM and UT ) encode the understanding that the factors in question may be dependent. k k Counterfactuals and Their Applications 121 UM UM fM (t, uM) M M fM (t, uM) fY (t, m, uY) fY (t, m, uY) UT UY UT UY T Y T Y (a) (b) Figure 4.6 (a) The basic nonparametric mediation model, with no confounding. (b) A confounded mediation model in which dependence exists between UM and (UT , UY ) Counterfactual definition of direct and indirect effects Using the structural model of Eq. (4.43) and the counterfactual notation defined in Section 4.2.1, four types of effects can be defined for the transition from T = 0 to T = 1. Generalizations to arbitrary reference points, say from T = t to T = t′ , are straightforward1 : (a) Total effect – TE = E[Y1 − Y0 ] = E[Y|do(T = 1)] − E[Y|do(T = 0)] (4.44) TE measures the expected increase in Y as the treatment changes from T = 0 to T = 1, while the mediator is allowed to track the change in T naturally, as dictated by the function fM. k (b) Controlled direct effect – k CDE(m) = E[Y1,m − Y0,m ] = E[Y|do(T = 1, M = m)] − E[Y|do(T = 0, M = m)] (4.45) CDE measures the expected increase in Y as the treatment changes from T = 0 to T = 1, while the mediator is set to a specified level M = m uniformly over the entire population. (c) Natural direct effect – NDE = E[Y1,M0 − Y0,M0 ] (4.46) NDE measures the expected increase in Y as the treatment changes from T = 0 to T = 1, while the mediator is set to whatever value it would have attained (for each individual) prior to the change, that is, under T = 0. (d) Natural indirect effect – NIE = E[Y0,M1 − Y0,M0 ] (4.47) NIE measures the expected increase in Y when the treatment is held constant, at T = 0, and M changes to whatever value it would have attained (for each individual) under T = 1. It captures, therefore, the portion of the effect that can be explained by mediation alone, while disabling the capacity of Y to respond to X. 1 These definitions apply at the population levels; the unit-level effects are given by the expressions under the expec- tation. All expectations are taken over the factors UM and UY. k k 122 Causal Inference in Statistics We note that, in general, the total effect can be decomposed as TE = NDE − NIEr (4.48) where NIEr stands for the NIE under the reverse transition, from T = 1 to T = 0. This implies that NIE is identifiable whenever NDE and TE are identifiable. In linear systems, where reversal of transitions amounts to negating the signs of their effects, we have the standard additive formula, TE = NDE + NIE. We further note that TE and CDE(m) are do-expressions and can, therefore, be estimated from experimental data or in observational studies using the backdoor or front-door adjust- ments. Not so for the NDE and NIE; a new set of assumptions is needed for their identification. Conditions for identifying natural effects The following set of conditions, marked A-1 to A-4, are sufficient for identifying both direct and indirect natural effects. We can identify the NDE and NIE provided that there exists a set W of measured covariates such that A-1 No member of W is a descendant of T. A-2 W blocks all backdoor paths from M to Y (after removing T → M and T → Y). A-3 The W-specific effect of T on M is identifiable (possibly using experiments or adjust- ments). A-4 The W-specific joint effect of {T, M} on Y is identifiable (possibly using experiments or adjustments). k k Theorem 4.5.2 (Identification of the NDE) When conditions A-1 and A-2 hold, the natural direct effect is experimentally identifiable and is given by ∑∑ NDE = [E[Y|do(T = 1, M = m), W = w] − E[Y|do(T = 0, M = m), W = w]] m w × P(M = m|do(T = 0), W = w)P(W = w) (4.49) The identifiability of the do-expressions in Eq. (4.49) is guaranteed by conditions A-3 and A-4 and can be determined using the backdoor or front-door criteria. Corollary 4.5.1 If conditions A-1 and A-2 are satisfied by a set W that also deconfounds the relationships in A-3 and A-4, then the do-expressions in Eq. (4.49) are reducible to conditional expectations, and the natural direct effect becomes ∑∑ NDE = [E[Y|T = 1, M = m, W = w] − E[Y|T = 0, M = m, W = w]] m w × P(M = m|T = 0, W = w)P(W = w) (4.50) In the nonconfounding case (Figure 4.6(a)), NDE reduces to ∑ NDE = [E[Y | T = 1, M = m] − E[Y | T = 0, M = m]]P(M = m | T = 0). (4.51) m k k Counterfactuals and Their Applications 123 Similarly, using (4.48) and TE = E[Y|T = 1] − E[Y|T = 0], NIE becomes ∑ NIE = E[Y | T = 0, M = m][P(M = m | T = 1) − P(M = m | T = 0)] (4.52) m The last two expressions are known as the mediation formulas. We see that while NDE is a weighted average of CDE, no such interpretation can be given to NIE. The counterfactual definitions of NDE and NIE (Eqs. (4.46) and (4.47)) permit us to give these effects meaningful interpretations in terms of “response fractions.” The ratio NDE∕TE measures the fraction of the response that is transmitted directly, with M “frozen.” NIE∕TE measures the fraction of the response that may be transmitted through M, with Y blinded to X. Consequently, the difference (TE − NDE)∕TE measures the fraction of the response that is necessarily due to M. Numerical example: Mediation with binary variables To anchor these mediation formulas in a concrete example, we return to the encouragement- design example of Section 4.2.3 and assume that T = 1 stands for participation in an enhanced training program, Y = 1 for passing the exam, and M = 1 for a student spending more than 3 hours per week on homework. Assume further that the data described in Tables 4.6 and 4.7 were obtained in a randomized trial with no mediator-to-outcome confounding (Figure 4.6(a)). The data shows that training tends to increase both the time spent on homework and the rate of success on the exam. Moreover, training and time spent on homework together are more likely to produce success than each factor alone. Our research question asks for the extent to which students’ homework contributes to their increased success rates regardless of the training program. The policy implications of such k k questions lie in evaluating policy options that either curtail or enhance homework efforts, for example, by counting homework effort in the final grade or by providing students with Table 4.6 The expected success (Y) for treated (T = 1) and untreated (T = 0) students, as a function of their homework (M) Treatment Homework Success rate T M E(Y|T = t, M = m) 1 1 0.80 1 0 0.40 0 1 0.30 0 0 0.20 Table 4.7 The expected homework (M) done by treated (T =1) and untreated (T =0) students Treatment Homework T E(M|T = t) 0 0.40 1 0.75 k k 124 Causal Inference in Statistics adequate work environments at home. An extreme explanation of the data, with significant impact on educational policy, might argue that the program does not contribute substantively to students’ success, save for encouraging students to spend more time on homework, an encour- agement that could be obtained through less expensive means. Opposing this theory, we may have teachers who argue that the program’s success is substantive, achieved mainly due to the unique features of the curriculum covered, whereas the increase in homework efforts cannot alone account for the success observed. Substituting the data into Eqs. (4.51) and (4.52) gives NDE = (0.40 − 0.20)(1 − 0.40) + (0.80 − 0.30)0.40 = 0.32 NIE = (0.75 − 0.40)(0.30 − 0.20) = 0.035 TE = 0.80 × 0.75 + 0.40 × 0.25 − (0.30 × 0.40 + 0.20 × 0.60) = 0.46 NIE∕TE = 0.07, NDE∕TE = 0.696, 1 − NDE∕TE = 0.304 We conclude that the program as a whole has increased the success rate by 46% and that a significant portion, 30.4%, of this increase is due to the capacity of the program to stimulate improved homework effort. At the same time, only 7% of the increase can be explained by stimulated homework alone without the benefit of the program itself. Study questions k k Study question 4.5.2 Consider the structural model: y = 𝛽1 m + 𝛽2 t + uy (4.53) m = 𝛾1 t + um (4.54) (a) Use the basic definition of the natural effects (Eqs. (4.46) and (4.47)) to determine TE, NDE, and NIE. (b) Repeat (a) assuming that uy is correlated with um. Study question 4.5.3 Consider the structural model: y = 𝛽1 m + 𝛽2 t + 𝛽3 tm + 𝛽4 w + uy (4.55) m = 𝛾1 t + 𝛾2 w + um (4.56) w = 𝛼t + uw (4.57) with 𝛽3 tm representing an interaction term. k k Counterfactuals and Their Applications 125 (a) Use the basic definition of the natural effects (Eqs. (4.46) and (4.47)) (treating M as the mediator), to determine the portion of the effect for which mediation is necessary (TE − NDE) and the portion for which mediation is sufficient (NIE). Hint: Show that: NDE = 𝛽2 + 𝛼𝛽4 (4.58) NIE = 𝛽1 (𝛾1 + 𝛼𝛾2 ) (4.59) TE = 𝛽2 + (𝛾1 + 𝛼𝛾2 )(𝛽3 + 𝛽1 ) + 𝛼𝛽4 (4.60) TE − NDE = (𝛽1 + 𝛽3 )(𝛾1 + 𝛼𝛾2 ) (4.61) (b) Repeat, using W as the mediator. Study question 4.5.4 Apply the mediation formulas provided in this section to the discrimination case discussed in Section 4.4.4, and determine the extent to which ABC International practiced discrimina- tion in their hiring criteria. Use the data in Tables 4.6 and 4.7, with T = 1 standing for male applicants, M = 1 standing for highly qualified applicants, and Y = 1 standing for hiring. (Find the proportion of the hiring disparity that is due to gender, and the proportion that could be explained by disparity in qualification alone.) k k Ending Remarks The analysis of mediation is perhaps the best arena to illustrate the effectiveness of the counterfactual-graphical symbiosis that we have been pursuing in this book. If we examine the identifying conditions A-1 to A-4, we find four assertions about the model that are not too easily comprehended. To judge their plausibility in any given scenario, without the graph before us, is unquestionably a formidable, superhuman task. Yet the symbiotic analysis frees investigators from the need to understand, articulate, examine, and judge the plausibility of the assumptions needed for identification. Instead, the method can confirm or disconfirm these assumptions algorithmically from a more reliable set of assumption, those encoded in the structural model itself. Once constructed, the causal diagram allows simple path-tracing routines to replace much of the human judgment deemed necessary in mediation analysis; the judgment invoked in the construction of the diagrams is sufficient, and that construction requires only judgment about causal relationships among realizable variables and their disturbances. Bibliographical Notes for Chapter 4 The definition of counterfactuals as derivatives of structural equations, Eq. (4.5), was introduced by Balke and Pearl (1994a,b), who applied it to the estimation of probabilities of causation in legal settings. The philosopher David Lewis defined counterfactuals in terms of similarity among possible worlds Lewis (1973). In statistics, the notation Yx (u) was devised by Neyman (1923), to denote the potential response of unit u in a controlled randomized trial, k k 126 Causal Inference in Statistics under treatment X = x. It remained relatively unnoticed until Rubin (1974) treated Yx as a random variable and connected it to observed variable via the consistency rule of Eq. (4.6), which is a theorem in both Lewis’s logic and in structural models. The relationships among these three formalisms of counterfactuals are discussed at length in Pearl (2000, Chapter 7), where they are shown to be logically equivalent; a problem solved in one framework would yield the same solution in another. Rubin’s framework, known as “potential outcomes,” differs from the structural account only in the language in which problems are defined, hence, in the mathematical tools available for their solution. In the potential outcome framework, problems are defined algebraically as assumptions about counterfactual independencies, also known as “ignorability assumptions.” These types of assumptions, exemplified in Eq. (4.15), may become too complicated to interpret or verify by unaided judgment. In the structural framework, on the other hand, problems are defined in the form of causal graphs, from which dependencies of counterfactuals (e.g., Eq. (4.15)) can be derived mechanically. The reason some statisticians prefer the algebraic approach is, primarily, because graphs are relatively new to statistics. Recent books in social science (e.g., Morgan and Winship 2014) and in health science (e.g., VanderWeele 2015) are taking the hybrid, graph-counterfactual approach pursued in our book. The section on linear counterfactuals is based on Pearl (2009, pp. 389–391). Recent advances are provided in Cai and Kuroki (2006) and Chen and Pearl (2014). Our discussion of ETT (Effect of Treatment on the Treated), as well as additive interventions, is based on Shpitser and Pearl (2009), which provides a full characterization of models in which ETT is identifiable. Legal questions of attribution, as well as probabilities of causation are discussed at length in k Greenland (1999) who pioneered the counterfactual approach to such questions. Our treatment k of PN, PS, and PNS is based on Tian and Pearl (2000) and Pearl (2000, Chapter 9). Recent results, including the tool kit of Section 4.5.1, are given in Pearl (2015a). Mediation analysis (Sections 4.4.5 and 4.5.2), as we remarked in Chapter 3, has a long tra- dition in the social sciences (Duncan 1975; Kenny 1979), but has gone through a dramatic revolution through the introduction of counterfactual analysis. A historical account of the con- ceptual transition from the statistical approach of Baron and Kenny (1986) to the modern, counterfactual-based approach of natural direct and indirect effects (Pearl 2001; Robins and Greenland 1992) is given in Sections 1 and 2 of Pearl (2014a). The recent text of VanderWeele (2015) enhances this development with new results and new applications. Additional advances in mediation, including sensitivity analysis, bounds, multiple mediators, and stronger identi- fying assumptions are discussed in Imai et al. (2010) and Muthén and Asparouhov (2015). The mediation tool kit of Section 4.5.2 is based on Pearl (2014a). Shpitser (2013) has derived a general criterion for identifying indirect effects in graphs. k k References Balke A and Pearl J 1994a Counterfactual probabilities: Computational methods, bounds, and applica- tions. In Uncertainty in Artificial Intelligence 10 (ed. de Mantaras RL and Poole D) Morgan Kaufmann Publishers, San Mateo, CA pp. 46–54. Balke A and Pearl J 1994b Probabilistic evaluation of counterfactual queries. Proceedings of the Twelfth National Conference on Artificial Intelligence, vol. I, MIT Press, Menlo Park, CA pp. 230–237. Bareinboim E and Pearl J 2012 Causal inference by surrogate experiments (or, z-identifiability). Pro- ceedings of the Twenty-eighth Conference on Uncertainty in Artificial Intelligence (ed. de Freitas N and Murphy K) AUAI Press, Corvallis, OR, pp. 113–120. Bareinboim E and Pearl J 2013 A general algorithm for deciding transportability of experimental results. Journal of Causal Inference 1 (1), 107–134. k Bareinboim E and Pearl J 2016 Causal inference and the data-fusion problem. Proceedings of the National k Academy of Sciences 113 (17), 7345–7352. Bareinboim E, Tian J and Pearl J 2014 Recovering from selection bias in causal and statistical inference. Proceedings of the Twenty-eighth AAAI Conference on Artificial Intelligence (ed. Brodley CE and Stone P) AAAI Press, Palo Alto, CA, pp. 2410–2416. Baron R and Kenny D 1986 The moderator-mediator variable distinction in social psychological research: Conceptual, strategic, and statistical considerations. Journal of Personality and Social Psychology 51 (6), 1173–1182. Berkson J 1946 Limitations of the application of fourfold table analysis to hospital data. Biometrics Bulletin 2, 47–53. Bollen K 1989 Structural Equations with Latent Variables. John Wiley & Sons, Inc., New York. Bollen K and Pearl J 2013 Eight myths about causality and structural equation models. In Handbook of Causal Analysis for Social Research (ed. Morgan S) Springer-Verlag, Dordrecht, Netherlands pp. 245–274. Bowden R and Turkington D 1984 Instrumental Variables. Cambridge University Press, Cambridge, England. Brito C and Pearl J 2002 Generalized instrumental variables. Uncertainty in Artificial Intelligence, Proceedings of the Eighteenth Conference (ed. Darwiche A and Friedman N) Morgan Kaufmann San Francisco, CA pp. 85–93. Cai Z and Kuroki M 2006 Variance estimators for three ‘probabilities of causation’. Risk Analysis 25 (6), 1611–1620. Causal Inference in Statistics: A Primer, First Edition. Judea Pearl, Madelyn Glymour, and Nicholas P. Jewell. © 2016 John Wiley & Sons, Ltd. Published 2016 by John Wiley & Sons, Ltd. Companion Website: www.wiley.com/go/Pearl/Causality k k 128 References Chen B and Pearl J 2014 Graphical tools for linear structural equation modeling. Technical Report R-432, Department of Computer Science, University of California, Los Angeles, CA. Submitted, Psychometrika, http://ftp.cs.ucla.edu/pub/stat_ser/r432.pdf. Cole S and Hernán M 2002 Fallibility in estimating direct effects. International Journal of Epidemiology 31 (1), 163–165. Conrady S and Jouffe L 2015 Bayesian Networks and BayesiaLab: A Practical Introduction for Researchers 1st edition edn. Bayesia USA. Cox D 1958 The Planning of Experiments. John Wiley and Sons, New York. Darwiche A 2009 Modeling and Reasoning with Bayesian Networks. Cambridge University Press, New York. Duncan O 1975 Introduction to Structural Equation Models. Academic Press, New York. Elwert F 2013 Graphical causal models. In Handbook of Causal Analysis for Social Research (ed. Morgan S) Springer-Verlag, Dordrecht, Netherlands pp. 245–274. Fenton N and Neil M 2013 Risk Assessment and Decision Analysis with Bayesian Networks. CRC Press, Boca Raton, FL. Fisher R 1922 On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society of London, Series A 222, 311. Fisher B, Anderson S, Bryant J, Margolese RG, Deutsch M, Fisher ER, Jeong JH and Wolmark N 2002 Twenty-year follow-up of a randomized trial comparing total mastectomy, lumpectomy, and lumpec- tomy plus irradiation for the treatment of invasive breast cancer. New England Journal of Medicine 347 (16), 1233–1241. Glymour MM 2006 Using causal diagrams to understand common problems in social epidemiology. Methods in Social Epidemiology John Wiley & Sons, Inc., San Francisco, CA pp. 393–428. Glymour M and Greenland S 2008 Causal diagrams. In Modern Epidemiology (ed. Rothman K, Greenland S, and Lash T) 3rd edn. Lippincott Williams & Wilkins Philadelphia, PA pp. 183–209. k k Greenland S 1999 Relation of probability of causation, relative risk, and doubling dose: A methodologic error that has become a social problem. American Journal of Public Health 89 (8), 1166–1169. Greenland S 2000 An introduction to instrumental variables for epidemiologists. International Journal of Epidemiology 29 (4), 722–729. Grinstead CM and Snell JL 1998 Introduction to Probability second revised edn. American Mathematical Society, United States. Haavelmo T 1943 The statistical implications of a system of simultaneous equations. Econometrica 11, 1–12. Reprinted in DF Hendry and MS Morgan (Eds.), 1995 The Foundations of Econometric Analysis, Cambridge University Press pp. 477–490. Hayduk L, Cummings G, Stratkotter R, Nimmo M, Grygoryev K, Dosman D, Gilespie, M., Pazderka-Robinson H and Boadu K 2003 Pearl’s d-separation: One more step into causal thinking. Structural Equation Modeling 10 (2), 289–311. Heise D 1975 Causal Analysis. John Wiley and Sons, New York. Hernán M and Robins J 2006 Estimating causal effects from epidemiological data. Journal of Epidemiology and Community Health 60 (7), 578–586. DOI: 10.1136/jech.2004.029496. Hernández-Díaz S, Schisterman E and Hernán M 2006 The birth weight “paradox” uncovered? American Journal of Epidemiology 164 (11), 1115–1120. Holland P 1986 Statistics and causal inference. Journal of the American Statistical Association 81 (396), 945–960. Howard R and Matheson J 1981 Influence diagrams. In Principles and Applications of Decision Analysis (ed. Howard R and Matheson J) Strategic Decisions Group, Menlo Park, CA pp.721–762. Imai K, Keele L and Yamamoto T 2010 Identification, i nference, a nd s ensitivity a nalysis f or causal mediation effects. Statistical Science 25 (1), 51–71. Jewell NP 2004 Statistics for Epidemiology. Chapman & Hall/CRC, Boca Raton, FL. Kenny D 1979 Correlation and Causality. John Wiley & Sons, Inc., New York. k k References 129 Kiiveri H, Speed T and Carlin J 1984 Recursive causal models. Journal of Australian Math Society 36, 30–52. Kim J and Pearl J 1983 A computational model for combined causal and diagnostic reasoning in infer- ence systems. Proceedings of the Eighth International Joint Conference on Artificial Intelligence (IJCAI-83), pp. 190–193, Karlsruhe, Germany. Kline RB 2016 Principles and Practice of Structural Equation Modeling fourth: Revised and expanded edn. Guilford Publications, Inc., New York. Koller K and Friedman N 2009 Probabilistic Graphical Models: Principles and Techniques. MIT Press, United States. Kyono T 2010 Commentator: A front-end user-interface module for graphical and structural equation modeling. Master’s thesis Department of Computer Science, University of California, Los Angeles, CA. Lauritzen S 1996 Graphical Models. Clarendon Press, Oxford. Reprinted 2004 with corrections. Lewis D 1973 Causation. Journal of Philosophy 70, 556–567. Lindley DV 2014 Understanding Uncertainty revised edn. John Wiley & Sons, Inc., Hoboken, NJ. Lord FM 1967 A paradox in the interpretation of group comparisons. Psychological Bulletin 68, 304–305. Mohan K, Pearl J and Tian J 2013 Graphical models for inference with missing data. In Advances in Neural Information Processing Systems 26 (ed. Burges C, Bottou L, Welling M, Ghahramani Z and Weinberger K) Neural Information Processing Systems Foundation, Inc. pp. 1277–1285. Moore D, McCabe G and Craig B 2014 Introduction to the Practice of Statistics. W.H. Freeman & Co., New York. Morgan SL and Winship C 2014 Counterfactuals and Causal Inference: Methods and Principles for Social Research, Analytical Methods for Social Research 2nd edn. Cambridge University Press, New York. Muthén B and Asparouhov T 2015 Causal effects in mediation modeling: An introduction with applica- k tions to latent variables. Structural Equation Modeling: A Multidisciplinary Journal 22 (1), 12–23. k Neyman J 1923 On the application of probability theory to agricultural experiments. Essay on principles. Section 9. Statistical Science 5 (4), 465–480. Pearl J 1985 Bayesian networks: A model of self-activated memory for evidential reasoning. Proceedings, Cognitive Science Society, pp. 329–334, Irvine, CA. Pearl J 1986 Fusion, propagation, and structuring in belief networks. Artificial Intelligence 29, 241–288. Pearl J 1988 Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, San Mateo, CA. Pearl J 1993 Comment: Graphical models, causality, and intervention. Statistical Science 8 (3), 266–269. Pearl J 1995 Causal diagrams for empirical research. Biometrika 82 (4), 669–710. Pearl J 1998 Graphs, causality, and structural equation models. Sociological Methods and Research 27 (2), 226–284. Pearl J 2000 Causality: Models, Reasoning, and Inference. Cambridge University Press, New York. Pearl J 2001 Direct and indirect effects. Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence Morgan Kaufmann San Francisco, CA pp. 411–420. Pearl J 2009 Causality: Models, Reasoning, and Inference 2nd edn. Cambridge University Press, New York. Pearl J 2014a Interpretation and identification of causal mediation. Psychological Methods 19, 459–481. Pearl J 2014b Understanding Simpson’s paradox. The American Statistician 88 (1), 8–13. Pearl J 2015a Causes of effects and effects of causes. Journal of Sociological Methods and Research 44, 149–164. Pearl J 2015b Detecting latent heterogeneity. Sociological Methods and Research DOI: 10.1177/0049124115600597, online:1–20. k k 130 References Pearl J 2015c Trygve Haavelmo and the emergence of causal calculus. Econometric Theory, Special issue on Haavelmo Centennial 31 (1), 152–179. Pearl J 2016 Lord’s paradox revisited—(oh Lord! Kumbaya!). Journal of Causal Inference 4 (2). DOI: 10.1515/jci-2016-0021. Pearl J and Bareinboim E 2014 External validity: From do-calculus to transportability across populations. Statistical Science 29 ,579–595. Pearl J and Mackenzie D 2018 The Book of Why: The New Science of Cause and Effect. Basic Books, New York. Pearl J and Paz A 1987 GRAPHOIDS: A graph-based logic for reasoning about relevance relations. In Advances in Artificial Intelligence-II (ed. Duboulay B, Hogg D and Steels L) North-Holland Publishing Co. pp. 357–363. Pearl J and Robins J 1995 Probabilistic evaluation of sequential plans from causal models with hidden variables. In Uncertainty in Artificial Intelligence 11 (ed. Besnard P and Hanks S) Morgan Kaufmann, San Francisco, CA pp. 444–453. Pearl J and Verma T 1991 A theory of inferred causation. Principles of Knowledge Representation and Reasoning: Proceedings of the Second International Conference (ed. Allena J, Fikes R and Sandewall E) Morgan Kaufmann San Mateo, CA pp. 441–452. Pigou A 1911 Alcoholism and Heredity. Westminster Gazette. February 2. Rebane G and Pearl J 1987 The recovery of causal poly-trees from statistical data. Proceedings of the Third Workshop on Uncertainty in AI, pp. 222–228, Seattle, WA. Reichenbach H 1956 The Direction of Time. University of California Press, Berkeley, CA. Robertson D 1997 The common sense of cause in fact. Texas Law Review 75 (7), 1765–1800. Robins J 1986 A new approach to causal inference in mortality studies with a sustained exposure period—applications to control of the healthy workers survivor effect. Mathematical Modeling 7, k 1393–1512. k Robins J and Greenland S 1992 Identifiability and exchangeability for direct and indirect effects. Epidemiology 3 (2), 143–155. Rubin D 1974 Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology 66, 688–701. Selvin S 2004 Biostatistics: How it Works. Pearson, New Jersey. Senn S 2006 Change from baseline and analysis of covariance revisited. Statistics in Medicine 25, 4334–4344. Shpitser I 2013 Counterfactual graphical models for longitudinal mediation analysis with unobserved confounding. Cognitive Science 37 (6), 1011–1035. Shpitser I and Pearl J 2007 What counterfactuals can be tested. Proceedings of the Twenty-Third Conference on Uncertainty in Artificial Intelligence AUAI Press Vancouver, BC, Canada pp. 352–359. Also, Journal of Machine Learning Research 9, 1941–1979, 2008. Shpitser I and Pearl J 2008 Complete identification methods for the causal hierarchy. Journal of Machine Learning Research 9, 1941–1979. Shpitser I and Pearl J 2009 Effects of treatment on the treated: Identification and generalization. Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence AUAI Press Montreal, Quebec pp. 514–521. Simon H 1953 Causal ordering and identifiability. In Studies in Econometric Method (ed. Hood WC and Koopmans T) John Wiley & Sons, Inc. New York pp. 49–74. Simpson E 1951 The interpretation of interaction in contingency tables. Journal of the Royal Statistical Society, Series B 13, 238–241. Spirtes P and Glymour C 1991 An algorithm for fast recovery of sparse causal graphs. Social Science Computer Review 9 (1), 62–72. Spirtes P, Glymour C and Scheines R 1993 Causation, Prediction, and Search. Springer-Verlag, New York. Stigler SM 1999 Statistics on the Table: The History of Statistical Concepts and Methods. Harvard University Press, Cambridge, MA, Hoboken, NJ. k k References 131 Strotz R and Wold H 1960 Recursive versus nonrecursive systems: An attempt at synthesis. Econometrica 28, 417–427. Textor J, Hardt J and Knuüppel S 2011 DAGitty: A graphical tool for analyzing causal diagrams. Epidemiology 22 (5), 745. Tian J, Paz A and Pearl J 1998 Finding minimal d-separators. Technical Report R-254, Department of Computer Science, University of California, Los Angeles, CA. http://ftp.cs.ucla.edu/pub/stat_ser/r254.pdf. Tian J and Pearl J 2000 Probabilities of causation: bounds and identification. Annals of Mathematics and Artificial Intelligence 28, 287–313. Tian J and Pearl J 2002 A general identification condition for causal effects. Proceedings of the Eigh- teenth National Conference on Artificial Intelligence AAAI Press/The MIT Press Menlo Park, CA pp. 567–573. VanderWeele T 2015 Explanation in Causal Inference: Methods for Mediation and Interaction. Oxford University Press, New York. Verma T and Pearl J 1988 Causal networks: Semantics and expressiveness. Proceedings of the Fourth Workshop on Uncertainty in Artificial Intelligence, pp. 352–359, Mountain View, CA. Also in R. Shachter, T.S. Levitt, and L.N. Kanal (Eds.), Uncertainty in AI 4, Elsevier Science Publishers, 69–76, 1990. Verma T and Pearl J 1990 Equivalence and synthesis of causal models. Proceedings of the Sixth Conference on Uncertainty in Artificial Intelligence, pp. 220–227, Cambridge, MA. Virgil 29 BC Georgics. Verse 490, Book 2. Wainer H 1991 Adjusting for differential base rates: Lord’s paradox again. Psychological Bulletin 109, 147–151. Wooldridge J 2013 Introductory Econometrics: A Modern Approach 5th international edn. South-Western, Mason, OH. k k k k k k k k Index actions causal assumptions, 5, 26–29 conditional, 70–71 testable implications of, 48–51 and counterfactuals, 90–91, 96–98 causal effect effect of, 53–58 adjustment formula for, 55–60 vs. observations, 53–55, 61, 78 definition of, 55 sequential, 77–78 identification of, 55–60 as surgeries, 54–56, 82 parametric estimation for, 83–87 see also intervention on the treated, 106–109 adjacency (in graphs), 25 causal models, 26–29 adjustment formula, 55–60 nonparametric, 78 k see also confounding bias structural, 26–29 k attribution, 111–112, 116–120 testing, 48–51 attributable risk, 117 causal search, 48–51 bounds on, 117–118 causation and counterfactuals, 89–94, 101–102 backdoor criterion, 61–66 and the law, 113, 117, 119–120 adjustment using, 61–64 and manipulation, 53–58 and ignorability, 102–103, 125–126 structural foundations, 93–94, 101–102 intuition, 61 working definition, 5–6 for parameter identification, 84–86 collider, 40–48, 51, 76, 103 sequential, 77–78 adjustment for, 63, 77 Bayes’ conditionalization, 13, 102–103 conditional independence, 10 as filtering, 8–10 definition, 10 Bayes’ rule, 12–15 graphical representation of, 38–49 Bayesian inference, 15 conditioning Bayesian networks, 33 as filtering, 8 Berkson’s paradox, 51 vs. interviewing, 53–55 blocking, 46 confounders, 76–78 bounds confounding bias on counterfactual probabilities, 119 control of, 53–60 Causal Inference in Statistics: A Primer, First Edition. Judea Pearl, Madelyn Glymour, and Nicholas P. Jewell. © 2016 John Wiley & Sons, Ltd. Published 2016 by John Wiley & Sons, Ltd. Companion Website: www.wiley.com/go/Pearl/Causality k k 134 Index correlation, 10 effects coefficient, 18 covariate-specific, 70–72 partial, 23 indirect, 76–77, 114–116, 121–126 test for zero partial, 50–51, 82 natural, 121–126 counterfactuals, 89–126 see also direct effects computation of, 93–96 equivalent models definition, 94 testing for, 50 and experiments, 103–105 error terms, 28–29 graphical representation of, 101–103 ETT (effect of treatment on the treated), independence of, 102–104 106–110 and legal responsibility, 113, 119–120 examples, 107, 109 in linear systems, 106–107 examples notation, 89–90 about to smoke, should I?, 111 and policy analysis, 107–111, 114–116 baseball batting average, 6 probability of, 98–101 casino crap game, 13–15, 19 as random variables, 98 drug, blood-pressure, recovery, 4, 58 structural interpretation, 91–93, drug, gender, recovery, 2 102–103 drug, weight, recovery, 62 covariates education, skill, and salary, 99–101, 103 adjustment for, 55–60 exercise and cholesterol, 3–4 treatment-dependent, 4, 66–69, freeway or Sepulveda?, 89 75–78 gender and education, 9–10 k gender-blind hiring, 114 k DAGs (directed acyclic graphs), 27 homework and exams, 94, 123–124 observational equivalence of, 50–51 ice cream and crime, 54 decomposition insulin increment, 97, 109 product, 29 kidney stones, 6 truncated, 60–61 legal liability of drug maker, 119 direct effects, 75–78 lollipop and confounding, 6–8, 65 controlled, 75–78, 121 lumpectomy and irradiation, 111 definition, 78, 123 Monty Hall, 14–16, 32, examples, 76–78, 123–124 recruitment to a program, 107 identification (nonparametric), 76–78, schooling, employment, salary, 27 123–124 sex discrimination in hiring, 113, 125 identification (parametric), 81–86 smoking, tar, and cancer, 66–68 natural, 115, 121–126 syndrome-affected drug, 31–32 do-calculus, 69, 88 three cards, 16 do(⋅) operator, 55 two casino players, 19 d-separation, 45–48 two coins and a bell, 42–44 and conditional independence, 47–48 two dice rolls, 21–22 definition, 46 two surgents, 6 and model testing, 48–50 weight gain and diet, 65 and zero partial correlations, 45, 49, 51 excess risk ratio (ERR), 117–120 corrected for confounding, 120 edges, 24–25 exogenous double arrow, 85–86 variables, 27–29 k k Index 135 expectation, 16 parents conditional, 17, 20 in graphs, 25–26 explaining away, 51 path coefficients, 80, 82–83 identification, 83–87 front-door criterion, 66–71 see also structural parameters example, 69–70 paths backdoor, 61–66 graphs, 24–26 blocked, 46 complete, 25 directed, 25 cyclic, 25–26 in graphs, 25 directed, 25 potential-outcome, 102, 104–105, 126 as models of intervention, 53–55 probability, 7–16 observationally equivalent, 50 conditional, 8–9 joint, 11 marginal, 12 identification, 59–66, 122–123 probabilities of causation, 111–112, causal effects rule for, 59 116–120, 125 of direct effects, 77, 84–86, bounds, 117–120 122–124 definition, 116 proportion explained by, 123–124 probability of necessity (PN), 112, proportion due to, 123–124 116–120, 126 ignorability, 103–104, 126 probability of sufficiency (PS), 112 and backdoor criterion, 103–104 product decomposition, 29–31 k independence, 10 propensity score, 59, 72–73 k conditional, 10 indirect effects, 76–77, 87, 114–116, 121–126 randomized experiments, 6, 53, 57, instrumental variables, 86–88 103–105 interaction, 78, 124 regression, 20–22 intervention, 53–55 coefficients, 23, 79–81 conditional, 70–71, 88 multiple, 22–24 multiple, 60–61 partial, 23–24, 80 notation, 55 rule (for identification), 84–85 truncated product rule for, 60–61 risk difference, 60 see also actions root nodes, 27 inverse probability weighting, 72–75 sampling variability, 21, 50, 57 Lord’s paradox, 65, 87 SCM (structural causal models), 26–30 selection bias, 51 manipulated graph, 56, 88 SEM (structural equation modeling), 26, 35, mediation 92, 95 see also indirect effects see also SCM Mediation Formula, 116, 123, 125 sequential backdoor, 77–78 Simpson’s paradox, 1–7, 23–24, 32, 44, Neyman–Rubin model, 125–126 55–59, 65, 73 see also potential-outcome framework examples of, 6–7, 73 k k 136 Index Simpson’s paradox (continued) total effect, 53–55, 81–83 history of, 32 see also causal effect lessons from, 24 truncated product, 60, 88 nonstatistical character of, 24 resolution of, 32, 55–59 variable single-world, 100 instrumental, 85–88 spurious association, 61, 63, 66 omitted, 28, 81 statistical vs. causal concepts, 1, 7, 24, random, 7 80–82 variance, 17–19 structural equations, 26, 80–81 v-structure, 50 interventional interpretation, 81–83 nonlinear, 107 nonparametric, 26–28 k k k