Causal Effect Estimation with Context and Confounders PDF
Document Details
2024
Arthur Gretton
Tags
Summary
This presentation discusses causal effect estimation techniques, including average treatment effects (ATE), conditional average treatment effects (CATE), and average treatment on the treated (ATT). It covers both observed and hidden covariates, using kernel methods and neural networks. The methods are illustrated with examples from various fields, including epidemiology and economics.
Full Transcript
Causal Effect Estimation with Context and Confounders Arthur Gretton Gatsby Computational Neuroscience Unit, Deepmind MLSS Okinawa, 2024 1/29 Observation vs intervention E[Y jA = a ] = x E...
Causal Effect Estimation with Context and Confounders Arthur Gretton Gatsby Computational Neuroscience Unit, Deepmind MLSS Okinawa, 2024 1/29 Observation vs intervention E[Y jA = a ] = x E[Y ja ; x ]p (x ja ) P Hidden context Conditioning from observation: observed or X or A Y From our observations of historical hospital data: P (Y = curedjA = pills) = 0:80 P (Y = curedjA = surgery) = 0:72 2/29 Observation vs intervention Average causal effect (intervention): E[Y (a ) ] = x E[Y ja ; x ]p (x ) P Hidden context observed, do(a), SWIG or X or A a Y a From our intervention (making all patients take a treatment): P (Y (pills) = cured) = 0:64 P (Y (surgery) = cured) = 0:75 Richardson, Robins (2013), Single World Intervention Graphs (SWIGs): A Unification of the Counterfactual and Graphical Approaches to Causality 2/29 Questions we will solve X A Y (a ) a 3/29 Outline First lecture: causal effect estimation, observed covariates: Average treatment effect (ATE), conditional average treatment effect (CATE), average treatment on treated (ATT), mediation effects. Second lecture: causal effect estimation, hidden covariates:... instrumental variables, proxy variables What’s new? What is it good for? Treatment A, covariates X , etc can be multivariate, complicated......by using kernel or adaptive neural net feature representations 4/29 Regression assumption: linear functions of features All learned functions will take the form: (x ) = > ' ( x ) or = h ; '(x )iH 5/29 Regression assumption: linear functions of features All learned functions will take the form: (x ) = > ' ( x ) or = h ; '(x )iH Option 1: Finite dictionaries of learned neural net features ' (x ) (linear final layer ) Xu, G., A Neural mean embedding approach for back-door and front-door adjustment. (ICLR 23) Xu, Chen, Srinivasan, de Freitas, Doucet, G. Learning Deep Features in Instrumental Variable Regression. (ICLR 21) Option 2: Infinite dictionaries of fixed kernel features: h'(xi ); '(x )iH = k (xi ; x ) Kernel is feature dot product. Singh, Xu, G. Kernel Methods for Causal Functions: Dose, Heterogeneous, and Incremental Response Curves. (Biometrika 23) Singh, Sahani, G. Kernel Instrumental Variable Regression. (NeurIPS 19) 5/29 Kernel ridge regression: reminder 6/29 Kernel ridge regression Approximate 0 (x ) := E[Y jX = x ] using ridge regression n X ^ = argmin ( yi h ; '(xi )iH)2 + k k2H 2H i =1 7/29 Kernel ridge regression Approximate 0 (x ) := E[Y jX = x ] using ridge regression n X ^ = argmin ( yi h ; '(xi )iH)2 + k k2H 2H i =1 Representer theorem: n X = i '(xi ); h'(xi ); '(xj )iH = k (xi ; xj ); i =1 7/29 Kernel ridge regression Approximate 0 (x ) := E[Y jX = x ] using ridge regression n X ^ = argmin ( yi h ; '(xi )iH)2 + k k2H 2H i =1 Representer theorem: n X = i '(xi ); h'(xi ); '(xj )iH = k (xi ; xj ); i =1 Solution is ^ = argmin ky K k2 + >K 2R d = (K + In ) 1 y: 7/29 Kernel ridge regression Approximate 0 (x ) := E[Y jX = x ] using ridge regression n X ^ = argmin ( yi h ; '(xi )iH)2 + k k2H 2H i =1 Representer theorem: n X = i '(xi ); h'(xi ); '(xj )iH = k (xi ; xj ); i =1 Solution is ^ = argmin ky K k2 + >K 2R d = (K + In ) 1 y: Prediction at new x : define (kXx )i = k (xi ; x ), * + Xn ^ (x ) = ^ h ; '(x )iH = i '(xi ); '(x ) = ^ > kXx i =1 H 7/29 Model fitting: ridge regression Approximate 0 (x ) := E[Y jX = x ] from features '(xi ) and yi : ! n X ^ = arg min 2H (yi h ; '(xi )iH) 2 + k k H : 2 i =1 Kernel solution at x (as weighted sum of y) 0.8 n X 0.6 ^ (x ) = yi i (x ) 0.4 f(x) i =1 0.2 (x ) = (KXX + I ) 1 kXx 0 k (xi ; xj ) = h'(xi ); '(xj )iH -0.2 (KXX )ij = -0.4 (kXx )i = k ( xi ; x ) -6 -4 -2 0 x 2 4 6 8 8/29 Observed covariates: (conditional) ATE Kernel features (Biometrika NN features (ICLR 2023): 2023): Code for NN and kernel causal estimation with observed covariates: https://github.com/liyuan9988/DeepFrontBackDoor/ 9/29 Observed covariates: (conditional) ATE Kernel features (Biometrika NN features (ICLR 2023): 2023): Code for NN and kernel causal estimation with observed covariates: https://github.com/liyuan9988/DeepFrontBackDoor/ 10/29 Average treatment effect Potential outcome (intervention): Z E[Y (a ) ] = E[Y ja ; x ]dp (x ) (the average structural function; in epidemiology, for continuous a, the dose-response curve). Assume: (1) Stable Unit Treatment Value Assumption (aka “no interference”), (2) ?? j Conditional exchangeability Y (a ) A X : (3) Overlap. Example: US job corps, training for disadvantaged youths: X A: treatment (training hours) Y : outcome (percentage employment) X : covariates (age, education, A marital status,...) Y (a ) a 11/29 Multiple inputs via products of kernels We may predict expected outcome from two inputs X 0 (a ; x ) := E[Y ja ; x ] Assume we have: covariate features '(x ) with kernel k (x ; x 0 ) A treatment features '(a ) with Y (a ) a kernel k (a ; a 0 ) (argument of kernel/feature map indicates feature space) 12/29 Multiple inputs via products of kernels We may predict expected outcome from two inputs X 0 (a ; x ) := E[Y ja ; x ] Assume we have: covariate features '(x ) with kernel k (x ; x 0 ) A treatment features '(a ) with Y (a ) a kernel k (a ; a 0 ) (argument of kernel/feature map indicates feature space) We use outer product of features ( =) product of kernels): (x ; a ) = '(a ) '(x ) K([a ; x ]; [a 0 ; x 0 ]) = k (a ; a 0 )k (x ; x 0 ) 12/29 Multiple inputs via products of kernels We may predict expected outcome from two inputs X 0 (a ; x ) := E[Y ja ; x ] Assume we have: covariate features '(x ) with kernel k (x ; x 0 ) A treatment features '(a ) with Y (a ) a kernel k (a ; a 0 ) (argument of kernel/feature map indicates feature space) We use outer product of features ( =) product of kernels): (x ; a ) = '(a ) '(x ) K([a ; x ]; [a 0 ; x 0 ]) = k (a ; a 0 )k (x ; x 0 ) Ridge regression solution: n X ^ (x ; a) = yi i (a ; x ); (a ; x ) = [KAA KXX + I ] 1 KAa K12/29 Xx i =1 ATE (dose-response curve) Well-specified setting: E[Y ja ; x ] =: 0 (a ; x ) = h 0 ; '(a ) '(x )i X ATE as feature space dot product: ATE(a ) = E[ 0 (a; X )] = E [h 0 ; '(a ) '(X )i] A Y (a ) a 13/29 ATE (dose-response curve) Well-specified setting: E[Y ja ; x ] =: 0 (a ; x ) = h 0 ; '(a ) '(x )i X ATE as feature space dot product: ATE(a ) = E[ 0 (a; X )] = E [h 0 ; '(a ) '(X )i] A Y (a ) = 0 ; '(a ) X a |{z} E['(X )] Feature map of probability P (X ), X =[ : : : E ['i (X )] : : :] 13/29 ATE: example US job corps: training for dis- advantaged youths: X X : covariate/context (age, education, marital status,...) A: treatment (training hours) Y : outcome (percent employment) A Y (a ) a Empirical ATE: [ ATE(a ) = E b ^0 ; '(X ) '(a ) n X 1 = Y > (KAA KXX +n I ) 1 (KAa KXxi ) n i =1 Schochet, Burghardt, and McConnell (2008). Does Job Corps work? Impact findings from the national Job Corps study. 14/29 Singh, Xu, G (2022a). ATE: results Percent employment 45 40 RKHS 35 DML2 0 500 1000 1500 2000 Class-hours First 12.5 weeks of classes confer employment gain: from 35% to 47%. [RKHS] is our A[TE(a ). [DML2] Colangelo, Lee (2020), Double debiased machine learning nonparametric inference with continuous treatments. Singh, Xu, G (2022a) 15/29 Conditional average treatment effect Well-specified setting: X V E[Y ja ; x ; v ] =: 0 (a ; x ; v ) = h 0 ; '(a ) '(x ) '(v )i : Conditional ATE A Y (a ) CATE(a ; v ) a h i = E Y (a ) jV = v 16/29 Conditional average treatment effect Well-specified setting: X V E[Y ja ; x ; v ] =: 0 (a ; x ; v ) = h 0 ; '(a ) '(x ) '(v )i : Conditional ATE A Y (a ) CATE(a ; v ) a h i = E Y (a ) jV = v = E [h 0 ; '(a ) '(X ) '(V )i jV = v] 16/29 Conditional average treatment effect Well-specified setting: X V E[Y ja ; x ; v ] =: 0 (a ; x ; v ) = h 0 ; '(a ) '(x ) '(v )i : Conditional ATE A Y (a ) CATE(a ; v ) a h i = E Y (a ) jV = v = E [h 0 ; '(a ) '(X ) '(V )i jV = v] = :::? How to take conditional expectation? Density estimation for p (X jV = v )? Sample from p (X jV = v )? 16/29 Conditional average treatment effect Well-specified setting: X V E[Y ja ; x ; v ] =: 0 (a ; x ; v ) = h 0 ; '(a ) '(x ) '(v )i : Conditional ATE A Y (a ) CATE(a ; v ) a h i = E Y (a ) jV = v = E [h 0 ; '(a ) '(X ) '(V )i jV = v ] = 0 ; '(a ) E| ['(X ){zjV = v}] '(v ) X jV =v Learn conditional mean embedding: X jV = v := EX ['(X )jV = v] 16/29 Regressing from feature space to feature space Our goal: an operator F0 : HV !HX such that F0 '(v ) = X jV =v Song, Huang, Smola, Fukumizu (2009). Hilbert space embeddings of conditional distributions with applications to dynamical systems. Grunewalder, Lever, Baldassarre, Patterson, G, Pontil (2012). Conditional mean embeddings as regressors. Grunewalder, G, Shawe-Taylor (2013) Smooth operators. Li, Meunier, Mollenhauer, G (2022), Optimal Rates for Regularized Conditional Mean Embedding 17/29 Learning Regressing from feature space to feature space Our goal: an operator F0 : HV !HX such that F0 '(v ) = X jV =v Assume F0 2 span f'(x ) '(v )g () F0 2 HS(HV ; HX ) Implied smoothness assumption: E[h (X )jV = v ] 2 HV 8h 2 HX Song, Huang, Smola, Fukumizu (2009). Hilbert space embeddings of conditional distributions with applications to dynamical systems. Grunewalder, Lever, Baldassarre, Patterson, G, Pontil (2012). Conditional mean embeddings as regressors. Grunewalder, G, Shawe-Taylor (2013) Smooth operators. Li, Meunier, Mollenhauer, G (2022), Optimal Rates for Regularized Conditional Mean Embedding 17/29 Learning Regressing from feature space to feature space Our goal: an operator F0 : HV !HX such that F0 '(v ) = X jV =v Assume F0 2 span f'(x ) '(v )g () F0 2 HS(HV ; HX ) Implied smoothness assumption: E[h (X )jV = v ] 2 HV 8h 2 HX A Smooth Operator Song, Huang, Smola, Fukumizu (2009). Hilbert space embeddings of conditional distributions with applications to dynamical systems. Grunewalder, Lever, Baldassarre, Patterson, G, Pontil (2012). Conditional mean embeddings as regressors. Grunewalder, G, Shawe-Taylor (2013) Smooth operators. Li, Meunier, Mollenhauer, G (2022), Optimal Rates for Regularized Conditional Mean Embedding Learning 17/29 Regressing from feature space to feature space Our goal: an operator F0 : HV !HX such that F0 '(v ) = X jV =v Assume F0 2 span f'(x ) '(v )g () F0 2 HS(HV ; HX ) Implied smoothness assumption: E[h (X )jV = v ] 2 HV 8h 2 HX Kernel ridge regression from '(v ) to infinite features '(x ): n X b F = argmin k'(x`) F '(v` )k2HX + 2 kF k2HS 2 F HS `=1 Song, Huang, Smola, Fukumizu (2009). Hilbert space embeddings of conditional distributions with applications to dynamical systems. Grunewalder, Lever, Baldassarre, Patterson, G, Pontil (2012). Conditional mean embeddings as regressors. Grunewalder, G, Shawe-Taylor (2013) Smooth operators. Li, Meunier, Mollenhauer, G (2022), Optimal Rates for Regularized Conditional Mean Embedding 17/29 Learning Regressing from feature space to feature space Our goal: an operator F0 : HV !HX such that F0 '(v ) = X jV =v Assume F0 2 span f'(x ) '(v )g () F0 2 HS(HV ; HX ) Implied smoothness assumption: E[h (X )jV = v ] 2 HV 8h 2 HX Kernel ridge regression from '(v ) to infinite features '(x ): n X b F = argmin k'(x`) F '(v` )k2HX + 2 kF k2HS 2 F HS`=1 Ridge regression solution: n X X jV =v := E['(X )jV = v] F b '(v ) = '( x` ) ` (v ) `=1 + 2 I ] 1 (v ) = [KVV kV v 17/29 Consistency of conditional mean embedding Assume problem well specified [B, Assumption 6] c1 1 E0 = G1 T 1 ; c1 2 (1; 2]; kG1 k2HS 1 ; 2 T1 is covariance of features '(v ): Eigenspectrum decays as 1;j j b1 , b1 1. Larger c1 = ) smoother E0 =) easier problem. [A] Li, Meunier, Mollenhauer, G (2022), Optimal Rates for Regularized Conditional Mean Embedding Learning [B] Singh, Xu, G (2022a) Earlier consistency proofs for finite dimensional '(x ): Grunewalder, Lever, Baldassarre, Patterson, G, Pontil (2012). Caponnetto, De Vito (2007). 18/29 Consistency of conditional mean embedding Assume problem well specified [B, Assumption 6] c1 1 E0 = G1 T 1 ; c1 2 (1; 2]; kG1 k2HS 1 ; 2 T1 is covariance of features '(v ): Eigenspectrum decays as 1;j j b1 , b1 1. Larger c1 = ) smoother E0 =) easier problem. Consistency [A, Theorem 2, Theorem 3] 1 c1 1 b E E0 HS = OP n 2 c1 +1=b1 ; best rate is OP (n 1=4 ) (minimax) [A] Li, Meunier, Mollenhauer, G (2022), Optimal Rates for Regularized Conditional Mean Embedding Learning [B] Singh, Xu, G (2022a) Earlier consistency proofs for finite dimensional '(x ): Grunewalder, Lever, Baldassarre, Patterson, G, Pontil (2012). Caponnetto, De Vito (2007). 18/29 Consistency of CATE Empirical CATE: ^CATE (a ; v ) = Y > (KAA KXX KVV + n I ) 1 (KAa KXX (KVV + n 1 I ) 1 KVv KVv ) | {z } from ^X jV =v 19/29 Consistency of CATE Empirical CATE: ^CATE (a ; v ) = Y > (KAA KXX KVV + n I ) 1 (KAa KXX (KVV + n 1 I ) 1 KVv KVv ) | {z } from ^X jV =v Consistency: [A, Theorem 2] k^CATE 0CATEk1 = OP 1 c 1 1 c1 1 n 2 c +1==b +n 2 c1 +1=b1 : b and Follows from consistency of E ^ ; under the assumptions: c1 1 E0 = G1 T1 2 ; kG1 k2HS 1 , 0 2 Hc : [A] Singh, Xu, G (2022a) 19/29 Conditional ATE: example US job corps: train- ing for disadvantaged X V youths: X : confounder/context (education, marital status,...) A: treatment (training A hours) Y (a ) a Y : outcome (percent employed) V : age Singh, Xu, G (2022a) 20/29 Conditional ATE: results 24 40.0 52.0 44.0 40.0 36.0 56. 22 0 Age 20 48.0 44.0 18 40.0 36.0 16 500 1000 1500 Class-hours Average percentage employment Y (a ) for class hours a, conditioned on age v. Given around 12-14 weeks of classes: 16 y/o: employment increases from 28% to at most 36%. 22 y/o: percent employment increases from 40% to 56%. Singh, Xu, G (2022a) 21/29 Counterfactual: average treatment on treated Conditional mean: E[Y ja ; x ] = 0 (a ;x) X Average treatment on treated: ATT (a ; a 0 ) (a 0 ) = E[y jA = a ] A Y (a ) a Empirical ATT: ^ATT (a ; a 0 ) 22/29 Counterfactual: average treatment on treated Conditional mean: E[Y ja ; x ] = 0 (a ; x ) = h 0 ; '(a ) '(x )i X Average treatment on treated: ATT (a ; a 0 ) (a 0 ) = E[y jA = a ] A Y (a ) a Empirical ATT: ^ATT (a ; a 0 ) 22/29 Counterfactual: average treatment on treated Conditional mean: E[Y ja ; x ] = 0 (a ;x) X Average treatment on treated: ATT (a ; a 0 ) (a 0 ) = E[y jA = a ] = EP 0 '(X ) jA = a 0 ; '(a ) A Y (a ) = 0 ; '(a ) 0 EP ['(X )jA = a ] a | {z } X jA=a Empirical ATT: ^ATT (a ; a 0 ) 22/29 Counterfactual: average treatment on treated Conditional mean: E[Y ja ; x ] = 0 (a ;x) X Average treatment on treated: ATT (a ; a 0 ) (a 0 ) = E[y jA = a ] = EP 0 '(X ) jA = a 0 ; '(a ) A Y (a ) = 0 ; '(a ) 0 EP ['(X )jA = a ] a | {z } X jA=a Empirical ATT: ^ATT (a ; a 0 ) = Y > (KAA KXX + n I ) 1 (KAa 0 KXX (KAA + n 1 I ) 1 K Aa ) | {z } from ^X jA=a 22/29 Mediation analysis Direct path from treatment A to effect Y Indirect path A ! M !Y X : context Is the effect Y mainly due to A? To M ? X M A Y 23/29 Mediation analysis: example US job corps: training for dis- advantaged youths: X X : confounder/context (age, education, marital status,...) A: treatment (training hours) M Y : outcome (arrests) M : mediator (employment) A Y 0 (a ; m ; x ) E[Y jA = a ; M = m; X = x] Singh, Xu, G (2022b). Kernel Methods for Multistage Causal Inference: Mediation Analysis and Dynamic Treatment Effects. 24/29 Mediation analysis: example US job corps: training for dis- advantaged youths: X X : confounder/context (age, education, marital status,...) A: treatment (training hours) M Y : outcome (arrests) M : mediator (employment) A Y 0 (a ; m ; x ) E[Y jA = a ; M = m; X = x] A quantity of interest, the mediated effect: Z f a 0 ;M (a ) g 0 Y = 0 (a ; M ; X )dP(M jA = a ; X )d P(X ) Effect of intervention a 0 , with M (a ) as if intervention were a Singh, Xu, G (2022b). Kernel Methods for Multistage Causal Inference: Mediation Analysis and Dynamic Treatment Effects. 24/29 Mediation analysis: example US job corps: training for dis- advantaged youths: X X : confounder/context (age, education, marital status,...) A: treatment (training hours) M Y : outcome (arrests) M : mediator (employment) A Y 0 (a ; m ; x ) E[Y jA = a ; M = m; X = x] A quantity of interest, the mediated effect: Z f a 0 ;M (a ) g 0 Y = 0 (a ; M ; X )dP(M jA = a ; X )d P(X ) = h 0; '(a 0) EP fM jA =a ;X '(X )gi Effect of intervention a 0 , with M (a ) as if intervention were a Singh, Xu, G (2022b). Kernel Methods for Multistage Causal Inference: Mediation Analysis and Dynamic Treatment Effects. 24/29 Mediation analysis: results Total effect: 0TE (a ; a 0 ) 0 a0 E[Y fa ;M g Y fa ;M g] ( ) (a ) := 2000 -0.080 0.00 Class-hours (a0) 1500 0 0 40 0.0 0 1000 -0.0.040 0 0 0.08 500.000 0 500 1000 1500 2000 Class-hours (a) a0 = 1600 hours vs a = 480 means 0.1 reduction in arrests Singh, Xu, G (2022b) 25/29 Mediation analysis: results Total effect: Direct effect: TE 0 (a ; a 0) 0DE (a ; a 0 ) := E[Y fa 0 ;M (a 0 ) g Y fa ;M g ] (a ) fa 0 ;M (a ) g := E[Y Y fa ;M g] (a ) 2000 -0.080 2000 -0.080 0.00 Class-hours (a0) Class-hours (a0) 1500 0 0 1500 0.00 0.00 00 40 0.0 0 1000 -0.0.040 040 1000 -0. 0 0 500.000 0 0.08 0 0.04 500 0.0 0.080 2000 0 500 1000 1500 2000 500 1000 0 1500 Class-hours (a) Class-hours (a) a 0 = 1600 hours vs a = 480 means 0.1 reduction in arrests Indirect effect mediated via employment effectively zero Singh, Xu, G (2022b) 25/29...dynamic treatment effect... Dynamic treatment effect: sequence A1 ; A2 of treatments. X1 X2 A1 A2 Y potential outcomes h Y (a1 ) ; Y (a2 ) ; Y (a1 ;a2 ) ; i (a10 ;a20 ) counterfactuals E Y jA1 = a1; A2 = a2... (c.f. the Robins G-formula) Singh, Xu, G. (2022b) Kernel Methods for Multistage Causal Inference: Mediation Analysis and Dynamic Treatment Effects 26/29 Conclusions Kernel and neural net solutions:...for ATE, CATE, dynamic treatment effects...with treatment A, covariates X ; V , proxies (W ; Z ) multivariate, “complicated” Convergence guarantees for kernels and NN Next lecture: Unobserved covariates/confounders (IV and proxy methods) Code available for all methods 27/29 Research support Work supported by: The Gatsby Charitable Foundation Google Deepmind 28/29 Questions? 29/29