Proxy Methods for Domain Adaptation PDF

Proxy Methods for Domain Adaptation Katherine Tsai Stephen R. Pfohl Olawale Salaudeen University of Illinois Urbana-Champaign Google Research University of Illinois Urbana-Champaign...

Proxy Methods for Domain Adaptation Katherine Tsai Stephen R. Pfohl Olawale Salaudeen University of Illinois Urbana-Champaign Google Research University of Illinois Urbana-Champaign Nicole Chiou Matt J. Kusner Stanford University University College London Alexander D’Amour Sanmi Koyejo Arthur Gretton Google DeepMind Google DeepMind Google DeepMind Stanford University Gatsby Computational Neuroscience Unit Abstract 1 Introduction The goal of domain adaptation is to transfer an accu- We study the problem of domain adaptation rate model from a labeled source domain to an unla- under distribution shift, where the shift is beled target domain, which has a different but related due to a change in the distribution of an un- distribution (Pan et al., 2010; Koh et al., 2021; Malinin observed, latent variable that confounds both et al., 2021). It is motivated by the fact that labeling the covariates and the labels. In this set- data is often labor intensive, and sometimes requires ting, neither the covariate shift nor the la- domain expertise. For example, the distribution of pa- bel shift assumptions apply. Our approach tients diagnosed with a condition from hospital A and to adaptation employs proximal causal learn- hospital B may differ due to patients’ socioeconomic ing, a technique for estimating causal effects status, demographics, and other factors. However, la- in settings where proxies of unobserved con- beled data might be only be available at hospital A founders are available. We demonstrate that and not at hospital B (e.g., due to less funding). As a proxy variables allow for adaptation to distri- result, an accurate model for patients from hospital A bution shift without explicitly recovering or may perform poorly for patients from hospital B. modeling latent variables. We consider two In order to provide guarantees on the accuracy of a settings, (i) Concept Bottleneck: an ad- transferred model, one of two classical assumptions ditional “concept” variable is observed that have been made: label shift or covariate shift. Label mediates the relationship between the covari- shift (Buck et al., 1966; Lipton et al., 2018) assumes ates and labels; (ii) Multi-domain: train- that the distribution of a label P pY q shifts between ing data from multiple source domains is source and target domains, but the conditional dis- available, where each source domain exhibits tribution P pX | Y q does not. Conversely, covariate a different distribution over the latent con- shift (Shimodaira, 2000) assumes that the covariate founder. We develop a two-stage kernel es- distribution P pXq shifts between domains, but the dis- timation approach to adapt to complex dis- tribution P pY | Xq stays the same. Each assumption tribution shifts in both settings. In our ex- provides theoretical guarantees on the generalization periments, we show that our approach out- of a transferred classifier. In fact, without any as- performs other methods, notably those which sumptions, the source and target domains could differ explicitly recover the latent confounder. arbitrarily, making guarantees impossible. However, these assumptions are often too restrictive to apply in real-world settings (Zhang et al., 2015; Schrouff et al., Proceedings of the 27th International Conference on Artifi- 2022). For instance, if covariates X and labels Y are cial Intelligence and Statistics (AISTATS) 2024, Valencia, confounded by a third variable U , it is possible for Spain. PMLR: Volume 238. Copyright 2024 by the au- neither P pX | Y q or P pY | Xq to be equal across do- thor(s). mains. For example, demographic information U could Proxy Methods for Domain Adaptation confound the relationship between a diagnosis Y and nique is inspired by approaches used to identify causal a radiological image X. In this example, if two hos- effects with unobserved confounding with observed pitals have different distributions over demographics, proxies (Kuroki and Pearl, 2014; Miao et al., 2018; both label shift and covariate shift adaptation meth- Deaner, 2018; Tchetgen et al., 2020; Mastouri et al., ods will fail to transfer a classifier across hospitals. 2021; Cui et al., 2023; Xu and Gretton, 2023). These approaches design ‘bridge functions’ to connect quan- To address this, recent work has introduced a latent tities involving a proxy W with those of the label Y. shift assumption: the distribution of U , an unob- The beauty of this approach is that these bridge func- served latent confounder of X and Y , shifts between tions are implicitly a marginalization over U. This the source and target domain (Alabdulmohsin et al., allows these approaches to identify causal quantities 2023). In this setting, all distributions of X and Y without identifying distributions involving U. (without conditioning on U ) may differ across the do- mains, violating label and covariate shift assumptions. Latent shift. Our work is most closely related to Al- abdulmohsin et al. (2023), who introduced the setting Contributions. We propose techniques for domain of latent shift with proxies W and concepts C. They adaptation under the latent shift assumption that are showed that the optimal predictor ErY | xs is identi- guaranteed to identify the optimal predictor ErY | xs fiable in the target domain if W and C are observed in the target domain. We make use of proxy methods in the source domain and X is observed in the target (Miao et al., 2018), which are a recently developed domain. To do so, they required (a) identification of framework for causal effect estimation in the pres- distributions involving U , (b) that U is a discrete vari- ence of a hidden confounder U , given indirect proxy able, (c) knowledge of the dimensionality of U , and information on U. Compared to prior work (Alab- (d) additional linear independence assumptions. In dulmohsin et al., 2023), our techniques do not re- contrast, our work derives identification results for ar- quire: identifying the distribution of the latent vari- bitrary U , and does not require any of (a)-(d). How- able U , that U be discrete, or further linear inde- ever, there is no free lunch: to achieve this, we require pendence assumptions. We consider two settings: (1) that proxies W are observed in the target, and either Concept Bottleneck: we observe in both domains that: (i) concepts C are also observed in the target, a proxy W of the unobserved confounder U and a or (ii) we observe multiple source domains. For (ii) we concept C that mediates the direct relationship be- do not require C in either the source or the target, but tween X and Y (Alabdulmohsin et al., 2023), or (2) for full identification we require that U is discrete. Multi-Domain: we do not observe C in either do- main, but have access to observations from multiple source domains. For both settings, we provide guar- 3 Problem Framework antees for identifying ErY | xs without observing Y Let P p¨q and Qp¨q denote the probability distribution in the target domain. When ErY | xs is identifi- functions of the source domain and target domain, re- able, we develop practical two-stage kernel estima- spectively. Let p and q indicate source and target tors to perform adaptation. The code is available at quantities. Our goal is to study identification and esti- https://github.com/koyejo-lab/ProxyDA. mation of the optimal target predictor Eq rY | xs when Y is not observed in the target domain. 2 Related Work Concept Bottleneck. The first setting we study is The development of techniques for learning robust described by the graph in Figure 1c. We have two addi- models and adapting to distribution shift has a long tional variables: (i) proxies W , which provide auxiliary history in machine learning, but recently has received information about U , or can be seen as a noisy version increased attention (Shen et al., 2021; Zhou et al., of it (Kuroki and Pearl, 2014), and (ii) concepts C, 2022; Wang et al., 2022). which mediate or ‘bottleneck’ the relationship between the covariates X and labels Y (Goyal et al., 2019; Koh Causality for domain adaptation. Our work is in- et al., 2020). For example, Koh et al. (2020) describe spired by techniques that formulate the covariate/label a setting where the concepts C are high-level clinical shift settings as assumptions on the causal structure and morphological features of a knee X-ray X, which for domain adaptation and distributional robustness mediate the relationship with osteoporosis severity Y. (e.g, Schölkopf et al. (2012); Peters et al. (2015); Zhang In this example, U could describe demographic varia- et al. (2015); Subbaswamy et al. (2019); Rothenhäusler tions that alter symptoms X, C and outcome Y , and et al. (2021); Veitch et al. (2021); Magliacane et al. the proxies W could include patient background and (2018); Arjovsky et al. (2019); Ganin et al. (2016); clinical history (e.g., prior diagnoses, medications, pro- Ben-David et al. (2010); Oberst et al. (2021)). cedures, etc). For the source domain we assume we Proximal causal inference. Our identification tech- observe pX, C, W, Y q „ P and for the target domain Tsai, Pfohl, Salaudeen, Chiou, Kusner, D’Amour, Koyejo, Gretton U U X U W Z U W X Y X Y C Y X Y n (a) Covariate shift (b) Label shift (c) Concept Bottleneck shift (d) Multi-Domain shift Figure 1: Causal diagrams. The shaded circle denotes unobserved variable and the solid circle denotes observed variable. X is the covariate, Y is the response, C is the concept, W is the proxy, Z is the domain-related variable, and U is the latent variable. we observe pX, C, W q „ Q. we observe pX, W, Y q „ P pX, W, Y |zr q :“ Pr pX, W, Y q. For the target, we denote it with index kZ ` 1 and We formalize the notion of latent shift, as introduced only observe pX, W q „ P pX, W |zkZ `1 q :“ QpX, W q. In in Alabdulmohsin et al. (2023). general let Pr pV q :“ P pV |zr q and QpV q :“ P pV |zkZ `1 q Assumption 1 (Concept Bottleneck, Alabdulmohsin for any V Ď tW, X, Y, U u. For this setting we replace et al. (2023)). The shift between P and Q is lo- Assumption 1 with the following shift assumption. cated in unobserved U , i.e., there is a latent shift Assumption 3 (Multi-Domain). For each z, z 1 P Zp P pU q ‰ QpU q, but P pV | U q “ QpV | U q, where such that z ‰ z 1 , we have P pU |zq ‰ P pU |z 1 q ‰ QpU q. V Ď tW, X, C, Y u. Note that Assumption 2 implies the following the con- This assumption states that every variable condi- ditional independence property in Figure 1d: tioned on U is invariant across domains. However, as P pU q ‰ QpU q, none of the marginal distributions tY, X, W u KK Z | U. are: P pV q ‰ QpV q for V Ď tW, X, C, Y u. This as- sumption is a generalization of covariate shift P pY | Note that under Assumption 3, we allow all joint X, U q “ QpY | X, U q (Shimodaira, 2000) and label distributions to be different P pW, X, U, Y |zq ‰ shift P pX | Y, U q “ QpX | Y, U q (Buck et al., 1966), P pW, X, U, Y |z 1 q ‰ QpW, X, U, Y q for z ‰ z 1 P Zp. with associated graphs in Figure 1a–1b. Assumption 2 (Structural assumption). Graphs in 4 Identification under Latent Shifts Figure 1 are faithful and Markov (Spirtes et al., 2000). Our identification techniques are inspired by proximal causal inference (Tchetgen et al., 2020). The key idea Under Assumption 2, we have the following conditional is to design so-called “bridge” functions to identify dis- independence properties for the graph in Figure 1c: tributions confounded by unobserved variables. We first show that with additional proxies and concepts, Y K K X | tU, Cu, W K K tX, Cu | U. Eq rY | xs is identifiable under any latent shift. With this conditional independence structure, tU, Cu blocks the information from X to Y and U blocks the 4.1 Identification with Concepts information flow from W to tX, Cu. We will see in To prove identifiability, we need certain assump- Section 4 that these assumptions allow us to obtain tions to hold for the shift. The first is a regular- QpY | xq from QpW, C | xq in the target domain, where ity assumption, also known as a completeness con- the latter is a function of observed quantities. dition, and is commonly used to identify causal es- Multi-domain. In the second setting, suppose we do timands (D’Haultfoeuille, 2011; Miao et al., 2018). not observe the concepts C in any domain, but instead Assumption 4 (Informative variables). Let g be any observe data from multiple source domains, according mean squared integrable function. Both the source do- to the graph in Figure 1d. For instance, we may want main and the target domain, pf, F q P tpp, P q, pq, Qqu, to learn a classifier for a target hospital that has only satisfy Ef rgpU q | x, cs “ 0 for all x P X , c P C if and unlabelled data, using data from several source hospi- only if gpU q “ 0 almost surely with respect to F pU q. tals with labelled data. Here, let Z be a random vari- able in Z denoting a prior over the source domains, and At a high level, completeness states that the X must let P pU |Zq be the distribution of U given Z. We make have sufficient variability related to the change of U. kZ draws from Z, indexed by r P t1,... , kZ u, and write This is a common assumption made in proximal causal tz1 ,... , zkZ u “: Zp Ď Z. For each source domain zr , inference (cf. Condition (ii) in Miao et al. (2018) and Proxy Methods for Domain Adaptation Assumption 3 in Mastouri et al. (2021)). For more 4.2 The Blessings of Multiple Domains details on the justification of completeness assumption, see the supplementary material of Miao et al. (2022). We now turn to the multi-domain setting. The graph- ical structure in Figure 1d is similar to the structure Second, we need a guarantee on the support of u P U. in Figure 1c with C replaced by X, X replaced by Z, Intuitively, if a u P U has non-zero probability in the and the arrow between U and Z flipped. Although target domain, it should have non-zero probability in the bridge function proposed by Miao et al. (2018) as- the source domain as well. Otherwise, it is impossi- sumes an edge from U to Z, changing the direction ble to adjust to certain shifts (as we never see these from Z to U does not change the conditional inde- regimes in the source domain). This is similar to the pendence structure (Pearl, 2009). The main difference positivity assumption commonly made in causality lit- is we will only be able to guarantee full identification erature (Hernán and Robins, 2006). when U is discrete. We start by demonstrating this, Assumption 5 (Positivity). For any u P U, if Qpuq ą and then give an example of the inherent difficulty of 0 then P puq ą 0. identification when U is continuous. If data are generated according to Figure 1c, and the To begin, for simplicity, assume U and W are discrete regularity conditions 8–10 hold (see Appendix A.2), (with dimensionalities kU and kW ). We have finitely Miao et al. (2018) first showed the existence of the many samples from Z, denoted as z1 ,... , zkZ , corre- solutions hp0 pw, cq, hq0 pw, cq of the following equations: sponding to our training domains. We seek a bridge ż function (in this case, a matrix M0 pwi , xq) satisfying Ep rY | c, xs “ hp0 pw, cqdP pw | c, xq (4.1) kw ÿ W ż Er rY | xs “ M0 pwi , xqPr pwi | xq, (4.2) Eq rY | c, xs “ hq0 pw, cqdQpw | c, xq. i“1 W for all r “ 1,... , kZ , where Er rY | xs is the conditional The terms hp0 pw, cq, hq0 pw, cq are called ‘bridge’ func- expectation obtained in domain r, and Pr pW | xq “ tions as they connect the proxy W to the label Y. If P pW | x, zr q. we are able to identify hq0 pw, cq then we can identify In order to identify M0 pwi , xq and Eq rY | xs, we need Eq rY | xs, by using eq. (4.1) to obtain Eq rY | C, xs enough source domains to capture the variability of U. and marginalizing over QpC | xq. The following result describes how many we need. We show that it is possible to connect identification Proposition 4.2. Suppose that we have kZ source do- of hq0 pw, cq with that of hp0 pw, cq, leading directly to mains and W , U have kW and kU categories respec- identification of Eq rY | xs. tively. Then, if kW , kZ ě kU and subject to appropri- Theorem 4.1. Assume that hp0 and hq0 exist (i.e., reg- ate rank conditions (see proof in Appendix B.2), the ularity Assumptions 8–10 hold). Then given Assump- bridge function is identifiable and does not depend on tions 1, 2, 4, 5 we have that, for any c P C, the specific z. ż ż hp0 pw, cqdP pw | uq “ hq0 pw, cqdQpw | uq, This result generalizes the identification analysis devel- W W oped in Miao et al. (2018). If the number of observed almost surely with respect to QpU q. This implies that source domains kZ is greater than the dimension of ż the latent U , then subject to appropriate identifiabil- Eq rY | xs “ hp0 pw, cqdQpw, c | xq. ity requirements (detailed in Appendix B.2), we can WˆC recover the bridge M0 pwi , xq. Now, consider the case where U is discrete but all ob- The proof is given in Appendix B.1. Hence, given hp0 served variables W, X, Y are continuous. In this case and pW, X, Cq from the target Q, we are able to adapt we have the following system to arbitrary distribution shifts in unobserved U. The ż advantage of this approach is that it will not require Er rY | xs “ m0 pw, xqdPr pw | xq, (4.3) estimating any distributions involving U. We demon- W strate this in Section 5. for r “ 1,... , kZ. The proof of existence of m0 is a While concepts can ensure identifiability, they may not modification of Proposition A.2, as shown in Propo- be available in practice. In this case, a natural question sition A.3. In order to identify target Eq rY | xs, we is whether the optimal target predictor Eq rY | xs is need the following assumption. still identifiable. In the next section we show that if Assumption 6. Let g be a square integrable function we instead have access to data from multiple source on U. For each x P X and for all z P Zp , ErgpU q | domains, Eq rY | xs may again be identifiable. x, zs “ 0 if and only if gpU q “ 0, P pU q almost surely. Tsai, Pfohl, Salaudeen, Chiou, Kusner, D’Amour, Koyejo, Gretton Given this assumption we can prove identifiability. latent variable U |zr is continuous valued and follows Proposition 4.3. Given that Assumptions 1–3, 6 different Beta distributions for each distinct r, given hold; that m0 exists; that pW, X, Y q are observed for just two training source domains. the sources z P Zp , and pW, Xq is observed from the 5 Kernel Bridge Function Estimation target domain. Then Eq rY | xs is identifiable, and for any x P X , we can write We introduce kernel methods to estimate the bridge functions and subsequently leverage the estimates to ż adapt to distribution shifts. Section 4 shows that Eq rY | xs “ m0 pw, xqdQpw | xq. (4.4) W bridge functions for both settings can be adapted to the target domain, so we drop the domain specific in- The proof is given in Appendix B.3. Crucially, this dices and use h0 and m0 to denote the bridge functions. result is valid only when Assumptions 6 holds, and it We begin by introducing the notation. remains unclear when it is expected to hold. Propo- Notation. Let b be the tensor product, b be sition 4.2 suggests that Assumptions 6 is not vacuous the columnwise Khatri-Rao product and d be the when U is finite dimensional. Hadamard product. For any space V P tX , C, W, Yu, Now let us consider the case where U is continuous. let k : V ˆ V Ñ R be a positive semidefinite kernel In this case, unfortunately, Assumption 6 is unlikely function and φpvq “ kpv, ¨q for any v P V be the to hold, preventing identification of Eq rY | xs. This is feature map. We denote HV to be the RKHS on V illustrated in the following example. associated with kernel function k. The RKHS has two properties: (i) f P HV , f pvq “ xf, kpv, ¨qy for Example 4.4. Recall the decomposition of both sides all v P V and (ii) kpv, ¨q P HV. We denote x¨, ¨y as of (4.3). Under Assumption 2 and given the existence the inner product and ||| ¨ |||HV as the induced norm. of m0 (Proposition A.2), For notation simplicity, we denote the product space ż HV ˆ HV 1 associated with operation HV b HV 1 as Ep rY | x, zs “ m0 pw, xqdP pw | x, zq HVV 1. We define the kernel mean embedding as µV “ żWż ş ErφpV qs “ kpv, ¨qppvqdv (Smola et al., 2007) ş and the “ m0 pw, xqdP pw | uqdP pu | x, zq; conditional mean embedding as µV |y “ kpv, ¨qppv | U W yqdv (Song et al., 2009; Singh et al., 2019). For V P (4.5) tW, X, Cu, we denote the a-th batch of i.i.d. samples ż na as “ Va “ tva,i‰ui“1. Define the Gram matrices as K‰Va “ Ep rY | x, zs “ Ep rY | x, usdP pu | x, zq. (4.6) “ U kpva,i , va,j q i,j P Rna ˆna , KVab “ kpva,i , vb,j q i,j P ‰J P HVna “ For every x, Eqs. (4.5) and (4.6) represent projections Rna ˆnb. Let ΦVa “ φpva,1 q,... , φpva,na q onto P pu | x, zr q, r P 1,... , kz. Consider U :“ r´π, πs be the vectorized feature map such that ΦVa pv 1 q “ “ ‰J with periodic boundary conditions, and for a given x kpva,1 , v 1 q,... , kpva,na , v 1 q P Rna. define P pu | x, zr q “ p2πq´1 p1 ` cospruqq, @r P N` (note that cosines form an orthonormal basis). We 5.1 Adaptation with Concepts now construct an example where (4.5) holds for some Suppose that for the bridge function h0 P HWC , where z but not for others. Define the difference HWC is a RKHS. It follows from Theorem 4.1 that ż Ep rY | x, us ´ m0 pw, xqdP pw | uq (4.7) Eq rY | X “ xs “ Eq rh0 pW, Cq | xs W “ Eq rxh0 , φpW q b φpCqy | xs “ cosppkz ` 1quq “: gpuq. “ xh0 , µqW C|x y. (5.1) In this case, gpuq ‰ 0, and in particular, (4.5) holds for all r ď kz , but not for P pu | x, zkz `1 q. To adapt to the distribution shifts, we estimate the bridge function h0 in the source domain and the condi- This example illustrates a larger point: that for contin- tional mean embedding µqW C|x “ Eq rφpW q b φpCq | xs uous U , no finite set of projections will suffice to com- in the target domain. The empirical estimate of the pletely characterize the square integrable functions on conditional mean embedding along with the consis- U. That said, as more projections are employed, and tency proof have been provided in (Song et al., 2009; subject to appropriate assumptions on the smoothness Grünewälder et al., 2012) thus we focus on the esti- of (4.7), the error will reduce as more domains are ob- mation procedure of the bridge function h0. served. The characterization of this convergence will be the topic of future work. In experiments, we show To estimate the bridge function h0 , we employ the that the adaptation can still be effective even when the regression method developed in Mastouri et al. (2021). Proxy Methods for Domain Adaptation Recall ErY | c, xs “ Erh0 pW, cq | c, xs. We define the bqW C|x and a new sample xnew , Finally, given estimate µ population risk function in the source domain as: we can construct the empirical predictor of (5.1) as Rph0 q “ Ep rpY ´ Gh0 pC, Xqq2 s; (5.2) ybpred “ xb bqW C|xnew y. h0 , µ Gh0 px, cq “ xh0 , µpW |c,x b φpcqy. This completes the full adaptation procedure. The procedure to optimize (5.2) involves two stages. On classification tasks. For classification tasks, In the first stage, we estimate the conditional mean where the label is Y P t1,... , kY u, we treat the multi- embedding µpW |c,x “ Ep rφpW q | c, xs, which we will task regressor as a classifier. We encode Y by a one-hot use as a plug-in estimator to estimate h0 in the sec- encoder and then regress on the encoded Ye P t0, 1ukY. ond step. Given n1 i.i.d. samples pX1 , W1 , C1 q “ Each label ` has a corresponding bridge function h0,` tpx1,i , w1,i , c1,i quni“1 1 from the source distribution p and for ` P t1,... , kY u. For i “ 1,... , n2 , let the encoded ‰J a regularizing parameter λ1 ą 0, we denote KX1 P “ y2,i be ye2,i “ ye2,i,1 ,... , ye2,i,kY P t0, 1ukY. Then for Rn1 ˆn1 , KC1 P Rn1 ˆn1 as the Gram matrices and each `, we can estimate h0,` by replacing y2,i in (5.4) n1 ΦX1 P HX , ΦC1 P HCn1 as n1 -dimensional vectorized with ye2,i,` P t0, 1u. For each new sample xnew , the feature maps of X1 , C1 respectively. Following the pro- predicted score of label ` is ybpred,` “ xb bqW C|xnew y, h0,` , µ cedure developed in Song et al. (2009), the estimate of and we select the label that has the highest prediction µpW |x,c is score: argmax` ybpred,`. n1 ÿ 5.2 Adaptation with Multiple Domains bpW |c,x “ µ bi px, cqφpw1,i q; (5.3) i“1 In the multiple source domain setting, the estimation bpx, cq “ pKX1 d KC1 ` λ1 n1 Iq ´1 pΦX1 pxq d ΦC1 pcqq. of m0 follows similarly to that of h0. Assuming that m0 P HWX , then (4.3) can be written as In the second stage, we replace µpW |x,c with µ bpW |x,c Er rY | xs “ Ep rxm0 , µW |x,r b φpxqy | xs, in (5.2) and define the empirical risk. Consider n2 i.i.d. samples pX2 , Y2 , C2 q “ tpx2,i , y2,i , c2,i quni“1 2 from for r “ 1,... , kZ. The task is to estimate m0 from the the source distribution and a regularization parameter source domain and then apply it to the target domain. λ2 ą 0, we want to minimize We can define the population risk function as kZ n2 ´ ÿ Er rpY ´ Gm0 pr, Xqq2 s; 1 ÿ ¯2 Rpm0 q “ (5.5) argmin bpW |c2,i ,x2,i y y2,i ´ xh0 , φpc2,i q b µ h0 PHWC 2n2 i“1 r“1 Gm0 pr, xq “ xm0 , µW |r,x b φpxqy. ` λ2 |||h0 |||2HWC. (5.4) We employ the two-stage estimation procedure as we We follow the same analysis procedure derived in Mas- did for estimating h0 : (i) we first estimate µW |r,x and touri et al. (2021). The solution to (5.4) is shown in then (ii) plug the estimate µbW |r,x to estimate m0. the following. At the r-th domain, we observe the samples: Proposition 5.1. Let KW1 P Rn1 ˆn1 , KC2 P Rn2 ˆn2 tpwr,i , xr,i , rquni“1 r. As with (5.3),řwe learn a condi- be the Gram matrices of W1 and C2 , respectively. Let nr tional mean embedding µ bW |r,x “ i“1 dr,i pxqφpwr,i q, KX12 P Rn1 ˆn2 , KC12 P Rn1 ˆn2 be the cross Gram ma- where dr pxq “ pKXr ` λ3 Iq´1 pΦXr pxqq P Rnr and trices of pX1 , X2 q and pC1 , C2 q, respectively. For any λ3 ą 0 for r “ 1,... , kZ. In the second stage, given an- λ2 ą 0, there exists a unique optimal solution to (5.4) other batch of independent samples: tpyr,i , xr,i , rquni“1 r of the form for r “ 1,... , kZ , we minimize: n1 ÿ ÿ n2 h0 “ αij φpw1,i q b φpc2,j q; kZ ÿ nr b 1 ÿ ` ˘2 i“1 j“1 ř yr,i ´ xm0 , φpxr,i q b µ bW |r,xr,i y 2 r“1 nr r“1 i“1 vecpαq “ pIbΓqpλ2 n2 I ` Σq´1 y2 , ` λ4 |||m0 |||2HWX. (5.6) where Σ “ pΓJ KW1 Γq d KC2 , Γ “ pKX1 d KC1 ` Then, m b 0 yields an analytical solution in similar form “ ‰J λ1 n1 Iq´1 pKX12 d KC12 q, and y2 “ y2,1 ,... , y2,n2. to b h0 shown in Proposition 5.1 (see Appendix C.2 for details). Finally, with the estimated conditional mean Proposition 5.1 is an application of the Representer embedding µ bqW |x and a new sample xnew from the tar- theorem (Schölkopf et al., 2001) – the optimal estimate get test set, we have of the infinite dimensional operator is a finite rank operator spanned by the feature space of W1 and C2. ybpred “ xm bqW |xnew b φpxnew qy. b 0, µ Tsai, Pfohl, Salaudeen, Chiou, Kusner, D’Amour, Koyejo, Gretton We convert the regression task with m0 to the clas- additional W, C, X as training samples in the target sification task by learning kY bridge functions, where domain while LSA-S and LSA-WAE only take X. For each bridge function m0,` corresponds to label `. all three methods, only X is observed in the test data. While the identification theory developed in (Alabdul- 6 Experiments mohsin et al., 2023) does not require W, C in the tar- We verify our theory with both simulated and real get domain, we are aware that in practice, having more data, demonstrating robustness to latent shifts and information in the target domain may improve estima- transferablility of the bridge functions. tion. To make the methods more directly comparable, we design an additional step to incorporate W from For the setting with concept variables present, we the target in the LSA-S algorithm. We describe this compare our method with baselines: Empricial Risk procedure in more detail in Appendix D.1. Minimization (ERM), Covariate shift weighting (CO- VAR) (Shimodaira, 2000), Label shift weighting (LA- Results are shown in Figure 2a. The proposed method BEL) (Buck et al., 1966), and the spectral (LSA-S) is more robust to the shift compared to baselines and and Wasserstein Autoencoder (LSA-WAE) latent shift is close to the ORACLE model. It is shown that with adaptation approaches (Alabdulmohsin et al., 2023). observed W in the target domain, LSA-S does not im- For the multi-domain setting, we compare our method prove the performance compared to LSA-S without W. with baselines: Simple Adaptation (SA) (Mansour We also compare results under different noise levels et al., 2008), Weighted Combination of Source Classi- and observe similar trends as discussed in Appendix D. fiers (WCSC) (Zhang et al., 2015), and Marginal Ker- dSprites dataset regression task. We test the nel (MK) (Blanchard et al., 2011). We also compare proposed procedure on the dSprites dataset (Matthey with multi-domain generalization baselines (Muandet et al., 2017), an image dataset described by five latent et al., 2013): Domain Adversarial Neural Networks parameters (shape, scale, rotation, posX, and posY). (DANN) (Ganin et al., 2016), Maximum Mean Dis- Motivated by Matthey et al. (2017)’s experiments, we crepancy (MMD) (Gretton et al., 2012). Additionally, design a regression task where the dSprites images (64 we modify the ERM method to the multi-domain set- ˆ 64 = 4096-dimensional) are X P R64ˆ64 and sub- ting by concatenating the source samples to learn one ject to a nonlinear confounder U P r0, 2πs which is a ERM model (Cat-ERM) or taking the average result rotation of the image. W P R and C P R are contin- of each source domain ERM model (Avg-ERM). The uous random variables. For this experiment, we have ORACLE model is a model that is trained on target 7000 training samples and 3000 test samples. Further distribution samples. and evaluated on held-out tar- details about the procedure are in Appendix D. get distribution samples. The tuning parameters for all models including the proposed model are selected In the results in Figure 2b, we vary a, which controls using five-fold cross-validation. Details regarding the which region of the source distribution that the target setups are in Appendix D. distribution concentrates. We design the experiment such that increasing a shifts the target distribution Classification task. The task designed in Alabdul- to increasingly low mass regions of the source distri- mohsin et al. (2023) is a binary classification prob- bution. We compute the mean squared error of each lem with Y P t0, 1u and the latent variable U P method on test examples from the target distribution. t0, 1u is a Bernoulli random variable. Additionally, X P R2 , W P R are continuous random variables and We find that, while the baseline methods degrade as C P R3 is a discrete variable. We have one source the target distributions shift increases, the proposed domain with P pU “ 1q “ 0.1. We evaluate the mod- method adapts and maintains low error, nearly match- els on the target distribution with QpU q shifting from ing the error achieved by the oracle, which is trained QpU “ 1q P t0.1,... , 0.9u. The goal of this task is to on target distribution samples. investigate whether the adaptation method is robust to any arbitrary shift of U. 6.1 Multi-Domain Adaptation The ORACLE and ERM model are implemented as In the multi-domain setting, we use the same classifi- MultiLayer Perceptrons (MLP). The kernel function cation dataset provided in Alabdulmohsin et al. (2023) used in the proposed method is the Gaussian kernel. as Section D.6. We assume that C is not observed in any domain and generate multiple datasets drawn with We compare the proposed method with the LSA-S different distributions on U. and Wasserstein Autoencoder adaptation LSA-WAE approaches developed in Alabdulmohsin et al. (2023). Classification task. We construct three different While all three methods are designed to adjust shift tasks with different settings of P pU q over the source for the same graph in Figure 1c, our method takes and target domains. For each task, we construct three Proxy Methods for Domain Adaptation 0.95 0.95 1.50 1.50 92.5 92.5 0.3 Mean Squared Error 0.3 Probability Density Mean Squared Error Probability Density 0.90 1.25 1.25 0.90 Accuracy (%) 90.0 Accuracy (%) 90.0 1.00 AUROC 0.85 1.00 AUROC 0.85 87.5 0.2 0.2 87.5 0.80 0.75 0.75 0.80 85.0 85.0 0.1 0.50 0.50 0.75 0.75 0.1 82.5 82.5 0.25 0.25 0.70 0.70 80.0 80.0 0.0 0.0 0.2 0.2 0.4 0.4 0.6 0.6 0.8 0.8 0.2 0.2 0.4 0.4 0.6 0.6 0.8 0.8 0 1 0a 2 1 a 3 2 4 3 5 4 6 5 6 0 0 2 2 4 4 Q(U=1) Q(U=1) Q(U=1)Q(U=1) u u a, Uniform(a, a, q(U) ∼ q(U) ∼ Uniform(a, 2π) 2π) ORACLE ORACLE LSA-S LSA-S LSA-WAE LSA-WAE p(U) p(U) ORACLEORACLE COVAR COVAR Proposed Proposed ERM ERM LSA-S LSA-S w/ target w/ target W W Proposed Proposed q(U) q(U) ERM ERM LABEL LABEL (a) Classification task on simulated data. (b) Regression on the dSprites dataset. Figure 2: Adaptation results with concept and proxy. Shown is the average evaluation metric on held-out target distribution samples across 10 independent replicates of the data. The proposed method is robust to the latent shift compared to the baselines in both cases. (a) We set P pU “ 1q “ 0.1. Both the AUROC and accuracy remains nearly constant in various degree of shifts, while the performance of other baselines drops as QpU “ 1q moves to 0.9. (b) The left figure denotes the density function of U , the overlapping area of two distribution shrinks as a moves rightward. The result on the right shows that our method is robust even when the overlapping area between two distributions is small. Table 1: Multi-domain adaptation result. The values are the average AUROC of 10 independent replicates of the data. Each task has three source domains with different Pr pU q and one target domain. The proposed method has outperformed other baselines and is close to the Oracle in task 2. Task ORACLE Cat-ERM Avg-ERM SA MK WCSC DANN MMD Proposed 0.9425 0.8030 0.7916 0.7918 0.5848 0.5221 0.8039 0.8055 0.8848 Task 1 ˘0.0039 ˘0.0155 ˘0.0148 ˘0.0148 ˘0.0593 ˘0.0299 ˘0.0229 ˘0.0248 ˘0.0120 0.9431 0.8942 0.8953 0.8953 0.8054 0.8144 0.9158 0.9149 0.9318 Task 2 ˘0.0061 ˘0.0084 ˘0.0079 ˘0.0079 ˘0.0204 ˘0.0474 ˘0.0125 ˘0.0135 ˘0.0063 0.8876 0.8483 0.8427 0.8408 0.8002 0.7428 0.8480 0.8470 0.8569 Task 3 ˘0.0085 ˘0.0134 ˘0.0130 ˘0.0132 ˘0.0311 ˘0.0311 ˘0.0166 ˘0.0181 ˘0.0095 source domains and one target domain, drawing 3200 informally demonstrating the effect of the closeness of random training samples for the each source domain the source domains to the target domain. For Task and 9600 random training samples for the target do- 3, while our proposed approach performs best, ERM main. The set of source domains of of Task 1–3 also performs well, and substantially better than the have different combinations of distribution on U doc- domain adaptation baselines. umented in Appendix D.3. Regression task. We consider two regression tasks, The backbone models for ORACLE, Cat-ERM, Avg- where U is either a Bernoulli or a Beta random vari- ERM, and SA (Mansour et al., 2008) are simple MLPs; able. We present the results in Appendix D. MK (Blanchard et al., 2011) is a weighted kernel sup- port vector machine; WCSC (Zhang et al., 2015) is 6.2 Concept and multi-domain adaptation a re-weighted kernel density estimator. SA (Mansour with MIMIC-CXR et al., 2008) assumes that QpXq is the convex com- binations of Pr pXq for r “ 1,... , kZ ; WCSC (Zhang We conduct a small-scale experiment using a sample et al., 2015) assumes that QpX | Y q is a linear mix- of chest X-ray data extracted from the MIMIC-CXR ture of Pr pX|Y q for r “ 1,... , kZ domain is an i.i.d. dataset (Johnson et al., 2019). We briefly describe the realization from the general distribution. experimental design and results here, and include a The results are shown in Table 1. Overall, we find complete description in Appendix D.7. We consider our approach performs better than ERM and baseline classification of the absence of a radiological finding multi-domain adaptation methods. All methods per- from low-dimensional embeddings of the X-rays (Sell- form better in the setting of Task 2 than for Task 1, ergren et al., 2022), using the absence of a radiological finding in the radiology report as the target of pre- Tsai, Pfohl, Salaudeen, Chiou, Kusner, D’Amour, Koyejo, Gretton Figure 3: Concept and multi-domain adaptation with MIMIC-CXR. Shown are the mean ˘ SD AUROC of concept (left) and multi-domain adaptation (right) for classification of “No finding” from embeddings of chest X-rays over five replicates of a sampling procedure that introduces a shift in the prevalence of “No finding” with patient sex subgroups, where radiology report embeddings serve as concept variables C and patient age serves as the proxy W. In the concept adaptation experiment, the source domain corresponds to P pU “ 1q “ P pY “ 1 | Sex “ Femaleq “ P pY “ 0 | Sex “ Maleq “ 0.1. In the multi-domain adaptation experiment, we consider two source domains P pU “ 1q “ t0.1, 0.2u. diction. This corresponds to the “No Finding” label cedure does. Furthermore, the adapted model does defined by Irvin et al. (2019). not outperform the kernel estimators that only lever- age information from the source domains. The lack of We consider distribution shifts similar to settings in success in this setting could potentially be explained Makar et al. (2022), where patient sex is considered by insufficient number or diversity of domains relative as a possible “shortcut” in the classification of the ab- to the level of noise induced by sampling variability sence of a radiological finding. We impose distribution and limited sample size. shift through structured resampling of the data where P pU “ 1q “ P pY “ 1 | Sex “ Femaleq “ P pY “ 0 | Sex “ Maleq and P pSex “ Femaleq “ P pSex “ 7 Discussion Maleq “ 0.5 is held constant. We perform both con- cept adaptation and multi-domain adaptation exper- We propose a strategy for adaptation under distribu- iments with the MIMIC-CXR data. For the concept tion shift in a latent variable using a bridge function adaptation experiment, we consider the concept vari- approach (Miao et al., 2018; Tchetgen et al., 2020). able C to be the embedding of a radiology report asso- This approach allows for identification of the optimal ciated with the chest X-ray. We experiment with the predictor in the target domain without identifying the use of patient age as a potential proxy W for U due distribution of the latent variable and without distri- to a hypothesized correlation between the presence of butional assumptions on the form of the latent. We radiological findings and patient age. require that proxies of the latent variable are present and that (i) mediating concepts are available or (ii) The results are summarized in Figure 3. For both ex- data from multiple source domains are present. periments, we find that the performance of baseline models fit using only information from the source do- We argue our approach is useful for two reasons. main(s) degrades under distribution shift. In the con- First, the latent distribution in general is only identifi- cept adaptation experiment, adaptation is relatively able under strict distributional assumptions (Locatello successful, as much of the performance of comparator et al., 2019). Second, recovery of the latent variable models fit using target domain data is recovered by may be challenging in practice even if it is identifiable the adaptation procedure. (Rissanen and Marttinen, 2021). For example, because most latent variable estimation methods are designed However, we find that the multi-domain adaptation to model the data generating process (Kingma and procedure is not successful. In this case, we find Welling, 2013), one might allocate substantial model- that while the multi-domain adaptation procedure ing capacity to variability in the data and the latent marginally outperforms a model fit using the concate- variable that are irrelevant to modeling the shift in nated source domain data under distribution shift, it the conditional distribution of Y | X. By contrast, we recovers substantially less of the performance of the model only the components of the observable variables target domain model than the concept adaptation pro- relevant to the adaptation. Proxy Methods for Domain Adaptation Acknowledgments: We thank Zhu Li and Dimitri Y. Goyal, A. Feder, U. Shalit, and B. Kim. Explain- Meunier for helpful discussions. AG was partly sup- ing classifiers with causal concept effect (cace). arXiv ported by the Gatsby Charitable Foundation. OS preprint arXiv:1907.07165, 2019. was partly supported by the UIUC Beckman Institute Graduate Research Fellowship, NSF-NRT 1735252. A. Gretton, K. Borgwardt, M. Rasch, B. Schoelkopf, KT was partly supported by NSF Graduate Research and A. Smola. A kernel two-sample test. JMLR, 13: Fellowship Program. SK was partly supported by the 723–773, 2012. NSF III 2046795, IIS 1909577, CCF 1934986, NIH S. Grünewälder, G. Lever, L. Baldassarre, S. Pat- 1R01MH116226-01A, NIFA award 2020-67021-32799, terson, A. Gretton, and M. Pontil. Conditional the Alfred P. Sloan Foundation, and Google Inc. This mean embeddings as regressors-supplementary. arXiv study was funded by Google LLC and/or a subsidiary preprint arXiv:1205.4656, 2012. thereof (‘Google’). M. A. Hernán and J. M. Robins. Estimating causal ef- References fects from epidemiological data. Journal of Epidemi- ology & Community Health, 60(7):578–586, 2006. I. Alabdulmohsin, N. Chiou, A. D’Amour, A. Gret- ton, S. Koyejo, M. J. Kusner, S. R. Pfohl, J. Irvin, P. Rajpurkar, M. Ko, Y. Yu, S. Ciurea-Ilcus, O. Salaudeen, J. Schrouff, and K. Tsai. Adapting C. Chute, H. Marklund, B. Haghgoo, R. Ball, K. Sh- to latent subgroup shifts via concepts and proxies. panskaya, et al. Chexpert: A large chest radiograph In International Conference on Artificial Intelligence dataset with uncertainty labels and expert compari- and Statistics, pages 9637–9661. PMLR, 2023. son. In Proceedings of the AAAI conference on arti- ficial intelligence, volume 33, pages 590–597, 2019. M. Arjovsky, L. Bottou, I. Gulrajani, and D. Lopez- Paz. Invariant risk minimization. arXiv preprint A. E. Johnson, T. J. Pollard, S. J. Berkowitz, N. R. arXiv:1907.02893, 2019. Greenbaum, M. P. Lungren, C.-y. Deng, R. G. Mark, and S. Horng. Mimic-cxr, a de-identified publicly S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, available database of chest radiographs with free-text F. Pereira, and J. W. Vaughan. A theory of learning reports. Scientific data, 6(1):317, 2019. from different domains. Machine learning, 79:151– 175, 2010. D. P. Kingma and M. Welling. Auto-encoding varia- G. Blanchard, G. Lee, and C. Scott. Generalizing tional bayes. arXiv preprint arXiv:1312.6114, 2013. from several related classification tasks to a new un- P. W. Koh, T. Nguyen, Y. S. Tang, S. Mussmann, labeled sample. Advances in neural information pro- E. Pierson, B. Kim, and P. Liang. Concept bottle- cessing systems, 24, 2011. neck models. In International conference on machine A. Buck, J. Gart, et al. Comparison of a screening learning, pages 5338–5348. PMLR, 2020. test and a reference test in epidemiologic studies. ii. a P. W. Koh, S. Sagawa, H. Marklund, S. M. Xie, probabilistic model for the comparison of diagnostic M. Zhang, A. Balsubramani, W. Hu, M. Yasunaga, tests. American Journal of Epidemiology, 83(3):593– R. L. Phillips, I. Gao, et al. Wilds: A benchmark 602, 1966. of in-the-wild distribution shifts. In International Y. Cui, H. Pu, X. Shi, W. Miao, and E. Tchet- Conference on Machine Learning, pages 5637–5664. gen Tchetgen. Semiparametric proximal causal infer- PMLR, 2021. ence. Journal of the American Statistical Association, M. Kuroki and J. Pearl. Measurement bias and effect pages 1–12, 2023. restoration in causal inference. Biometrika, 101(2): B. Deaner. Proxy controls and panel data. arXiv 423–437, 2014. preprint arXiv:1810.00283, 2018. Z. Lipton, Y.-X. Wang, and A. Smola. Detecting and X. D’Haultfoeuille. On the completeness condition in correcting for label shift with black box predictors. In nonparametric instrumental problems. Econometric International conference on machine learning, pages Theory, 27(3):460–471, 2011. 3122–3130. PMLR, 2018. Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, F. Locatello, S. Bauer, M. Lucic, G. Raetsch, S. Gelly, H. Larochelle, F. Laviolette, M. Marchand, and B. Schölkopf, and O. Bachem. Challenging common V. Lempitsky. Domain-adversarial training of neural assumptions in the unsupervised learning of disentan- networks. The journal of machine learning research, gled representations. In international conference on 17(1):2096–2030, 2016. machine learning, pages 4114–4124. PMLR, 2019. Tsai, Pfohl, Salaudeen, Chiou, Kusner, D’Amour, Koyejo, Gretton S. Magliacane, T. Van Ommen, T. Claassen, J. Peters, P. Bühlmann, and N. Meinshausen. Causal S. Bongers, P. Versteeg, and J. M. Mooij. Domain inference using invariant prediction: identification adaptation by using causal inference to predict in- and confidence intervals, 2015. variant conditional distributions. Advances in neural information processing systems, 31, 2018. S. Rissanen and P. Marttinen. A critical look at the consistency of causal estimation with deep latent vari- M. Makar, B. Packer, D. Moldovan, D. Blalock, able models. Advances in Neural Information Pro- Y. Halpern, and A. D’Amour. Causally motivated cessing Systems, 34:4207–4217, 2021. shortcut removal using auxiliary labels. In Inter- national Conference on Artificial Intelligence and D. Rothenhäusler, N. Meinshausen, P. Bühlmann, Statistics, pages 739–766. PMLR, 2022. and J. Peters. Anchor regression: Heterogeneous data meet causality. Journal of the Royal Statistical Soci- A. Malinin, N. Band, G. Chesnokov, Y. Gal, M. J. ety Series B: Statistical Methodology, 83(2):215–246, Gales, A. Noskov, A. Ploskonosov, L. Prokhorenkova, 2021. I. Provilkov, V. Raina, et al. Shifts: A dataset of real distributional shift across multiple large-scale tasks. B. Schölkopf, R. Herbrich, and A. J. Smola. A gen- arXiv preprint arXiv:2107.07455, 2021. eralized representer theorem. In International con- ference on computational learning theory, pages 416– Y. Mansour, M. Mohri, and A. Rostamizadeh. Do- 426. Springer, 2001. main adaptation with multiple sources. Advances in neural information processing systems, 21, 2008. B. Schölkopf, D. Janzing, J. Peters, E. Sgouritsa, K. Zhang, and J. Mooij. On causal and anticausal A. Mastouri, Y. Zhu, L. Gultchin, A. Korba, R. Silva, learning. arXiv preprint arXiv:1206.6471, 2012. M. Kusner, A. Gretton, and K. Muandet. Proximal causal learning with kernels: Two-stage estimation J. Schrouff, N. Harris, S. Koyejo, I. M. Alabdul- and moment restriction. In International conference mohsin, E. Schnider, K. Opsahl-Ong, A. Brown, on machine learning, pages 7512–7523. PMLR, 2021. S. Roy, D. Mincu, C. Chen, et al. Diagnosing fail- ures of fairness transfer across distribution shift in L. Matthey, I. Higgins, D. Hassabis, and A. Ler- real-world medical settings. Advances in Neural In- chner. dsprites: Disentanglement testing sprites formation Processing Systems, 35:19304–19318, 2022. dataset. https://github.com/deepmind/dsprites- dataset/, 2017. A. B. Sellergren, C. Chen, Z. Nabulsi, Y. Li, A. Maschinot, A. Sarna, J. Huang, C. Lau, S. R. Ka- W. Miao, Z. Geng, and E. J. Tchetgen Tchetgen. lidindi, M. Etemadi, et al. Simplified transfer learning Identifying causal effects with proxy variables of an for chest radiography models using less data. Radiol- unmeasured confounder. Biometrika, 105(4):987– ogy, 305(2):454–465, 2022. 993, 2018. Z. Shen, J. Liu, Y. He, X. Zhang, R. Xu, H. Yu, and W. Miao, W. Hu, E. L. Ogburn, and X.-H. Zhou. P. Cui. Towards out-of-distribution generalization: A Identifying effects of multiple treatments in the pres- survey. arXiv preprint arXiv:2108.13624, 2021. ence of unmeasured confounding. Journal of the American Statistical Association, pages 1–15, 2022. H. Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood func- K. Muandet, D. Balduzzi, and B. Schölkopf. Domain tion. Journal of statistical planning and inference, 90 generalization via invariant feature representation. In (2):227–244, 2000. International conference on machine learning, pages 10–18. PMLR, 2013. R. Singh, M. Sahani, and A. Gretton. Kernel in- strumental variable regression. Advances in Neural M. Oberst, N. Thams, J. Peters, and D. Sontag. Reg- Information Processing Systems, 32, 2019. ularizing towards causal invariance: Linear models with proxies. In International Conference on Ma- A. Smola, A. Gretton, L. Song, and B. Schölkopf. A chine Learning, pages 8260–8270. PMLR, 2021. hilbert space embedding for distributions. In Inter- national conference on algorithmic learning theory, S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang. pages 13–31. Springer, 2007. Domain adaptation via transfer component analy- sis. IEEE transactions on neural networks, 22(2): L. Song, J. Huang, A. Smola, and K. Fukumizu. 199–210, 2010. Hilbert space embeddings of conditional distributions with applications to dynamical systems. In Proceed- J. Pearl. Causality. Cambridge University Press, 2 ings of the 26th Annual International Conference on edition, 2009. Machine Learning, pages 961–968, 2009. Proxy Methods for Domain Adaptation P. Spirtes, C. N. Glymour, and R. Scheines. Causa- tion, prediction, and search. MIT press, 2000. A. Subbaswamy, P. Schulam, and S. Saria. Prevent- ing failures due to dataset shift: Learning predictive models that transport. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 3118–3127. PMLR, 2019. E. J. T. Tchetgen, A. Ying, Y. Cui, X. Shi, and W. Miao. An introduction to proximal causal learn- ing. arXiv preprint arXiv:2009.10982, 2020. V. Veitch, A. D’Amour, S. Yadlowsky, and J. Eisen- stein. Counterfactual invariance to spurious correla- tions in text classification. Advances in Neural Infor- mation Processing Systems, 34:16196–16208, 2021. J. Wang, C. Lan, C. Liu, Y. Ouyang, T. Qin, W. Lu, Y. Chen, W. Zeng, and P. Yu. Generalizing to unseen domains: A survey on domain generalization. IEEE Transactions on Knowledge and Data Engineering, 2022. L. Xu and A. Gretton. Kernel single proxy con- trol for deterministic confounding. arXiv preprint arXiv:2308.04585, 2023. K. Zhang, M. Gong, and B. Schölkopf. Multi-source domain adaptation: A causal view. In Proceedings of the AAAI Conference on Artificial Intelligence, vol- ume 29, 2015. K. Zhou, Z. Liu, Y. Qiao, T. Xiang, and C. C. Loy. Domain generalization: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022. Tsai, Pfohl, Salaudeen, Chiou, Kusner, D’Amour, Koyejo, Gretton Supplementary Materials A Identification of the Distribution In this section, we demonstrate the existence of the bridge functions h0 and m0 under certain regularity condi- tions. We first discuss the discrete case and then generalize to the continuous case. A.1 The Discrete Case of the Bridge Function h0 The idea of bridge function h0 may seem abstract in the continuous setting. When every variable is discrete, however, the construction of the bridge function is demonstrated by solving series of matrix problems. This idea originates from Miao et al. (2018) and we apply the technique to show the construction of bridge function when every variable pW, U, C, X, Y q is discrete. “ ‰J “ ‰ Let PpW | uq “ P pw1 | uq... P pwkW | uq P RkW , PpW | U q “ PpW | u1 q... PpW | ukU q P RkW ˆkU be a column vector, and a matrix, respectively. We define similarly PpU | x, cq “ “ ‰J kU “ ‰ P pu1 | c, xq... P pukU | c, xq P R , PpU “ | X, cq “ PpU | x1 , cq ‰... PpU | xkX , cq P RkU ˆkX “for c P C. We define PpY ‰| X, cq “ PpY | x1 , cq... “ PpY | xkX , cq P RkY ˆkX , PpY‰ | U, cq “ PpY | u1 , cq... PpY | ukX , cq P RkY ˆkX , PpW | X, cq “ PpW | x1 , cq... PpW | xkX , cq P RkW ˆkX analogously. As an alternative to finding a h0 pw, cq such that k ÿW ErY | c, xs “ h0 pwi , cqppwi | c, xq, i“1 the proxy problem is converted to finding a H e 0 pY, W, cq such that PpY | X, cq “ H e 0 pY, W, cqPpW | X, cq, c P C. First, under the condition that W K K tX, Cu | U , we can write PpW | X, cq “ PpW | U qPpU | X, cq. (A.1) Similarly, under the condition that Y K K X | tU, Cu, we have PpY | X, cq “ PpY | U, cqPpU | X, cq (A.2) We introduce the following assumption: Assumption 7. Columns of PpW | U q are linearly independent. For every c P C, the columns of PpW | X, cq satisfy PpW | x, cq P N pPpW | U q˚ qK for all x P X. Assumption 7 is the requirement for the least-squares problem to have an unique solution. Hence, by Assump- tion 7, we have PpU | X, cq “ PpW | U q: PpW | X, cq, where PpW | U q: is the generalized inverse of PpW | U q. Plug the above equation into (A.2), we see that PpY | X, cq “ PpY | U, cqPpW | U q: PpW | X, cq. looooooooooooomooooooooooooon HpY,W,cq e Proxy Methods for Domain Adaptation A.2 Existence of the Bridge Function h0 The sufficient conditions of existence of h0 are originally discussed in Miao et al. (2018), we adapt them to our setting and provide a brief review in this section. We assume the following completeness assumption and regularity conditions. This assumption is equivalent to Condition (iii) in Miao et al. (2018). Assumption 8. For any mean squared integrable function g and for c P C, ErgpXq | W, cs “ 0 almost surely if and only if gpXq “ 0 almost surely. Let f be either the distribution from p or q, we consider Kc : L2 pW | cq Ñ L2 pX | cq as the conditional expectation operator associated with the kernel function f pw, x | cq kpw, x, cq “. f pw | cqf px | cq Then it follows that ErY | c, xs “ Kc h0 : ż ErY | c, xs “ h0 pw, cqf pw | x, cqdw żW “ kpw, x, cqh0 pw, cqf pw | cqdw “ Kc h0. To find the solution h0 , we assume the followings. ş ş Assumption 9. For any c P C, W X f pw | c, xqf px | c, wqdwdx ă 8. This is a sufficient condition to ensure that Kc is a compact operator (Carrasco et al., 2007, Example 2.3). Hence, by the definition of a compact operator, there exists a singular system tλc,i , φc,i , ψc,i uiPN of Kc for every c P C. Assumption 10. For fixed c P C: 1. ErY | X, cs P L2 pX | cq; 2 2. iPN λ´2 ř c,i |xErY | X, cs, ψc,i y| ă 8. The above two assumptions are restatements of Conditions (v)–(vii) in Miao et al. (2018). We adapt the results from Proposition 1 in Miao et al. (2018) to the graph in Figure 1c which replaces the node X by C and node Z by X. Proposition A.1 (Existence of h0 , adapted from Proposition 1 in Miao et al. (2018)). Under Assumption 2, 8–10, the solution to (4.1) exists. Proof. The proof follows directly from the result of Picard’s theorem. Assumption 9 implies that Kc is a compact operator. Assumption 8 implies that N pKc˚ qK “ L2 pX | cq. Therefore, under the first statement in Assumption 10, we have ErY | X, cs P N pKc˚ qK. Along with the second statement in Assumption 10, we can apply Lemma A.3. A.3 Existence of Bridge Function m0 The proof of the existence of mp0 is similar to the analysis of h0. Let Kx : L2 pW | xq Ñ L2 pZ | xq be the integral operator associated with the kernel function kpw, x, zq “ ppw, z | xq{pppw | xqppz | xqq. Then, we can write ż Ep rY | x, zs “ kpw, x, zqppw | xqm0 pw, xqdw “ Kx m0. Proposition A.2 (Existence of m0 , Proposition 1 in Miao et al. (2018)). Assume that 1. for any mean squared integrable function g and for x P X , ErgpZq | W, xs “ 0 almost surely if and only if gpZq “ 0 almost surely; Tsai, Pfohl, Salaudeen, Chiou, Kusner, D’Amour, Koyejo, Gretton ş ş 2. For any x P X , W Z f pw | x, zqf pz | x, wqdwdz ă 8; 3. For any x P X , ErY | Z, xs P L2 pZ | xq; 2 4. For any x P X , iPN λ´2 ř x,i |xErY | Z, xs, ψx,i y| ă 8, where pλx,i , φx,i , ψx,i q is the singular system of Kx. Then the solution of mp0 exists. The proof of Proposition A.2 is similar to the proof of Proposition 1 in (Miao et al., 2018), where we replace P py|z, xq in Proposition 1 of Miao et al. (2018) with ErY | Z, xs. The proof for existence of mq0 also follows similarly as Proposition A.2. A.4 Auxiliary Lemma We introduce the Picard’s theorem as follows. Lemma A.3 (Picard’s Theorem). Let K : H1 Ñ H2 be a compact operator with singular system tλj , ϕj , ψj u8 j“1 and φ be a given function in H2. Then the equation of first kind Kh “ φ have solutions if and only if 1. φ P N pK ˚ qK , where N pK ˚ q “ th : K ˚ h “ 0u is the null space of the adjoint operator K ˚. ř`8 2 2. j“1 λ´2j |xφ, ψj y| ă 8. B Transferring Bridge Functions In this section, we discuss the identifiability results. B.1 Proof of Theorem 4.1 For f P tp, qu, recall that ż Ef rY | c, xs “ hf0 pw, cqf pw | c, xqdw żW ż “ hf0 pw, cqf pw | c, uqf pu | c, xqdudw W U ż ż “ hf0 pw, cqf pw | uqf pu | c, xqdudw pW KK C | U q. W U Similarly, we can write ż Ef rY | c, xs “ Ef rY | c, usf pu | c, xqdu pY KK X | tU, Cuq. U Under Assumption 4, we have ż Ef rY | c, U s “ hf0 pw, cqf pw | U qdw (B.1) W almost surely with respect to F pU q, F P tP, Qu. Suppose that u P U such that Qpuq ą 0. Then, by Assumption 5 , we must have P puq ą 0. Hence, conditioned on the selected u and c and under Assumption 1, we have ż Ep rY | c, us “ hp0 pw, cqppw | uqdw; W ż Eq rY | c, us “ hq0 pw, cqppw | uqdw pppw | uq “ qpw | uq, @c P C, w P W, u P Uq. W Proxy Methods for Domain Adaptation We then can write ż ż Ep rY | c, us ´ Eq rY | c, us “ hp0 pw, cqppw | uqdw ´ hq0 pw, cqqpw | uqdw. W W Note that, by Assumption 1, we have Ep rY | c, us “ Eq rY | c, us and hence the left hand side of the above equation is 0 and we can conclude that: ż ż hp0 pw, cqppw | U qdw “ hq0 pw, cqqpw | U qdw W W QpU q almost surely. We complete the first part of proof. To show the second part of the theorem, note that we can write Eq rY | xs “ Eq rEq rY | C, xs | xs “ Eq rEq rhq0 pW, cq | C, xs | xs. Since ppw | uq “ qpw | uq by Assumption 1, we can factorize the above equation as ż „ż "ż *  Eq rY | xs “ hq0 pw, cqppw | uqdw qpu | c, xqdu qpc | xqdc. C U W 1 0 Let the support of U conditioned on c, x be Uc,x ş | c, xq ą 0u and Uc,x “ ştu : Qpu | c, xq “ 0u. Hence, “ tu : Qpu 0 1 0 1 we have U “ Uc,x Y Uc,x , and Uc,x X Uc,x “ H such that U 0 qpu | c, xqdu “ 0 and U 1 qpu | c, xqdu “ 1. Then, c,x c,x we can further decompose the above as ż «ż "ż * ff Eq rY | xs “ hq0 pw, cqppw | uqdw qpu | c, xqdu qpc | xqdc C 0 Uc,x W ż «ż "ż * ff ` hq0 pw, cqppw | uqdw qpu | c, xqdu qpc | xqdc C 1 Uc,x W ż «ż "ż * ff “ hq0 pw, cqppw | uqdw qpu | c, xqdu qpc | xqdc. C 1 Uc,x W 1 Given c, x, since the support of QpU | c, xq is included in the support of QpU q, so if u P Uc,x , we must have q p Qpuq ą 0 and hence P puq ą 0 by Assumption 5, and we can swap h0 with h0. ż «ż "ż * ff “ hp0 pw, cqppw | uqdw qpu | c, xqdu qpc | xqdc. C 1 Uc,x W hp0 pw, cqppw | uqdw qpu | c, xqdu “ 0, we can add it to the above term and arrive at ş ş ( Since 0 Uc,x W ż „ż "ż *  “ hp0 pw, cqppw | uqdw qpu | c, xqdu qpc | xqdc żC ż U W “ hp0 pw, cqqpw, c | xqdwdc. (B.2) C W Since we can identify hp0 from the observable pW, X, Y, Cq of the source domain by solving the linear system (4.1), given observable pW, C, Xq from the target domain, we can identify Eq rY | xs. B.2 Proof of Proposition 4.2 The following proof is a generalization of the proof of Miao et al. (2018), suited to the multidomain case. All variables besides Z are assumed to be discrete-valued and multivariate: V can take kv values for V P tU, X, Y, W u. Tsai, Pfohl, Salaudeen, Chiou, Kusner, D’Amour, Koyejo, Gretton “ ‰ Let “ PpW | U q “ PpW | u‰1 q... PpW | ukU q P RkW ˆkU. Similarly, define PpY | U, xq “ PpY | u1 , xq... PpY | ukU , xq P RkY ˆkU. This notation carries through to the remaining variables. The approach we will take differs from the concept case (and standard proxy case) in the following way: we do not observe Z in the training or test domains, nor do we know its true dimension (indeed Z may be continuous valued). Rather, we assume that we have at least kZ distinct draws zr from Z in training, where r P t1,... , kZ u is the domain index, and that kZ ě kU. We also suppose that in test, we observe a distinct draw zkZ `1 which was not seen in training. Our goal is to obtain a bridge function, which in the categorical case will be a bridge matrix of dimension Mw,x P RkY ˆkW. Define Pr pV | xq :“ P pV | x, zr q for V P tU, Y, W u. We assume that for each x, “ ‰ rank pP1:kZ pU | xqq “ kU , P1:kZ pU | xq :“ P1 pU | xq... PkZ pU | xq which implies that P pU | x, zr q varies with zr , and that we see a sufficient diversity of domains to span the space of vectors on U. The graphical model supports the conditional independence relation tY, X, W u KK Z | U, however we will only require the standard proxy assumptions W KK X, Z | U, Y KK Z | X, U. Next, as in the concept case, we require P pY |U, xq “ Mw,x P pW |U q, where we assume rankpP pW |U qq “ ku (as in the first condition of Assumption 7). The matrix Mw,x is invariant to the distribution P pU q by construction. If we can solve for Mw,x , then given a novel domain corresponding to the draw zkz `1 , we have P pY |U, xqPkz `1 pU |xq “ Mw,x P pW |U qPkz `1 pU |xq Pkz `1 pY |xq “ Mw,x Pkz `1 pW |xq. This allows us to compute conditional expectations under P pY | xq in the novel domain, based on observations of pW, Xq in this domain. To solve for Mw,x , we project both sides on a basis over U arising from the training domains, P pY |U, xqP1:kZ pU | xq “ Mw,x P pW |U qP1:kZ pU | xq, “ ‰ where we define P1:kZ pY |xq “ P1 pY | xq... PkZ pY | xq , and likewise P1:kZ pW | xq. Then the above becomes P1:kZ pY |xq “ Mw,x P1:kZ pW | xq : Mw,x “ P1:kZ pY |xqP1:kZ pW | xq. (B.3) This demonstrates that we can recover the domain-invariant Mw,x purely from observed data. One domain is not enough: We illustrate with an example, where we again consider the case where all variables are categorical: P pY |xq “ Mw,x P pW |xq, (B.4) where P pY | xq is a kY ˆ 1 vector of probabilities, P pW | xq is a kW ˆ 1 vector of probabilities, and M is a kY ˆ kW matrix for which we wish to solve. We have too few equations for the number of unknowns. One solution to (B.4) is the matrix of conditional probabilities Mw,x “ P pY |W, xq. This matrix is not invariant to changes to P pU q, however: ppY |W, xq “ ppY |U, xqP pU |W, xq. The posterior P pU |W, xq changes when the prior P pU q changes. In contrast, the solution in (B.3) is guaranteed to be domain invariant. Proxy Methods for Domain Adaptation B.3 Proof of Proposition 4.3 For all r “ 1,... , kZ , we can write ż Er rY | xs “ ErY | x, zr s “ m0 pw, xqdP pw | x, zr q żWż “ m0 pw, xqdP pw | uqdP pu | x, zr q; (B.5) żU W ErY | x, zr s “ ErY | x, usdP pu | x, zr q. (B.6) U By Assumption 6, the integrands of (B.5)–(B.6) have the following proper

Proxy Methods for Domain Adaptation PDF

Document Details

Tags

Related

Summary

Full Transcript