Statistical Analysis of Twin Data PDF
Document Details
Uploaded by AppealingAmazonite
null
Tags
Summary
This document analyzes statistical methods for twin data, focusing on the use of structural equation models. It details the applications of various models, the integration of traditional medical and psychological approaches, and the advantages and disadvantages of different types of studies, such as family, adoption and twin studies. The document further discusses the assumptions made in classical twin studies and methodologies encompassing structural equation modelling, software like Mx, and goodness-of-fit measures.
Full Transcript
Statistical Analysis of Twin Data - Reading 31 October 2023 13:29 Source Notes Analytic approaches to twin data using structural equation models Introduction to Quantitative Genetics (Rijsdijk & Sham, 2002) Goal: Study the relative contribution of genetic and environmental influences to individual d...
Statistical Analysis of Twin Data - Reading 31 October 2023 13:29 Source Notes Analytic approaches to twin data using structural equation models Introduction to Quantitative Genetics (Rijsdijk & Sham, 2002) Goal: Study the relative contribution of genetic and environmental influences to individual differences in traits Methods: Family, adoption, and twin studies Types of studies Family studies: Investigate familial aggregation of a disease or trait Adoption studies: Compare adopted and non-adopted children to control for shared environmental factors Twin studies: Compare monozygotic (MZ) and dizygotic (DZ) twins to disentangle genetic from environmental factors Applications Behavioral traits: Cognitive ability, personality, etc. Non-behavioral traits: Height, BMI, brain volume, etc. Integration of approaches Traditional medical model: Diseases defined as categorical entities Psychological approach: Quantitative measures of traits Trend to integrate the two approaches, especially for traits with both diagnostic criteria and quantitative measures (e.g., depression and anxiety) Advantages and disadvantages of each study type Family studies: ○ Advantage: Easy to conduct ○ Disadvantage: Cannot discriminate genetic from shared environmental factors Adoption studies: ○ Advantage: Can control for shared environmental factors ○ Disadvantage: Difficult to obtain data, selective placement and prenatal influences can bias data Twin studies: ○ Advantage: Can disentangle genetic from environmental factors ○ Disadvantage: Data can be difficult to interpret due to confounding factors Twin Studies Introduction to the Classical Twin Method Utilizes data from MZ (Monozygotic) and DZ (Dizygotic) twin pairs to analyse genetic and environmental influences on traits. Two main types of twin studies: ascertained through affected probands and population twin registers. Basic principles of the twin method are shared, irrespective of the type of twin study. Biometrical Genetics and the Twin Method Structural equations link observed traits to underlying genotypes and environments. Components of genetic and environmental variation: ○ Additive genetic influences (A), ○ non-additive genetic influences (dominance, D, or epistasis), ○ common environmental influences (C), ○ and unique environmental influences (E). Total phenotypic variance (P) is the sum of these components (P = A + D + C + E). Falconer's Formula Heritability (h^2) estimated using Falconer's formula: h^2 = 2(rMZ - rDZ), where r is the intraclass correlation coefficient. Shared and non-shared environmental effects: c^2 = rMZ * h^2, e^2 = 1 - h^2 + c^2. More advanced methods are used for complex analyses like sex differences and multivariate data. Assumptions of the Twin Method Assumptions in classical twin studies include: ○ MZ and DZ twins share environments equally. ○ Minimal gene-environment correlations and interactions. ○ Twins are representative of the general population. ○ Random matings in the population (no assortment). ○ Violation of these assumptions has consequences that should be considered. Path Analysis and Structural Equations Path analysis interprets correlations between variables in terms of causal relations. Twin model illustrated using a path diagram, with latent genetic and environmental variables. Path estimates (a, c, d, e) represent the effects of latent variables on observed traits. Genetic covariance between twins calculated based on path estimates. Total covariance between twins derived from chains connecting their observed traits. Variances and covariances within MZ and DZ pairs expressed in terms of variance components. Contrasting Effects of Genetic and Environmental Factors Dominance indicated when DZ correlations are less than half the MZ correlations. Common environmental influences make DZ correlations greater than half the MZ correlations. DZ correlations around half the MZ correlations suggest additive genetic influences or a combination of common environmental and dominance genetic effects. Additional data, like adoptive siblings, can help estimate the effects of both components. Indices of relative contribution of genetic and environmental effects reported as standardized values. Structural equation model fitting Structural Equation Modelling (SEM) and Twin Data Analysis Path diagrams, structural equations, and covariance matrices are interchangeable in modelling. SEM serves as a unified platform for analysing twin data. SEM programs involve matrix calculations and numerical optimization routines. SEM tests hypotheses about relations among observed and latent variables. SEM is suitable for human and animal quantitative genetic data analysis. Mx: A Software for Twin Data Analysis The Mx package is specifically designed to model genetically sensitive data in a flexible way. It offers a graphical user interface (MxGUI) for drawing path diagrams or allows users to write scripts to specify models. Mx supports various data input formats, including summary statistics and raw data. Raw data offer greater flexibility and handle missing data, finite mixture distributions, and continuous moderator variables. Mx can test both dichotomous (e.g., sex) and continuous moderator variables. Fit Function in SEM Programs SEM programs estimate model parameters by minimizing a goodness-of-fit statistic. Maximum-likelihood criterion is a common and robust goodness-of-fit measure. Parameters are adjusted iteratively to maximize the log-likelihood. The log-likelihood function is calculated based on observed scores and model parameters. Likelihood-based confidence intervals (CIs) are used for parameter precision. Goodness-of-Fit Measures Goodness-of-fit compares the model to a perfectly fitting (saturated) model using a likelihood ratio chi-square statistic (x^2). A non-significant x^2 value indicates the model fits the data well, while a significant x^2 value suggests a poor fit. Degrees of freedom (df) for the x^2 test are calculated as the number of observed statistics minus the number of model parameters. Statistical Significance in Twin Data Analysis The significance of differences between competing models can be tested using the difference in x^2 and df. Models are nested, meaning one model's parameters are a subset of the other's. Testing can determine if components (A, D, C, and E) are significantly present. A simpler, nested model is preferred if its fit is not significantly worse than the full model, as it provides a more parsimonious* explanation of the data. *Law of parsimony - the principle that the simplest explanation of an event or observation is the preferred explanation Multivariate genetic models Multivariate genetic models Can be used to investigate the genetic overlap between different disorders, the continuity of genetic factors at different stages of the illness, and the relation between genetic factors and mediating or environmental variables. Within-individual cross-traits covariances imply common aetiological influences. Cross-twin cross-traits covariances imply that these common aetiological influences are familial. Whether these common familial aetiological influences are genetic or environmental is reflected in the MZ/DZ ratio of the cross-twin cross-traits covariances. Depression and anxiety For depression and anxiety, two very common disorders that often occur together, it was found that the substantial genetic components for both disorders was due to the same genetic factors. However, the environmental factors were different, and, thus, were responsible for shaping the different outcomes. Multivariate ACE model Total Phenotypic Variance the total variability in how traits or characteristics are expressed in a population. It is a sum of three components: ▪ Additive Genetic Effects: This represents the genetic influence on the traits passed down from parents. ▪ Common Environmental Effects: These are environmental factors that affect individuals in a similar way, often due to shared living conditions or experiences. ▪ Specific Environmental Effects : These are unique environmental factors that affect individuals differently, making them unique ΣP=ΣA+ΣC+ΣE ΣP is the covariance matrix of the phenotypes. PSYC0036 Genes and Behaviour Page 1 ΣP is the covariance matrix of the phenotypes. ΣA is the covariance matrix of the additive genetic effects. ΣC is the covariance matrix of the common environmental effects. ΣE is the covariance matrix of the specific environmental effects. Phenotypic correlation (r21) a measure of how two traits or characteristics are related in a population. It's represented as a number between -1 and 1, where 1 means they are perfectly correlated (as one increases, the other also increases), -1 means they are perfectly inversely correlated (as one increases, the other decreases), and 0 means there is no correlation. ▪ Variance: This measures how much an individual trait varies within a population. ▪ Covariance: These measure how the two traits change together, whether they tend to increase or decrease together. The formula you provided is used to calculate the phenotypic correlation between two traits, r21, and it involves these terms. r21 is the phenotypic correlation between the two phenotypes. σ11 is the variance of the first phenotype. σ12 is the covariance between the first and second phenotypes. σ21 is the covariance between the second and first phenotypes. σ22 is the variance of the second phenotype. Genetic correlation (tΛ) similar to phenotypic correlation but specifically looks at the genetic factors that influence the two traits. It tells us how much of the correlation between two traits is due to their genetic similarity. ▪ Variance of Additive Genetic Effects: This is similar to the variance (sigma11) in the phenotypic correlation but for the genetic component. ▪ Covariance of Additive Genetic Effects : This measures how the genetic effects on the two traits are related. The formula provided is used to calculate the genetic correlation, tΛ, between two traits and involves these terms. tΛ is the genetic correlation between the two phenotypes. σA11 is the variance of the additive genetic effects for the first phenotype. σA12 is the covariance between the additive genetic effects for the first and second phenotypes. σA22 is the variance of the additive genetic effects for the second phenotype Categorical Twin data Data Ordered categories: presence/absence of a disease, responses to a single item on a questionnaire Counts: number of individuals within each response category Model Assume ordered categories reflect an imprecise measurement of an underlying normal distribution of liability Liability distribution has one or more thresholds (cut-offs) to discriminate between the categories Advantages and Disadvantages of Variance Component Genetic Models for Categorical Twin Data Advantages: ○ Can be used to study traits that can only be measured in a small number of ordered categories ○ Can control for shared environmental factors by comparing monozygotic (MZ) and dizygotic (DZ) twins Disadvantages: ○ Model assumptions may be violated (e.g., liability distribution may not be normal) ○ Can be difficult to interpret results due to confounding factors (e.g., age, sex) Genetic model fitting of categorical twin data What are tetrachoric correlations? Correlations in liability for dichotomous traits, estimated from contingency tables of monozygotic (MZ) and dizygotic (DZ) twin pairs How are tetrachoric correlations estimated? Maximum likelihood using programs such as Mx or PRELIS How is the heritability of liability estimated? Variance decomposition applied to liability, in which correlations in liability are determined by path model Advantages of tetrachoric correlations Can be used to study dichotomous traits (e.g., presence/absence of a disease) Can control for shared environmental factors by comparing MZ and DZ twins Disadvantages of tetrachoric correlations Model assumptions may be violated (e.g., liability distribution may not be normal) Can be difficult to interpret results due to confounding factors (e.g., age, sex) Maximum-likelihood analysis of contingency tables Advantages of the fit function to CT of twin data Can be used to test the fit of a model to categorical twin data Can be used to estimate the parameters of a model (e.g., heritability, correlation) Disadvantages of the fit function to CT of twin data Can be computationally expensive Can be sensitive to model assumptions (e.g., normality of liability distribution) Co-morbidity What is comorbidity? The co-occurrence of two or more disorders in the same patient Models of comorbidity Correlated liability model: Individuals have disorder A if they are above threshold on the liability to A (LA) and disorder B if they are above threshold on the liability to B (LB). Comorbidity arises when the correlation between LA and LB is greater than 0. liability is a latent (hidden) variable that is thought to represent the underlying vulnerability to a disorder. Likelihood function The likelihood function in the correlated liability model involves integration over four dimensions: ○ the liabilities for A and B in both members of the pair. Contingency table The data can be summarized in a 4x3x4 contingency table. Predicted proportions Assuming multivariate normality, each of the predicted proportions in the 16 cells of the contingency table can be expressed as a quadruple integral in which only the limits are altered. Implications for twin resemblance The correlated liability model has implications for the resemblance of the disorders across MZ and DZ twins. Quantitative predictions can be made about the relative proportions of twin-pairs in which each member can fall into one of four categories: ○ neither disorder, ○ disorder A but not B, ○ disorder B but not A, ○ or both A and B. Categorical data from proband-ascertained samples Complete selection Categorical twin data obtained from random population samples All four cells of the contingency table are represented in the same proportions as the complete population Proband-ascertained samples Used when collection of random samples from the population is inefficient (e.g., for rare diseases) Twin pairs are ascertained through a register of affected individuals (probands) Several types of ascertainment: ○ Complete ascertainment: all affected twins in a community sample are registered and selected as probands ○ Single ascertainment: all twin pairs have just one proband ○ Multiple incomplete ascertainment: lies between complete and single ascertainment Advantages of proband-ascertained samples Can be used to study rare diseases Can be more efficient than collecting random samples from the population Disadvantages of proband-ascertained samples Can lead to bias in the results, because the sample is not representative of the general population PSYC0036 Genes and Behaviour Page 2 Can lead to bias in the results, because the sample is not representative of the general population Can be difficult to estimate the extent of incomplete ascertainment Probandwise concordance rate What is the probandwise concordance rate? The probability that the co-twin of a proband twin will also have the disorder How is the probandwise concordance rate calculated? For complete ascertainment: ○ Number of probands whose co-twins are affected / Number of probands Interpretation of the probandwise concordance rate A difference between MZ and DZ concordance rates suggests a genetic component to the disorder Pairwise concordance rate Now obsolete Calculated as the number of pairs where both twins are affected divided by the total number of twin pairs Maximum-Likelihood Analysis of Categorical Data from Proband-Ascertained Samples Structural equation modelling can be used to analyze proband-ascertained twin samples, but an ascertainment correction is needed ○ Effect of ascertainment ▪ Distorts the frequencies of the cells in the contingency table ▪ Concordant unaffected pairs will not be ascertained ▪ Twin pairs with one affected member may be under-represented ○ Correction for ascertainment ▪ Introduce an ascertainment probability for each cell ○ Adjust the cell probabilities by multiplication with these ascertainment probabilities, divided by a scaling factor so that they sum to 1 Software for maximum-likelihood analysis Mx can be used to conduct this type of analysis using the option for user-defined fit function Checking the assumption of the twin method Equal environments Assumption: MZ and DZ twins share their environment to the same extent. Implications: If MZ twins are treated more similarly than DZ twins, this can lead to an overestimation of the genetic effect and an underestimation of the shared environmental effect. How to detect: ○ Look at the phenotypic correlation between parents for the trait in question. ○ Compare the correlation between a measure of family environment (e.g., parental responsivity) and offspring traits in non-adoptive and adoptive families. Genotype–environment effects Assumption: Individuals are exposed to environments randomly, regardless of their genotype. Implications: If individuals choose partners who are phenotypically like themselves, or if they select themselves into certain environments based on their genotype, this can lead to an overestimation of the shared environmental effect. How to detect: ○ Look at the phenotypic correlation between parents for the trait in question. ○ Trace the change in spouse resemblance over time or analyse the resemblance between the spouses of biologically related individuals. ○ Active G 3 E correlation: Individuals create or invoke environments that are a function of their genotype. ○ Passive G 3 E correlation: Individuals are exposed to environments that are provided by their biological relatives, with whom they are genetically related. Gene–environment interaction Assumption: Different genotypes respond identically to the same environment. Implications: If different genotypes respond differently to the same environment, this can lead to an overestimation of the shared environmental effect. How to detect: ○ Look for a relationship between the sum and absolute differences of twin pairs' scores (known as heteroscedasticity). ○ Look for a relationship between trait sum and absolute trait difference in DZ but not MZ twins. Generalisability of twins to the general population Assumption: Twins are representative of the target population from which the researcher has been sampling. Implications: If twins are not representative of the general population, this can lead to an underestimation or overestimation of the heritability of the trait. How to detect: ○ Compare the prevalence of the trait in twins and singletons. ○ Look for differences in the obstetric and paediatric complications of twins and singletons. Power calculations Results of power studies Results of power studies show that at least 200 pairs are needed for obtaining a reasonable estimate of the degree of genetic influence on a highly heritable trait. For intermediate or low heritable traits, 10–20 times these numbers are required. The same is true for detecting family environmental effects and nonadditive genetic effects. Conclusion Assumptions of the Classical Twin Design Equal environments assumption (EEA): Environmental similarity is roughly the same for both types of twin pairs reared in the same family. Potential Violations of the EEA More similar treatment of MZ twins: This can result in increased MZ correlations relative to DZ correlations, overestimating the genetic effect and underestimating the shared environmental effect. Differential prenatal history of MZ vs. DZ twins: This may lead to differences in monochorionic and dichorionic MZ twin pairs, challenging the validity of the classical twin study. SEM Methodology for Twin Data Basic biometrical genetic models: Partition genetic from environmental factors on a trait. Extensions to multivariate and categorical data: Common extensions to accommodate different types of data. Additional extensions: Testing quantitative sex differences in heritability. Testing qualitative sex differences (different genetic factors operating across sexes). Modelling measurement errors, reporting bias, reciprocal twin interaction, and twins–parents data. Summary The classical twin design is a powerful tool for partitioning genetic from environmental factors on a trait, but it relies on the EEA. Potential violations of the EEA, such as more similar treatment of MZ twins and differential prenatal history, can lead to biased estimates of genetic and environmental effects. SEM methodology can be extended in many ways to address these limitations, such as testing sex differences and modelling measurement errors. PSYC0036 Genes and Behaviour Page 3