Genome-Wide Association Studies (GWAS) PDF
Document Details
Uploaded by UnparalleledCottonPlant
University of Western Ontario
2024
Dr. Parisa Shooshtari
Tags
Summary
This document is a lecture on Genome-Wide Association Studies (GWAS). It covers the basics of GWAS, including its aims, experimental workflow, and visualization methods. It also discusses different types of analysis techniques commonly used for GWAS data.
Full Transcript
Genome-Wide Association Studies (GWAS) Instructor: Dr. Parisa Shooshtari MBI 3100 – Lecture 11 November 13, 2024 1 Genome-Wide Association Studies (GWAS) in One Slide...
Genome-Wide Association Studies (GWAS) Instructor: Dr. Parisa Shooshtari MBI 3100 – Lecture 11 November 13, 2024 1 Genome-Wide Association Studies (GWAS) in One Slide A A A 62% A Cases Single Nucleotide A C C 38% C Polymorphism (SNP) Cases (10,000) People with disease C SNP A A C 49% A Controls A C C 51% C Manhattan Plot Controls (10,000) for Schizophrenia People without disease Is there a significant shift in Hundreds of Genomic Regions (Loci) with allele frequency between cases Significant Association to Disease and controls? Significance Level P Value of 5×10-8 Genetic Association Per SNP Image from: Ripke et.al., Nature 2014 Chromosome 2 Zoom In To a GWAS Locus 3 Outline Introduction Experimental Workflow of GWAS Selecting Study Population Genotyping Data Processing GWAS Results 4 Outline Introduction Experimental Workflow of GWAS Selecting Study Population Genotyping Data Processing GWAS Results 5 What is GWAS? Genome-wide association studies (GWAS) aim to identify associations of genotypes with phenotypes by testing for differences in the allele frequency of genetic variants between individuals who are ancestrally similar but differ phenotypically. GWAS can consider Ø copy-number variants or Ø sequence variations in the human genome, Ø although the most commonly studied genetic variants in GWAS are single- nucleotide polymorphisms (SNPs). 6 Difference Between Genome-Wide Association Studies (GWAS), Whole-Genome Sequencing (WGS), and Whole- Exome Sequencing (WES) Genome-wide association studies (GWAS) generally involve targeted genotyping of specific and pre-selected variants using microarrays. Whole-exome sequencing (WES) and whole-genome sequencing (WGS) studies aim to capture all genetic variation. Strictly speaking, both WES and WGS studies are also GWAS, although in the literature ‘GWAS’ mostly refers to genome-wide studies of common variants and is sometimes considered separate from WGS and WES studies. 7 Common vs. Rare Genetic Variants Declaring a variant as common or rare is population-specific and cannot be generalized across populations. Common variants: Variants with a minor allele frequency above 5%. Although as population sizes grow this threshold can be as low as 1% as researchers typically adhere to a minimum minor allele count (for example, at least 100 individuals who carry at least one copy of the minor Manolio, A., et. al. “Finding missing heritability od complex diseases”, Nat. Rev., 2009 allele.) 8 Question: What is the main difference between GWAS vs. WES/WGS? Question: In GWAS do we mostly consider rare or common variants? Why? 9 Some Statistics on GWAS Studies > 5,700 GWAS > 3,300 Traits > 1,000,000 Studies (phenotypes) participants Thousands of Hundreds of associated & genomic loci replicable variants (SNPs) 10 Challenges in Interpreting the Associations Variants are correlated These variants are often with causal and non- Individual variants confer associated with many causal variants that are very little risk other traits physically close Direct biological, causal inferences is very complicated 11 Individual variants confer very little risk Individual Variants Confer Very Little Risk Manolio, A., et. al. “Finding missing heritability od complex diseases”, Nat. Rev., 2009 12 These variants are often associated with many other traits Variants Associated to Multiple Traits Autoimmune Thyroid Disease Position on Chromosome (AITD) 6 (Mbp) Celiac Disease (CEL) Position on Chromosome 6 (Mbp) Position on Chromosome 6 Position on Chromosome 6 89.98 90.48 90.98 91.48 91.98 89.98 90.48 90.98 91.48 91.98 Posterior 6 Posterior 6 0.95 0.23 CEL GWAS ATD GWAS −log10(P) −log10(P) 4 4 2 2 0 0 0 0 Multiple Position Sclerosis (MS) on Chromosome 6 (Mbp) Position on on IChromosome Type Position Diabetes (T1D)66 (Mbp) Chromosome (Mbp) Position on Chromosome 6 Position on Chromosome 6 89.98 90.48 90.98 91.48 91.98 89.98 89.98 90.48 90.48 90.98 90.98 91.48 91.48 91.98 91.98 6 Posterior Posterior Posterior 0.48 0.85 0.59 6 GWAS T1DGWAS −log10(P) MS GWAS −log10(P) −log10(P) 4 4 IBD 2 2 0 00 0 00 13 Shooshtari, et. al. AJHG 2017 Variants are correlated with causal and non-causal variants that are physically close due to Linkage Disequilibrium (LD) Variants are correlated with causal and non- causal variants that are physically close Association should not be confused with causality 14 Another Challenge in Interpreting the Associations Another challenge is that Ø genetic associations may differ across ancestries (e.g. European vs. African or Asian), complicating direct comparisons between groups of individuals. These limitations result in Ø unclear conclusions about the biological meaning of GWAS result Ø sometimes limiting their utility to produce mechanistic insights or to serve as starting points for drug development. 15 Genetic Associations May Differ Across Ancestries 16 Image: E. Uffelmann, et. al., “Genome-wide association studies”, Nat. Rev., 2021 Question: What are the four main challenges that make interpretation of GWAS results complicated? 17 Outline Introduction Experimental Workflow of GWAS Selecting Study Population Genotyping Data Processing GWAS Results 18 Experimental Workflow of GWAS ØThe collection of DNA and phenotypic information from a group of individuals (such as disease status and demographic information such as age and sex) ØGenotyping of each individual using available GWAS arrays or sequencing strategies ØQuality control ØImputation of untyped variants using haplotype phasing and reference populations ØConducting the statistical test for association ØConducting a meta-analysis (optional) ØSeeking an independent replication ØInterpreting the results by conducting multiple post-GWAS analyses 19 Experimental Workflow of GWAS Image: E. Uffelmann, et. al., “Genome-wide association studies”, Nat. Rev., 2021 20 Experimental Workflow: Data Collection Data can be collected from study cohorts or available genetic and phenotypic information can be used from biobanks or repositories. Image: E. Uffelmann, et. al., “Genome-wide association studies”, Nat. Rev., 2021 21 Experimental Workflow: Genotyping of Each Individual Genotypic data can be collected using microarrays to capture common variants, or next-generation sequencing methods for whole-genome sequencing (WGS) or whole-exome sequencing (WES) Image: E. Uffelmann, et. al., “Genome-wide association studies”, Nat. Rev., 2021 22 Experimental Workflow: Quality Control Quality control includes steps at: ØThe wet-laboratory stage, such as genotype calling and DNA switches ØThe dry-laboratory stages on called genotypes, such as deletion of bad single- nucleotide polymorphisms (SNPs) ØIndividuals, detection of population strata in the sample and calculation of principle components. Clustering individuals according to genetic substrata Image: E. Uffelmann, et. al., “Genome-wide association studies”, Nat. Rev., 2021 23 Experimental Workflow: Imputation of Untyped Variants Genotypic data can be phased, and untyped genotypes imputed using information from matched reference populations from repositories such as 1000 Genomes Project or TopMed. In this example, genotypes of SNP1 and SNP3 are imputed based on the directly assayed genotypes of other SNPs. Image: E. Uffelmann, et. al., “Genome-wide association studies”, Nat. Rev., 2021 24 Experimental Workflow: Genetic Association Test Genetic association tests are run for each genetic variant, using an appropriate model (for example, additive, non-additive, linear or logistic regression). Confounders are corrected for, including population strata, and multiple testing needs to be controlled. Output is inspected for unusual patterns and summary statistics are generated. A A A 62% A Cases Single Nucleotide A C C 38% C Polymorphism (SNP) Cases (10,000) People with disease C SNP A A C 49% A Controls A C C 51% C Manhattan Plot Controls (10,000) for Schizophrenia People without disease Is there a significant shift in Hundreds of Genomic Regions (Loci) with allele frequency between cases Significant Association to Disease and controls? Significance Level P Value of 5×10-8 Genetic Association Per SNP 25 E. Uffelmann, et. al., “Genome-wide association studies”, Nat. Rev., 2021 Image from: Ripke et.al., Nature 2014 Chromosome Experimental Workflow: Meta-Analysis Results from multiple smaller cohorts are combined using standardized statistical pipelines. E. Uffelmann, et. al., “Genome-wide association studies”, Nat. Rev., 2021 GWAS of Post-Traumatic Stress Disorder (PTSD) European Ancestry African Ancestry Nievergelt, et. al., “International meta-analysis of PTSD genome-wide association 26 studies identifies sex- and ancestry-specific genetic risk loci”, Nat. Com. 2019 Experimental Workflow: Replication Results can be replicated using internal replication or external replication in an independent cohort. For external replication, the independent cohort must be ancestrally matched and not share individuals or family members with the discovery cohort. Image: E. Uffelmann, et. al., “Genome-wide association studies”, Nat. Rev., 2021 27 Experimental Workflow: Post-GWAS Analysis In silico analysis of genome-wide association studies (GWAS), using information from external resources. This can include: ØIn silico fine-mapping ØSNP to gene mapping ØGene to function mapping ØPathway analysis ØGenetic correlation analysis ØMendelian randomization ØPolygenic risk prediction. E. Uffelmann, et. al., “Genome-wide association studies”, Nat. Rev., 2021 Also, after GWAS, functional hypotheses can be tested using experimental techniques such as CRISPR or massively parallel reporter assays, or results can be validated in a human trait/disease model. 28 Question: Can you explain each step of the GWAS experimental workflow? 5 minutes to review this in your group. Image: E. Uffelmann, et. al., “Genome-wide association studies”, Nat. Rev., 2021 29 Outline Introduction Experimental Workflow of GWAS Selecting Study Population Genotyping Data Processing GWAS Results 30 Selecting Study Population GWAS often require very large sample sizes to identify reproducible genome- wide significant associations. Ø The desired sample size can be determined using power calculations in software tools such as CaTS14 or GPC15. Case and controls Quantitative Study designs (when the trait is (when the trait is binary) quantitative) The choice of data resource and study design for a GWAS depends on the required sample size, the experimental question and the availability of pre- existing data or the ease with which new data can be collected. 31 Selecting Study Population (cont.) Assembling data sets of a sufficient size to run a well-powered GWAS for a complex trait requires major investments of time and money that go beyond the capacity of most individual laboratories. However, there are several excellent public resources available that provide access to large cohorts with both genotypic and phenotypic information, and the majority of GWAS are conducted using these pre-existing resources. Even when new data have been collected in-house, these will typically be co-analysed with data from pre-existing resources 32 Outline Introduction Experimental Workflow of GWAS Selecting Study Population Genotyping Data Processing GWAS Results 33 Genotyping NGS for both Microarrays for common and rare common variants variants Microarray-based genotyping is the most commonly used method for obtaining genotypes for GWAS owing to the current cost of next- generation sequencing. WGS (which determines nearly every genotype of a full genome) is preferred over WES and microarrays. WGS is expected to become the method of choice over the next couple of years with the increasing availability of low-cost WGS technology. 34 Outline Introduction Experimental Workflow of GWAS Selecting Study Population Genotyping Data Processing GWAS Results 35 Data Processing: Input Files Input files for a GWAS include ØAnonymized individual ID numbers ØCoded family relations between individuals ØSex ØPhenotype information (e.g. case/control) ØCovariates ØGenotype calls for all called variants (e.g. SNPs) ØInformation on the genotyping batch 36 Data Processing: Input Files (cont.) Data format in PLINK (PED and MAP files) The PED file is a white-space (space or tab) delimited file. The first six columns are mandatory: 1. Family ID 2. Individual ID 3. Paternal ID 4. Maternal ID 5. Sex (1=male; 2=female; other=unknown) 6. Phenotype (e.g. disease/control) PLINK have been specifically designed to analyse genetic data. 37 Data Processing: Input Files (cont.) Data format in PLINK (PED and MAP files) Each line of the MAP file describes a single marker and must contain exactly 4 columns: 1. chromosome (1-22, X, Y or 0 if unplaced) 2. rs# or SNP identifier 3. Genetic distance (morgans) 4. Base-pair position (bp units) In genetics, a centimorgan (cM) is a unit for measuring genetic linkage. It is defined as the distance between chromosome positions (also termed loci or markers) for which the expected average number of intervening chromosomal crossovers in a single generation is 0.01. It is often used to infer distance along a chromosome. One centimorgan corresponds to about 1 million base pairs in humans on average.38 Question: Can you give a list of information you can find in the input files from genotyping? 39 Data Processing: Quality Control Following input of the data, generating reliable results from GWAS requires careful quality control. Some example steps include: ØRemoving rare variants ØFiltering SNPs that are missing from a fraction of individuals in the cohort ØIdentifying and removing genotyping errors ØEnsuring that phenotypes are well matched with genetic data, often by comparing self-reported sex versus sex based on the X and Y chromosomes. PLINK have been specifically designed to analyse genetic data and can be used to conduct many of these quality control steps. Link to PLINK: http://zzz.bwh.harvard.edu/plink/ 40 Data Processing: Imputation After quality control, variants usually undergo phasing and are imputed using a sequenced haplotype reference panel such as the 1000 Genomes Project or TOPMed, which involves the statistical inference of genotypes that have not been assayed directly. Imputation servers: Michigan Imputation Server and TOPMed E. Uffelmann, et. al., “Genome-wide association studies”, Nat. Rev., 2021 Local tools: IMPUTE2, BEAGLE, MACH and SHAPEIT2 41 Data Processing: Imputation (cont.) Imputation involves several steps: ØStatistically phase individual genotypes ØDecide whether to use hard calls or weight for uncertainty ØSelect an appropriate reference population panel ØConvert reference panel and target population into the same genomic build ØResolve issues between different platforms, possibly remove ambiguous SNPs ØCheck for unusual minor allele frequencies and patterns of linkage disequilibrium between reference panel and target data ØImpute missing genotypes against the selected population panel ØCheck imputation quality and possibly remove badly imputed SNPs 42 Question: What is imputation? 43 Data Processing: Ancestry Consideration Ancestry and relatedness must be carefully considered and accounted for in GWAS, and indeed all genetic studies — particularly in data sets from participants of diverse backgrounds to avoid false positive or negative genetic signals and biased test statistics owing to population stratification. Cases and controls should be matched by ancestry to avoid confounding. For example, a GWAS for chopstick use where cases are defined as ‘using chopsticks regularly’ and controls as ‘not using chopsticks’ would likely result in cases being drawn more often from an East Asian population than controls. Not accounting for ancestry in this study would identify associations among variants more common in East Asian populations than other populations. Image: E. Uffelmann, et. al., “Genome-wide association studies”, Nat. Rev., 2021 44 Data Processing: Ancestry Consideration (cont.) Ancestry is usually considered in GWAS through an iterative process using principal component analysis. The genotypes of all individuals are used to define clusters of individuals with similar genotypes. Image: E. Uffelmann, et. al., “Genome-wide association ØTo identify and exclude outliers, and then studies”, Nat. Rev., 2021 ØTo compute and include principal components as covariates in subsequent GWAS regression models. 45 Data Processing: Testing for Association Association tests for GWAS: A A A Cases 62% A Single Nucleotide A C C 38% C Polymorphism (SNP) Cases (10,000) ØLinear model is used, if the People with disease C SNP A A C 49% A phenotype is continuous, such as Controls A C C 51% C height, blood pressure or body mass Manhattan Plot Controls (10,000) for Schizophrenia People without disease index. Is there a significant shift in Hundreds of Genomic Regions (Loci) with allele frequency between cases Significant Association to Disease and controls? ØLogistic regression model is used, if Significance Level P Value of 5×10-8 Genetic Association the phenotype is binary, such as the Per SNP presence or absence of disease. Image from: Ripke et.al., Nature 2014 Chromosome 46 Data Processing: Testing for Association (cont.) To account for stratification and avoid confounding effects from demographic factors, covariates such as age, sex and ancestry are included. The caveat is that this may reduce statistical power for binary traits in some samples. When conducting a GWAS, it should be noted that the genotypes of genetic variants that are physically close together are not independent as they tend to be in Linkage Disequilibrium (LD). This dependency of tests should also be considered when conducting a GWAS. 47 Question: Which association test model is usually used for quantitative traits (such as height)? Question: Which association test model is usually used for binary traits (disease status)? 48 Data Processing: Accounting for False Discovery Testing millions of associations between individual genetic variants and a phenotype of interest requires a stringent multiple-testing threshold to avoid false positives. Recap from statistics: If you run N independent tests, in order to adjust for multiple- testing by using Bonferroni Correction, you need to use a thresholdSingle0.05/N instead A A A Nucleotide A C C of 0.05. Polymorphism (SNP) Cases (10,000) People with disease The International HapMap Project and other studies have shown that there are C SNP A A C approximately 1 million independent common genetic variants across the human A C C genome on average. Therefore, our GWAS Bonferroni testing threshold Manhattan Plot is Controls (10,000) People without disease 0.05/1,000,000, which is P < 5 × 10–8. for Schizophrenia Hundreds of Genomic Regions (Loci) with Significant Association to Disease Significance Level 5×10-8 Image from: Ripke et.al., Nature 2014 Chromosome 49 Outline Introduction Experimental Workflow of GWAS Selecting Study Population Genotyping Data Processing GWAS Results 50 GWAS Results: Summary Statistics The primary output of a GWAS analysis is a list of ØP values ØEffect sizes, and ØTheir directions generated from the association tests of all tested genetic variants with a phenotype of interest. Link to GWAS Catalog: https://www.ebi.ac.uk/gwas/home 51 52 Diastolic Blood Pressure (Quantitative) Risk allele frequency Odds Ratio Confidence Interval 53 Multiple Sclerosis (Disease) Risk allele frequency Odds Ratio Confidence Interval 54 GWAS Results: Visualization These data are routinely visualized using Manhattan plots and quantile–quantile plots, generated using software tools such as R or web platforms such as FUMA88 or LocusZoom8. Manhattan Plot 55 GWAS Results: Visualization (cont.) These data are routinely visualized using Manhattan plots and quantile–quantile plots, generated using software tools such as R or web platforms such as FUMA88 or LocusZoom8. Quantile-Quantile Plot (QQ-Plot) 56 Question: What is the name of a resource, where you can find GWAS summary results? Question: What kind of information can you obtain in this resource for each trait? Question: What are the two plot types that are commonly used for visualizing GWAS results? 57 Outline Introduction Experimental Workflow of GWAS Selecting Study Population Genotyping Data Processing GWAS Results 58 PLINK Analysis Toolset PLINK: Whole genome association analysis toolset (https://zzz.bwh.harvard.edu/plink/) Link to installation for PLINK 1.90 beta: (https://www.cog- genomics.org/plink/) 59 References Uffelmann, E., et. al., “Genome-Wide Association Studies”, Nature Reviews, Methods Primers, (1:59) 2021 Tam, V., “Benefits and Limitations of Genome-Wide Association Studies”, Nature Reviews, Genetics, (volume 20) 2019 Shooshtari, et. al., “Integrative Genetic and Epigenetic Analysis Uncovers Regulatory Mechanisms if Autoimmune Disease”, AJHG 2017 Manolio, A., et. al. “Finding missing heritability of complex diseases”, Nat. Rev., 2009 Nievergelt, et. al., “International meta-analysis of PTSD genome-wide association studies identifies sex- and ancestry-specific genetic risk loci”, Nat. Com. 2019 60 References (cont.) Huang, H. et al. “Fine-mapping inflammatory bowel disease loci to single variant resolution”. Nature 547, 173–178 (2017). Ripke, S., et. al., “Biological insights from 108 schizophrenia- associated genetic loci”, Nature, 2014. GWAS Catalog (https://www.ebi.ac.uk/gwas/) PLINK: Whole genome association analysis toolset (https://zzz.bwh.harvard.edu/plink/) 61