Lecture 10a - Polymorphism Discovery PDF

Summary

This document discusses polymorphism discovery and genotyping, focusing on various methods and software tools. It touches upon different types of polymorphisms, such as SNPs and indels, and the challenges associated with genotyping. The document also presents different approaches for sequencing-based genotyping.

Full Transcript

Lecture 10a – Polymorphism Discovery and Genotyping BIO4BI3 - Bioinformatics Where are we going? DNA Sequencing DNA DNA Read Sequencing Quality Assembly Map...

Lecture 10a – Polymorphism Discovery and Genotyping BIO4BI3 - Bioinformatics Where are we going? DNA Sequencing DNA DNA Read Sequencing Quality Assembly Mapping Control Genome Expression Annotation Analysis Marker-Trait Population Polymorphis Genotyping Associations Analysis m Discover Learning Objectives Know what polymorphisms are and the 4 broad types Describe what genotyping is and the methods used for genotyping Know some of the software tools used for genotyping Understand the challenges associated with genotyping Know what genome imputation is and three approaches Types of Polymorphisms A polymorphism is a difference in the genome of an individual relative to a selected reference genome which has been whole genome sequenced or genotyped We’ve already discussed that genomes evolve through the process of mutation, duplication, and deletion This results in four broad categories of polymorphisms SNPs – These are changes of individual bases in an individual relative to a reference. These are the most common type of polymorphism Indels – Insertions or deletion of 1 or more bases relative to a reference. Indels are typically defined as the addition/subtraction of less than 50 bp Copy Number Variants – differences in the number of copies of a particular element of the genome. Often think of CNVs as they related to genes Structural variants – these are typically classified as changes that are greater than 1 kb. They can be insertions ,deletions, inversions, translocations, or rearrangements. What is Genotyping Genotyping is the act of determining the alleles an individual has at particular positions relative to a reference The positions we are genotyping are usually predetermined and are usually (but not always) separate from polymorphism discovery There are a number of methods used to genotype an individual Genome sequencing based methods. These methods can have low cost per datapoint Microarry-based genotyping – fixed panels of 10s to 100,000s of SNPs that are developed following the identification of polymorphisms in a species or population. Rapid and reliable PCR derived methods such as Taq-Man, KASP, and High Resolution DNA melting (HRM). These aren’t as high throughput as the others and generally have a lower cost per sample DNA Sequencing Based Genotyping As sequencing prices continue to decline we are seeing more development in the area of sequencing-based genotyping There are a number of approaches Whole genome – the complete genome of an individual is sequence but at low rates of genome coverage (< 1X). Relies heavily on imputing missing data. This is called ‘low-pass’ sequencing or ‘Skim-Seq’ Targeted amplification – regions of the genome are amplified with PCR primers, the products sequenced, and genotypes relative to a reference found Reduced representation – sequence a consistent subset of the genome. Exome sequence and Genotyping-By-Sequencing (GBS) are examples. Genotyping-by-Sequencing https://www.sweetpotatoknowledge.org/wp-content/uploads/2016/07/SASHA_GT4SHP_Workshop_Bode. Polymorphism Discovery vs Genotyping Genotyping approaches that involve known polymorphic positions have the discovery and genotyping separated. This would be microarrays, PCR-based genotyping, and targeted amplification Sequencing-based approaches are often used to perform polymorphism discover and genotyping at the same time (GBS, exome sequencing) A benefit to genotyping for known positions is that those techniques offer consistent, high-quality genotype calls for an individual. They often involve an investment in developing the technology (creating the microarrays etc.) A benefit to sequencing-based approaches is that the per-individual cost can be quite low compared to other methods. The disadvantage is that individuals may not be consistently genotyped at the same positions Tools for Discovery and Genotyping Most modern polymorphism discovery efforts are centred around whole genome sequencing. As sequencing technologies continue to improved there is a shift towards long-read technologies for discovery. Sequencing-based genotyping (SBG) has its foundation in mapping sequencing reads to a reference genome and identify the polymorphisms that exist. There are many pieces of software used to call SNPs in SBG but three are primarily used Samtools – This software is good for determining the alleles at particular positions but isn’t strong for polymorphism discovery GATK – Likely the most popular software for polymorphism discovery and genotyping. Based on quality filtering of mapped reads and comparative alignment/processing of many individuals Deep Variant – A deep neural network approach from Google. It looks at ‘pictures’ of aligned reads for discovery and genotyping. Claims to be the most accurate genotyping software Area of continued development and improvement GATK Pipeline (gatk.broadinstitute.org) Deep Variant https://google.github.io/deepvariant/posts/2022-06-09-adding-custom- channels/ Discovery Challenges Sequencing errors – Short reads continue to dominate read mapping and the errors intrinsic to those technologies materialize in polymorphism discovery/genotyping. Non-random errors result in mistakes in discovery and genotyping Alignment errors – When mapping reads there can be mistakes in where a read is aligned to a genome or mistakes in aligning regions with indels. GATK involves a post-read mapping step specifically to perform local realignments prior to SNP calling Genomic complexity – Polymorphism discovery/genotyping in polyploidy genomes can be more complicated. One needs to use significantly more genome coverage to have a level of statistical certainty that enough sequencing has been done to sample all of the alleles at a position. Genome duplications in individuals can also cause issues as one can find themselves genotyping 2 or more loci simultaneously believing that only a single position is being genotyped Why We Genotype The reason we genotype individuals can be broken into two categories; We want to understand the genetic relationship among individuals We want to understand the relationship between genotypes and an individual’s phenotype Genetic relationships – This could be understanding the genetic diversity in a population, genealogical research, intellectual property protection Genotype-Phenotype – This is used for predicting phenotypes based on the polymorphisms an individual has (disease classification, marker assisted backcrossing, genomic selection). Can also genotype with the goal of identifying a relationship between a phenotype and a genotype (genetic mapping and genome-wide association studies) Phased Genotyping Recall that most organism get one copy of genetic material from their maternal parent and one copy from their paternal parent We may want to know which polymorphisms on a pair of chromosome originate from the same molecule. These are called ‘phased’ genotypes There is a movement towards developing phased genome assemblies for heterozygous individuals. Phasing allows one to understand the parental origin of polymorphisms and which ones move together during meiosis (in phase) or move in opposition (in repulsion) Phased genomes also enable more accurate genome imputation Imputing Genotypes All genotyping technologies result in some level of missing data for a sample Genome imputation is the process of filling in the missing data with an allele Most imputing algorithms rely on a reference panel A reference panel is a collection of individuals representing (hopefully) the diversity in the pedigree of the sample being genotyped. The individuals of the reference panel have high quality genotypes assigned The choice of which allele to use when filling in missing data can be derived from a number of algorithms; Statistical relationship with the other allele calls in the individual Most common allele called at that position among the population Parental haplotype maximization Common imputing software includes plink, BEAGLE, MACH, and GLIMPSE Imputing Genotypes Manchini and Howie, 2010

Use Quizgecko on...
Browser
Browser