Lecture 10b - Marker Trait Associations PDF

Lecture 10b – Population Analysis and Marker Trait Associations BIOTECH4BI3 - Bioinformatics Where are we going? DNA Sequencing DNA DNA Read Sequencing Quality Assemb...

Lecture 10b – Population Analysis and Marker Trait Associations BIOTECH4BI3 - Bioinformatics Where are we going? DNA Sequencing DNA DNA Read Sequencing Quality Assembly Mapping Control Genome Expression Annotation Analysis Marker-Trait Population Polymorphis Genotyping Associations Analysis m Discover Learning Objectives Understand the relationship between genome recombination and genotyping Discuss approaches we use to measure the relatedness of individuals in a population Presentation of various machine learning approaches to understand how individuals are assigned membership to sub- populations Discuss genetic mapping populations and their strengths/weaknesses Understand the approaches used to identify marker-trait associations and interpret the visuals created from the analysis Meiosis We can use our understanding of how genomes are ‘shuffled’ over generations to develop models of how groups of individuals are related. Meiosis is the process of specialized cell division with the purpose of producing gamete for reproduction It is a multi-stage process that results in the formation of four haploid daughter cells A haploid cell is one where only 1 copy of a genome is present. Most animals and plants are diploid (having 2 copies of the genome). Two haploid cells (one from each parent) are brought together to form a diploid cell with a unique combination for DNA from its parents It is during the creation of the haploid daughter cells that the important processes of genetic recombination takes place Meiosis https://en.wikipedia.org/wiki/Meiosis#/media/ Recombination Recombination is the process where new combinations of alleles are created in haploid cells. Interchromosomal recombination – shuffling of whole chromosomes into haploid cells. This is known as independent assortment of alleles Intrachromosomal recombination – the exchange of chromosome segments between homologous chromosomes through crossing over. When recombination is discuss this is what people are speaking about Recombination typically occurs about 2-3 times per chromosome per meiosis. Double recombination does occur but it is much less frequent Some regions of the genome are more likely to experience recombination than others. Typically recombination occurs more frequently towards the ends of the chromosome Recombination https://opengenetics.pressbooks.tru.ca/chapter/recombination/ Recombination Thomas Hunt Morgan determined that genes are organized in a row on the chromosome The statistical correspondence of genes is directly related to how close they are to each other on the chromosome. The closer two genes are to each other the less likely recombination will occur between them Discovered the phenomenon of crossovers Nobel Prize 1933 Single crossover Double crossover Morgan et al, Morgan et al, 1916 Genotyping Genotype states at a particular position are called ‘alleles’. Alleles can be thought of as slightly different versions of orthologous genes chromosome locations. When those different alleles are in regions controlling traits they can result in different phenotypes We can use the alleles to follow segments of chromosomes as they are recombined over generations. The more locations that are genotyped in an experiment the closer we can come to locating the positions of recombination events When a genotype is experimentally or statistically association with a phenotype we call it a genetic marker for a trait. Ancestry A individual is often regarded as a mixture of the genes from its maternal and paternal parent By extension, an individual can be considered a mosaic of all of the generations contributing to its pedigree If we assume that alleles originated only once in a pedigree then with enough genotyping resolution we could identify all of the ancestral recombination events that took place to arrive at a particular individual Through identification of shared ancestral alleles among a collection of individuals from a population we can quantify the relatedness of individuals This information can be used to construct phylogenetic trees or classify individuals into groups based on shared ancestral DNA regions, Individuals are Genome Mosaics This figure illustrates the construction of a structured population from 8 parents Colours are used to indicate parental origin With enough polymorphisms one can identify the ancestral origins of the regions of the chromosome (different colours) The individuals at the bottom of the contain all of the historic recombination events that went into creating them and are mosaics of their ancestors Population Structure In a population of randomly mating individuals the frequencies of alleles are expected to be approximately the same. An imbalance in allele frequency from non-random mating results in population structure Non-random behaviour can occur because of physical separation of members of the population, gene flow from migration, evolutionary pressure, culture, and random chance A polymorphism that is prevalent in a population that has a high incidence of a trait may be erroneously associated with having a role in the trait. This must be controlled for in marker-trait associations Two common methods often used to capture population structure are the program ‘Structure’ by Pritchard and principle component analysis Pritchard’s Structure Clustering program that uses genotype data to infer sub- populations with genomic data. Old program but still used by many to model groupings in their data Have to define the number of sub-populations and the program will assign individuals to them Principle Component Analysis PCA is a unsupervised machine learning technique to simplify high-dimensional data sets It identifies the axis in multidimensional space that vary the most (the principle components) With genetic data (thousands of SNPs across a population) PCA enables us to identify clusters of genetically similar individuals This is useful for visualizing the relationships among individuals, identifying groups sharing high levels of genetic similarity, and for correcting for population structure in marker-trait analysis Though there are a number of dimensional reduction algorithms (SVD, t-SNE, LDA) PCA has been shown to control for population structure Linkage Disequilibrium Before performing a PCA, one wants to have a group of markers that are acting independently of each other in the population Linkage disequilibrium is a measure of the non-random association that alleles have among a group of individuals An example would be a series of SNP alleles all in the same gene. Because of how physically close they are, it is unlikely that they will behave (sort) independently of each other. We typically discuss LD in terms of r2, which is the square of the correlation coefficient between two markers. An r2 of 1 between markers means that the markers communicated the same information while an r2 of zero means the markers are in perfect equilibrium in the population The software PLINK is commonly used to prune markers sets on L Example Lyu, J. et al. BMC Plant Biol 14, 160 (2014) Scree Plots A line graph of the eigenvalues from a PCA The eigenvalues tell us how much of the observed variance can be explained by a principle component Used in population analysis to identify how many PCs to use for modeling the number of sub-populations among individuals Rule of thumb is that the number of populations to use is the PC # where the curve starts to flatten out (‘elbow’) https://sanchitamangale12.medium.com/scree-plot- 733ed72c8608 Kinship Matrix In some types of analyses, we need to control for the relatedness of the individuals to each other in the population Failing to account for this can lead to false associations between markers and a trait A kinship matrix captures the relationships among individuals as a numerical value The values capture the probability that two alleles sampled at random from pairs of individuals are identical by descent (IBD) IBD means that the allele has been inherited from a common ancestor and is not the result of mutation Kinship is calculated as the proportion of shared alleles between individuals Accounting for the relatedness among members of the population can help distinguish between true associations and Kinship Matrix This is a heat map of kinship values among 50 mice The colours capture the amount of shared alleles between individuals (red is higher) The dendrogram along the top and side show how the individuals cluster together based on kinsip https://smcclatchy.github.io/mapping/08-calc-kinship/#:~:text=By%20default%2C%20the%20genotype%20probabilities,the%20proportion%20of %20shared%20alleles. Genetic Mapping First genetic map was developed by Alfred Sturtevant (a student of Morgan) in 1911 Base on phenotypic associations (not genotypes) https://en.wikipedia.org/wiki/Thomas_Hunt_Morgan#/media/ File:Drosophila_Gene_Linkage_Map.svg Genetic Mapping Constructing a genetic map, also called linkage mapping, is a technique to identify the location of genes/markers in a genome and the distance between the markers The distance isn’t a physical distance in base pairs but a statistical measure of how often there is recombination between adjacent markers. The units are called centimorgans (cM). 1 cM distance between markers means there is a 1% chance of a recombination happening between the markers Genetic mapping refers to the process of identify which genes/markers in a genome contribute to a trait of interest You can perform genetic mapping using a genetic map or using a collection of more genetically diverse individuals Genetic mapping is the dominant way that we identify relationships between genes and traits. Once we identify the genes involved in controlling a trait (or disease) scientists can further work to elucidate biological mechanisms leading to the creation of new drugs, treatments, of plant varieties Types of Mapping Populations There are three classes of populations for mapping traits Bi-parental populations – These are individuals that have been created from the controlled cross of two parents. The parents are selected because they differ in the trait researchers are interested in (like disease resistance). Genome-wide association panels – These are collections of individuals from the same species with variability in many traits Multi-parent populations – These are individuals created through the controlled crossing of a few, well characterized individuals. Examples include NAM (Nested Association Mapping) populations and MAGIC (Mulit-parent Advanced Generation Inter- Cross) populations In modern mapping populations the members of the population have been extensively genotyped using high- density genotyping arrays or DNA-sequencing based approaches like GBS or skim-seq Bi-Parental Populations Create from two parents that differ in traits under investigation. A genetic map is created from the genotyped progeny and statistical relationships are found between traits and regions of the genome. These regions are called Quantitative Trait Loci (QTL) The larger the population of individuals the closer we can come to the underlying gene(s) controlling the trait. Typically QTL span intervals of 5 cM to 10 cM (some millions of bases) Theses populations are able to detect minor genetic contributions of genome regions towards complex traits They are limited in that you can only genetically map traits that differ between the two parents Size of these populations are typically in the hundreds. More individuals -> more recombination in your population -> better mapping resolultion Association Mapping Populations These populations are typically easy to create. One collects individuals from the species under investigation and selects for the phenotype(s) under investigation. Tend to avoid closely related material (sibling) You are rely on the historic recombinations among the members of the population to associate markers with traits. This means you have MUCH greater resolution for narrowing down regions controlling traits These populations are good at identify regions controlling a large percentage of the variation controlling a trait Benefits are that these populations are easy to ‘construct’ and you can map any trait that varies among the members Population sizes can be massive, with some human studies having >100K individuals Downsides are that these populations tend to miss regions with a small impact on a trait and polymorphisms with low frequency (

Lecture 10b - Marker Trait Associations PDF

Document Details

Tags

Related

Summary

Full Transcript