Bayesian Inference and Phylogenetics PDF
Document Details

Uploaded by iPiersa
Tags
Summary
This document covers Bayesian inference and phylogenetics, including topics such as Markov Chain Monte Carlo (MCMC) methods, phylogenetic tree construction, and models of trait evolution. It discusses various methods for phylogenetic analysis, addressing concepts like gene discordance, coalescent probabilities, and statistical tests for tree topology and includes information on how to account for evolutionary biases and how the molecular clock can be useful. The document explores different evolutionary models and techniques used in phylogenetics including the methods for modelling trait evolution on phylogenies.
Full Transcript
FILE 8 bayesian inference The evaluation is done by using the q-matrix (take a look in the notebook) for each tree proposed. Our aim is obtain a tree that represent the best model for our dataset. We start with a starting tree that is calculated by using the initial data that are low. Then, we exp...
FILE 8 bayesian inference The evaluation is done by using the q-matrix (take a look in the notebook) for each tree proposed. Our aim is obtain a tree that represent the best model for our dataset. We start with a starting tree that is calculated by using the initial data that are low. Then, we explore other possible trees by using three methods that arrange the tree in different ways, Nearest Neighbour Interchange (NNI), Subtree Pruning and Regrafting (SPR) Tree Bisection and Reconnection (TBR). Thanks to it we obtain a new tree that will be evaluated by using the likelihood that will show us if the new tree is better or not the current one. The final one will be the best tree that represent our data. -What is the difference between maximum likelihood and Bayesian inference? The first one calculate the probability of observed data given the model, the second one the opposite So the aim of the Bayesin inference is obtain the posteriori distribution, in other terms the probability of a tree and their parameters given the dataset. In the formula we can see that there is P(Data) that represent all possible trees and parameters that can be obtained by given dataset but is pretty impossible to do. In order to solve this problem we can use the Markov Chain Monte Carlo (MCMC). - What the MCMC does? It simulates steps in the ‘space’ of trees and parameters. It try to explore some steps and calculate the tree and parameters in order to understand if that point is better (the tree and parameters) than the previous one. This method is compound in two part: 1) Markov chain Is a sequence of states (tree and parameters) where the next step depends on the current state 2) Monte Carlo This part uses just random states, previously created, to simulate which Is the best as a next step - Which are the steps of MCMC? 1. Initialization: Choose a random tree and parameters that represent the starting point 2. Proposal: Change parameter/s and tree in order to create another state 3. Evaluation: Calculate the R so the probability of the new state divided by the current state. If R > 1 the new state is better, if R < 1 we can take it as a possibility. 4. Iteration: Repeat the steps 2 and 3 so many times Metropolis-Hastings → If R = 1.20 → always accept (step "uphill"). If R = 0.17 → almost never accept (step "into a ravine"). If R = 0.92 → often accept (step "slightly downhill", useful for exploring) - What happened during the MCMC simulation? The log-likelihood, tree and parameters change during the time. At the beginning these three inferences change a lot because we are far to the destination. Step by step we will be closer and closer to the peak so the three parameters change slowly. To be sure that the peak that we reached is the highest one we should start at different points, if all of them reach the same peak, we will obtain the best parameters for that dataset, (pay attention on the local peaks). The ‘peak’ is called stationary distribution. - Describe the cold and heated chain (Metropolis-coupled MCMC or MCMCMCMC) The cold chains are used to find the peak as we already said. The heated chain allow us to jump in states that have less probability but, are useful to avoid the local optima and escape in other states to find the maximum optima. When this method is performed it follows this function L^1/T (likelihood modified). L is the likelihood and T the temperature which modify the function. In which terms? If T=1 the chain is cold and the likelihood is normal, if T>1 the function become less extreme allowing movement between different trees , so the good trees and the not bad trees are less different between them. - What is burn-in in MCMC? Is the beginning phase where the chain doesn’t reach the stationary distribution. The parameters and trees obtained in this period are not reliable because they are too influenced by the starting point and don’t represent the correct fit for our dataset. Since we know that, we can discard (usually) 10-25% of the total samples. This percentage depends on how many steps the chain did to reach the peak, in order to know this, we can use convergence diagnostic. - What is autocorrelation? And why can it represent a problem? This phenomenon is caused by the overestimation of the independent sample. Let’s explain it. At the final point of the inference we want as much as possible independent sample in order to have more information about the trees obtained during the process. For instance we obtain 10.000 samples, this can look like a good result, but in the most of the cases in 10.000 samples just 500 are independent. What does it mean? The other samples are too similar to each other because each state is calculated by using the previous one, this lead us to obtain results that are less informative due their similarity. - What is ESS (effective sample size)? This tool tells you how many samples are independent, if ESS is hight your estimation is reliable, instead there are too many correlated samples. Generally the ESS has to be >200 for each parameter. - How can I improve my ESS? Better proposal strategies: During the MCMC explore the parameters space we can set some thresholds in order to avoid steps that change poorly the parameters. So not too big and not too small. Good proposal = better movement = less autocorrelation = higher ESS Increase MCMC run length: Just run the longer chain in order to obtain more chances for independent samples. - Should all parameters have hight ESS? No, just the crucial parameters as tree likelihood, substitution rates and topological stuff. - How should the graphs look like? These graphs show the value of R during the iterations of MCMC chain. In the first one we can see that the range explored is wide and this means that we are exploring the posterior in a good way. The second one is stacked in one value for the most of the time, it’s not exploring the possible value around the posterior site. In this case the graph shows us the sampled values. On the left there is a good shape so each possible value is been calculated providing a good representative sample. On the right the histogram doesn't look smooth. Instead, it has gaps, irregularities, or multiple peaks. This suggests that the MCMC chain hasn’t sampled the posterior distribution very well. - How summarize the result form a Bayesian phylogenetics analysis? → Majority-rule consensus tree: this is a tree that include just the clades that have appeared in the majority of the sampled tree, so this tree shows the more consistent clades that appear at least at 50% of the sampled trees. Is useful because identify the well-supported groups across the posterior samples. Sometimes this tree shows a topology that wasn’t specifically sampled in the analysis → Maximum clade credibility (MCC) tree: this tree includes all the clades that are best well-supported maximizing the product of posterior probabilities across the tree’s clades. It provides the best tree in terms of clade support assuming that the higher posterior probability means that the credibility is high. If one clade has low support all the credibility will be decreased, so this tree doesn’t always represent the best possible tree. → Maximum sum of clade credibilities (MSCC) tree: this tree is pretty similar to the previous one indicating the best tree in terms of well-supported clade, but in this case the total probability across all the clades is taken rather than the highest individual clade support. In general, this tree tend to show all possible clade because all of them contribute to the total sum. Is useful because is more balanced than the previous one but it doesn’t show the best tree. → Median clade credibility tree: this tree trys to represent the tree that is the average in terms of branch lengths indicating the ‘central tendency’. In other words, this tree tries to reduce as much as possible the genetic change across the taxa that have occurred. Can be useful because is a good tree if you want study the consistency of branch length instead of the exactly topology. FILE 10 SUPPORT METRICS In order to be confident on our findings we have to take in account more than one parameter (likelihood) as: 1) nonparametric (BPs) 2) parametric bootstrap (BPs) 3) Tranfer Bootstrap Expectation (TBE) 4) parametric & nonparametric jackknife 5) SH-like approximate Likelihood Ratio Test (SHaLRT) 6) genes & sites Concordance Factors (gCFs & sCFs) 7) Posterior Probabilities (PP) 8) Quartet Support (QS) NON-PARAMETRIC BOOTSTRAP (BPs) - Start with an alignment of length n - Create a random fake alignment by choosing columns randomly in the original alignment (a column can be chosen more than once). This new alignment has the same length n - Create a starting tree by using the same method (likelihood or neighbour-joining) - Do this procedure thousands of times - Count how often each clade shows up, this number represent the boostrap proportion (BP). For each branch in the original tree (with the right alignment), check the percentage of the bootstrap trees also contain that same branch This method is used to check how stable or consistent the branch is in order to understand if we shuffle the data randomly we can get anyway the same clade. PARAMETRIC BOOTSTRAP (BPs) Is the same process of the previous one but with one difference. Instead to create a model every time that we want to infer a tree (calculate the GTR for example), we use the same model of the tree that was created with the right alignment. It’s important to keep in mind that the model that you used to infer all the trees has to be as much as possible perfect, instead all the tree inferred will be wrong. TRANSFER BOOTSTRAP EXPECTATION (TBE) Definition → it measures how much of the branch's structure is preserved, even if the exact branch doesn’t appear - Split all the taxa of one branch ( R ) into two groups (A and B) - Look at the all the bipartitions (branches) that you produce as bootstrap by using the random alignment (as in the first technique and compare them to the reference branch R. - Calculate the transfer distance by counting how many moves you need to modify your tree into the original one R - Pick the closest bootstrap tree that you generated (with the smallest transfer distance) - Convert the transfer distance into a value from 0 to 1 (transfer index) and do it for every bootstrap to obtain many transfers index and calculate the average. This final result indicates the TBE value supporting the clade R. JACKNIFE This method is different in terms of alignment of the others. Instead to shuffle the alignment (taking also more than once un site) we delete some chunk of the original alignment (each position will appear at most once). Then, from each jackknife alignment create a tree and repeat this procedure multiple times. Finally, you check how many times your branch R is appeared in the trees created. You can perform this methos in parametric version or non-parametric version. APPROXIMATE LIKELIHOOD RATIO TEST This method compares the likelihood of the entire tree (L1) with the branch R and the tree without the branch R (L0). Then, you compute LR = 2 x (L1 – L0). This approach can be expansive in terms of computational cost for a huge tree. So, the method used here is to compare the original tree with the next-best local tree around the branch R. Instead to take L0 we take L2 that represent the log-likelihood of the second-best NNI (nearest neighbour interchange) around R. There Is a problem, this method takes in account just local modification but miss better topologies far away in tree space. POSTERIOR PROBABILITIES (PPs) When the MCMC algorithm is running there is a part that Is called burn-in, in this part the result that come up, in most of the cases, are the best one. This methos takes in account how many times the clade R appear in burn-in part, so under your assumptions this clade can be realistic there. Never compare the value of BPs o aLRT with PPs. CONCORDANCE FACTOR “How many parts of my data actually agree with this tree?” - Gene concordance factor (gCF): Split your dataset in order to obtain genes for each taxa that you are studing. Then, you create a tree for each gene and compare them with the specific branch of your original tree to get a percentage. - Site concordance factor (sCF): Compare just the sites that are informative and get a percentage. If CF value is >70 strong support, 40-70 maybe conflict, 0.95) has strong support, while lower ELW values indicate less support, but not necessarily rejection. FILE 13 DISCORDANCE, ILS AND THE COALESCENT - Species tree: shows how the species have evolved over the time. Tells us how the species diverged from common ancestor (cladogenesis) - Gene tree: shows how a certain gene has evolved over the time. This is not necessarily equal to the species tree, this phenomenon is called gene discordance. But why that happen? ➔ Incomplete lineage sorting (ILS): happens when a certain gene persists after the species split, or, when species form and their descendants don’t inherit the same genes. This can happen when the population size is large or when there ia bottleneck. ➔ Horizontal gene transfer (HGT): this happens when a gene or a piece of genome is transferred from a unrelated organism to another organism without a reproduction or speciation process. ➔ Hybrid introgression (HI): in this case two related organisms hybridize generating an organism that has genes from the first organism (A) and the second (B). When this hybridize organism will mate with other organisms of population A, there will be also genes from population B. How can we compare two phylogenetic trees with the same set of taxa? Robinson-Foulds (RF) distance: It quantifies the topological difference between two trees by focusing on their bipartitions. This method involved in compare how many bipartitions that are in the first tree that are not in the second one and viceversa. The number of bipartitions that are different represent the RF distance. This number can be normalized in a range from0 to 1 (0 means the tree are identical). The limitation here is that all bipartitions are considered equal, regardless their biological significance. How can we deal with the discordance in species tree inference? - Concatenation-based species tree inference: Combines data from multiple genes into one large matrix (species × genes). Assumes all genes have the same evolutionary history (one tree for all genes). - Coalescent-based species tree inference: Uses individual gene trees to infer a species tree, allowing for discordance due to processes like incomplete lineage sorting (ILS). Models’ gene trees independently and combines them into a species tree. This method, in principle, tries to trace evolutionary history of gene alleles backwards in time focusing on how mutations trace back to a common ancestor. The coalescence event is referred to points in time where genetic variants from different individuals come together to a common ancestor, this event gives us insight into the natural demographic history. How can we calculate the coalescent probability? 1) Coalescence in the Previous Generation: in each generation each gene is chosen from a parent from the previous generation. If the effective size of one parent is N the total gene copies available is 2N. So, we can calculate it by using this formula: 1/2N 2) Chance of Two Gene Copies Sharing a Common Ancestor t Generations Ago: in order to calculate it we have just to multiply the probability to don’t have an coalescence (raise to t that means all of the generations that the coalescence doesn’t appear) with the probability to have it. (1−2Ne1)t−1×2Ne1 Which tools we can use to infer gene trees? Unrooted Trees (no fixed starting point): ASTRAL: The "gold standard" for multi-locus species tree inference, especially under Incomplete Lineage Sorting (ILS). ASTER: Includes ASTRAL, plus more advanced tools. BUCKy: Detects gene tree conflicts by estimating concordance factors. ASTRID: A fast, distance-based species tree method. Rooted Trees (trees with a fixed root): MP-EST: Uses rooted gene trees with a pseudo-likelihood framework, sensitive to rooting errors. Site-Based Methods (methods that work on the actual sequence data): SVDQuartets: Works directly with site patterns (sequence alignments), no need for gene trees. SNAPP: A Bayesian method for biallelic SNP data that’s coalescent-based and accurate but slower. CASTER: A newer, coalescent-aware method for large datasets using site patterns. Bayesian Co-estimation (methods that use Bayesian statistics to estimate both gene and species trees): StarBEAST2: Uses full Bayesian co-estimation for gene and species trees. How can we be confidence with branches in a tree? ➔ Local Posterior Probability (LPP): This tells you how likely a branch is correct based on the data. It’s like a confidence level, ranging from 0 (low confidence) to 1 (high confidence). ➔ Multi-locus Bootstrap (MLBS): A percentage that tells you how stable a branch is across different bootstrapped gene trees. ➔ Coalescent Units: This is a way to scale branch lengths so we can compare coalescent events between populations of different sizes. One coalescent unit = 2Nₑ generations (where Nₑ is the effective population size). FILE 14 BIASES IN PHYLOGENETICS What does stochastic mean? Something that is random Stochastic bias → means that your phylogenetic tree gets messed up by random noise and not by a bad method. But how? - Not enough sequence data: if you have few genes or a small sequence could happen that random pattern might look meaningful - Weak signal: the DNA sequences don’t have enough useful differences - Crazy mutations pattern: if some part of your sequence has a really different evolutionary rate, your result will be messed up How can we address to this problem? - Use more data (100-500 can be good but the best choice is thousand) - Pick sequences that can give information (wisely) - Trim the part of the alignment that is useless - Use bootstrap values or other branch support metrics to see if you can be confident - Add more species Systematic bias → Systematic bias is a consistent error that pushes results in the wrong direction due to flawed assumptions or methods, not just random noise, but a repeatable skew. Generally, when we build a phylogenetic tree, we often assume that evolution follow few rules called SRH: 1) Stationary: The base or amino acid frequencies don’t change over time 2) Reversible: The process of change looks the same forward and backward in time 3) Homogeneous: The same evolutionary rules apply to every branch in the tree The reality is much different, with our assumptions we systematically biased our studies. There are two examples: Long branch attraction (LBA): your model doesn’t take in account that some lineages evolve superfast, they might look like cousin but is not true. This phenomenon happens more often when there are long branches because there are more possibilities to see more mutation over the sequence. Moreover, same substitution happens in parallel. Compositional heterogeneity: if some species have some unbalanced structure, they can be grouped together but is not like this. Here the problem is related to the different nucleotide or amino acid frequencies. This problem can be lead by irregular (uneven) rates evolution, poor taxing sampling or deep divergence times. How to deal with LBA? 1) Test before you build 2) Subsample smartly 3) Use better models 4) Partition your data to use different models for each part 5) Use mixture model How to deal with compositional heterogeneity? 1) Add intermediate taxa to break long branches 2) Use better-fitting models 3) Group the amino acid or nucleotides in order to reduce the influence of compositional bias (example A and G in purines) FILE 15 DIVERGENCE TIME ANALYSES As we already see the molecular clock assumption in most of the cases is wrong. So let’s see other assumption that we can consider. - Local molecular clock: each monophyletic group (the group that has the same common ancestor) has the own evolution rate - Relaxed molecular clock: allows rate variation among lineages (or taxa) and is divided into two groups: 1) Uncorrelated clocks: each branch has own evolution rate, it is independent. The common statistical distributions are lognormal and exponential 2) Correlated clocks: each branch has the evolution rate related to the neighbours - Autocorrelated relaxed clocks: the rates change gradually along the tree. This model is used to simulate the ‘clock drift’ where the evolution rate change over time. Lognormal distribution The question is, how can we estimate the divergent time on a phylogenetic tree? Least squares dating (LSD): This method tries to minimize the difference between observed branch length and expected divergence times. In poor words, it tries to find the best set of divergence times that makes the branch length senseful. It needs a known tree, is fast, is not probabilistic Penalized maximum likelihood (PML): Tries to balance fit (how well the times match the data) with smoothness (not letting rates change too wildly between branches). Allows rate variation but without jumps, needs known tree, efficient, semi-parametric method. Bayesian interference: It estimates everything at once: tree topology, branch lengths, substitution rates, divergence times. Fully probabilistic. Tip dating: You give dates to the tips of the tree (like fossils or ancient DNA). The tree is then built using this time information directly. Used in paleogenomics, more sensitive. Node dating: You assign dates (usually minimum ages) to internal nodes based on fossil evidence. Requires correct fossil placement. How can we use the fossil information? You’ve got a fossil of some ancient species, and you want to use it to help estimate when two lineages split apart (diverged) in your phylogenetic tree. But here’s the thing, The fossil tells you when that species was alive, not the exact moment of divergence. The divergence must’ve happened before the fossil appeared, because the fossil is a descendant, not the ancestor itself. So, in order to calibrate a node you should give a minimum age to the fossil (like a lower bound, the split is before that age). Sometimes if you have additional information such as no older fossil you can give also the maximum age (upper bound). Which prior distribution can I assign? This is just a curve that reflects how likely you think different divergence dates are — between that min and max. Crown group: is a group that includes the last common ancestor of all living members of a clade and all of its descendants. Stem group: this includes all the extinct species that are closer to the crown group than to any other group but they are not part of the crown group because they don’t all the defining traits of it. FILE 17 MODELLING TRAIT EVOLUTION ON PHYLOGENIES: DISCRETE TRAITS What are phylogenetic comparative methods? They're a toolkit of mathematical and statistical techniques that use phylogenetic trees plus trait data to study how evolution has shaped life across time. What can you do with phylogenetic comparative methods? Model fitting Ancestral state reconstruction Diversification analyses Biogeography This method allows us to test evolution theories quantitatively, so how, why and when evolution shaped traits, species and regions. There are two traits that matter when we compare species in a phylogenetic tree (character types): ➔ Discrete characters: they are binary traits, such as there is or there isn’t. ➔ Continuous characters: they are traits that can be measured by using a value (file 18) How can we model the discrete characters? By using Mk models where M = Markov model and k = number of possible traits The core of the model is the q-matrix that provides the rate of change between states, in depends on how much we want this model flexible there are: - ER (equal rates): every transition has the same rate - SYM (symmetrical): different rates but if you go forward the value is the same If you go backword - ARD (all raters different): each transition has the own rate How can we model the continuous character? - Brownian motion (BM): this assumes that every change over the time is totally random - Ornstein-Uhlenbeck (OU): is still random but not totally, there is an optimum What traits did ancestral species have? 1) Join reconstruction: Tries to find the best combination of ancestral states for all nodes at once. one full story, the best explanation overall 2) Marginal reconstruction: Zooms in on one node at a time, asking: "Given everything we know, what’s the most likely trait here?". best local guess, accepting we’re unsure about the rest of the tree 3) Stochastic character mapping (SCM): In stochastic character mapping we sample character histories in direct proportion to their posterior probability, under a model. First, you sample a Q matrix, then you sample ancestral states at each node based on that Q. Finally, you simulate character evolution along the branches 4) Parsimony reconstruction: assuming evolution is slow, and changes are rare Do certain traits make lineages more likely to diversify or go extinct? State-dependent speciation and extinction models (SEE): These models link trait evolution to the birth and death of species. FILE 18 MODELLING TRAIT EVOLUTION ON PHYLOGENIES: CONTINUOS TRAITS Brownian motion (BM): We can define as “Every generation, your trait might get slightly bigger, slightly smaller, whatever. Flip a coin, roll some dice. Evolution don’t care”. The change is drawn from a normal distribution (bell curve). Key features: - Randomness: it’s all stochastic - Variance increases with time: The longer two species have been evolving sepa- rately, the more different their traits can become. - Close relatives = more similar: Since they split more recently, they’ve had less time to randomly drift apart. Why use BM? It’s used as a null model, so if your data fits the BM model, means that your traits are evolving naturally, without selection. Ornstein-Uhlenbeck (OU): Imagine your trait (like body size or brain volume) is evolving randomly like in Brownian Motion... but now there’s a magnetic pull dragging it toward a sweet spot, an optimal value, called θ (theta). Key features: - Target trait value, the optimum theta - Includes three parameters → theta (optimum), alpha (strength of pull toward, strength of selection), sigma squared (evolutionary rate, the size of the steps) When use OU? If you see that your data have constraints and follow a certain pattern (selection), or we want to check if some taxa are converging to the same trait value. What does similar trait mean? Means that there is a phylogenetic signal, but why should I care? Because if you ignore phylogenetic signal (that can be strong or weak), you might treat species as independent datapoints when they’re not. So, the model has to be chosen in the right way. Pagel’s lambda "How much do the internal branches of the phylogenetic tree matter for explaining trait similarity?" it tells you how much of the variation in your trait is explained by phylogeny — and whether you should model things with or without taking that into account. If lambda = 1 → species similarity is exactly what we would expect under Brownian motion (full signal) If lambda = 0 → the species evolved traits independently of their shared history (no signal) Blomberg’s K Blomberg’s K is all about measuring how strong the phylogenetic signal is in your data. It compares what you see (observed trait variation across species) vs what you’d expect if traits were evolving under Brownian Motion (BM). K = 1 trait are behaving as BM predicts K > 1 traits are more conserved than expected K < 1 traits are less conserved than expected This Is important because our traits are not independent, they are linked by shared ancestry. So, what can we do?