LEC 10 - Phylogenetics II PDF
Document Details
Uploaded by AffectionateCommonsense7053
UWI Cave Hill
Dr. A. T Alleyne
Tags
Summary
This document from a university lecture discusses phylogenetics, covering topics such as tree building, DNA substitution models, and computational steps in phylogenetic analysis. It includes examples and summaries of different methods like UPGMA, Neighbor Joining, Maximum Parsimony, and Maximum Likelihood.
Full Transcript
Lecture 10: PHYLOGENETICS II:TREE BUILDING BIOC 3265-Principles of Bioinformatics Dr. A. T Alleyne- UWI Cave Hill 1 Recap A phylogenetic tree is essentially a graphical representation of a multiple sequence alignment....
Lecture 10: PHYLOGENETICS II:TREE BUILDING BIOC 3265-Principles of Bioinformatics Dr. A. T Alleyne- UWI Cave Hill 1 Recap A phylogenetic tree is essentially a graphical representation of a multiple sequence alignment. 2 L EARNING O UTCOMES At the end of this lecture, you should be able to: 1. Describe the steps in plotting a Phylogenetics tree 2. Explain DNA substitution models used in phylogenetic analysis 3. Distinguish between Neighbor joining and unweighted (NJOIN and UPGMA) methods in tree construction 4. Distinguish between parsimony and distance methods 5. Describe the steps in NJOIN and UPGMA 6. Distinguish between ML and MP methods 7. Explain the term Bootstrap and its significance in tree evaluation 8. Construct a phylogenetic tree using distance based and character- based methods 3 C OMPUTATIONAL STEPS Sequence acquisition Multiple Selection Tree sequence substitution Tree building evaluation alignment model 4 E DITING AN MSA Analyze the MSA to ensure that all sequences are homologous To maximize alignment of distantly related organism or sequences adjust gap creation and extension penalties Restrict phylogenetic analysis to regions of the MSA where data is complete Delete any column that contains gaps. 5 Either Distance- Used to calculate based or branch lengths Character-based models Distance based models employ statistics to estimate the no. of Characters based models also amino acid changes that have use statistics to assess the best occurred during a pairwise tree topology sequence comparison of the sequences. E VOLUTIONARY MODELS P- distance model- Observed proportion of differences between Other models e.g., one parameter or two parameter models sequences ( MEGA) 6 N UCLEOTIDE S UBSTITUTION MODELS Transition Changing a purine for another purine (A-G) Changing a pyrimidine for another pyrimidine (C-T) Transversion Changing a purine for a pyrimidine (G-T) Changing a pyrimidine for a purine (A-C) 7 In this early and simple model (one parameter model), the rate of change to any of the three other nucleotides is alpha and is equal or the same. It can be used to calculate a distance matrix for phylogenetic analysis. J UKES C ANTOR MODEL 1969 ( A = B ) 8 A later two parameter model; the rate of change to any of the nucleotides is either alpha or beta and considers both transversions and transitions. Gives more weight to transversions. Introduced the neutral theory of evolution- rate of change is through mutation and random drift. K IMURA MODEL 1980 ( A ≠ B ) 9 T REE BUILDING METHODS Direct Genetic comparison differences of characters Distance Parsimony based methods based methods Compares pair- Characters exist wise distance in different between two evolutionary data sets states 10 Tree Building methods Distance-based methods involve a distance metric, such as the number of amino acid changes between the sequences, or a distance score. Examples of distance-based algorithms are UPGMA and neighbor-joining. Character-based methods include maximum parsimony and maximum likelihood. Parsimony analysis involves the search for the tree with the fewest amino acid (or nucleotide) changes that account for the observed differences between taxa. 11 G ENETIC DISTANCE ̶ Simplest approach: align pairs of sequences, and then count the number of differences (no. of substitutions) ̶ This degree of divergence is called the Hamming distance (D). ̶ For an alignment of length N with n places where there are 𝒏 differences, the degree of divergence d is: 𝒅 = X 100 𝑵 ̶ A matrix of pairwise distance scores is then used to generate the tree, from a tree file. ̶ Computationally fast so it is used for a large number of sequences >50 or 100 ̶ Branch lengths may correspond to genetic distance 12 Distance based algorithms ̶ Two common distance- based algorithms: 1. UPGMA- Unweighted Pair Group Method of Arithmetic Means 2. NJ- Neighbor Join 13 Distance Matrix Distance from a point to itself is zero D (x, x)= 0 properties Distance from x to y is the same as from y to x, D (x, y)= D( y, x) A triangular inequality applies, D (x, y)< D (x, z) + (D (z, y) 14 UPGMA A UPGMA tree is usually rooted and is based on a hierarchical clustering technique. UPGMA algorithm assumes that the molecular clock applies for sequences in the tree So, all distances contribute equally to each average that is computed- Tree branch lengths can be used to estimate divergence rates If there are unequal substitution rates, the tree may be wrong The algorithm infers both tree topology and branch lengths Used for species trees when the rate of gene substitution is constant and there are many OTU’s UPGMA is simple but less accurate than NJoin. 15 T HE UPGMA ALGORITHM (S OKAL & M ICHENER 1958) Evolutionary distance is derived from a distance matrix 1. Identify the least dissimilar pair ( most closely related) 2. Combine them into a new group or cluster. 3. Rearrange the distance matrix based on this new group 4. Connect them via a new node and repeat the process with the next smallest dissimilarity building cluster groups as you go along. 5. Continues until there are only two remaining groups. 6. The branch points are computed by the distance of d(AB/2) 16 UPGMA Taken from: Calculation of Phylogeny: the UPGMA Clustering Method (nmsr.org) The matrix gets reduced every time a new relationship is formed, and the distance is halved between each pair that is joined together. 17 1. compute the pairwise 2. Cluster the two proteins with the 3. Find the next two proteins distances 5 proteins smallest pairwise distance. with the smallest pairwise distance. 1 2 3 4 5 4. Repeat: Find the next two sets of Repeat! This is your tree. 5. proteins with the smallest pairwise distance. 18 S UMMARY: UPGMA ALGORITHM Clustering Cluster units with the smallest distance between them into one group Compute a new distance matrix Use this cluster and the next two units with the next smaller distance between them Repeat Repeat clustering the process until all groups are clustered in relation to each other 19 E XAMPLES UPGMA A NALYS IS ̶ Microarray data analysis ̶ Phylogenetic analysis of simple molecular data e.g. RAPDs, AFLPs, RFLPs etc. Molecular data used for scoring and computing a distance matrix or similarity matrix 20 Njoin concept The principle of the NJ method is to find neighbors by sequentially minimizing the total length of the tree A neighbor is a pair of OTU’s, or taxa connected by a single node 21 A neighbor: a pair of OTU’s or taxa which are connected by a single interior node NJoin produces an unrooted tree: NJOIN- S AITOU does not assume the molecular clock or a constant evolutionary rate AND N EI Uses a minimum evolution method (1987) (GREEDY ALGORITHM- choosing locally optimal solution at each stage. it can provide a locally optimal solution that approximates the global solution in a rescannable amount of time) Produces a tree with branch lengths based on the sum of distances 22 N EIGHBOR - JOINING : STEP 1 Place all the taxa ( OTU’s) in a star-like structure Assumes no initial clusters, considers all pairs as potential neighbors 23 N EIGHBOR - JOINING : STEP 2 o Identify neighbors (1 and 2) that are most closely related (sum of branch lengths). Pairwise comparisons are made. o Connect these neighbors (1,2) to other OTUs via an internal branch, XY o At each successive stage, minimize the sum of the branch lengths, until the tree topology is completed Define the distance from X to Y by dXY = 1/2(d1Y + d2Y – d1,2) 24 Summary: Neighbor Joining algorithm: start from a star phylogeny (left); find the nearest pair of nodes (according to the distance matrix, either of A-B or D-E) (middle); recalculate the distance matrix using the new node (AB); repeat until the tree is fully resolved (right). 25 S UMMARY: NJOIN TREES Generates a final tree topology with branch length estimates Based on the principle of minimum evolution Appropriate for large data sets and very quick Permits correction for multiple substitutions Suited for datasets comprising lineages with largely varying rates of evolution 26 Maximum parsimony Based on the fewest possible evolutionary changes-no explicit evolutionary model C HARACTER- Represent the best relationship between taxa- based on shortest branch length BASED Maximum likelihood METHODS Based on tree topology (shape) and branch lengths – explicit evolutionary model Calculates the probability of each residue in an alignment. Determines the best tree based on the likelihood of the data seen. 27 “Goal” Find tree with the shortest branch lengths possible, the most parsimonious (“simple”) tree. Identify informative sites e.g., changes in characters M AXIMUM states. Constant states ( no change) are not informative. Parsimonious sites assume independence of each state. PARSIMONY Count the number of changes required to create each tree. ALGORITHM For 12 taxa a heuristic search is done Every tree is assigned a cost. Select the tree (or trees) with the fewest molecular changes. Parsimonious method analysis may fail if there too many changes in the data set. 28 This tree is preferred as it explains the data with the fewest evolutionary changes 29 E XAMPLE : M AXIMUM PARSIMONY LKGH- Kangaroo LKGF- Turtle LKSH- Seal MLGF- Horse THSF- Kangaroo α The tree with the lowest cost is chosen If there are trees with the same cost, the consensus tree is chosen 30 M AXIMUM L IKELIHOOD Computationally intensive. Determines which tree topology (shape) and branch lengths have the greatest likelihood of producing the observed data set. A likelihood is the probability of each residue in an alignment based upon a substitution model- hypothesis testing of observed events Provides a statistical model of evolutionary changes across branches. Used by MEGA program, PAUP, Tree- Puzzle and PHYLIP. 31 S UMMARY: MP VS ML seeks to find the tree topology that requires the MP fewest changes in character state (shortest distance) seeks to find the tree topology that confers the ML highest probability on the observed characteristics 32 Bayesian inference and phylogenetics Bayesian inference and maximum likelihood are widely used for phylogenetic analyses Pr [ Tree | Data ] is the posterior probability distribution of all possible trees. This involves a summation over all possible trees. Markov Chains (MCMC) are run to estimate the posterior probability distribution. 33 T REE EVALUATION Robustness Consistency Efficiency 34 B OOTSTRAPPING Used to measure the statistical strength of a tree topology. Shows how consistently an algorithm finds a particular branching order in a randomly permutation of the original data set Bootstrap values show the percent of times each clade is supported after a large number (n>500) of repetitive sampling of the data Infers the frequency with which each clade is observed from repeated sampling ( randomizing). 35 1. Make an artificial dataset obtained B OOTSTRAP METHOD by randomly sampling columns from your MSA. 2. Make the dataset the same size as the original. 3. Compute 100 -1,000 bootstrap replicates. 4. Observe the % values assignment of each clade in the original tree, which is supported by the bootstrap results. 5. >70% is considered significant 36 C OMPUTATIONAL S TAGES IN PHYLOGENETIC ANALYSI S DNA, RNA OR Perform MSA and JC or Kimura model UPGMA or NJ Protein edit (gap extensions (transitions vs MP or ML penalties etc.; transversions sequence integrity and homology) Selection of Multiple Selection of a sequences for sequence substitution Tree building analysis alignment model Bootstrap Randomization test Tree evaluation 37 R EFERENCES 1. Krane, D. E and Raymer (2003) Fundamental concepts of Bioinformatics Benjamin Cummings, CA. USA 2. Felsenstein, J. (2004) inferring Phylogenies Sinauer Associates, Ma. USA 3. Pevsner, J. (2009) Bioinformatics and functional Genomics 2nd ed., Wiley and Sons 4. Phylogenetics Primer- NCBI 38