Lecture 7 - Multiple Sequence Alignment (MSA) - BIOC 3265

Lecture 7 Multiple sequence Alignment (MSA) BIOC 3265- PRINCIPLES OF BIOINFORMATICS DR ANGELA T ALLEYNE 1. Define a multiple sequence alignment (MSA)...

Lecture 7 Multiple sequence Alignment (MSA) BIOC 3265- PRINCIPLES OF BIOINFORMATICS DR ANGELA T ALLEYNE 1. Define a multiple sequence alignment (MSA) 2. Compare the algorithmic methods for MSA analysis LEARNING 3. Discuss benchmarking OUTCOMES 4. Rationalize the limitations of the exact technique for MSA 5. Describe the steps in the Progressive method of MSA analysis 6. Compute a distance matrix using the progressive algorithm 7. Explain the steps in the CLUSTALW At the end of this lecture you program should be able to: 8. Evaluate the progressive technique for MSA construction 2 M ULTIPLE SEQUENCE ALIGNMENT Whole or partial alignments involving three or more nucleotide or protein sequences (N>2) Used for: i. Identification of protein or gene families, ( group membership) ii. Conservation analysis of protein or nucleotide residues, motifs etc. conserved secondary structure features iii. Phylogenetic analysis, (distant relationships) iv. Reconstruction of gene fragments (e.g regulatory regions )in sequencing projects. v. Analysis of population data ̶ Homologous residues are aligned in columns across the length of the sequences 3 MSA A DVANTAGES Visual examination gives a quick analysis of homology Shows conservation patterns Can be used to accurately and quickly discern distant relationships than using This Photo by Unknown Author is licensed under CC BY single sequences 4 The Goal Example: ̶ To write each sequence along the others to TGCG, AGCTG, and express any similarity AGCG can be aligned as between the sequences follows: ̶ Each element of sequence is either placed alongside a corresponding element in the other sequences and/or a gap character 5 A MULTIPLE SEQUNCE ALIGNMENT Similarly aligned sequences are color coded 6 MSA: ALGORITHMIC T ECHNIQUES Progressive Iterative Probabilistic hierarchical uses the progressive uses probability Uses Pairwise method but builds a statistics e.g. Hidden alignments profile Markov models 7 METHOD STEPS Exact Sum of pairs ( Uses a DP algorithm) 1. Align most closely related sequence, then less related sequences 2. Plot a phylogenic tree to quantify similarities Progressive 1. Start with a progressive method (an initial global alignment) 2. Realign sequences after leaving one out 3. Add left-out sequence Iterative 4. Repeat until an acceptable alignment is achieved Consistency Consistency and probability-based methods Uses contextual information about the progressive alignment in progress to guide individual pairwise alignments- improves accuracy 8 MSA COMPLEXITY ̶ Adding additional sequences results in an M No. of exponential increase in the (no. of comparisons number of computations sequences) required to find the 2 90,000 optimal alignment ̶ For m comparisons, even 3 2.7 x10 7 the dynamic programming methods quickly break 4 Approx. 8 x109 down 5 Approx. 5 x10 12 9 *m protein sequences each 300 amino acids in length EXACT METHODS  Exact methods: Dynamic Programming (problem solving using a recursive method of smaller sub-problems).  Instead of the 2-D matrix in the NM technique, a 3-D, 4-D or higher order matrix is needed.  Gives optimal alignments but not feasible in time or space > ~10 sequences. 10 N individual sequences Large search space Long Computational Time requires constructing the increases exponentially N = no of sequences and N-dimensional equivalent with increasing N and is L= average sequence of the score and trace- also strongly dependent length, back matrices. on sequence length. computational time required is O(2N LN) LIMITATIONS OF SUM OF PAIRS METHOD 11 HOW IS DP USED IN MSA COMPUTATIONS? Benchmarking!  extremely high-quality alignment of a small number of sequences.  MSA benchmarking standard is used in evaluating new or refining a heuristic techniques. 12 BENCHMARKING  Benchmarking provides a “ Gold standard” approach or comparison of MSA methods to reduce the variability in multiple analyses e.g.  It assists in determining the correct analysis when different approaches/ algorithms provide different results  The Gold standard consists of true biological relationships  Replication is built into benchmark analysis http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2995051/pdf/gkq6 13 25.pdf A Benchmark dataset is a database of separate categories of MSAs. BENCHMARK Properties: DATABASES 1. Relevance- include practical tasks usually encountered 2. Solvability- include solutions to tasks, not too easy not too hard 3. Scalability- include a range of tasks, small or large 4. Accessibility- publicly available 5. Independence- method for data set construction should be different to that used for the MSA construction 6. Evolution- is not static and should be adapted to new problems 14 This Photo by Unknown Author is licensed under CC BY HTTP://SMART.EMBL-HEIDELBERG.DE/ 15 HTTPS://PFAM.XFAM.ORG/ generates an MSA by first aligning the most similar sequences ( PAIRWISE) and then adding successively less related sequences or groups to the alignment until the entire query set has been incorporated into the solution. P ROGRESSIVE , automatically construct a HIERARCHICAL phylogenetic tree or dendrogram as well as an OR alignment TREE METHODS weight the sequences in the query set according to their relatedness, which reduces likelihood of making a poor choice of initial sequences and improves alignment accuracy 16 CLUSTAL ̶ CLUSTAL -a progressive method for multiple sequence alignment ̶ performs a global-multiple sequence alignment ̶ CLUSTALW- cluster analysis by weight ̶ CLUSTALW replaced by CLUTALO- clustal omega 2019 https://www.ebi.ac.uk/Tools/msa/clustalo/ 17 T HE CLUSTAL A LGORITHM Pair-wise Step 2: Combine alignments sequentially- starting alignments from the most closely between all Construct a guide related sequences or sequences, uses the tree from the groups to the most NM algorithm similarity matrix distantly related ,using the guide tree in step 2 Step 1 Step 3: CLUSTALW-Commonly used Progressive method 18  Use a pair-wise alignment method (NM) to compute pair-wise CLUSTALW: alignments amongst the sequences STEP 1  Using the pair-wise alignments, compute a “distance score” between all pairs of sequences. EXAMPLE: o Count the number of mismatches between the pair-wise alignment o Count the number of non-gapped positions in the pairwise alignment o Compute a distance matrix: Distance (D) = (No. of mismatches)/(No. of non-gapped positions) 19 CALCULATING DISTANCE For each pairwise alignment, look at the non-gapped positions and count the number of differences per site. mismatch Non gapped positions 1 1 1 1 Example: the best alignment for the two sequences has a distance of 1/4 = 0.25 for the mismatch between the M and V. 20 Feng and Doolittle in 1987 used the rule that once a gap always a gap (fixed gaps) in computing distance scores in the distance matrix 21 “ONCE A GAP” — Gaps are often added to the first two (closest) sequences — To change the initial gap choices later would be to give more weight to distantly related sequences — To maintain the initial gap choices is to trust that those gaps are most believable 22 After a pairwise alignment has been completed, any gap symbols are replaced by a neutral X character. This is the ”once a gap, always a gap” rule by Feng and Doolittle. This rule allows for the aligned sequences to guide the alignment of future sequences. 23 o Use clusters to build up groups(PAIRWISE). FENG & o Take the sequence you want to add, and DOOLITTLE pairwise align it to all the sequences already in the group. (TREE) o The highest scoring pair is then used to determine how to add the new sequence to the group. (PROGRESSIVE ALIGN) Implication: The first and two most closely related sequences that are aligned are given more weight in assigning gaps. 24 FENG AND DOOLITTLE (1987) Because there is no cost with aligning anything to an X, it has the side effect of encouraging gaps to occur in the same column in subsequent alignments. 25 CLUSTALW S TEP 1 A. After computing the distance between all pairs of sequences, put them into a matrix- a distance matrix (Feng and Doolittle, 1990). Note: compute distances are in the range of 0 to 1, with smaller values indicating more closely related sequences 26 CLUSTAL: STEP 2 Construct a guide tree  Uses the similarity or distance matrix and a technique called the Neighbor Joining (NJOIN) or UPGMA methods to construct a guide tree  If two or more sequences share a branch, this may indicate an evolutionary relationship between the sequences.  Guide tree construction is usually O(N2) for N sequences  The guide tree can be displayed graphically using JalView 27 THE GUIDE TREE (STEP 2)  The guide tree has branches of different lengths.  Each branch represents the divergence from the common ancestor as a proportional distance.  Progressive alignment is done on the branch order in the tree 28 CLUSTAL: STEP 3  Pairwise alignments are created starting from the most closely related sequences or groups to the most distantly-related sequences or groups in the guide tree  The alignment is continued progressively until an MSA is achieved from the order presented in the guide tree 29 CLUSTAL: (STEP 3) 30 CLUSTAL PROGRAMMING STEPS: IN SUMMARY 1 2 3 Step 1 Step 2 Step 3 a pair-wise Construct a guide Progressive alignment method tree Pairwise alignment Compute distance Neighbour joining Use Guide tree “d” 31 FEATURES OF CLUSTAL CLUSTALW generates good MSA’s because: ̶ Individual weights are assigned to sequences; ̶ very closely related sequences are given less weight, while distantly related sequences are given more weight ̶ Scoring matrices are varied based on the presence of conserved or divergent sequences ̶ Residue-specific gap penalties are applied ̶ Gaps are fixed ̶ However, the progressive approach is a ‘greedy algorithm’ where mistakes made at the initial alignment stages cannot be corrected later. 32 CLUSTAL O (OMEGA)  Fast and relatively new (2009) accurate MSA algorithm based on CLUSTAL o allows alignments of almost any size to be produced.  Uses a scalable progressive alignment  Alignments can be re-used  Can compare protein families with >50 000 sequences from genome projects  Was a response to the increasing numbers of available sequences and the need to make big alignments quickly and accurately  Currently only handled by MAFFT 33 Uses mBed Each sequence is ‘emBedded’ in a space of n dimensions CLUSTALO  where n is proportional to log N. ALGORITHM  Algorithm complexity is O(N log N)- allows guide trees of hundreds of thousands of sequences to be made by restricting the calculation of sequence alignment scores to NLog(N).  It aligns based on profiles- uses a profile hidden Markov models (HMMs) for comparison of each sequence to each other instead of the conventional dynamic programing and profile alignment. gives greatly increased accu-racy to Clustal Omega when compared to earlier Clustal programs  These vectors can then be clustered extremely quickly by standard methods such as K-means or UPGMA. http://msb.embopress.org/content/msb/7/1/539.full.pdf 34 REFERENCES Baxevanis, A. D. and Oulette, B. F. (2005) Bioinformatics: A practical guide to the analysis of genes and proteins. 3rd ed. Wiley. CHAPTER 12 Pevsner, J. Bioinformatics and Functional Genomics; Wiley and Sons, 2009. CHAPTER 6 http://bioweb.cbm.uam.es/courses/MasterVirol201 3/alignment/Review_MSA.pdf https://www.embopress.org/doi/pdf/10.1038/msb. 2011.75 35

Lecture 7 - Multiple Sequence Alignment (MSA) - BIOC 3265

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue