Multiple Sequence Alignment (MSA) PDF
Document Details
Uploaded by QuieterChrysoprase1777
Bicol University
Joseph Martin Q. Paet
Tags
Summary
This document explains multiple sequence alignment (MSA). It describes the different types of MSA algorithms and methods, such as progressive, iterative, consistency-based, and structure-based methods. This document also includes examples of these algorithms using different sequence alignment programs.
Full Transcript
Bio16 Computational Biology Multiple Sequence Alignment (MSA) Prepared by: Joseph Martin Q. Paet Biology Department, College of Science Bicol University 1...
Bio16 Computational Biology Multiple Sequence Alignment (MSA) Prepared by: Joseph Martin Q. Paet Biology Department, College of Science Bicol University 1 Multiple Sequence Alignment Similar to PSA, MSA aims to look for proteins or genes that are related Family of sequences = orthologs and paralogs Homologs = retain similar structure and function More powerful than PSA = ambiguous PSA can be aligned via their relationship to a third sequence, leading to the identification of conserved regions and grouping of sequences to a family MSA is a collection of three or more protein (or nucleic acid) sequences that are partially or completely aligned. Domains and motifs that characterize a protein family are defined by the existence of an MSA. Pevsner, J. (2015). Bioinformatics and Functional Genomics (3rd ed.). John Wiley & Sons Inc. 2 1 Cyclic Nucleotide Gated Ion Channel Homologous Residues Evolution 3D Position Talke, I. et. al. (2003). CNGCs: Prime targets of plant cyclic nucleotide signalling? Trends in Plant Science, 8(6), 286–293. https://doi.org/10.1016/s1360-1385(03)00099-2 3 When do we use MSA? Group membership = provide insight into its function, structure, and evolution More sensitive in detecting homologs Reveal conserved residues or motifs Detection of deleterious variants Comparison of whole genomes and mining of novel genes/proteins Generating phylogenetic trees Identifying regulatory regions of genes through consensus sequences Designing primers Pevsner, J. (2015). Bioinformatics and Functional Genomics (3rd ed.). John Wiley & Sons Inc. 4 2 5 Algorithmic Approaches to MSA Exact Methods Progressive Alignments Iterative Approaches Consistency-Based Methods Structure-Based Methods Pevsner, J. (2015). Bioinformatics and Functional Genomics (3rd ed.). John Wiley & Sons Inc. 5 Exact Methods Employ dynamic programming similar to the Needleman-Wunch PSA algorithm Multidimensional comparison The goal is to maximize the summed alignment score of each pair of sequences Not feasible in time or space for more than a few sequences O(2N LN) where N is the number of sequences, and L is the mean sequence length Pevsner, J. (2015). Bioinformatics and Functional Genomics (3rd ed.). John Wiley & Sons Inc. 6 3 Progressive Sequence Alignment Most commonly used; proposed by Fitch and Yasunobu (1975) and popularized by Da-Fei Feng and Russell Doolittle (1987, 1990) The strategy entails calculating pairwise sequence alignment scores between all the sequences being aligned, then beginning the alignment with the two closest sequences and progressively adding more sequences to the alignment “once a gap, always a gap” rule Rapid alignment but not guaranteed to provide the most accurate alignment E.g., ClustalW Pevsner, J. (2015). Bioinformatics and Functional Genomics (3rd ed.). John Wiley & Sons Inc. 7 Progressive Sequence Alignment (1) Series of PSA ClustalW (2) Create a Guide Tree (3) Generate the MSA Pevsner, J. (2015). Bioinformatics and Functional Genomics (3rd ed.). John Wiley & Sons Inc. 8 4 Progressive Sequence Alignment (1) Series of PSA ClustalW (2) Create a Guide Tree (3) Generate the MSA Pevsner, J. (2015). Bioinformatics and Functional Genomics (3rd ed.). John Wiley & Sons Inc. 9 Iterative Approaches compute a suboptimal solution using a progressive alignment strategy, and then modify the alignment using dynamic programming or other methods until a solution converges Overcomes issues with errors E.g. Multiple Alignment using Fast Fourier Transform (MAFFT), Iteralign, Profile Alignment (PRALINE), and Multiple Sequence Comparison by Log-Expectation (MUSCLE) Pevsner, J. (2015). Bioinformatics and Functional Genomics (3rd ed.). John Wiley & Sons Inc. 10 5 Iterative Approaches Designed for command-line Pevsner, J. (2015). Bioinformatics and Functional Genomics (3rd ed.). John Wiley & Sons Inc. 11 Iterative Approaches Stages of MUSCLE Alignment A draft progressive alignment is generated MUSCLE improves the tree and builds a new progressive alignment The guide tree is iteratively refined by systematically partitioning the tree to obtain subsets Pevsner, J. (2015). Bioinformatics and Functional Genomics (3rd ed.). John Wiley & Sons Inc. 12 6 Consistency-Based Methods adopt a different approach by using information about the multiple sequence alignment as it is being generated to guide the pairwise alignments by incorporating evidence from multiple sequences, this technique aims to improve the accuracy and reliability of the alignment process E.g., ProbCons and T-COFFEE (Tree-based Consistency Objective Function For alignmEnt Evaluation) ProbCons outperformed 6 other MSA, e.g., ClustalW, MAFFT, T-COFFEE, and MUSCLE Pevsner, J. (2015). Bioinformatics and Functional Genomics (3rd ed.). John Wiley & Sons Inc. 13 Consistency-Based Methods Stages of ProbCons Alignment Calculation of Posterior Probability Matrices Computation of Expected Accuracy Re-estimation of Quality Scores Construction of an Expected Accuracy Guide Tree Progressive Alignment of Sequences Pevsner, J. (2015). Bioinformatics and Functional Genomics (3rd ed.). John Wiley & Sons Inc. 14 7 Structure-Based Methods improve the accuracy of multiple sequence alignments by including information about the three- dimensional structure of one or more members of the group of proteins being aligned E.g., PRALINE and Expresso (module of T-COFFEE) Expresso = each sequence is automatically searched by BLAST against PDB and matches (sharing >60% amino acid identity) are used to provide a template to guide the creation of the MSA Pevsner, J. (2015). Bioinformatics and Functional Genomics (3rd ed.). John Wiley & Sons Inc. 15 Bio16 Computational Biology Multiple Sequence Alignment (MSA) References: Pevsner, J. (2015). Bioinformatics and Functional Genomics (3rd ed.). John Wiley & Sons Inc. Talke, I. et. al. (2003). CNGCs: Prime targets of plant cyclic nucleotide signalling? Trends in Plant Science, 8(6), 286–293. https://doi.org/10.1016/s1360-1385(03)00099- 2 Prepared by: Joseph Martin Q. Paet Biology Department, College of Science Bicol University 16 8