Lecture 8- MSA II PDF

LECTURE 8 BIOC 3265-Principles of Bioinformatics MULTIPLE SEQUENCE ALIGNMENTS II: ITERATIONS Dr. A. T Alleyne- UWI Cave Hill At the end of this lecture you should be able to: 1. Define and describe iterative models of MSA 2. Define and describe probabilistic models of MSA 3. Define a Hidden Markov Model (HMM)- LEARNING Profile search OUTCOMES 4. Explain the importance of the Viterbi Algorithm in constructing a HMM 5. Explain the uses, advantages and disadvantages of the HMM 6. Evaluate the efficiency and best use of each MSA method 2 MSA METHODS (REVIEW) Probability/Consistency Compute probabilities then align eg. HHalign, ProbCons Iterative Iterate the progressive method eg MUSCLE, MAFFT Progressive Align similar sequences then Plot a guide tree eg ClustalW, T-Coffee. The progressive methods have a computational complexity of O(N2 ) 3 Assign a global alignment using 1 a scoring method and then re- align the sequence subsets Re-align subsets then align ITERATIVE 2 again to produce the next iteration's MSA METHOD Repeat until an optimal 3 alignment is produced, by comparing each iteration 4 ITERATIONS An initial alignment is created and then modified until it is improved An initial progressive alignment may be sub-optimal Iterative methods can overcome and correct an initial sub- optimal alignment Two examples of this approach are MUSCLE- Multiple Sequence comparison by log Expectation MAFFT – Multiple alignment Using Fast Fourier 5 transform MUSCLE – (ROBERT E DGAR 2004 ) Step 1: Build a draft progressive alignment 1b Construct a rooted Determine pair- tree using UPGMA wise similarity or (Unweighted Pair distance through Compute a Group Average ) or k-mer counting triangular NJoin (not by alignment) distance matrix Construct draft progressive alignment following tree 1a regressive alignment 6 1c Step 2: Improves tree and MUSCLE build a new progressive alignment The guide tree is Improves the 2b refined by obtaining guide by building a subsets of the tree, new progressive Each new tree is deleting and creating alignment again compared to the new branches and through a initial tree from then creating a profile calculation (using step 1 and new iterations of Kimura distance) the best tree. 2 3 Step 3: Refine guide tree 7 MUSCLE Uses k-mer distance (unaligned pairs) and a kimura distance measure (aligned pairs) A kmer - a contiguous subsequence of length k. Related sequences tend to have more kmers in common K-mer = k-tuple or word size (k) for an unaligned pair of sequences Kimura= a pairwise distance measure for an aligned pair of sequences Kimura distance is based on evolutionary base substitutions 8 MUSCLE 1. Compute the pairwise percent identities i.e., how much percentage of the sequences are aligned/matched and convert it to a distance matrix applying K- mer 2. Distance matrices are then compiled using UPGMA method (i,e., a method of phylogeny tree construction based on a fixed and constant mutation rate). This gives TREE1,which is followed by a progressive alignment and forms MSA1 3. Compute pairwise percent identities from MSA1 and construct a Kimura Distance Matrix, 4. Apply UPGMA method again and this gives you TREE2 9 6. Now from the last obtained tree, delete MUSCLE the edges which results in the formation of two sub trees. 7. Computes the sub tree profile (align the sub trees) 8. Then finally gives an MSA, for which the SP Score (Sum of Pairs) is calculated 9. If the SP Score is better, then it saves that last obtained MSA as MSA3, otherwise it discards the MSA. 10. Repeat again from step 6 onwards to finally give a clustered MSA. 10 Summary Computational steps in MUSCLE algorithm 11 By Luisren64 - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=117692993 M ULTIPLE A LIGNMENT BY F AST F OURIER T RANSFORMATION MAFFT Identifies key regions of similarity ( using FFT)- a progressive alignment method MSA is created from refined distances and a second progressive alignment is done Part- Tree, a fast progressive alignment tool is then used to re-align large numbers of sequences. ( K-mer counting) 12 MAFFT Uses two heuristic methods : progressive alignment and iterations Progressive alignment consists of 6 tuples or k-mers to calculate pairwise distances- K mer counting Transforms sequences by representing them by their physiochemical properties viz.; polarity and volume (size) computational time is O(N log N) MAFFT is useful for hard-to-align sequences such as those containing large gaps (e.g., rRNA sequences containing variable loop regions) 13 MAFFT allows local and global alignments PROBABILISTIC METHODS STATISTICAL ASSIGN LIKELIHOODS THEY DO NOT FINDS CONSERVED TO ALL POSSIBLE PATTERNS IN AN METHODS OF COMBINATIONS OF PRODUCE THE ALREADY PROBABILITY GAPS, MATCHES, AND SAME SOLUTION CONSTRUCTED MSA AND MISMATCHES TO EVERY TIME ON OR CONSTRUCTS' A DETERMINE THE LIKELIHOOD MOST LIKELY MSA THE SAME PARTIAL MSA BY OR SET OF POSSIBLE DATASET. CONSIDERING MSAS EVERY PATTERN 14 Uses unaligned sequences and statistical analysis PROBABILISTIC METHODS Requires large computer resources/space Can be reduced by limiting to short continuous stretches of sequence 15 CONSISTENCY METHOD Combines iterative and progressive approaches with a unique probabilistic model. Uses Hidden Markov Models to calculate probability matrices for matching residues, uses this to construct a guide tree Progressive alignment hierarchically along guide tree Examples: T-COFFEE, Dalign, ProbCONs http://tcoffee.crg.cat/apps/tcoffee/index.html 16 CONSISTENCY BASED METHOD Consistency based approach uses information about the MSA to adjust the scoring and hence the eventual alignment. Five steps: 1. Calculate a probability matrix using a pair of sequences (Uses a Hidden Markov model) 2. The expected accuracy of each pairwise alignment is calculated ( uses NM scoring) 3. Quality scores are re-estimated ( an interaction)- uses information about conserved residues identified through the pairwise alignment in step 1 4. A guide tree is constructed from the expected accuracies 5. MSA produced from guide tree using a progressive alignment( may be refined through iterations) 17 T-C OFFEE (N OTREDAME ET AL. Tree based Consistency based Objective Function for alignment Evaluation 2000) Consistency type algorithm based on a progressive approach Library eg. CLUSTALW and Most alike sequences LALIGN are aligned based on Construction of score weights in libraries of alignment library Extended to weigh position in the library Global or local Progressive alignment alignment 18 T- COFFEE STEPS 1. Generate 2 sets of pairwise alignments, one global (clustalw) and one local (lalign) 2. Weight, Compare, and Combine 3. Library Extension (reweighting - position specific scoring) 4. Progressive Alignment (No gap penalties necessary, due to weight strategy 19 T COFFEE- Considers both local and global alignments Improves on errors on Clustal algorithm Uses information about the alignment as it is being built to refine the alignment 20 Taken from : Fig. 1 in PMID: 10964570] MSA EVALUATION- TRANSITIVE CONSISTENCY SCORE Program BAliBASE PREFAB TCS (transitive consistency score), is suitable for the identification of TCS 94.44 89.24 correctly aligned residues, by structural analysis, and for the improvement of GUIDANCE 90.28 85.74 phylogenetic reconstruction and MSA evaluation. HoT 82.66 80.30 21 H IDDEN M ARKOV M ODELS (HMM S ) A probabilistic method Hidden Markov models (HMMs) are “probability states” that describe the probability of having an amino acid residue/nucleotide arranged in a column of a multiple sequence alignment HMMs may give more sensitive alignments than traditional techniques such as progressive alignment HMMs can produce both global and local alignments Assesses the likelihood of all gaps, matches , mismatches , insertions and deletions possible between an alignment. 22 MARKOV MODELS (CHAIN) Consists of a data structure that includes: a start, a possible transition state and an end or finite state Each transition state has a probability of occurrence Each transition state is independent of the history of previous states Any path taken from start to finish can produce a sequence 23 A S IMPLE H IDDEN M ARKOV M ODEL P(dog goes out in rain) = 0.20 P(dog goes out in sun) = 0.80 Observation: YNNNYYNNNYN (Y=goes out, N=doesn’t go out) Whether the dog goes outside(observed state or transition), may be 24 a predictor of the weather which is the hidden state. HMM EXAMPLES Based on a physical system that goes stepwise through a change. A protein (or DNA) sequence alignment has some hidden process (transitions) that describes transition probabilities. Chance (based on specific probabilities) plays an essential role in determining the exact alignment produced by specifying its position in the MSA Profile HMM can produce a model of an MSA 25 HMM T RANSITIONS AND STATES Transitions occur from any one state to another A state may be a match that aligns a residue at a specific position in an alignment column All transitions have an associated probability to convert to another state in each position The sum of these probabilities must be equal to 1 Some states may also be silent The end state does not have transition 26 The probabilities for all transition is 1000 sequences MAFFT Large alignment >30,000 sequences 38 REFERENCES: Baxevanis, A. D. and Oulette, B. F. Bioinformatics: A practical guide to the analysis of genes and proteins. 3rd ed. Wiley. Pevsner, J. Bioinformatics and functional Genomics 2nd ed. (2009). Wiley and Sons http://www.tcoffee.org/Publications/Pdf/t coffee.pdf https://www.ncbi.nlm.nih.gov/pmc/article s/PMC390337/ https://www.ncbi.nlm.nih.gov/pmc/article s/PMC135756/ 39

Lecture 8- MSA II PDF

Document Details

Tags

Related

Summary

Full Transcript