Unit-3 PDF
Document Details
Uploaded by TopsRationality8859
SRM Institute of Science and Technology
Tags
Summary
This document covers sequence alignment, including global and local pairwise alignments, database searching methods like BLAST and FASTA, and molecular phylogeny. It also provides definitions and introduces important concepts in bioinformatics, especially concerning protein structure and function.
Full Transcript
Unit- 3 Sequence alignment- Global Pairwise Alignment Algorithm- Solving problems- Local Pairwise Alignment Algorithm Database searching- BLAST- FASTA Multiple Sequence Alignment:- Progressive and Iterative Alignment- Tools for pairwise alignment- tools for multiple sequence alignment- A...
Unit- 3 Sequence alignment- Global Pairwise Alignment Algorithm- Solving problems- Local Pairwise Alignment Algorithm Database searching- BLAST- FASTA Multiple Sequence Alignment:- Progressive and Iterative Alignment- Tools for pairwise alignment- tools for multiple sequence alignment- Application of Multiple Sequence Alignment- Databases Of Multiple Alignment- Molecular Phylogeny- Methods of phylogeny- types of trees - Tools for phylogeny PAM and BLOSUM Sequence alignment introduction There is a long tradition in biology of comparative analysis leading to discovery. The aim of inferring structural, functional, and evolutionary relationships among the sequences under study. Why sequence alignment Lots of sequences with unknown structure and function vs. a few (but growing number) sequences with known structure and function If they align, they are “similar“ If they are similar, then they might have similar structure and/or function. Identify conserved patterns (motifs) If one of them has known structure/function, then alignment of other might yield insight about how the structure/functions works. Similar motif content might hint to similar function Define evolutionary relationships Importance of Sequence Comparison Protein Structure Prediction Similar sequence have similar structure & function Phylogenetic Tree Homology based protein structure prediction Genome Annotation Homology based gene prediction Function assignment & evolutionary studies Searching drug targets Searching sequence present or absent across genomes Some definitions 6 Protein Sequence Alignment and Database Searching Alignment of Two Sequences (Pair-wise Alignment) – The Scoring Schemes or Weight Matrices – Techniques of Alignments – DOTPLOT Multiple Sequence Alignment (Alignment of > 2 Sequences) –Extending Dynamic Programming to more sequences –Progressive Alignment (Tree or Hierarchical Methods) –Iterative Techniques Database Scanning – FASTA, BLAST, PSIBLAST, ISS Alignment of Whole Genomes – MUMmer (Maximal Unique Match) Pairwise alignment Used to find the best-matching piecewise local or global alignments of two query sequences. It can only be used between 2 sequences at a time Efficient to calculate and are often used for methods such as searching a database for sequences with high homology to a query. Pair-Wise Sequence Alignment Scoring Schemes or Weight Matrices Identity Scoring Genetic Code Scoring Chemical Similarity Scoring Observed Substitution or PAM Matrices BLOSUM: Matrix Derived from Ungapped Alignment Matrices Derived from Structure Techniques of Alignment Simple Alignment, Alignment with Gaps Application of DOTPLOT (Repeats, Inverse Repeats, Alignment) Dynamic Programming (DP) for Global Alignment Local Alignment (Smith-Waterman algorithm) Important Terms Gap Penalty (Opening, Extended) Similarity/Dissimilarity Score Significance Score (e.g. Z & E ) The Scoring Schemes or Weight Matrices For any alignment one need scoring scheme and weight matrix Important Point All algorithms to compare protein sequences rely on some scheme to score the equivalencing of each 210 possible pairs. 190 different pairs + 20 identical pairs Higher scores for identical/similar amino acids (e.g. A,A or I, L) Lower scores to different character (e.g. I, D) Identity Scoring Simplest Scoring scheme Score 1 for Identical pairs Score 0 for Non-Identical pairs Unable to detect similarity Percent Identity DNA scoring systems Sequence 1 ACTACCAGTTCATTTGATACTTCTCAAA | | | || Sequence 2 TACCATTACCGTGTTAACTGAAAGGACTTAAAGACT A C G T A 1 0 0 0 C 0 1 0 0 Match: 5 x 1 = 5 G 0 0 1 0 Mismatch: 19 x 0 = 0 T 0 0 0 1 Score: 5 The Scoring Schemes or Weight Matrices Genetic Code Scoring Fitch 1966 based on Nucleotide Base change required (0,1,2,3) Required to interconvert the codons for the two amino acids Rarely used nowadays Complication: „inexact“ is not binary (1|0) but something relative Amino acids have different physical and biochemical properties that are/are not important for function and thus influence their probability to be replaced in evolution The Scoring Schemes or Weight Matrices Chemical Similarity Scoring ❖Similarity based on Physio-chemical properties ❖MacLachlan 1972, Based on size, shape, charge and polar ❖Score 0 for opposite (e.g. E & F) and 6 for identical character The Scoring Schemes or Weight Matrices Observed Substitutions or PAM matrices ❖ Based on Observed Substitutions ❖Chicken and Egg problem ❖Dayhoff group in 1977 align sequence manually ❖Observed Substitutions or point mutation frequency ❖MATRICES are PAM30, PAM250, PAM100 etc AILDCTGRTG…… ALLDCTGR--…… SLIDCSAR-G…… AILNCTL-RG…… PAM (Percent Accepted Mutations) matrices Derived from global alignments of protein families. Family members sharing at least 85% identity (Dayhoff et al., 1978). Construction of phylogenetic tree and ancestral sequences of each protein family Computation of number of substitutions for each pair of amino acids How are substitution matrices generated ? Manually align protein structures (or, more risky, sequences) Look for frequency of amino acid substitutions at structurally constant sites. Entry -log(freq(observed/freq(expected)) + → more likely than random 0 → At random base rate - → less likely than random The Math Score matrix entry for time t given by: Conditional probability that a is s(a,b|t) = log P(b|a,t) substituted by b in time t qb Frequency of amino acid b PAM250 PAM Matrices: salient points Derived from global alignments of closely related sequences. Matrices for greater evolutionary distances are extrapolated from those for lesser ones. The number with the matrix (PAM40, PAM100) refers to the evolutionary distance; greater numbers are greater distances. Does not take into account different evolutionary rates between conserved and non-conserved regions. The Scoring Schemes or Weight Matrices BLOSUM- Matrix derived from Ungapped Alignment Similar idea to PAM matrices Derived from Local Alignment instead of Global Blocks represent structurally conserved regions Henikoff and Henikoff derived matric from conserved blocks BLOSUM80, BLOSUM62, BLOSUM35 BLOSUM (Blocks Substitution Matrix) Derived from alignments of domains of distantly related proteins (Henikoff & Henikoff, 1992) A A C E C Occurrences of each amino acid pair in A each column of each block alignment is A A - A = 1 C A - C = 4 counted E A - E = 2 The numbers derived from all blocks were C C - E = 2 C - C = 1 used to compute the BLOSUM matrices BLOSUM (Blocks Substitution Matrix) Sequences within blocks are clustered according to their level of identity Clusters are counted as a single sequence Different BLOSUM matrices differ in the percentage of sequence identity used in clustering The number in the matrix name (e.g. 62 in BLOSUM62) refers to the percentage of sequence identity used to build the matrix Greater numbers mean smaller evolutionary distance BLOSUM Matrices: Salient points Derived from local, ungapped alignments of distantly related sequences All matrices are directly calculated; no extrapolations are used – no explicit model The number after the matrix (BLOSUM62) refers to the minimum percent identity of the blocks used to construct the matrix; greater numbers are lesser distances. The BLOSUM series of matrices generally perform better than PAM matrices for local similarity searches (Proteins 17:49). Protein scoring systems Sequence 1 PTHPLASKTQILPEDLASEDLTI |||||| | || || PTHPLAGERAIGLARLAEEDFGM Sequence 2 substitution matrix C S T P A G N D.. C 9 S -1 4 T:G = -2 T -1 1 5 T:T = 5 P -3 -1 -1 7 A 0 1 0 -1 4... G -3 0 -2 -2 0 6 Score = 48 N -3 1 0 -2 -2 0 5 D -3 0 -1 -1 -2 -1 1 6.. substitution (scoring) matrix displaying the score matrix blosum62... A R N D C Q E G H IGrouping L K of M side F chains P S by T charge, W Y polarity V B Z... X A 4 R -1 5 N -2 0 6 D -2 -2 1 6 C 0 -3 -3 -3 9 di-sulphide bridges – important for protein structure Q -1 E -1 1 0 0 0 0 2 -3 -4 5 2 5 Exchange of D (Asp) by E (Glu) is „better“ G 0 -2 0 -1 -3 -2 -2 6 H -2 0 1 -1 -3 0 0 -2 (both are negatively charged) than 8 often in reactive center I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 replacement e.g. by F (Phe) (aromatic) 2 4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 M -1 -1 -2 -3 -1 0 -2 -3 -2 C (Cys) makes disulphide bridges and 1 2 -1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 P -1 -2 -2 -1 cannot be exchanged by other residue -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7Helix breaker – secondary structure S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 Both substrates for S/T kinases T 0 -1 0 -1 → high score of 9. -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 bulky aromatic Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1 Different substitution matrices for different alignments more stringent less stringent BLOSUM matrices usually perform better than PAM matrices for local similarity searches (Henikoff & Henikoff, 1993) When comparing closely related proteins one should use lower PAM or higher BLOSUM matrices, for distantly related proteins higher PAM or lower BLOSUM matrices For database searching the commonly used matrix (default) is BLOSUM62 The Scoring Schemes or Weight Matrices PET91: An Updated PAM matrix Matrices Derived from Structure Structure alignment is true/reference alignment Allow to compare distant proteins Risler 1988, derived from 32 protein structures Which Matrix one should use Matrices derived from Observed substitutions are better BLOSUM and Dayhoff (PAM) BLOSUM62 or PAM250 Alignment of Two Sequences Dealing Gaps in Pair-wise Alignment Sequence Comparison without Gaps Slide Windos method to got maximum score ALGAWDE ALATWDE Total score= 1+1+0+0+1+1+1=5 ; (PID) = (5*100)/7 Sequence with variable length should use dynamic programming Sequence Comparison with Gaps Insertion and deletion is common Slide Window method fails Generate all possible alignment 100 residue alignment require > 1075 PAM Vs. BLOSUM PAM100 = BLOSUM90 PAM120 = BLOSUM80 PAM160 = BLOSUM60 PAM200 = BLOSUM52 PAM250 More distant sequences = BLOSUM45 ⚫PAM120 for general use ⚫BLOSUM62 for general use ⚫PAM60 for close relations ⚫BLOSUM80 for close relations ⚫PAM250 for distant relations ⚫BLOSUM45 for distant relations TIPS on choosing a scoring matrix Generally, BLOSUM matrices perform better than PAM matrices for local similarity searches (Henikoff & Henikoff, 1993). When comparing closely related proteins one should use lower PAM or higher BLOSUM matrices, for distantly related proteins higher PAM or lower BLOSUM matrices. For database searching the commonly used matrix is BLOSUM62. Scoring Insertions and Deletions A T G T A A T G C A T A T G T G G A A T G A A T G T - - A A T G C A T A T G T G G A A T G A insertion / deletion The creation of a gap is penalized with a negative score value. Why Gap Penalties? Gaps not permitted Score: 0 1 GTGATAGACACAGACCGGTGGCATTGTGG 29 ||| | | ||| | || || | 1 GTGTCGGGAAGAGATAACTCCGATGGTTG 29 Match = 5 Mismatch = -4 Gaps allowed but not penalized Score: 88 1 GTG.ATAG.ACACAGA..CCGGT..GGCATTGTGG 29 ||| || | | | ||| || | | || || | 1 GTGTAT.GGA.AGAGATACC..TCCG..ATGGTTG 29 Why Gap Penalties? The optimal alignment of two similar sequences is usually that which maximizes the number of matches and minimizes the number of gaps. There is a tradeoff between these two - adding gaps reduces mismatches Permitting the insertion of arbitrarily many gaps can lead to high scoring alignments of non-homologous sequences. Penalizing gaps forces alignments to have relatively few gaps. Scoring Insertions and Deletions match = 1 mismatch = 0 Total Score: 4 A T G T T A T A C T A T G T G C G T A T A Total Score: 8 - 3.2 = 4.8 A T G T - - - T A T A C Gap parameters: T A T G T G C G T A T A d = 3 (gap opening) e = 0.1 (gap extension) insertion / deletion g = 3 (gap length) (g)=-d –(g-1)e (g) = -3 - (3 -1) 0.1 = -3.2 Techniques of Alignment Simple Alignment, Alignment with Gaps Application of DOTPLOT (Repeats, Inverse Repeats, Alignment) Dynamic Programming (DP) for Global Alignment Local Alignment (Smith-Waterman algorithm) Manual alignment. When there are few gaps and the two sequences are not too different from each other, a reasonable alignment can be obtained by visual inspection. GCG-TCCATCAGGTAGTTGGTGTG GCGATCCATCAGGTGGTTGGTGTG 39 Advantages of manual alignment: (1) use of a powerful and trainable tool (the brain, well… some brains). (2) ability to integrate additional data, e.g., domain structure, biological function. 40 41 Protein Alignment may be guided by Secondary and Tertiary Structures Escherichia coli DjlA protein Homo sapiens DjlA protein 42 Disadvantages of manual alignment: The method is subjective and unscalable. 43 The dot-matrix method (Gibbs and McIntyre, 1970): The two sequences are written out as column and row headings of a two-dimensional matrix. A dot is put in the dot- matrix plot at a position where the nucleotides in the two sequences are identical. 44 The alignment is defined by a path from the upper- left element to the lower-right element. 45 There are 4 possible steps in the path: (1) a diagonal step through a dot = match. (2) a diagonal step through an empty element of the matrix = mismatch. (3) a horizontal step = a gap in the sequence on the top of the matrix. (4) a vertical step = a gap in the sequence on the left of the matrix. 46 Dot-matrix methods: Advantages: May unravel information on the evolution of sequences. Disadvantages: May not identify the best possible alignment. 47 Window size = 60 amino acids; Stringency = 24 matches Advantages: Highlighting Information The vertical gap indicates that a coding region corresponding to ~75 amino acids has either been deleted from the human gene or inserted into the bacterial gene. 48 Window size = 60 amino acids; Stringency = 24 matches Advantages: Highlighting Information The two pairs of diagonally oriented parallel lines most probably indicate that two small internal duplications occurred in the bacterial gene. 49 Disadvantages: Not possible to identify the best alignment. 50 exercise TGTACTGCATGCCATTCGCAATT TGTACCTCATGCCATACGCAATT PROBLEM ACTGATTCA ACGCATCA COELACANTH PELICAN Gap penalty= -2 Score for mismatch= -3 Score for match = 2 Database Similarity Searching INTRODUCTION A main application of Pairwise alignment is retrieving biological sequences in databases based on similarity. It involves submission of a query sequence and performing a Pairwise comparison with all individual sequences in a database. Database similarity searching is Pairwise alignment on a large scale. The dynamic programming method is slow and impractical to use in most cases. Special search methods are needed to speed up the computational process of sequence comparison. UNIQUE REQUIREMENTS OF DATABASE SEARCHING There are unique requirements for implementing algorithms for sequence database searching. They are Sensitivity Selectivity or Specificity, Speed HEURISTIC DATABASE SEARCHING The heuristic algorithms perform faster searches because they examine only a fraction of the possible alignments examined in regular dynamic programming. Heuristic Methods: FASTA and BLAST FASTA ( FAST ALL) First fast sequence searching algorithm for comparing a query sequence against a database. BLAST Basic Local Alignment Search Technique improvement of FASTA: Search speed, ease of use, statistical rigor. 83 These methods are not guaranteed to find the optimal alignment or true homologs, but are 50–100 times faster than dynamic programming. The increased computational speed comes at a moderate expense of sensitivity and specificity of the search, which is easily tolerated by working molecular biologists. Both BLAST and FASTA use a heuristic word method for fast pairwise sequence alignment. It works by finding short stretches of identical or nearly identical letters in two sequences. These short strings of characters are called words The basic assumption is that two related sequences must have at least one word in common. Basic idea: a good alignment contains subsequences of absolute identity (short lengths of exact matches): First, identify very short exact matches. Next, the best short hits from the first step are extended to longer regions of similarity. Finally, the best hits are optimized. FASTA (FAST ALL) FASTA (FAST ALL, http://www.ebi.ac.uk/Tools/sss/fasta/ ) FASTA is a DNA and protein sequence alignment software package first described (as FASTP) by David J. Lipman and William R. Pearson in 1985 It was in fact the first database similarity search tool developed, preceding the development of BLAST. FASTA uses a “hashing” strategy to find matches for a short stretch of identical residues with a length of k. The string of residues is known as ktuples or ktups, which are equivalent to words in BLAST, but are normally shorter than the words. Typically, a ktup is composed of two residues for protein sequences and six residues for DNA sequences. FASTA Algorithm Derived from logic of the dot plot compute best diagonals from all frames of alignment The method looks for exact matches between words in query and test sequence DNA words are usually 6 nucleotides long protein words are 2 amino acids long 89 Steps of the FASTA alignment procedure In step 1-all possible ungapped alignments are found between two sequences with the hashing method. In step 2- the alignments are scored according to a particular scoring matrix. Only the ten best alignments are selected. In step 3- the alignments in the same diagonal are selected and joined to form a single gapped alignment, which is optimized using the dynamic programming approach. FASTA Algorithm 92 Makes Longest Diagonal After all diagonals are found, tries to join diagonals by adding gaps Computes alignments in regions of best diagonals 93 FASTA Alignments 94 Statistical Significance Z-score : This describes the number of standard deviations from the mean score for the database search. Because most of the alignments with the query sequence are with unrelated sequences. The higher the Z-score for a reported match, the further away from the mean of the score distribution, hence, the more significant the match. For a Z-score > 15, the match can be considered extremely significant, with certainty of a homologous relationship. If Z is in the range of 5 to 15, the sequence pair can be described as highly probable homologs. If Z < 5, their relationships is described as less certain. FASTA Results - Histogram !!SEQUENCE_LIST 1.0 (Nucleotide) FASTA of: b2.seq from: 1 to: 693 December 9, 2002 14:02 TO: /u/browns02/Victor/Search-set/*.seq Sequences: 2,050 Symbols: 913,285 Word Size: 6 Searching with both strands of the query. Scoring matrix: GenRunData:fastadna.cmp Constant pamfactor used Gap creation penalty: 16 Gap extension penalty: 4 Histogram Key: Each histogram symbol represents 4 search set sequences Each inset symbol represents 1 search set sequences z-scores computed from opt scores z-score obs exp (=) (*) < 20 0 0: 22 0 0: 24 3 0:= 26 2 0:= 28 5 0:== 30 11 3:*== 32 19 11:==*== 34 38 30:=======*== 36 58 61:===============* 38 79 100:==================== * 40 134 140:==================================* 42 167 171:==========================================* 44 205 189:===============================================*==== 46 209 192:===============================================*===== 48 177 184:=============================================* 97 FASTA Results - List The best scores are: init1 initn opt z-sc E(1018780).. SW:PPI1_HUMAN Begin: 1 End: 269 ! Q00169 homo sapiens (human). phosph... 1854 1854 1854 2249.3 1.8e-117 SW:PPI1_RABIT Begin: 1 End: 269 ! P48738 oryctolagus cuniculus (rabbi... 1840 1840 1840 2232.4 1.6e-116 SW:PPI1_RAT Begin: 1 End: 270 ! P16446 rattus norvegicus (rat). pho... 1543 1543 1837 2228.7 2.5e-116 SW:PPI1_MOUSE Begin: 1 End: 270 ! P53810 mus musculus (mouse). phosph... 1542 1542 1836 2227.5 2.9e-116 SW:PPI2_HUMAN Begin: 1 End: 270 ! P48739 homo sapiens (human). phosph... 1533 1533 1533 1861.0 7.7e-96 SPTREMBL_NEW:BAC25830 Begin: 1 End: 270 ! Bac25830 mus musculus (mouse). 10,... 1488 1488 1522 1847.6 4.2e-95 SP_TREMBL:Q8N5W1 Begin: 1 End: 268 ! Q8n5w1 homo sapiens (human). simila... 1477 1477 1522 1847.6 4.3e-95 SW:PPI2_RAT Begin: 1 End: 269 ! P53812 rattus norvegicus (rat). pho... 1482 1482 1516 1840.4 1.1e-94 98 FASTA Results - Alignment SCORES Init1: 1515 Initn: 1565 Opt: 1687 z-score: 1158.1 E(): 2.3e-58 >>GB_IN3:DMU09374 (2038 nt) initn: 1565 init1: 1515 opt: 1687 Z-score: 1158.1 expect(): 2.3e-58 66.2% identity in 875 nt overlap (83-957:151-1022) 60 70 80 90 100 110 u39412.gb_pr CCCTTTGTGGCCGCCATGGACAATTCCGGGAAGGAAGCGGAGGCGATGGCGCTGTTGGCC || ||| | ||||| | ||| ||||| DMU09374 AGGCGGACATAAATCCTCGACATGGGTGACAACGAACAGAAGGCGCTCCAACTGATGGCC 130 140 150 160 170 180 120 130 140 150 160 170 u39412.gb_pr GAGGCGGAGCGCAAAGTGAAGAACTCGCAGTCCTTCTTCTCTGGCCTCTTTGGAGGCTCA ||||||||| || ||| | | || ||| | || || ||||| || DMU09374 GAGGCGGAGAAGAAGTTGACCCAGCAGAAGGGCTTTCTGGGATCGCTGTTCGGAGGGTCC 190 200 210 220 230 240 180 190 200 210 220 230 u39412.gb_pr TCCAAAATAGAGGAAGCATGCGAAATCTACGCCAGAGCAGCAAACATGTTCAAAATGGCC ||| | ||||| || ||| |||| | || | |||||||| || ||| || DMU09374 AACAAGGTGGAGGACGCCATCGAGTGCTACCAGCGGGCGGGCAACATGTTTAAGATGTCC 250 260 270 280 290 300 240 250 260 270 280 290 u39412.gb_pr AAAAACTGGAGTGCTGCTGGAAACGCGTTCTGCCAGGCTGCACAGCTGCACCTGCAGCTC |||||||||| ||||| | |||||| |||| ||| || ||| || | DMU09374 AAAAACTGGACAAAGGCTGGGGAGTGCTTCTGCGAGGCGGCAACTCTACACGCGCGGGCT 310 320 330 340 350 360 99 FASTA on the Web Many websites offer FASTA searches Each server has its limits Be aware that you depend “on the kindness of strangers.” 100 Institut de Génétique Humaine, Montpellier France, GeneStream server http://www2.igh.cnrs.fr/bin/fasta-guess.cgi Oak Ridge National Laboratory GenQuest server http://avalon.epm.ornl.gov/ European Bioinformatics Institute, Cambridge, UK http://www.ebi.ac.uk/htbin/fasta.py?request EMBL, Heidelberg, Germany http://www.embl-heidelberg.de/cgi/fasta-wrapper-free Munich Information Center for Protein Sequences (MIPS) at Max-Planck-Institut, Germany http://speedy.mips.biochem.mpg.de/mips/programs/fasta.html Institute of Biology and Chemistry of Proteins Lyon, France http://www.ibcp.fr/serv_main.html Institute Pasteur, France http://central.pasteur.fr/seqanal/interfaces/fasta.html GenQuest at The Johns Hopkins University http://www.bis.med.jhmi.edu/Dan/gq/gq.form.html National Cancer Center of Japan http://bioinfo.ncc.go.jp 101 The Basic Local Alignment Search Tool (BLAST) The BLAST program was developed by Stephen Altschul of NCBI in 1990 and has since become one of the most popular programs for sequence analysis. BLAST uses heuristics to align a query sequence with all sequences in a database The objective is to find high-scoring ungapped segments among related sequences. The existence of ungapped segments above a given threshold which helps to discriminate related sequences from unrelated sequences in a database. BLAST benefits Speed User friendly Statistical rigor More sensitive BLAST Algorithm The resulting contiguous aligned segment pair without gaps is called high-scoring segment pair. In the original version of BLAST, the highest scored HSPs are presented as the final report. They are also called maximum scoring pairs. BLAST Statistics After an HSP is identified, it is important to determine whether this match is significant or not. Two BLAST statistics, The score (S) and The E-value (E) (expectation value) Score (S) The score (S) is a good measure of the quality of an alignment because it is calculated as the sum of substitution and gap scores for each aligned residue. E-value (expectation value) It indicates the probability that the resulting alignments from a database search are caused by random chance. BLAST compares a query sequence against all database sequences, and so the E-value is determined by the following formula: E=m×n×P ◦ where m is the total number of residues in a database, ◦ n is the number of residues in the query sequence, and ◦ P is the probability that an HSP alignment is a result of random chance. For example, query sequence of 100 residues to a database containing a total of 1012 residues Results in a P-value for the ungapped HSP region in one of the database matches of 1 × 1−20. The E-value: 100 × 1012 × 10−20 =10−6. It is expressed as 1e − 6 in BLAST output. This indicates that the probability of this database sequence match occurring due to random chance is 10−6. The E-value provides information about the likelihood that a given sequence match is purely by chance. The lower the E-value, the less likely the database match is a result of random chance and therefore the more significant the match is. Empirical interpretation of the E-value is as follows. ◦ If E < 1e − 50 (or 1 × 10−50), there should be an extremely high confidence that the database match is a result of homologous relationships. ◦ If E is between 0.01 and 1e − 50, the match can be considered a result of homology. ◦ If E is between 0.01 and 10, the match is considered not significant, but may hint at a tentative remote homology relationship. Additional evidence is needed to confirm the tentative relationship. ◦ If E > 10, the sequences under consideration are either unrelated or related by extremely distant relationships that fall below the limit of detection with the current method. Bit score (S’) The bit score measures sequence similarity independent of query sequence length and database size and is normalized based on the raw pairwise alignment score. The bit score (S’) is determined by the following formula S’ = (λ × S − lnK)/ ln2 where λ is the Gumble distribution constant, S is the raw alignment score, and K is a constant associated with the scoring matrix used. Clearly, the bit score (S) is linearly related to the raw alignment score (S). Thus, the higher the bit score, the more highly significant the match is. Variants There are many different BLAST programs available. 1. blastp: 2. blastn: 3. blastx: 4. tblastn 5. tblastx: 114 115 116 117 118 119 120 121 122 Choosing the right parameters 123 124 125 126 Controlling the output 127 128 129 130 131 More on BLAST NCBI Blast Glossary http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/glossary2.html Education: Blast Information http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html Steve Altschul's Blast Course http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html 132 COMPARISON OF FASTA AND BLAST The major difference is in the seeding step; BLAST FASTA BLAST uses a substitution FASTA identifies identical matrix to find matching matching words using the words, whereas. hashing procedure Faster Slower The use of low-complexity FASTA scans smaller masking in the BLAST window sizes. Thus, it gives procedure means that it may more sensitive results than have higher specificity than BLAST, with a better FASTA because potential coverage rate for homologs. false positives are reduced. FASTA returns only one final BLAST sometimes gives alignment. multiple best-scoring alignments from the same sequence; Multiple Sequence Alignment introdution A natural extension of pairwise alignment is multiple sequence alignment, which is to align multiple related sequences to achieve optimal matching of the sequences. As the process generates multiple matching sequence pairs, it is often necessary to convert the numerous pairwise alignments into a single alignment, which arranges sequences in such a way that evolutionarily equivalent positions across all sequences are matched. Multiple Alignment versus Pairwise Alignment Up until now we have only tried to align two sequences. baboon venter Multiple Alignment versus Pairwise Alignment venter Up until now we have baboon only tried to align two sequences. What about more than chicken _1768316_tiger2 two? And what for? Multiple Alignment versus Pairwise Alignment Up until now we have only tried baboon venter to align two sequences. What about more than two? And what for? A faint similarity between two chicken _1768316_tiger2 sequences becomes significant if present in many Multiple alignments can reveal subtle similarities that pairwise alignments do not reveal Why we do multiple alignments? Multiple nucleotide or amino sequence alignment techniques are usually performed to fit one of the following scopes : In order to characterize protein families, identify shared regions of homology in a multiple sequence alignment; (this happens generally when a sequence search revealed homologies to several sequences) It allows the identification of conserved sequence patterns and motifs in the whole sequence family, Many conserved and functionally critical amino acid residues can be identified in a protein multiple alignment. Multiple sequence alignment is also an essential prerequisite to carrying out phylogenetic analysis of sequence families and prediction of protein secondary and tertiary structures. Multiple sequence alignment also has applications in designing degenerate polymerase chain reaction (PCR) primers based on multiple related sequences. An example of Multiple Alignment VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLTCLVKGFYPSD--IAVEWWSNG-- EXHAUSTIVE ALGORITHMS The exhaustive alignment method involves examining all possible aligned positions simultaneously. Similar to dynamic programming in Pairwise alignment, which involves the use of a two-dimensional matrix to search for an optimal alignment, to use dynamic programming for multiple sequence alignment, extra dimensions are needed to take all possible ways of sequence matching into consideration. This means to establish a multidimensional search matrix. Back-tracking is applied through the three-dimensional matrix to find the highest scored path that represents the optimal alignment. For this reason, full dynamic programming is limited to small datasets of less than ten short sequences DCA (Divide-and-Conquer Alignment) DCA (divide-and-conquer alignment, http://bibiserv.Techfak.Uni-bielefeld.De/ dca/) is a web-based program that is in fact semi exhaustive because certain steps of computation are reduced to heuristics. It works by breaking each of the sequences into two smaller sections. The breaking points are determined based on regional similarity of the sequences Dynamic programming is applied for aligning each set of subsequences. The resulting short alignments are joined together head to tail to yield a multiple alignment of the entire length of all sequences. It performs global alignment and requires the input sequences to be of similar lengths and domain structures. Despite the use of heuristics, the program is still extremely computationally intensive and can handle only datasets of a very limited number of sequences. HEURISTIC ALGORITHMS The heuristic algorithms fall into three categories: progressive alignment type, iterative alignment type and block-based alignment type. Progressive Alignment It is multiple sequence alignment strategy that uses a stepwise approach to assemble an alignment. It first performs all possible pairwise alignments using the dynamic programming approach Determines the relative distances between each pair of sequences to construct a distance matrix. which is subsequently used to build a guide tree. It then realigns the two most closely related sequences using the dynamic programming approach. Other sequences are progressively added to the alignment according to the degree of similarity suggested by the guide tree. Clustal The most well-known progressive alignment program is Clustal. Clustal (www.ebi.ac.uk/clustalw/) is a progressive multiple alignment program available either as a stand-alone or on- line program. The stand-alone program, which runs on UNIX and Macintosh, has two variants, ClustalW- a simple text-based interface ClustalX- user-friendly graphical interface. ClustalW- for multiple alignment ClustaW is a general purpose multiple alignment program for DNA or proteins. ClustalW is produced by Julie D. Thompson, Toby Gibson of European Molecular Biology Laboratory, Germany and Desmond Higgins of European Bioinformatics Institute, Cambridge, UK. Algorithmic ClustalW is cited: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Research, 22:4673-4680. ClustalW- for multiple alignment ClustalW can create multiple alignments, manipulate existing alignments, do profile analysis and create phylogentic trees. Alignment can be done by 2 methods: - slow/accurate - fast/approximate Running ClustalW [~]% clustalw ************************************************************** ******** CLUSTAL W (1.7) Multiple Sequence Alignments ******** ************************************************************** 1. Sequence Input From Disc 2. Multiple Alignments 3. Profile / Structure Alignments 4. Phylogenetic trees S. Execute a system command H. HELP X. EXIT (leave program) Your choice: Running ClustalW The input file for clustalW is a file containing all sequences in one of the following formats: NBRF/PIR, EMBL/SwissProt, Pearson (Fasta), GDE, Clustal, GCG/MSF, RSF. Using ClustalW ****** MULTIPLE ALIGNMENT MENU ****** 1. Do complete multiple alignment now (Slow/Accurate) 2. Produce guide tree file only 3. Do alignment using old guide tree file 4. Toggle Slow/Fast pairwise alignments = SLOW 5. Pairwise alignment parameters 6. Multiple alignment parameters 7. Reset gaps between alignments? = OFF 8. Toggle screen display = ON 9. Output format options S. Execute a system command H. HELP or press [RETURN] to go back to main menu Your choice: Output of ClustalW CLUSTAL W (1.7) multiple sequence alignment HSTNFR GGGAAGAG---TTCCCCAGGGACCTCTCTCTAATCAGCCCTCTGGCCCAG------GCAG SYNTNFTRP GGGAAGAG---TTCCCCAGGGACCTCTCTCTAATCAGCCCTCTGGCCCAG------GCAG CFTNFA -------------------------------------------TGTCCAG------ACAG CATTNFAA GGGAAGAG---CTCCCACATGGCCTGCAACTAATCAACCCTCTGCCCCAG------ACAC RABTNFM AGGAGGAAGAGTCCCCAAACAACCTCCATCTAGTCAACCCTGTGGCCCAGATGGTCACCC RNTNFAA AGGAGGAGAAGTTCCCAAATGGGCTCCCTCTCATCAGTTCCATGGCCCAGACCCTCACAC OATNFA1 GGGAAGAGCAGTCCCCAGCTGGCCCCTCCTTCAACAGGCCTCTGGTTCAG------ACAC OATNFAR GGGAAGAGCAGTCCCCAGCTGGCCCCTCCTTCAACAGGCCTCTGGTTCAG------ACAC BSPTNFA GGGAAGAGCAGTCCCCAGGTGGCCCCTCCATCAACAGCCCTCTGGTTCAA------ACAC CEU14683 GGGAAGAGCAATCCCCAACTGGCCTCTCCATCAACAGCCCTCTGGTTCAG------ACCC ** * ClustalW options Your choice: 5 ********* PAIRWISE ALIGNMENT PARAMETERS ********* Slow/Accurate alignments: 1. Gap Open Penalty :15.00 2. Gap Extension Penalty :6.66 3. Protein weight matrix :BLOSUM30 4. DNA weight matrix :IUB Fast/Approximate alignments: 5. Gap penalty :5 6. K-tuple (word) size :2 7. No. of top diagonals :4 8. Window size :4 9. Toggle Slow/Fast pairwise alignments = SLOW H. HELP Enter number (or [RETURN] to exit): ClustalX - Multiple Sequence Alignment Program ClustalX provides a new window-based user interface to the ClustalW program. It uses the Vibrant multi-platform user interface development library, developed by the National Center for Biotechnology Information (Bldg 38A, NIH 8600 Rockville Pike,Bethesda, MD 20894) as part of their NCBI SOFTWARE DEVELOPEMENT TOOLKIT. ClustalX ClustalX ClustalX ClustalX ClustalX ClustalX ClustalW Popular multiple alignment tool today ‘W’ stands for ‘weighted’ (different parts of alignment are weighted differently). Three-step process 1.) Construct pairwise alignments 2.) Build Guide Tree 3.) Progressive Alignment guided by the tree Step 1: Pairwise Alignment Aligns each sequence again each other giving a similarity matrix Similarity = exact matches / sequence length (percent identity) v1 v2 v3 v4 v1 - v2.17 - v3.87.28 - v4.59.33.62 - (.17 means 17 % identical) Step 2: Guide Tree Create Guide Tree using the similarity matrix ClustalW uses the neighbor-joining method Guide tree roughly reflects evolutionary relations Step 2: Guide Tree (cont’d) v1 v1 v2 v3 v4 v3 v1 - v4 v2.17 - v2 v3.87.28 - v4.59.33.62 - Calculate: v1,3 = alignment (v1, v3) v1,3,4 = alignment((v1,3),v4) v1,2,3,4 = alignment((v1,3,4),v2) Step 3: Progressive Alignment Start by aligning the two most similar sequences Following the guide tree, add in the next sequences, aligning to the existing alignment Insert gaps as necessary FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFD FOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFD FOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFD FOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQ FOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ.. : **. :.. *:.* *. * **: Dots and stars show how well-conserved a column is. Iterative Alignment The iterative approach is based on the idea that an optimal solution can be found by repeatedly modifying existing suboptimal solutions. The procedure starts by producing a low-quality alignment and gradually improves it by iterative realignment through well-defined procedures until no more improvements in the alignment scores can be achieved. However, this method is also heuristic in nature and does not have guarantees for finding the optimal alignment. An example of iterative alignment PRRN (http://www.genome.jp/tools/prrn/ )is a web-based program that uses a double nested iterative strategy for multiple alignments. It performs multiple alignments through two sets of iterations: In the outer iteration, an initial random alignment is generated that is used to derive a UPGMA tree. (unweighted pair group method using arithmetic average).Weights are subsequently applied to optimize the alignment. In the inner iteration, the sequences are randomly divided into two groups. Randomized alignment is used for each group in the initial cycle, after which the alignment positions in each group are fixed. The two groups, each treated as a single sequence, are then aligned to each other using global dynamic programming. The process is repeated through many cycles until the total SP score no longer increases. At this point, the resulting alignment is used to construct a new UPGMA tree. New weights are applied to optimize alignment scores. The newly optimized alignment is subject to further realignment in the inner iteration. This process is repeated over many cycles until there is no further improvement in the overall alignment scores Block-Based Alignment The progressive and iterative alignment strategies fail to recognize conserved domains and motifs among highly divergent sequences of varying lengths. For such divergent sequences that share only regional similarities, a local alignment based approach has to be used. The strategy identifies a block of ungapped alignment shared by all the sequences, block-based alignment programs DIALIGN2 ( http://bioweb.pasteur.fr/seqanal/interfaces/dialign2.html ) Match-Box ( www.sciences.fundp.ac.be/biologie/bms/matchbox ) submit.shtml DIALIGN2 It is a web based program designed to detect local similarities. It does not apply gap penalties and thus is not sensitive to long gaps. The method breaks each of the sequences down to smaller segments and performs all possible pairwise alignments between the segments. High-scoring segments, called blocks, among different sequences are then compiled in a progressive manner to assemble a full multiple alignment. It places emphasis on block-to-block comparison rather than residue-to-residue comparison. The sequence regions between the blocks are left unaligned. The program has been shown to be especially suitable for aligning divergent sequences with only local similarity. Match-Box It is also a web-based server that also aims to identify conserved blocks (or boxes) among sequences. The program compares segments of every nine residues of all possible pairwise alignments. If the similarity of particular segments is above a certain threshold across all sequences, they are used as an anchor to assemble multiple alignments; residues between blocks are unaligned. The server requires the user to submit a set of sequences in the FASTA format and the results are returned by e-mail. Molecular Phylogeny Phylogenetics is the science of estimating and analyzing evolutionary relationships. Phylogenetic relationships among micro-organisms are especially difficult to discern. Molecular biology often helps in determining genetic relationships between different organisms. Nucleic acids (DNA and RNA) and proteins are 'information molecules' in that they retain a record of an organism's evolutionary history. The approach is to compare nucleic acid or protein sequences from different organisms using computer programs and estimate the evolutionary relationships based on the degree of homology between the sequences. The nucleotide or amino acid differences within a gene reflect the evolutionary distance between two organisms. Closely related organisms will exhibit fewer sequence differences than distantly related organisms. Small-subunit ribosomal RNA (rRNA) is widely used in molecular phylogeny. One advantage of the molecular approach in determining phylogenetic relationships over the more classical approaches, such as those based on morphology or life cycle traits, is that the differences are readily quantifiable. Sequences from different organisms can be compared and the number of differences can be established. These data are often expressed in the form of 'trees' in which the positions and lengths of the 'branches' depict the relatedness between organisms. Methods of phylogeny 1. Distance-Based Methods 1. Distance-Based Methods Neighbor-Joining (NJ): Constructs trees by minimizing the total branch length using pairwise distance data between taxa. UPGMA (Unweighted Pair Group Method with Arithmetic Mean): A simple method assuming a constant rate of evolution (molecular clock hypothesis), forming a rooted tree based on pairwise distances. 2. Character-Based Methods Maximum Parsimony (MP): Selects the tree that minimizes the total number of evolutionary changes (mutations) needed to explain the observed data. Maximum Likelihood (ML): Finds the tree that maximizes the probability of the observed data under a given model of evolution. It considers branch lengths and substitution rates. Bayesian Inference: A probabilistic approach that estimates the posterior distribution of trees based on prior knowledge and the likelihood of the observed data. 3. Clustering-Based Methods Hierarchical Clustering: Groups sequences into clusters without a strict evolutionary model, producing dendrograms that resemble phylogenetic trees but may not reflect true evolutionary relationships. 4. Network-Based Methods Split Networks: Can handle more complex relationships, such as hybridization or gene flow, which are not easily represented by a simple tree. Phylogenetic trees represent the evolutionary relationships among various species or organisms. There are several types of phylogenetic trees, each based on how the data is structured and interpreted: 1. Cladogram 2. Phylogram 3. Ultrametric Tree (or Chronogram) 4. Dendrogram 5. Rooted Tree 6. Unrooted Tree 7. Network Phylogeny 1. Cladogram Structure: Branch lengths are not scaled to evolutionary time. Purpose: Shows the relative order of branching, reflecting shared common ancestors. Use: Focuses on the relationship between clades (groups sharing a common ancestor). These relationships are based on observable physical characteristics. Cladograms show the relationships in a graphic that looks like a tree, with branches connected to a common ancestry.. Cladogram of birds 2. Phylogram Structure: Branch lengths are proportional to the amount of evolutionary change (e.g., mutations). Purpose: Reflects the amount of evolutionary divergence, showing both relationships and genetic differences. Phylogram of Costasiella 3. Ultrametric Tree (or Chronogram) Structure: Branch lengths are scaled to represent actual time. Purpose: Illustrates the timing of evolutionary events. All tips (leaves) are equidistant from the root, indicating that species diverged at the same time. 4. Dendrogram Structure: General term for tree-like diagrams. Purpose: Used to illustrate clustering or hierarchical relationships, but less specific than a phylogenetic tree. The dendrogram below shows the hierarchical clustering of six observations shown on the scatterplot to the left. 5. Rooted Tree Structure: Contains a single common ancestor (root), showing the direction of evolutionary time or ancestry. Purpose: Depicts an ancestral lineage from which all other organisms in the tree have descended. 6. Unrooted Tree Structure: Does not indicate a common ancestor, only shows relationships between organisms. Purpose: Focuses on relative relatedness without suggesting a time-based evolutionary path. 7. Network Phylogeny Structure: Represents complex evolutionary histories involving hybridization, horizontal gene transfer, or recombination. Purpose: Accounts for non-tree-like evolutionary events that can't be represented in traditional trees. Each type serves different purposes depending on the focus of the evolutionary study, such as time, divergence, or genetic differences. Tools of phylogeny Some of the tools used for phylogeny Canopy , CGRphyl, CITUP, ClustalW, CoalEvol, CodABC, Dendroscope, EXACT, EzEditor, fastDNAml ,FastTree Fitmodel ,Geneious, HyPhy, IQPNNI, IQ-TREE, jModelTest, JolyTree, LisBeth, MEGA ,MegAlign Pro Thank you