Pairwise Sequence Alignment PDF
Document Details
Tags
Summary
This document provides an overview of pairwise sequence alignment, focusing on the concepts of sequence homology, orthologs, and paralogs. It discusses how sequence comparison is informative for identifying evolutionary and functional relationships between biological sequences. It introduces different alignment methods, such as global and local alignments, and mentions their applications in bioinformatics.
Full Transcript
Pairwise sequence alignment SIO1003 | Bioinformatics Concepts SEQUENCE HOMOLOGY SIO1003 | Bioinformatics Concepts Sequence comparison is most informative when it detects homologs Homologs are sequences that have common origins i.e. they share a common ancestor...
Pairwise sequence alignment SIO1003 | Bioinformatics Concepts SEQUENCE HOMOLOGY SIO1003 | Bioinformatics Concepts Sequence comparison is most informative when it detects homologs Homologs are sequences that have common origins i.e. they share a common ancestor They may or may not have common activity Common Time Ancestor CTCGTTA Can be used to establish evolutionary Recent relationships Species CACTGTA CATGTTA SIO1003 | Bioinformatics Concepts Key terms When we talk about related sequences, we use specific terminology. Homologous sequences may be either: Orthologs or Paralogs (Note. these are all or nothing relationships!) Any pair of sequences may share a certain level of: Identity and/or Similarity (Note. if these metrics are above a certain level we often infer homology) SIO1003 | Bioinformatics Concepts Orthologs Orthologs tend to have similar function Orthologs: are homologs produced by speciation that have diverged due to divergence of the organisms they are associated with. – Ortho = [greek: straight]... implies direct descent Common Time Ancestor CTCGTTA Speciation Recent Species CACTGTA CATGTTA SIO1003 | Bioinformatics Concepts Paralogs Paralogs tend to have slightly different functions Paralogs: are homologs produced by gene duplication. They represent genes derived from a common ancestral gene that duplicated within an organism and then subsequently diverged by accumulated mutation. – Para = [greek: along side of] Single Species CTCGTTA Duplication CTCGTTA CACGTTA Divergence CACTGTA CATGTTA SIO1003 | Bioinformatics Concepts Orthologs and paralogs are often viewed in a single tree Homologs Paralogs Orthologs Orthologs Frog 𝛂 Chick 𝛂 Mouse 𝛂 Mouse 𝛃 Chick 𝛃 Frog 𝛃 𝛂-chain gene 𝛃-chain gene Gene Duplication Early Gene of Interest SIO1003 | Bioinformatics Concepts Orthologs vs Paralogs In practice, determining ortholog vs paralog can be a complex problem: gene loss after duplication, lack of knowledge of evolutionary history, weak similarity because of evolutionary distance Homology does not necessarily imply exact same function may have similar function at very crude level but play a different physiological role SIO1003 | Bioinformatics Concepts Homologous proteins. Share very similar 3D structure. Pairwise sequence alignment shows very limited amino acid identity SIO1003 | Bioinformatics Concepts Myoglobin orthologs Human and rat myoglobin both function to transfer oxygen to muscle cells SIO1003 | Bioinformatics Concepts Human globin paralogs All globins have distinct but related functions: as oxygen carrier proteins SIO1003 | Bioinformatics Concepts hemoglobin subunit beta [Homo sapiens] NCBI Reference Sequence: NP_000509.1 >NP_000509.1 hemoglobin subunit beta [Homo sapiens] MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKK VLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNV LVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH myoglobin [Homo sapiens] NCBI Reference Sequence: NP_005359.1 >NP_005359.1 myoglobin [Homo sapiens] MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASEDLKKHGAT VLTALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQ VLQSKHPGDFGADAQGAMNKALELFRKDMASNYKELGFQG SIO1003 | Bioinformatics Concepts hemoglobin subunit alpha [Homo sapiens] (alpha 1) NCBI Reference Sequence: NP_000549.1 >NP_000549.1 hemoglobin subunit alpha [Homo sapiens] MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALT NAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTS KYR hemoglobin subunit alpha [Homo sapiens] (alpha 2) NCBI Reference Sequence: NP_000508.1 >NP_000508.1 hemoglobin subunit alpha [Homo sapiens] MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALT NAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTS KYR SIO1003 | Bioinformatics Concepts Sequence changes during evolution There are three major types of sequence change that can occur during evolution. Mutations/Substitutions Deletions Insertions (C) CTCGTTA (B) (A) CATGTTA CACTGTA SIO1003 | Bioinformatics Concepts Mutations, deletions and insertions There are three major types of sequence change that can occur during evolution. Mutations/Substitutions Deletions CTCGTTA CACGTTA Insertions CTCGTTA Mutation CACGTTA CATGTTA CACTGTA Likely occurred prior to speciation SIO1003 | Bioinformatics Concepts Mutations, deletions and insertions There are three major types of sequence change that can occur during evolution. Mutations/Substitutions Deletions CTCGTTA CACGTTA Insertions CTCGTTA Mutation CACGTTA CACGTTA CACGTTA (speciation) CATGTTA CACTGTA SIO1003 | Bioinformatics Concepts Mutations, deletions and insertions There are three major types of sequence change that can occur during evolution. Mutations/Substitutions Deletions CTCGTTA CACGTTA Insertions CACGTTA CACTTA CTCGTTA Mutation CACGTTA CACGTTA X CACGTTA Deletion CACTTA CATGTTA CACTGTA SIO1003 | Bioinformatics Concepts Mutations, deletions and insertions There are three major types of sequence change that can occur during evolution. Mutations/Substitutions Deletions CTCGTTA CACGTTA Insertions CACGTTA CACTTA CACTTA CACTGTA CTCGTTA Mutation CACGTTA CACGTTA X CACGTTA Deletion CACTTA CATGTTA CACTGTA Insertion SIO1003 | Bioinformatics Concepts Mutations, deletions and insertions There are three major types of sequence change that can occur during evolution. Mutations/Substitutions Deletions CTCGTTA CACGTTA Insertions CACGTTA CATGTTA CTCGTTA Mutation CACGTTA CACGTTA CACGTTA Mutation CATGTTA CACTGTA SIO1003 | Bioinformatics Concepts Alignment view Alignments are great tools to visualize sequence similarity and evolutionary changes in homologous sequences. Mismatches represent mutations/substitutions Gaps represent insertions and deletions (indels) (C) Substitution Indels CTCGTTA (A) CAC-TGTA ||: | || (B) CATGT-TA (B) (A) Match Mismatch Gap CATGTTA CACTGTA 5 1 2 SIO1003 | Bioinformatics Concepts Alternative alignments Unfortunately, finding the correct alignment is difficult if we do not know the evolutionary history of the two sequences There are many possible alignments Which alignment is best? CACTGTA CACTGT-A CAC-TGTA ||:::|| || ||| | ||: | || CATGTTA CA-TGTTA CATGT-TA SIO1003 | Bioinformatics Concepts Alternative alignments One way to judge alignments is to compare their number of matches, insertions, deletions and mutations 4 matches 6 matches 5 matches 3 mismatches 0 mismatches 1 mismatches 0 gaps 2 gaps 2 gaps CACTGTA CACTGT-A CAC-TGTA ||:::|| || ||| | ||: | || CATGTTA CA-TGTTA CATGT-TA SIO1003 | Bioinformatics Concepts Scoring alignments We can assign a score for each match (+3), mismatch (+1) and indel (-1) to identify the optimal alignment for this scoring scheme 4 (+3) 6 (+3) 5 (+3) 3 (+1) 0 (+1) 1 (+1) 0 (-1) = 15 2 (-1) = 16 2 (-1) = 14 CACTGTA CACTGT-A CAC-TGTA ||:::|| || ||| | ||: | || CATGTTA CA-TGTTA CATGT-TA SIO1003 | Bioinformatics Concepts Optimal alignments Biologists often prefer parsimonious alignments, where the number of postulated sequence changes is minimized. 4 matches 6 matches 5 matches 3 mismatches 0 mismatches 1 mismatches 0 gaps 2 gaps 2 gaps CACTGTA CACTGT-A CAC-TGTA ||:::|| || ||| | ||: | || CATGTTA CA-TGTTA CATGT-TA SIO1003 | Bioinformatics Concepts Optimal alignments Biologists often prefer parsimonious alignments, where the number of postulated sequence changes is minimized. Warning: There may be more than one optimal alignment and these may not reflect the true evolutionary history of our sequences! 4 matches 6 matches 5 matches 3 mismatches 0 mismatches 1 mismatches 0 gaps 2 gaps 2 gaps CACTGTA CACTGT-A CAC-TGTA ||:::|| || ||| | ||: | || CATGTTA CA-TGTTA CATGT-TA SIO1003 | Bioinformatics Concepts Sequence identity and similarity Two commonly quoted metrics for pairs of aligned sequences. Sequence identity: typically quotes the percent of identical characters in the aligned region of two sequences Sequence similarity: typically the score resulting from optimal pair-wise alignment (note dependence on parameters used: i.e. scoring scheme) In contrast, homology is an all or nothing relationship, you can not have a percent homology! SIO1003 | Bioinformatics Concepts Sequence identity and similarity High sequence similarity is frequently used as an indicator of homology Use to find genes and/or proteins with potentially similar or identical function Can query a database of sequences by performing a series of pair-wise alignments Knowledge of the difference between sequences can also yield valuable functional and mechanistic insights A gene from a normal and an affected subject – possible cause of a heritable disease Similar proteins with different substrate specificities – what amino acid changes might be responsible for this? SIO1003 | Bioinformatics Concepts Pairwise sequence alignment SIO1003 | Bioinformatics Concepts Sequence Alignment In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns. SIO1003 | Bioinformatics Concepts Why compare biological sequences? To obtain functional or mechanistic insight about a sequence by inference from another potentially better characterized sequence To find whether two (or more) genes or proteins are evolutionarily related To find structurally or functionally similar regions within sequences (e.g. catalytic sites, binding sites for other molecules, etc.) Many practical bioinformatics applications… SIO1003 | Bioinformatics Concepts Applications of sequence alignment Similarity searching of databases Protein structure prediction, annotation, etc... Assembly of sequence reads into a longer construct such as a genomic sequence Mapping sequencing reads to a known genome "Resequencing", looking for differences from reference genome - SNPs, indels (insertions or deletions) Mapping transcription factor binding sites via ChIP-Seq (chromatin immuno-precipitation sequencing) Pretty much all next-gen sequencing data analysis Pairwise sequence alignment is arguably the most fundamental operation of bioinformatics! SIO1003 | Bioinformatics Concepts Pairwise Sequence Alignment Objective: arrange two sequences in such a fashion that pairs of matching characters between the two sequences are maximized Match does not have to be identity, can be defined by a function that ranks or scores the characters being compared (often termed a substitution matrix) Ungapped alignment example – bars indicate matching characters Seq1: GTAATCTG- ||| ||| Seq2: -TAAGCTGA SIO1003 | Bioinformatics Concepts Simplest case – brute force alignments In the simplest case we can simply slide one sequence across the other and count matching characters for each possible alignment Chose a scoring scheme and do not allow internal gaps within sequences Algorithmic complexity is linear N + M alignments to consider (where N and M are the length of each sequence) SIO1003 | Bioinformatics Concepts GTAATCTG GTAATCTG Brute Force Alignment, | | No Gaps TTAAGCTGA TTAAGCTGA GTAATCTG GTAATCTG | ||| ||| TTAAGCTGA TTAAGCTGA GTAATCTG GTAATCTG | | TTAAGCTGA TTAAGCTGA GTAATCTG GTAATCTG | | TTAAGCTGA TTAAGCTGA GTAATCTG GTAATCTG | | TTAAGCTGA TTAAGCTGA GTAATCTG GTAATCTG | TTAAGCTGA TTAAGCTGA Etc... SIO1003 | Bioinformatics Concepts Gaps make the brute force method unusable for all but the shortest sequences Pairs of related sequences often have insertions or deletions relative to one-another, we therefore require gapped pair-wise alignment Need to generate all the possible gap lengths and combinations of gaps at all possible positions in both sequences For two sequences of equal length, the formula is: 2𝑁 ! 2!" N = 10: 184756 2𝑁 N = 50: ~1.00E29 = ! ≅ 𝑁 𝑁! 𝜋𝑁 N = 250: ~1.17E149 SIO1003 | Bioinformatics Concepts Three general solutions to the alignment problem 1. The dot plot or dot matrix approach A simple graphical method for pair-wise alignment No scoring, so difficult to compare alternative alignments Can give visual clues to sequence structure but requires human interaction 2. Dynamic programming algorithms Provides Optimal solutions (but not necessarily unique solutions) 3. Heuristic word or k-tuple approaches Much faster (e.g. BLAST and FASTA) Widely used for database searches May miss some pairs with low similarity SIO1003 | Bioinformatics Concepts 1. The dot plot or dot matrix approach SIO1003 | Bioinformatics Concepts Dot plots: simple graphical approach Place one sequence on the vertical axis of a 2D grid (or matrix) and the other on the horizontal A C G C G A C A C G SIO1003 | Bioinformatics Concepts Dot plots: simple graphical approach Now simply put dots where the horizontal and vertical sequence values match A C G C G A C A C G SIO1003 | Bioinformatics Concepts Dot plots: simple graphical approach Diagonal runs of dots indicate matched segments of sequence A C G C G A C A C G SIO1003 | Bioinformatics Concepts Dot plots: simple graphical approach Q. What would the dot matrix of a two identical sequences look like? A C G C G A C G C G SIO1003 | Bioinformatics Concepts Pairwise alignment with dot plots A human globin searched against itself produces a unit diagonal on a dot plot (NCBI BLASTP, aligning 2 sequences). SIO1003 | Bioinformatics Concepts Dot plots: simple graphical approach Identical Highly similar SIO1003 | Bioinformatics Concepts Dot plots: simple graphical approach Different but related sequences SIO1003 | Bioinformatics Concepts Dot plots: simple graphical approach Dot matrices for long sequences can be noisy SIO1003 | Bioinformatics Concepts Dot plots: window size and match stringency Solution: use a window and a threshold compare character by character within a window require certain fraction of matches within window in order to display it with a dot. You have to choose window size and stringency Filter Window = 3 Stringency = 3 SIO1003 | Bioinformatics Concepts Dot plots: window size and match stringency Solution: use a window and a threshold compare character by character within a window require certain fraction of matches within window in order to display it with a dot. You have to choose window size and stringency Filter Window = 3 Stringency = 2 SIO1003 | Bioinformatics Concepts Window size = 5 bases A dot plot simply puts a dot where two sequences match. In this example, dots are placed in the plot if 5 bases in a row match perfectly. Requiring a 5 base perfect match is a heuristic – only look at regions that have a certain degree of identity. SIO1003 | Bioinformatics Concepts Window size = 7 bases This is a dot plot of the same sequence pair. Now 7 bases in a row must match for a dot to be placed. Noise is reduced. Using windows of a certain length is very similar to using words (kmers) of N characters in the heuristic alignment search tools Bigger window (kmer) fewer matches to consider SIO1003 | Bioinformatics Concepts Ungapped alignments Only diagonals can be followed Downward or rightward paths represent insertion or deletions (gaps in one sequence or the other) indels SIO1003 | Bioinformatics Concepts Global alignments Global alignments go from end to end, i.e. from the upper left corner to the lower right corner. Global alignments do not have good statistical characterization and are not used for database searches. More on this later SIO1003 | Bioinformatics Concepts Uses for dot matrices Visually assessing the similarity of two protein or two nucleic acid sequences Finding local repeat sequences within a larger sequence by comparing a sequence to itself Repeats appear as a set of diagonal runs stacked vertically and/or horizontally SIO1003 | Bioinformatics Concepts Finding repeats (a) Human LDL receptor protein sequence (GenBank P01130) W =1 S =1 (Figure 3.6 from Mount, “Bioinformatics: sequence and genome analysis) SIO1003 | Bioinformatics Concepts Finding repeats (b) Human LDL receptor protein sequence (GenBank P01130) W = 23 S =7 (Figure 3.6 from Mount, “Bioinformatics: sequence and genome analysis) SIO1003 | Bioinformatics Concepts Finding repeats SIO1003 | Bioinformatics Concepts Finding repeats Search human cytoglobin (190 aa) against a large snail globin (2148 aa) having many globin repeats. More repeats are observed using PAM250 than BLOSUM62. To “read” this plot note that cytoglobin (x-axis) matches the snail globin (y-axis) at about a dozen locations across the snail protein. Red arrows indicate that the first few and last few amino acids of cytoglobin do not participate in this repeat structure. SIO1003 | Bioinformatics Concepts Finding repeats BLASTP output includes the various sequence alignments. One is shown here: human cytoglobin (residues 18-154) aligns to the snail globin (at residues 1529- 1669). The expect value is convincing (4e-13), and this is one of a dozen sequence alignments. Conclusion: the dot plot is an excellent way to visualize complex repeats. SIO1003 | Bioinformatics Concepts Dots can have “weights” Some matches can be rewarded more than others, depending on likelihood Use PAM or BLOSUM substitution matrix (more on these later) Put a dot only if a minimum total or average weight is achieved SIO1003 | Bioinformatics Concepts 2. Dynamic programming algorithms SIO1003 | Bioinformatics Concepts The Dynamic Programming Algorithm The dynamic programming algorithm can be thought of an extension to the dot plot approach One sequence is placed down the side of a grid and another across the top Instead of placing a dot in the grid, we compute a score for each position Finding the optimal alignment corresponds to finding the path through the grid with the highest possible score Seq2 Seq2 Seq1 Seq1 SIO1003 | Bioinformatics Concepts Four possible outcomes in aligning two sequences identity (stay along a diagonal) mismatch (stay along a diagonal) gap in one sequence (move vertically!) gap in the other sequence (move horizontally!) SIO1003 | Bioinformatics Concepts Different paths represent different alignments Seq1: D P L E Seq1: D P L E Seq1: D P M E Seq1: D P - E | | | | | | : | | | | | | | Seq2: D P L E Seq2: D P M E Seq2: D P - E Seq2: D P L E Matches are represented by diagonal paths and indels with horizontal or vertical path segments SIO1003 | Bioinformatics Concepts Alignment methods Very short or very similar sequences can be aligned by hand. However, most interesting problems require the alignment of lengthy, highly variable or extremely numerous sequences that cannot be aligned solely by human effort. Constructing algorithms to produce high-quality sequence alignments, and occasionally in adjusting the final results to reflect patterns that are difficult to represent algorithmically (especially in the case of nucleotide sequences). Computational approaches to sequence alignment generally fall into two categories: global alignments and local alignments. SIO1003 | Bioinformatics Concepts Alignment methods A variety of computational algorithms have been applied to the sequence alignment problem. These include slow but formally correct methods like dynamic programming. These also include efficient, heuristic algorithms or probabilistic methods designed for large-scale database search, that do not guarantee to find best matches. Many algorithms but generally: Incremental positive scores- identical residues Negative scores- for gaps and substitutions SIO1003 | Bioinformatics Concepts Alignment methods Calculating a global alignment is a form of global optimization , which Global attempt to align every residue in every sequence, are most useful when the sequences in the query set are similar and of roughly equal size By contrast, local alignments identify regions of similarity within long Local sequences that are often widely divergent overall. Local alignments are often preferable, but can be more difficult to calculate because of the additional challenge of identifying the regions of similarity. SIO1003 | Bioinformatics Concepts Global alignments “Forces" the alignment to span the entire length of all query sequences. Attempt to align every residue in every sequence, are most useful when the sequences in the query set are similar and of roughly equal size. (This does not mean global alignments cannot end in gaps.) A general global alignment technique is the Needleman–Wunsch algorithm, which is based on dynamic programming. The Needleman–Wunsch algorithm was the first application of dynamic programming for biological sequence comparison. It is sometimes referred to as the Optimal matching algorithm Global FTFTALILLAVAV F--TAL-LLA-AV Local FTFTALILL-SVSV --FTAL-LLAAV-- SIO1003 | Bioinformatics Concepts Local alignments Local alignments are more useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context. The Smith–Waterman algorithm is a general local alignment method also based on dynamic programming. Instead of looking at the total sequence, the Smith–Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure. Global FTFTALILLAVAV F--TAL-LLA-AV Local FTFTALILL-SVSV --FTAL-LLAAV-- SIO1003 | Bioinformatics Concepts Local alignments If only using global alignment – some important matches are not detected E.g. Local alignment must be used when comparing spliced mRNA sequences with original genomic sequences With sufficiently similar sequences, there is no difference between local and global alignments. SIO1003 | Bioinformatics Concepts Local alignment (bottom) includes matches ignored by global alignment (top) Bacterial globin family: NP_824492 Streptomyces avermitilis NP_337032 Myobacterium tuberculosis Global: 14.7% identity 39/266 22.6% similarity 51.9% gaps Local: 30% identity 35/115 SIO1003 | Bioinformatics Concepts Algorithm of Needleman and Wunsch Two sequences can be compared in a matrix along x- and y-axes. If they are identical, a path along a diagonal can be drawn Find the optimal subpaths, and add them up to achieve the best score. This involves adding gaps when needed allowing for conservative substitutions choosing a scoring system (simple or complicated) N-W is guaranteed to find optimal alignment(s) SIO1003 | Bioinformatics Concepts Algorithm of Needleman and Wunsch The Needleman–Wunsch approach to global sequence alignment has three basic steps: 1) setting up a 2D-grid (or alignment matrix), 2) scoring the matrix, and 3) identifying the optimal path through the matrix (1) (2) (3) Needleman, S.B. & Wunsch, C.D. (1970) “A general method applicable to the search for similarities in the amino acid sequences of two proteins.” J. Mol. Biol. 48:443-453. SIO1003 | Bioinformatics Concepts Scoring the alignment matrix To find the best alignment, we retrace the arrows starting from the bottom right cell (called the trace-back procedure) The optimal alignment score and alignment are dependent on the chosen scoring system - D P L E Scores: match = +1, mismatch = -1, indel = -2 - 0 -2 -4 -6 -8 D -2 1 -1 -3 -5 Alignment P -4 -1 2 0 -2 DPME M -6 -3 0 1 -1 DPLE E -8 -5 -2 -1 2 SIO1003 | Bioinformatics Concepts Algorithm of Needleman and Wunsch N-W is guaranteed to find optimal alignments, although the algorithm does not search all possible alignments. It is an example of a dynamic programming algorithm: an optimal path (alignment) is identified by incrementally extending optimal subpaths. Thus, a series of decisions is made at each step of the alignment to find the pair of residues with the best score. SIO1003 | Bioinformatics Concepts Global vs local alignments Needleman-Wunsch is a global alignment algorithm Global Resulting alignment spans the complete sequences end to end This is appropriate for closely related Local sequences that are similar in length For many practical applications we require local alignments Local alignments highlight sub- regions (e.g. protein domains) in the two sequences that align well SIO1003 | Bioinformatics Concepts Local alignment: Definition Smith & Waterman proposed simply that a local alignment of two sequences allow arbitrary-length segments of each sequence to be aligned, with no penalty for the unaligned portions of the sequences. Otherwise, the score for a local alignment is calculated the same way as that for a global alignment Smith, T.F. & Waterman, M.S. (1981) “Identification of common molecular subsequences.” J. Mol. Biol. 147:195-197. SIO1003 | Bioinformatics Concepts - - Scoring system: Match = +1 Mismatch = -1/3 Gap = difference between a match and a mismatch Highest scoring value Trace-back from here SIO1003 | Bioinformatics Concepts - - Local alignment GCC-UCG GCCAUUG Global alignment CA-GCC-UCGCUUAG AAUGCCAUUGACG-G SIO1003 | Bioinformatics Concepts Local alignments can be used for database searching Goal: Given a query sequence (Q) and a sequence database (D), find a list of sequences from D that are most similar to Q Input: Q, D and scoring scheme Output: Ranked list of hits Input Output Query sequence Database Score Ranked hit list Annotation GTATGGTCA 100 GTATGGTCA Ras CTATGGTCA 90 TGATGGTCA Ras GTATGGTCA TAGTGGTCA 40 CGATCTGCA HSP90 TGATGGTCA GTATGGTCA 38 TCGTTGCTA P450 SIO1003 | Bioinformatics Concepts The database search problem Due to the rapid growth of sequence databases, search algorithms have to be both efficient and sensitive Time to search with SW is proportional to m x n (m is length of query, n is length of database), too slow for large databases! To reduce search time heuristic algorithms, such as BLAST, first remove database sequences without a strong local similarity to the query sequence in a quick initial scan. SIO1003 | Bioinformatics Concepts The database search problem Due to the rapid growth of sequence databases, search algorithms have to be both efficient and sensitive Time to search with SW is proportional to m x n (m is length of query, n is length of database), too slow for large databases! Query RGGVKRIKLMR To reduce search time heuristic algorithms, such as BLAST, first Smaller remove database sequences GAQRGLA database without a strong local similarity to RGGVKRI the query sequence in a quick FKLLGRI initial scan. MGLGVKA MGLGVKA MPQRGLA RGGVKRI SIO1003 | Bioinformatics Concepts SIO1003 | Bioinformatics Concepts SIO1003 | Bioinformatics Concepts SIO1003 | Bioinformatics Concepts SCORING MATRICES SIO1003 | Bioinformatics Concepts Scoring Matrix A substitution matrix assigns a score for aligning any possible pair of residues. In general, different substitution matrices are tailored to detecting similarities among sequences that are diverged by differing degrees. A single matrix may be reasonably efficient over a relatively broad range of evolutionary change. SIO1003 | Bioinformatics Concepts Scoring Matrix There are two main families of amino acids substitution scoring matrices: PAM substitution matrices based on the rate of divergence between sequences BLOSUM substitution matrices based on the conservation of domains in proteins SIO1003 | Bioinformatics Concepts BLOSUM (BLOck SUbstitution Matrix) The BLOSUM (BLOck SUbstitution Matrix) series of matrices using multiple alignments of evolutionarily divergent proteins. The probabilities used in the matrix calculation are computed by looking at "blocks" of conserved sequences found in multiple protein alignments. These conserved sequences are assumed to be of functional importance within related proteins. The BLOSUM62 matrix is calculated from observed substitutions between proteins that share 62% sequence identity or more - clustering together proteins showing 62% sequence identity or more and giving weight 1 to each such cluster (Henikoff and Henikoff). SIO1003 | Bioinformatics Concepts BLOSUM (BLOck SUbstitution Matrix) Collection of protein blocks clustering BLOSUM62 62% MOTIF 80% BLOSUM80... 504 groups of 2205 blocks (short clusters of related proteins domains of ungapped sequences sharing multiple alignment) a high identity Henikoff S and Henikoff JG (1991). Automated assembly of protein blocks for database searching. Nucl Acids Res 23:6565-6572. Web: http://blocks.fhcrc.org/blocks/ SIO1003 | Bioinformatics Concepts BLOSUM (BLOck SUbstitution Matrix) BLOSUM100 matrix is calculated from alignments between proteins showing 100% identity the proteins in the BLOCKS database. The BLOSUM62 matrix is one of the best substitution matrixes for detecting weak protein similarities/ distant sequences. This is the matrix used by default in most recent alignment applications such as BLAST. Aligning two closely related sequences - use a higher numbered BLOSUM matrix. More divergent sequences - use a lower number BLOSUM. SIO1003 | Bioinformatics Concepts BLOSUM (BLOck SUbstitution Matrix) Eg., BLOSUM80 is used for alignments that are more similar in sequence. BLOSUM45 is used for alignments that have diverged from each other. For particularly long and weak alignments, the BLOSUM45 matrix may provide the best results. Strong, short alignments (alignments that are short but have a higher % of matching residues) are more easily detected using a matrix with a higher ‘relative entropy’ than that of BLOSUM62. The BLOSUM series does not include any matrices with relative entropies suitable for the shortest queries. (-better use PAM) SIO1003 | Bioinformatics Concepts PAM (Point Accepted Mutation) Developed by Margaret Dayhoff in the 1970s. Point Accepted Mutation (PAM) is the point mutation per 100 amino acids. Point of accepted mutation – a replacement of one amino acid in a protein by another residue that has been accepted by natural selection An amino acid change that is accepted by natural selection occurs when: a gene undergoes a DNA mutation such that it encodes a different amino acid the entire species adopts that change as the predominant form of the protein SIO1003 | Bioinformatics Concepts PAM (Point Accepted Mutation) Eg. PAM1 means a 1 point mutation/100 amino acids. A PAM matrix is usually a 20 by 20 matrix (different rates of mutations between pairs of amino acids). This PAM matrix is a type of substitution matrix that is used in sequence alignment and multiple sequence alignment. This matrix is calculated by observing the differences in closely related proteins. The PAM1 matrix estimates what rate of substitution would be expected if 1% of the amino acids had changed. SIO1003 | Bioinformatics Concepts PAM (Point Accepted Mutation) The PAM1 matrix is used as the basis for calculating other matrices by assuming that repeated mutations would follow the same pattern as those in the PAM1 matrix, and multiple substitutions can occur at the same site. Using this logic, Dayhoff derived matrices as high as PAM250. Usually the PAM 30 and the PAM70 are used. But Dayhoff's methodology does not to work very well for aligning evolutionarily divergent sequences. Sequence changes over long evolutionary time scales are not well approximated by compounding small changes that occur over short time scales. (– better use BLOSUM) SIO1003 | Bioinformatics Concepts Differences between PAM and BLOSUM PAM matrices are based on an explicit evolutionary model (i.e. replacements are counted on the branches of a phylogenetic tree), whereas the BLOSUM matrices are based on an implicit model of evolution. The PAM matrices are based on mutations observed throughout a global alignment, this includes both highly conserved and highly mutable regions. The BLOSUM matrices are based only on highly conserved regions in series of alignments forbidden to contain gaps. SIO1003 | Bioinformatics Concepts Differences between PAM and BLOSUM The method used to count the replacements is different, unlike the PAM matrix, the BLOSUM procedure uses groups of sequences within which not all mutations are counted the same. Higher numbers in the PAM matrix naming scheme denote larger evolutionary distance, while larger numbers in the BLOSUM matrix naming scheme denote higher sequence similarity and therefore smaller evolutionary distance. Example: PAM150 is used for more distant sequences than PAM100; BLOSUM62 is used for closer sequences than BLOSUM50. SIO1003 | Bioinformatics Concepts PAM vs BLOSUM matrices A comparison of the matrices can be done on the basis of their “information content” More conserved PAM100 = BLOSUM90 proteins PAM120 = BLOSUM80 PAM160 = BLOSUM60 More distant PAM200 = BLOSUM52 proteins PAM250 = BLOSUM45 SIO1003 | Bioinformatics Concepts PAM vs BLOSUM matrices BLOSUM 80 BLOSUM 62 BLOSUM 45 PAM 1 PAM 120 PAM 250 Less divergent More divergent Human vs Human vs chimpanzee bacterial beta globin globin SIO1003 | Bioinformatics Concepts PAM vs BLOSUM matrices SIO1003 | Bioinformatics Concepts Superiority of BLOSUM for database searches (according to Henikoff and Henikoff) SIO1003 | Bioinformatics Concepts