Lecture 5 - Pairwise Sequence Alignments PDF
Document Details
Uploaded by AffectionateCommonsense7053
UWI Cave Hill
Dr. A. T Alleyne
Tags
Summary
This document covers Lecture 5 on pairwise sequence alignments from a Bioinformatics course. It details methods like graphical approaches, dynamic programming, and heuristic algorithms (FASTA and BLAST). The concepts are explained, alongside examples within the context of biological sequences (e.g., DNA, protein).
Full Transcript
Lecture 5 Pairwise sequence Alignments BIOC 3265-Principles of Bioinformatics Dr. A. T Alleyne- UWI Cave Hill 1 1. Compute a score for a pair-...
Lecture 5 Pairwise sequence Alignments BIOC 3265-Principles of Bioinformatics Dr. A. T Alleyne- UWI Cave Hill 1 1. Compute a score for a pair- wise alignment given gap penalties 2. Describe the three methods of aligning 2 sequences 3. Use a simple dot matrix for aligning two sequences 4. Distinguish between PAM and LEARNING OUTCOMES BLOSUM matrices 5. Explain the meaning of dynamic programming 6. Explain the differences At the end of this lecture you between the Needleman- should be able to: Wunsch and the Smith Waterman Algorithms 2 Alignment Methods There are three methods of computing sequence alignments: 1. Graphical- visual assessment of the type and quality of the sequence alignment 2. Dynamic programming- A recursive solution, mathematical proof of the best sequence alignment possible; requires computing strength. 3. Heuristic- approximation of the mathematic proof in dynamic programming, but much faster than dynamic programming 3 #1-Dot Plot A graphic method which gives an overview of similarity between two strings of sequence data ̶ Compiled using a table or matrix consisting of rows and columns ̶ Stretches of similarity appear as dots while mismatches are left blank ̶ Used to initially confirm an obvious alignment, but requires further sensitive methods for more accuracy 4 Graphical: The Dot Plot method Seq 1 Take 2 sequences and write each along one side of a 2D matrix Every place where the sequences match, place a dot Seq 2 To find the maximal alignment, find the longest diagonal runs Example: seq #1: aagtcccgtg seq #2: aggtccgttc 5 Alignment #1 (8 matches, 4 gaps): aag–tcccg–tc a–ggtcc–gttc Alignment #2 (8 matches, 4 gaps): aa–gtcccg–tc - aggtcc–gttc Alignment #3 (8 matches, 4 gaps): aa–gtcccgt-c - aggt–ccgttc Theoretically, all possible sequence pairs can be aligned by the introduction of gaps. 6 Dot plot method Connect the diagonals with Each jump corresponds to a Background noise in biological a jump to get the gap, which is an insertion or sequences is high. alignment. a deletion Increasing the window size can reduce the noise and focus in on the aligned strings or areas with continuous diagonals. 7 Adding a Score to an Alignment Strategy: 1. Instead of dots devise a score matrix using 1 and 0 2. To score the alignment add the score on the optimal path. 8 Substitution matrices A substitution matrix describes the rate at which one character in a sequence string changes to other character state over time. Amino acids are not exchanged with the same probability as might be conceived theoretically. For example, an exchange of Asp for Glu is frequently observed; however, a change from Asp to Trp is rare most algorithms use substitution matrices to align protein sequences. amino acid substitution matrices describe the probability that amino acids will be exchanged in the course of evolution. They contain a logarithm for the relationship of two probabilities that a couple of amino acids or nucleotides will appear in an alignment, i.e., both the probability of a coincidental concurrence and the probability of an evolutionary event responsible for the occurrence are taken into account. Negative values in the matrix mean that the occurrence is rather coincidental, whereas positive values suggest an evolutionary event. 9 a 2-dimensional matrix containing all possible pair-wise scores of 2 sequences SCORING The basic rules for scoring matrices were first put forward by MATRICES Margaret Dayhoff in 1978 Two commonly used matrices are PAM and BLOSUM matrices or substitution matrices 10 #2-Dynamic programming A heuristic recursive process that seeks to solve an intractable problem. A Dynamic programming (DP) algorithm is an algorithmic technique usually based on a recurrent formula and one (or some) starting states. A sub-solution ( divide and conquer) of the problem is constructed from previously found ones. Dynamic programming is an exact solution for intractable problem It can be used for sequence comparison, gene recognition, RNA structure prediction and can be solved by DP 11 Dynamic Programming (DP) ̶ Gaps in a sequence make the alignment of 2 sequences very difficult, so Dynamic Programming (DP)methods are used instead ̶ In DP, the 2 sequences are broken up into small sub-sequences and single nucleotides or amino acids are compared and scored individually ̶ Low-scoring alignments are discarded ̶ It works by starting from (i) the simplest instance of a problem, (ii) finding an optimal solution for it, and (iii) extending the optimal solution to bigger instances. ̶ The solution is the best diagonal path corresponding to the optimal alignment. 12 vertex Path score Graphs DP methods can be represented by graphical solutions- the optimal path A series of connections via vertices. 13 Conservation- accounts for conservative substitutions as well as absolute conservation of proteins. Maintains protein function. Biological considerations Frequency- accounts for the frequency with which residues occur (Log odds ratio or Lod of score score) matrices Evolution- accounts for evolutionary patterns between closely related or unrelated proteins. 14 PAM: (Point Accepted Mutation) used in evolutionary or phylogenetic studies ( also called the Dayhoff matrix) Popular Score BLOSUM: (Block Substitution Matrix Matrix) used in finding common motifs Programs K- TUPLE- heuristic method used for comparison of a single sequence with an entire database- basis of BLAST (Basic Local Alignment) programming 15 16 Point accepted mutation (PAM) Matrix Point accepted mutation (PAM) occurs by replacing a single amino acid with another in a protein- this occurs naturally throughout evolution 17 PAM matrix in Bioinformatics A matrix is used to represent the twenty amino acids in rows and columns Each amino acid in a protein sequence is then compared for substitutions So, they are used as substitution matrices Began with Margaret Dayhoff in 1970 18 PAM Matrix % Identity M. O. 0 100 Dayhoff 30 75 Pam 80 50 Matrix 110 60 200 25 Used 71 protein families and 250 20 examined approx. 1572 changes 19 Two sequences are at 1 PAM distance if one can be converted into the other The conversion accepts one mutation every 100 amino acids Accepted means that these mutations are not lethal for the organism. PAM Matrices A PAM 1 matrix means that the sequences have ~99% identical amino acids PAM matrix compare two sequences which are a A PAM 250 matrix means that the sequences have ~20% specific number of PAM identical amino acids units apart. A high PAM number indicates less similarity because of the possibility multiple mutations occurring in a single position 20 PAM 1 matrix- initially constructed from a set of proteins with 85% identity by Margaret Dayhoff Other PAM matrices are constructed by multiplying the PAM 1 MATRIX by itself PAM MATRIX CONSTRUCTION PAM 100 is PAM 1 X PAM1 100 times, PAM 160 is PAM 1 X PAM 1 160 times etc. ** this could be done when the database had few proteins in the 1970’s 21 Pam 1 Matrix PAM 1 PAM 1 PAM 1 A mutation probability Highest scores are along a Evolutionary origin is matrix diagonal from top left to therefore assumed Shows possible substitution bottom right Based on closely related of each amino acid with Values in each column sum proteins another in the matrix to 100 Assumes a 1% substitution rate- computes the probability for every 100 aas 22 Pam 1 A score exactly equal to 1 indicates amino acid pairs that are found as alternatives at exactly the frequency predicted by chance. Residue pairs with scores less than 1 (1) indicate that both residues can carry out similar functions. 23 If substitutions occur randomly, evolution is Larger N therefore approximated by simply corresponds to greater multiplying the Pam 1 matrix evolutionary separation by itself. PAM x PAM A common choice for alignment is the PAM250 matrix N PAM = (PAM)N Pam 250 means 250 aas mutations for every 100 aas https://youtu.be/68lF71zEUF8?si=1xjrVQ2voOiYivx3 24 PAM 250 SCORING LOD (Log Odds)MATRIX https://youtu.be/68lF71zEUF8?si=g0XpwPmzUwUYfnW5 25 Assumptions of the Pam matrix Matrix is based on prediction, A change in an amino acid is not direct observation…so independent of mutations in similarity is usually an the same position in a protein. extrapolation. All sites are equally mutable Time is not taken into consideration- all changes are and independent of deemed equal even if they surrounding amino acid occurred over short or long time residues spans. 26 BLOSUM ̶ Derived by S. Heinkoff and J Heinkoff (1992) ̶ BLOSUM- Blocks Substitution Matrix ̶ Replaced the PAM matrix with a better solution in identifying distant relationships ̶ Uses closely related protein aligned without gaps- domains or conserved regions or blocks ̶ Ratios are expressed as log odds like PAM matrix 27 BLOSUM- Block substitution matrix BLOSUM matrices - constructed by using ungapped Blocks are aligned by segments or BLOCKS clustering on the basis of (conserved regions), from a set percent identity of multiple aligned sequences. The matrix number refers to the minimum BLOSUM matrices are generated level of identity the sequences may have from direct observation of and still contribute independently to the unchanged regions, not model e.g. BLOSUM62 means that all extrapolation sequences have 62% sequence identity to another member in the block 28 BLOSUM 62 Useful for Default BLAST Uses a Lod protein with < matrix score in log 2 62% identity 29 BLOSUM VS PAM BLOSUM MATRICES BLOSUM MATRIXES ARE BLOSUM GIVES A BETTER PROVIDE A MORE CALCULATED DIRECTLY APPROXIMATION OF ACCURATE MEASURE OF FROM CONSERVED EVOLUTIONARY SUBSTITUTION REGIONS DISTANCE PATTERNS. EACH NUMBER IN THE PAM MATRIX IS BASED BLOSUM MATRIX ON ASSUMPTIONS OF REPRESENTS A LEVEL OF MUTATIONS CONSERVATION 30 BLOSUM 62 SCORING LOD (Log Odds)MATRIX https://youtu.be/njva17LwhsE?si=UQxo-VfWs6j0sxeC Most frequent occurrence This Photo by Unknown Author is licensed under CC BY-SA 31 Implications of BLOSUM and PAM Blosum 62 (Default matrix in Blast) or PAM 120 Blosum 90, PAM 30 Blosum 45, PAM 250 Less divergent More divergent 32 Choosing a matrix Matrix Best use Similarity (%) PAM40 Short alignments that are highly similar 70-90 PAM 160 Detecting members of a protein family 50-60 PAM 250 Longer alignment of a more diverse 30 sequences BLOSUM 90 Short alignment that are highly similar 70-90 BLOSUM 80 Detecting members of a protein family 50-60 BLOSUM 62 Finding all potential similarities 30-40 BLOSUM 30 Longer alignment of a more diverse