Lecture 2 - Protein Sequence Analysis PDF
Document Details
Uploaded by ConvincingOak
Imperial College London
Mike Sternberg
Tags
Summary
This document details protein sequence analysis, including primary DNA sequence databases, metagenomics, and methods for detecting evolutionary relationships. It explains concepts like orthologues, paralogues, and pairwise protein sequence alignment. The document provides an overview of these topics and discusses databases, algorithms, and data analysis methods for proteins.
Full Transcript
Lecture 2 – Protein Sequence Analysis Primary DNA Sequence Databases - Genbank (NCBI) - ENA (EMBL) - DDBJ Data is exchanged between these sights nightly, so they have the same core data. Initial DNA deposition translated into protein sequences: - Genbank to Genpept - EMBL to TrEMBL In parallel S...
Lecture 2 – Protein Sequence Analysis Primary DNA Sequence Databases - Genbank (NCBI) - ENA (EMBL) - DDBJ Data is exchanged between these sights nightly, so they have the same core data. Initial DNA deposition translated into protein sequences: - Genbank to Genpept - EMBL to TrEMBL In parallel Swissprot is a high-quality source of annotation for some sequences. UniProtKB This is from Swissprot with high quality annotation. There are now 240M entries, there has been a huge expansion in sequences. Problems and Errors in Databases - Organisation of databases changes rapidly - The names or proteins are very variable - The errors are very slow to correct - Sometimes errors will not be corrected as the organisation will not change a submission without action by the submitter. - Sometimes will not work on a browser Metagenomics Work pioneered by Craig Ventor to obtain sequences in batch from microorganisms in exotic locations such as the middle of the ocean or the human gut to give an insight into biodiversity. However many sequences are of poor quality and are often fragments. MGnify database at EBI has 350,000 amplicons from 33,000 metagenomes. Detecting Evolutionary Relationships To detect evolutionary relationships, we need to quantify the similarity between the species. We can do this by pairwise protein sequence alignment. Orthologues and Paralogues There are two diSerent events during evolution: gene duplication and speciation. Gene duplication is when a gene is duplicated within a genome and the two proteins and they are called paralogues. - This can result in a change of function, as only one copy is required to provide the original protein, so the second gene/protein can evolve a new function. - Speciation is when a new species is created. As a result the two species have a single copy of the same gene, and the two proteins are called orthologues. - Both species only have a single copy, so their function is less likely to change. Pairwise Protein Sequence Alignment Requirements A scoring scheme of similarity of amino acid residues and an algorithm to establish the alignment. The aim is that the combined use of the algorithm with the scoring scheme generates the best alignment in terms of the biology and has the potential to be extended to database scoring. Scoring Scheme The simplest way is to score 1 for identical amino acids and 0 for diSerent ones (similarly, identical bases can be scored). However, for proteins, evolution imposes constraints on types of amino acid changes that generally occur to modify, but not destroy protein function. Residues tend to keep their chemical property, e.g. the tendency to be buried (i.e. non-polar or hydrophobic residues). The maintenance of chemical property is called conservative substitution. Point Accepted Mutation (PAM) This was developed by DayhoS in around 1978 (founder of bioinformatics). This is based on counting the number of times residue types changed in aligned sequences of closely homologous sequences, it can be extended to more distant relationships by assuming the matrix can be multiplied by itself. PAM 250 was developed to model sequences with 20% identity. First, it quantifies the odds that one residue is mutated from another from: 4:698;95 1?@ 4A =B1C4 =715 62%. It looked at the conserved blocks and ignored the loops to try and amplify the signal over the noise. It has a similar calculation to the PAM250 matrix, however BLOSUM62 is the most accurate and sensitive and so it is the most widely used matrix and included in the BLAST/PSIBLAST family of database searching algorithms. Gap Penalties Penalise gaps (indels). Penalty = o + e × l Where o = the gap opening constant (10) e = gap extension constant (1) l = length of gap extension o > e as the evolutionary event is making the gap and we often see long gaps. Alignment Methods Protein Domains Often a protein sequence is formed from parts known as domains, where each domain is a diGerent homologous family. Domains are the evolutionary unit. Local vs Global Alignment Needleman-Wunsch Algorithm This is a general algorithm for sequence comparison, it maximises a similarity score to give a maximum match. Finds the best GLOBAL alignment of any two sequences. Maximum match = largest number of residues of one sequence that can be matched with another allowing for all possible indels. The N-W involves an iterative matrix method of calculation: - all possible pairs of residues (one from each sequence) are represented in a 2-dimensional array. - all possible alignments are represented by pathways through this array. Three Main Steps: 1. Assign similarity values 2. For each cell, look at all possible pathways back to the beginning of the sequence (allowing for indels) and give that cell the value of the maximum scoring pathway. 3. Construct an alignment (pathway) back from the highest scoring cell to give the highest scoring alignment. Similarity Values Sij is the numerical value that is assigned to every cell in the array depending on the similarity/dissimilarity of the two residues. It uses the BLOSUM62 matrix for this. Constructing the Alignment The alignment score is cumulative by adding along a path through the array. The best alignment has the highest score i.e. the maximum match. Maximum match = the largest number resulting from summing the cell values of every pathway, the maximum match will always be somewhere in the outer row or column shown. The alignment is constructed by working backwards from the maximum match, as you trace back the best path through the matrix introducing gap penalties. When a gap penalty is introduced, the next step is: Best of { Just continue the alignment Add gap in vertical sequence Add gap in horizontal sequence } Smith-Waterman Algorithm Instead of looking at each sequence in its entirety this compares segments of all possible lengths (LOCAL alignments) and chooses whichever maximises the similarity measure. Dynamic Programming Following its introduction by Needleman and Wunsch (1970), dynamic programming has become the method of choice for ‘‘rigorous’’ alignment of DNA and protein sequences. For a number of useful alignment-scoring schemes, this method is guaranteed to produce an alignment of two given sequences having the highest possible score. It also allows gaps. Database Searching Query sequence + Database of sequences à Comparison algorithm à list of similar protein sequences à infer homologous similar structure and often similar function. Fast Pairwise Search Algorithms Single query aligned independently to any similar database entry, but it must perform a local search. The Smith-Waterman is guaranteed to find a mathematically optimal solution but is too slow for searching except on specialist parallel processing computers. Various fast methods have been developed based on finding short local matches and then building up the alignment. These methods are good, but they are not guaranteed to find a mathematically optimal solution. FATSA – popular method developed in 1985 but is no longer widely used. BLAST – this is now the major sequence search tool in protein and DNA bioinformatics. BLAST This is a highly sophisticated approach developed by Altschul in 1990 and is a very fast local search program (50x the speed of the Smith-Waterman algorithm). 1. First it finds short segments or seeds (known as words) in the query that have matches in the database using the BLOSUM62 score. 2. Then it extends suitable seeds to form HSPs (high scoring pairs) using ungapped and gapped alignments. The significance of a HSP match of a given length is evaluated by precise statistics. BLAST is also used for DNA/DNA and Protein/6 frame DNA translation. PSI-BLAST has also been developed that uses multiple sequences. Accuracy of Database Searching Need a cut oG score to assign positives and negatives – do this by P and E values. Reliability of a Match P(S) is the probability of achieving a score S or a better score by chance (P is a cumulative score). Also use a related measure which is the expectation of an error in the database scan (E-value). E(S) is the expected number of chance occurrences of scores equal or better than S. E-Values E-value is the expected number of matches that are errors if you searched and took all matches up to and including S. Essentially the E-value is the estimated number of false positives found using S as the cut oS. Most search programs return one or both of these values, and values do consider the size of the database searched and the score of the match. BLAST also considers the length of the match as short matches are easier to find. For matches < 20 residues you must be very cautious in suggesting true homology, and you cannot infer short matches will have a similar 3D structure. Confident if P or E < 10-3 but these are estimated values and could be wrong. Also, you cannot compare E-values from diSerent programs as they all calculate them diSerently. Note that P is a probability and so P >/(9C61?1;1?@ = (,?< + ,AC) This is what fraction of the positives did you correctly call positive (the true positive rate). This is also called sensitivity. ,?< J8971614C = (,?< + ,A