BLAST Searching - Lecture Notes PDF

Summary

These lecture notes cover BLAST searching, a fundamental bioinformatics tool. The material discusses the function, usage, algorithms, and examples of BLAST. It also addresses related topics such as HSPs, E-values, and scoring matrices within the context of biological sequences.

Full Transcript

Lecture 6 BIOC 3265- Principles of BLAST searching Bioinformatics Dr. A. T Alleyne- UWI Cave Hill L EARNING O UTCOMES After this lecture you should be able to: 1. Describe the functions of the BLAST search. 2. Describe the main BL...

Lecture 6 BIOC 3265- Principles of BLAST searching Bioinformatics Dr. A. T Alleyne- UWI Cave Hill L EARNING O UTCOMES After this lecture you should be able to: 1. Describe the functions of the BLAST search. 2. Describe the main BLAST programs and their uses 3. Explain how BLAST works 4. Define a High Scoring segment pair (HSP) 5. Explain the significance of E-values and S-scores 6. Discuss BLAST alternatives and improvements 7. Compare BLAST and FASTA searches 8. Conduct a BLAST search given a query sequence 2 H IGH - SCORING S EGMENT PAIR (HSP) The Basic BLAST Unit is a High segment scoring pair or HSP An HSP is a local alignment with no gaps and the highest alignment score in a search. An HSP consists of two sequence fragments of equal length whose alignment is locally maximal and with an alignment score that meets or exceeds a cutoff value. An HSP is defined by: two sequences, a scoring system, and a cutoff score 3 HSP: High –scoring segment pair K-mers (Words) that match with a score above a selected threshold ( T), are extended to form an HSP 4 A list of words (w=3) VTA TAL ALW GAW GTW GSW NTW ATW GTY NTW ANA GAS ( query seq.) GTW 6,5,11 = 22 neighborhood GSW 6,1,11 = 18 word hits ATW 0,5,11 = 16 > threshold NTW 0,5,11 = 16 GTY 6,5,2 = 13 (T=11) ANA = 5 neighborhood GAS = 7 word hits < below threshold Fig. 4.11 page 116 5 H OW BLAST WORKS 1 2 3 Scan and Evaluate- Compile or List extend traceback Compilation of a database is scanned list of high-scoring for word matches, or Assign insertions, “words” (pairwise hits. Extend hits to deletions and alignments) differentiate random mismatches by hits from meaningful traceback. hits( score) 6 Phase 1: compile list of words Taken from Bioinformatics &Functional Genomics 3e by Pevsner 7 Phase 2: scan the database for matches and extend Taken form Bioinformatics &Functional Genomics 3e by Pevsner 8 Extend the exact matches to High-scoring Segment Pairs Extension (HSPs) Matching K-mers are extended into stretches called High-scoring Segment Pairs (HSPs), resulting in matches that are longer than K. Two or more of these HSPs are combined to form a longer alignment. 9 Calculate locations of insertions, deletions and Calculate matches ( from phase 2) Phase 3: Apply Apply composition-based statistics for BLASTP etc. Traceback Generate Generate a gapped alignment 10 Based on a preset “word” size BLASTN , BLAST + 2 for a match and -3 mismatch SCORING BLASTP either BLOSUM 62, 45 , 80, PAM70, 30 substitution matrices Raising word size w=15 gives fewer matches and is faster than w=11 or w=7. 11 Hits Extensions If word size threshold values are raised, the speed increases, but fewer hits are received; the reverse is true when the word size is lowered. 12 U SES OF BLAST SEARCHES Determines function of paralog and Exploration of protein function or amino acid residues orthologs Identifies a nucleotide Discovery of new genes or protein sequence Determines whether particular genes or proteins are found in a particular organism Investigation of ESTs Determination of gene or protein variants or other specialized PCR Primer selection databases 13 BLASTP search at NCBI: overview of web-based search query: FASTA format or accession database Entrez query algorithm parameters Q UERY SEQUENCE FORMAT 1. Cut and paste a file in FASTA format Enter query sequence 2. Enter an accession number from NCBI 15 Text based format of writing FASTA FILE nucleotide or protein sequences Begins with a (>) sign in the description line This is followed by letter nucleotide or peptide sequence in upper or lowercase. 16 Enables faster searches New Core- Returns similar top results for most searches nt BLAST default Reduces redundancy for some highly represented organisms database Allows easier download and requires less storage space for database download for Core_nt contains the same standalone BLAST eukaryotic transcript and gene-related sequences as nt. Get Faster, More Focused Search Results with NCBI’s New BLAST Core Nucleotide Database (core_nt) - NCBI Insights (nih.gov) 17 SEARCH PARAMETERS ̶ The default search parameters are usually used ̶ Search results format can be changed ̶ A search submission results in a  RID number Request identifier, and  Estimated time taken for search to be completed  Searches can be saved at myNCBI to see recent results 18 choose optional BLASTP search parameters max sequences short queries expect threshold word size max matches scoring matrix gap costs compositional adjustment filter mask B&FG 3e Fig. 4-4 19 Page 128 Results page Contains: ̶ the Summary, ̶ Graphical Overview, ̶ Descriptions table, ̶ and Alignments sections. 20 21 BLAST output includes list of matches; links to the NCBI protein entry; score and E value; and download options 22 Graphic summary 23 A LIGNMENTS : THE BULK OF THE REPORT 24 Alignments ― Provide a summary of the database sequences identified by BLAST to be similar to the input query. ― description/title of matched database sequence ― highest alignment score (Max score) from that database sequence ― total alignment scores (Total score) from all alignment segments ― percentage of query covered by alignment to the database sequence ― best (lowest) Expect value (E value) of all alignments from that database sequence ― highest percent identity (Max ident) of all query-subject alignments, and ― Accession of the matched database sequence 25 Scores are E-values approximate the E XPECT OR E- reported by BLAST for each high- number of HSPs VALUES scoring sequence with score S that pair (HSP) as E- are expected by values chance 26 the probability of the alignment occurring by chance. E- VALUE a statistical calculation based on the quality of alignment (the score) and the size of the database. The lower the E-value, or the closer it is to zero, the more "significant" the match is.. A low e-value therefore represents a better match for an alignment. As e approaches zero, there is less chance of the alignment being a random event. E- values ≤ 0.05 are statistically significant. 27 An e –e-value of 1e - 3 is saying that there is a 0.001 or (1 in 1000 ) chance that that alignment would exist in the database by chance. E ≤ 0.02- sequences probably homologous An E-value < 1e-05 (0.00001) is usually considered a G ENERAL significant match and provides high confidence for a homologous relationship. G UIDE TO E- 0.02 ≥ E ≥1.0- homology needs to be proven but VALUES should not be dismissed an E-value of 1 means that one expects by chance to see 1 match with a similar score. E>1.0 a good match just by chance E values are used in conjunction with careful observation and analysis of the biological questions being asked. 28 S = the score E = the expect value = the number of high-scoring segment pairs (HSPs) expected to occur with a score of at least S m is the query length, n is the sum of all the lengths of all the sequences in the database) E = Kmn e-λS K and λ are the Karlin-Altschul variables. λ normalizes the alignment score, K scales the E-value based on the database and sequence lengths. E-value thus depends on the database size 29 How to interpret BLAST: E values and p values __E ____p 10 0.99995460 E values of about 1 to 10 are 5 0.99326205 far easier to interpret than 2 0.86466472 corresponding p values. 1 0.63212056 0.1 0.09516258 (about 0.1) 0.05 0.04877058 (about 0.05) Very small E values are very 0.001 0.00099950 (about 0.001) similar to p values. 0.0001 0.00010000 small E-value: low number of hits, but of high quality large E-value: many hits, partly of low quality In BLAST, the p-value represents the probability of obtaining an alignment with a bit score >= S’ by 30 chance in a database search. S CORES The Score “S” represents the number of hits against the entire databases. It is a statistical measure. It is the similarity of each pairwise alignment A high S-score therefore represents a better match. A low score a chance match S-scores are listed from high to low in the report Can be compared across searches The scores follow an extreme value distribution (EVD) rather than a normal distribution. 31 There are two kinds of scores: raw scores (calculated from a substitution matrix) and bit scores (normalized scores) Raw scores Bit scores are comparable between different searches because they are normalized to and bit scores account for the use of different scoring matrices and different database sizes Bit scores allow you to compare results between different database searches, using different scoring matrices. 32 MAX and TOTAL SCORES Max Score: score Total Score: sum of single best of scores of all aligned sequence aligned sequences 33 I N SUMMARY: E VALUE AND S- SCORE Increasing S, decreases E- Value = Good alignment Decreasing S, increases E-value = Poor alignment 34 35 PSI BLAST Protein-protein BLAST Position-Specific Iterated BLAST Finds more distantly related matches Iterates: Initial search results provide information on “allowed” mutations; subsequent searches use these to create custom substitution matrix. PSI-BLAST < Sequence Similarity Searching < EMBL-EBI 36 COMPARISON OF BLAST AND PSI-BL AST Slow for complete searches against the database BLAST Returns close matches to the query sequence Good for general homology searches Useful for protein sequences PSI- Used for whole genomes Identifies more homologues below BLAST 30% sequence identity than BLAST(distant relationships) 37 FASTA- (Fast –All) Rapid heuristic alignment of pairs of protein or DNA sequences FASTA Uses a k-tuple algorithm (matches ( PEARSON AND sequence patterns or words) LIPMAN 1988) Builds a local alignment based on word matches Uses tables for searching and comparing query sequences Slower than BLAST but may be more accurate based on word size 38 A word (short sequence segment) used in the hash table process in FASTA k- refers to the numbers of elements in the word e.g. K- TUPLE no. of nucleotides or amino acids E.g. for sequence TGATGATGAAGACATCAG TGATGATG and GATGATGA are two words or k-tuples of k=8 Identify common k-tuples (k-tup) between two sequences (overlapping words or k-tups)- Hashing Score diagonals (pairwise alignments) with k-tuples and identify the best 10 diagonals- 1ST Scoring FASTA STEPS Re-score initial regions with a substitution score matrix (Smith Waterman algorithm). An optimal pairwise alignment is produced.- 2nd Scoring Perform dynamic programming to assess the significance of the alignment- an e-value and an S-score is calculated.- DP and statistical analysis 40 BLAST V FASTA BLAST uses a heuristic FASTA uses a DP algorithm, breaks algorithm to perform down the query pairwise sequence sequence into smaller alignments, words, and searches comparing each query for matches in a pre- sequence residue to constructed database. every database residue. It then extends these matches to find local alignments. Seed matches do not need to match exactly. Taken from : http://ab.inf.uni-tuebingen.de/teaching/ws06/albi1/script/BLAST_slides.pdf 41 R EFERENCES : Altschul, Stephen F. (1991). Amino acid substitution matrices from an information theoretic perspective. J. Mol. Biol. 219:555-65. Baxevanis, A.D. and Oulette BF (2005) Bioinformatics a practical guide to analysis of genes and proteins. Wiley &Sons. Chapter 11 Pevsner J. Bioinformatics and Functional genomics (2016) Wiley. Chapter 4. BLAST Glossary - BLAST® Help - NCBI Bookshelf (nih.gov) 42

Use Quizgecko on...
Browser
Browser