Lectures 21-22 - BLAST and Sequence Alignment PDF

Document Details

FriendlyTrust

Uploaded by FriendlyTrust

University of KwaZulu-Natal

Cassie Upton

Tags

BLAST sequence alignment bioinformatics molecular biology

Summary

These lecture notes cover BLAST (Basic Local Alignment Search Tool) used in bioinformatics to analyze biological sequences and determine sequence similarity and alignment. They compare DNA and protein sequences, discuss the different types of BLAST procedures, and highlight the parameters and algorithm of BLAST.

Full Transcript

Lecture 21 & 22: BLAST & Sequence Alignment RDNA202 Cassie [email protected] Biotechnology and Genomics Lecture 20: Introduction to Bioinformatics Lecture 21 & 22: BLAST and Sequence alignment BLAST Basic Local Alignment Search Tool One of the most commonly use...

Lecture 21 & 22: BLAST & Sequence Alignment RDNA202 Cassie [email protected] Biotechnology and Genomics Lecture 20: Introduction to Bioinformatics Lecture 21 & 22: BLAST and Sequence alignment BLAST Basic Local Alignment Search Tool One of the most commonly used tools for: Comparing sequence information Retrieving sequences from databases BLAST - Types Used to analyse: DNA via nucleotide comparisons Proteins via amino acid comparisons DNA and proteins via translation comparisons Blastp P for protein Blast P – compares protein query against protein sequence database So Protein - Protein tBlastn t for translated n for nucleotide t Blast n – compares protein query against all six reading frames from a translated nucleotide sequence database So Protein compared with translated nucleotide Blastn n for nucleotide Blast n – compares nucleotide query against nucleotide sequence database So nucleotide compared with nucleotide Blastx Blast x – compares six frame conceptual translation products of nucleotide query sequence (both strands) against a protein sequence database So translated nucleotide compared with protein tBlastx t for translated t Blast x – compares nucleotide query sequence against a translated nucleotide sequence database Translates both to 6‐frame amino acid sequences and compares them at the amino acid level What does it look like? BLAST algorith m BLAST algorithm Blast is a heuristic program – meaning it relies on smart shortcuts to perform search faster Different Parameters: 1. General Parameters: E-value Word size 2. Scoring Parameters 3. Filter and Masking BLAST algorithm 1.General Parameters: E-value Gives indication of statistical significance of a given pairwise alignment The lower the E-value – the more significant the hit If a sequence alignment has an E-value of 0.05 – means that it has a similarity of 5 in 100 (1 in 20) E-value greater than 1 – indicates that the alignment likely occurred by chance Word size The length of the seed that initiates an alignment BLAST algorithm 2. Scoring Parameters: Reward and penalty for matching and mismatching bases Cost to create and extend a gap in an alignment BLAST algorithm 3. Filter and Masking: Mask regions of low compositional complexity that may cause spurious or misleading results Mask query while producing seeds used to scan database – but not for extensions BLAST Results The top most hit = the best match to the query sequence BLAST Results The top most hit = the best match to the query sequence Why is Blast popular? 1. The flexibility of the search algorithm 2. Reliable statistical reports 3. Continual software development 4. The speed attained by the heuristic search methods Sequence Alignment Alignment algorithms Sequence alignment – most essential step in comparing biological sequences Identifies regions of similarity between sequences Two commonly used sequence alignment algorithms: 1. Global alignment Compares two sequences – by aligning the entire length of the sequences Used when sequences are the same length 2. Local alignment ◦ Does not align the entire sequence lengths ◦ Aligns regions with the highest density of matches ◦ Useful in identifying short conserved regions in nucleotide or protein sequences Sequence Alignment Process of comparing two (pairwise alignment) or more (multiple sequence alignment) DNA or protein sequences Done by searching for a series of individual characters/residues or character patterns that are the same order in a sequence Alignment types 1. Pairwise alignment Is the most fundamental operation of bioinformatics: Involves aligning two sequences together Main goal – obtain highest possible score Indicates degree of similarity between two sequences Uses: In genome analysis To decide if two proteins/genes are structurally or functionally related To identify domains or motifs shared between proteins Is the basis of BLAST searching Pairwise Sequence Alignment of the best hit Alignment types 2. Multiple Sequence Alignment Involves aligning multiple (3 or more) biological sequences to achieve optimal sequence matching Used to: Identify conserved sequence regions Construct phylogenetic trees Helps us understand functional and evolutionary relationships between different species or groups of organisms Multiple Sequence Alignment Why Compare Sequences? DNA and Proteins are products of evolution Molecular sequences undergo random changes over time Substitutions, insertions, deletions Some of these are selected for during evolution Why Compare Sequences? Detection of similarities indicate homology Detection of similarities between sequences – by sequence alignment Allows us to infer roles and functions of newly isolated sequences Using well-known already characterised sequences Homology Homology – term used when two sequences share a common ancestor that is recent enough that it is still detectable in their sequence Simply – We must compare the same nucleotide sequence in all organisms in our comparison Orthologs Orthologs Genes related by vertical decent from a common ancestor Genes in different species that evolved from common ancestral gene through speciation event Encode proteins with the same function in different species Paralogs Paralogs Genes that have evolved within the same species by gene duplication events When a gene is duplicated – the two copies can evolve independently – leading to development of paralogs Code for proteins with similar – but not necessarily identical – functions Orthologs vs Paralogs Feature Orthologs Paralogs Result from Result from gene Origin speciation events duplication events Found in different Found within the Species species same species Functional Typically retain Functions may ity similar functions diverge over time Homology Homology When two sequences share a common ancestor Recent enough that it is still detectable in their sequence Must compare the same nucleotide sequence in all organisms in our comparison Similarity Any 2 sequences can be compared and similarity calculated (% nt or aa identity BUT This is meaningless unless they are homologous Alignments – Positional Homology AATGATCCGATT How do you compare ATGATCCGATT these sequences? AATGATTCTTCT Which are most ATTGATTCGATTCTA similar? Align them An alignment involves creating Positional Homology Nucleotides at equivalent positions are placed under each other This allows comparison and identification of mutations A good alignment is essential for a good analysis Alignments – Positional Homology An algorithm is used to create an AATGATCCGATT alignment – E.g. Clustal W ATGATCCGAGT AATGATTCGTCT Questions: ATTGATTCGAGTCTA Are these sequences an alignment? What is the best way to align them? AATGATCCGATT Sometimes it is necessary to add gaps to AATGATCCGAGT the sequence to get a better alignment AATGATTCAAGTCT ATTGATTCGAGTCTA Alignments – Positional Homology AATGATCCGATT AATGATCCGAGT TRIM AATGATTCAAGTCT Ensure sequences are all the same ATTGATTCGAGTCTA length Analyse The quality of the analysis depends on AATGATCCGATT the quality of the alignment AATGATCCGAGT AATGATTC - - GTCAT ATTGATTCGAGTCTA Importance of Sequence Alignments BLAST finds matches CLUSTAL aligns matches It is easy to make comparisons when sequences are aligned E.g. examine how gene sequence varies among people with and without a disease E.g. Cystic fibrosis – Person affected by the disease is missing a three-base DNA sequence

Use Quizgecko on...
Browser
Browser