Biotech 4BI3 Bioinformatics Lecture Notes PDF

Summary

These lecture notes provide an overview of bioinformatics techniques for sequence analysis, including dot plots, BLAST, and alignment scoring matrices (PAM and BLOSUM). The document focuses on the principles and applications of these methods for comparing and analyzing biological sequences.

Full Transcript

BIOTECH 4BI3 - Bioinformatics Lecture 4 – Scoring Alignments and Similarity Searches Where are we going? DNA Sequencing DNA DNA Read Sequencing Quality...

BIOTECH 4BI3 - Bioinformatics Lecture 4 – Scoring Alignments and Similarity Searches Where are we going? DNA Sequencing DNA DNA Read Sequencing Quality Assembly Mapping Control Genome Expression Annotation Analysis Marker-Trait Population Polymorphis Genotyping Associations Analysis m Discover Describe what a dotplot is and what information can be learned from it Learn how similarity scores are calculated between DNA sequences Understand how similarity scoring matrices are created and how they are used in comparing the relatedness of Learning proteins Detail the steps of the BLAST algorithm Outcomes Interpret BLAST output, including E- values, P-values, and alignment scores, to assess biological relevance Describe the Smith-Waterman alignment algorithm and be able to implement it Understand the difference between global and local alignments and when to apply each. Dotplots Definition: A dotplot is a graphical tool used to visualize the relationship between two sequences by plotting matching residues or nucleotides in a matrix. Axes Representation: One sequence is placed along the x-axis, and the other along the y-axis. Matches between the sequences are represented by dots where they agree. Visualization: Diagonal lines indicate regions of similarity or identity between sequences. Useful for spotting duplications, inversions, and repeats. Dotplots Advantages: 1. Easy visualization of regions shared between sequences. 2. Identifies structural features: Repeats Duplications Inversions Palindromes Software: MUMmer: A widely-used tool for generating dotplots, particularly for genome comparisons. Limitations of Dotplots 1. Limited Scalability: a. Dotplots work well for small to moderately sized sequences but become cluttered and hard to interpret with large genomic sequences. b. Memory and computational constraints. As sequence sizes increase, the dotplot matrix grows exponentially, which can lead to performance issues. 2. Sensitivity to Noise: c. Noise in repetitive or low-complexity regions. Repetitive sequences or regions with low complexity can produce misleading results, creating false positives that appear as diagonal streaks or blocks. d. High sensitivity to random similarities. Even non-homologous sequences can show matches, which can confuse the analysis. 3. Lack of Quantitative Measure: e. No weighted scoring system. Dotplots provide a visual representation but do not generate numerical scores for similarity or alignment quality, making them less useful for detailed analysis. f. Subjective interpretation. Interpretation of dotplots can be subjective, especially when patterns are not obvious or regions of similarity are unclear. Limitations of Dotplots 4. Gaps and Insertions Handling: a. Difficulty visualizing complex indels: While small gaps or insertions can be observed, dotplots struggle to visualize large or complex insertions and deletions in an informative way. b. Does not show alignments directly: Dotplots only show similarity without explicitly aligning the sequences or considering gap penalties. 5. Limited Use in Divergent Sequences: c. Dotplots are more effective for comparing closely related sequences. As sequence divergence increases, meaningful patterns become harder to detect. Dotplots Interpreting Dotplots Perfect Duplicated Agreement Region Interpreting Dotplots Palindromic Homologous Sequences Sequences Interpreting Dotplots Microsatellties Sequence Inversion Dotplot - Comparing Two Sequences https://mummer4.github.io/tutorial/ tutorial.html Dotplot – Self Comparison Compare a sequence against itself Note the reflection across the diagonal Useful to visualize duplicated regions. They look like lines parallel to the diagonal Regions of low complexity appear as blocks https://en.wikipedia.org/wiki/Dot_plot_ %28bioinformatics%29 Sequence Similarity Definition: Sequence similarity refers to the degree of likeness or matching between two DNA, RNA, or protein sequences. It can be quantified by various methods, depending on the nature of the comparison (e.g., exact matches, evolutionary distances). Sequence similarity is a fundamental concept in bioinformatics that underpins many types of analysis, from gene annotation to evolutionary studies. Applications: Identifying Homologs: Determining if two sequences are evolutionarily related. Functional Inference: Inferring gene or protein function based on similarity. Phylogenetic Analysis: Reconstructing evolutionary trees based on sequence data. Sequence Similarity Methods to Assess Sequence Similarity: 1. Hamming Distance: Measures the number of mismatches between two sequences of equal length. Useful when no insertions or deletions (indels) are involved. 2. Edit Distance (Levenshtein Distance): Measures the minimum number of operations (insertions, deletions, or substitutions) required to transform one sequence into another. Accounts for indels and substitutions. Hamming Distance ATCGGGATGCCAGAGCCC ATCAGGATGTTAGAGCGC Hamming Distance is 4 Edit Distance ATCGGGATGC-CAGAGCCC AT--GGTTGCCCAGAGCGC Insertions - 2 Deletions – 1 Transversions – 2 Total = 5 Scoring Matrices Definition: Matrices that assign scores to different types of matches and mismatches based on the evolutionary or functional significance of the changes. Commonly used in protein alignments. Examples: PAM (Point Accepted Mutation) matrix: Focuses on mutations over evolutionary time. BLOSUM (Blocks Substitution Matrix): More commonly used for distant sequence relationships. DNA Alignment Scoring Definition: DNA Alignment Scoring refers to assigning numerical values (scores) to the matches, mismatches, and gaps between two or more DNA sequences during alignment. Goal: Maximize the alignment score to represent the best possible sequence comparison, balancing matches and differences (mismatches/gaps). Why Scoring is Important: Provides a quantitative way to compare sequences and assess similarity. Helps identify evolutionary relationships or functional similarities between sequences. DNA Alignment Scoring A G T C Identity Matrix: This simple scoring matrix assumes that the likelihood of A 1 0 0 0 mutating from one nucleotide to any other is equal G 0 1 0 0 Key Feature: T 0 0 1 0 Assumes that all mismatches are equally likely and doesn’t account for C 0 0 0 1 biological or evolutionary significance. Limitations: Simplistic and does not reflect the biological reality that certain mutations are more common than others. DNA Alignment Scoring A G T C Transition-Transversion Matrix: A biologically informed scoring matrix A 4 2 1 1 Transitions: More common in evolution G 2 4 1 1 and therefore given higher scores. T 1 1 4 2 Transversions: Less common in evolution and are given lower scores. C 1 1 2 4 Benefit: More biologically accurate than equal scoring with the identity matrix Gap Penalties Gaps represent insertions or deletions and should be penalized to prevent introducing too many gaps in the alignment. Types of Gap Penalties: 1. Constant Gap Penalty: A fixed penalty for each gap, regardless of length. 2. Affine Gap Penalty: Penalizes the introduction of a gap more than its length, encouraging the use of fewer, longer gaps rather than many shorter ones. Why Use Gap Penalties: Avoids excessive use of gaps that would artificially inflate alignment scores. More biologically realistic, as long gaps (indels) are often rarer than small mutations. Gap Penalties In the context of affine gap penalty, the word affine refers to a linear relationship between the gap length and the total penalty applied for introducing a gap in sequence alignment. An affine gap penalty has two components: 1. Gap opening penalty: A fixed cost or penalty for starting a gap. 2. Gap extension penalty: A smaller, recurring cost for each additional unit (base or amino acid) in the gap after the initial one. Affine gap penalty example: ) Where is the penalty for opening a gap and is the penalty for extending it. Interpreting an Alignment Score Key Concepts: Higher Scores mean a better alignment, indicating closer evolutionary or functional relationships between the sequences Lower Scores means a mismatches or gaps, suggesting more divergence between the sequences Factors Affecting Alignment Scores: 1. Choice of scoring matrix 2. Gap penalties applied Biological context: Depending on the organisms or sequences being compared, different matrices and gap penalties may be more appropriate. PAMN Matrices Point Accepted Mutations The first positional scoring matrix used to determine protein relatedness A quantitative way to measure the evolutionary distance between two protein sequences 1572 mutations observed within 71 closely related proteins (85%) to generate PAM1. All other PAMs are powers of PAM1 Idea is that some protein mutations are more likely based on what is observed in nature N represents the number of mutations per 100 amino acids Point Accepted Mutation (PAM) Matrix Definition: PAM matrices are substitution matrices that score alignments based on the likelihood of one amino acid replacing another during evolution. They measure evolutionary distances between protein sequences. Key Concept: A PAM matrix is built on the idea that some mutations are more likely to occur than others due to evolutionary pressure. Origin: The first PAM matrix (PAM1) was generated by Margaret Dayhoff by studying 1,572 mutations in 71 closely related proteins with at least 85% sequence identity. Other PAM matrices (PAMn) are derived by multiplying the PAM1 matrix by itself n times. Nomenclature: PAM1: Represents a 1% change per 100 amino acids. PAM250: Represents 250 accepted mutations per 100 amino Creating a PAM Matrix 1. Align Protein Sequences: Perform global alignments on sequences with at least 85% identity. 2. Identify Accepted Mutations: Look for mutations in the alignments that are accepted evolutionarily (i.e., they have been preserved in nature). 3. Calculate Mutation Probabilities: Each entry in the matrix reflects the probability of one amino acid mutating into another over time. 4. Extrapolation to Higher PAMs: Multiply the PAM1 matrix by itself to create matrices for more distant evolutionary events, assuming that mutations follow a Markov process (future state depends only on the present, not past mutations). Interpreting a PAM Matrix Positive Scores: Indicate amino acid substitutions that occur more frequently than expected by chance (conservative mutations). Negative Scores: Indicate substitutions that occur less frequently than expected (non-conservative mutations). Usage: PAM matrices are typically used to align sequences that are more evolutionarily related (closer in time), with PAM250 commonly used for moderately divergent proteins. PAM 250 PAM 250 means we expect 250 amino acid changes per 100 amino acids A positive number means an amino acid change occurs more often than expected by chance A negative number means an amino acid change occurs less often than expected by chance Blocks Substitution (BLOSUM) Matrix Definition: BLOSUM matrices score alignments between proteins by considering how frequently pairs of amino acids are substituted in conserved regions of protein families. Unlike PAM matrices, BLOSUM matrices are based on observed substitutions in blocks of conserved sequences without gaps. Key Concept: BLOSUM matrices are derived from regions of proteins that are highly conserved, meaning they do not change much over evolutionary time, and are typically used for detecting more distant evolutionary relationships. Nomenclature: BLOSUM62: The most widely used matrix, created by clustering sequences that share at least 62% identity. BLOSUMn: Different BLOSUM matrices are built by clustering sequences at various identity thresholds. A lower number (e.g., BLOSUM45) is used for comparing more distantly related sequences, while higher numbers How to Create a BLOSUM Matrix 1. Align Conserved Protein Blocks: Use alignments from highly conserved regions of proteins, which are generally free from gaps. 2. Cluster Sequences: Group sequences based on a specific identity threshold. For instance, BLOSUM62 clusters sequences with at least 62% identity. 3. Count Amino Acid Pairs: Within these clusters, count the number of times pairs of amino acids are aligned at a specific position. 4. Calculate Substitution Frequencies: For each position, calculate the frequency of observed amino acid pairs compared to the expected frequency (based on amino acid abundance in proteins). 5. Compute Log-Odds Ratios: For each amino acid pair, calculate a log-odds score, which reflects how much more likely the pair is to be found in conserved sequences than by random chance. Interpreting BLOSUM Matrices Positive Scores: Amino acid substitutions that occur more frequently than expected by chance, indicating "conservative" substitutions that maintain protein structure and function. Negative Scores: Substitutions that occur less frequently than expected, often representing "non-conservative" changes that are less likely to be evolutionarily preserved. Usage: BLOSUM matrices are used to compare protein sequences. Depending on the evolutionary distance of interest, different matrices (BLOSUM45, BLOSUM62, BLOSUM80) can be used. Lower numbers (e.g., BLOSUM45) are used for more distantly related sequences, while higher numbers (e.g., BLOSUM80) are better for closely related sequences. How to Create a BLOSUM Matrix Sequenc Positio Positio Position e n1 n2 3 Seq-A T T A Seq-B A A Q Seq-C A Q Q Seq-D A T A 1. Find AA observed AA frequency Amino Count Acid A 6 Q 3 T 3 Total 12 How to Create a BLOSUM Matrix Sequenc Positio Positio Position e n1 n2 3 Seq-A T T A Seq-B A A Q Seq-C A Q Q Seq-D A T A AA Pairs Count 2. Calculate the count of pairs of AA AA 4 QQ 1 TT 1 AT 5 AQ 5 QT 2 Total 18 How to Create a BLOSUM Matrix Sequenc Positio Positio Position e n1 n2 3 Seq-A T T A Seq-B A A Q Seq-C A Q Q Seq-D A T A 3. Calculate the observed frequency of pairs of AA AA Count Observ Pairs ed AA 4 4/18 QQ 1 1/18 TT 1 1/18 AT 5 5/18 AQ 5 5/18 QT 2 2/18 How to Create a BLOSUM Matrix Sequenc Positio Positio Position e n1 n2 3 Seq-A T T A Seq-B A A Q Seq-C A Q Q Seq-D A T A 4. Calculate the expected frequency of pairs of AA AA Expected Pairs AA 6/12 * 6/12 Amino Count QQ 3/12 * 3/12 Acid TT 3/12 * 3/12 A 6 AT 6/12 * 3/12 *2 Q 3 AQ 6/12 * 3/12 * 2 T 3 QT 3/12 * 3/12 * 2 Total 12 How to Create a BLOSUM Matrix 5. Calculate the log odds ratio 1/alpha * log2(observed/expected) AA Expected AA Observ AA 2log2(O/E) Pairs Pairs ed Pairs AA 6/12 * 6/12 AA 4/18 AA 2log2(0.222/0.250) = - QQ 3/12 * 3/12 QQ 1/18 0.170 TT 3/12 * 3/12 TT 1/18 QQ 2log2(0.056/0.063) = - AT 6/12 * 3/12 *2 0.170 AT 5/18 AQ 6/12 * 3/12 * 2 AQ 5/18 TT 2log2(0.056/0.063) = - QT 3/12 * 3/12 * 2 0.170 QT 2/18 AT 2log2(.278/.250) = 0.152 AQ 2log2(.278/.250) = 0.152 QT 2log2(0.111/0.125) = - 0.107 http://courses.washington.edu/bioinfo/Pabio536/Data/Blossum%2062.jpg BLOSUM Matrix Read the article “Where did the BLOSUM62 alignment score matrix come from?” by Sean R Eddy Basic Local Alignment Search Tool (BLAST) Definition: BLAST is a widely used algorithm that searches a database of sequences to find regions of similarity to a query sequence. It’s crucial for identifying homologous sequences and assessing biological relevance. Key Concept: BLAST works by comparing a sequence (nucleotide or protein) to a large database to identify similar sequences. It is fast and efficient for both local alignments and large-scale searches. Applications: Identifying homologous genes or proteins. Inferring evolutionary relationships. Discovering functional similarities between sequences. Common Types of BLAST blastn: Compares a nucleotide sequence against a nucleotide database. blastp: Compares a protein sequence against a protein database. blastx: Compares a translated nucleotide sequence against a protein database. tblastn: Compares a protein sequence against a translated nucleotide database. tblastx: Compares translated nucleotide sequences against each other. PSI-BLAST: Iteratively searches for distant protein relationships by updating the search profile after each round. How BLAST Works 1. Query Segmentation (Word Size):The query sequence is divided into shorter, fixed-length words (k-mers).For DNA words are typically 11 nucleotides long. For proteins words are usually 3 amino acids long. These words represent potential alignment "seeds“ 2. Word Matching (Hits): BLAST scans the database to identify sequences that contain exact or near-exact matches to the query words. Matches (hits) between the query and the database sequences are identified at this stage. 3. Word Extension: Once hits are found, BLAST attempts to extend the matches in both directions, forming high-scoring segment pairs (HSPs). Extension occurs by adding adjacent nucleotides or amino acids until the score starts to drop significantly. During this phase, BLAST allows for mismatches or gaps, which help to improve the alignment over longer regions. How BLAST Works 4. Scoring and Filtering: The alignment is scored based on the scoring matrix (e.g., PAM or BLOSUM for protein alignments). BLAST filters out low-scoring HSPs, retaining only the highest- scoring matches for further evaluation. 5. Statistical Evaluation: The remaining alignments are evaluated using statistical methods to calculate the E-value (expected value). The E-value measures the likelihood that the match occurred by chance. A lower E-value indicates a more significant result. 6. Output: BLAST outputs the results in a ranked list, with the highest-scoring alignments at the top. For each hit, BLAST provides the alignment score, E-value, and a detailed alignment showing where matches, mismatches, and gaps occur. How BLAST Works How a BLAST Report is Organized Query sequence: Your sequence The database: The collection of sequences you are comparing the query sequence to Hits: These are the sequences in the database that share similarity with the query sequence Alignment: For each hit, the report shows an alignment of the hit sequence with the query sequence. This includes a score for the alignment and an estimate of the statistical significance of the match. These alignments are presented as high scoring pairs (HSP). The HSPs are segment pairs whose scores cannot be improved by extension or trimming of the alignment Score: The score reflects the quality of the alignment. Higher scores indicate better alignments. p-value: The probability that the observed match could have happened by chance in the database e-value: The E-value estimates the number of hits one can "expect" to see by chance when searching a database of a particular size. It decreases exponentially as the Score (S) of the match increases. Essentially, the lower the E-value, the more significant the score. BLAST Report http://bioinformatics.bc.edu/~marth/BI820/images/ cshlBlastReport.png BLAST Report Statistics p-value: The probability that the observed match could have happened by chance in the database. The value depends on the e-value e-value: The number of matches as good as the observed one that would have arisen by chance. The e- value depends on the length the query sequence, the size of the database, and the scoring matrix used. Suggestions to Interpret a BLAST Report P-value Meaning 10-1 Likely insignificant What is DNA alignment Definition: DNA alignment is the process of arranging two or more DNA sequences to identify regions of similarity. The goal is to line up the sequences so that similar or identical bases are in the same columns, maximizing the match score between them. Purpose: DNA alignment helps to: 1. Detect evolutionary relationships: Alignments can reveal how closely related two sequences are, helping to trace ancestry and speciation events. 2. Identify functional regions: Highly conserved (unchanged) regions in DNA are often important functional elements, such as genes or regulatory sequences. 3. Predict mutations and polymorphisms: Alignments highlight differences (mutations, SNPs, indels) between sequences, which are important in evolutionary biology and medical genetics. Types of DNA Alignment Global Alignment: 1.Attempts to align the sequences across their entire length, even if it requires introducing gaps to match the sequences fully 2.Used when comparing sequences of similar lengths that are expected to align across their entirety (e.g., two complete genes or genomes) 3.Algorithm: Needleman-Wunsch Local Alignment: 1.Finds regions of similarity within two sequences and aligns only those segments. This approach is more flexible, as it allows for partial matches 2.Used for sequences that are only partially similar or contain large regions of divergence (e.g., when comparing genes from different species) Needleman-Wunsch General method of performing sequence comparison Goal is to maximize the alignment score between two sequences Finds the best global alignment. This means that all bases from both sequences are included in the alignment. Two step process All pairs of residues are compared/scored in a 2-D matrix using a Dynamic Programming approach All alignments are represented by pathways through this matrix Highest scoring path is the best global alignment The highest scoring path begins at the largest value in the last row or column and ends in the top left cell Needleman-Wunsch https://upload.wikimedia.org/wikipedia/commons/3/3f/Needleman- Wunsch_pairwise_sequence_alignment.png Smith-Waterman Alignment Definition: The Smith-Waterman algorithm is a dynamic programming algorithm used for local sequence alignment. It identifies the best matching subsequences between two sequences by maximizing the alignment score and finding the highest scoring local region. Key Features: Local Alignment: Finds regions of similarity within two sequences, rather than aligning them end-to-end (global alignment). Optimal for Substrings: Used when you want to find the best- matching local regions between sequences with large dissimilar segments. Gaps and Mismatches: Allows gaps (insertions or deletions) and mismatches in the alignment, with penalties applied to prevent overuse. More Smith-Waterman w(gap) = −1 w(match) = +2 w(mismatch) = -1 http://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm Demonstrate SW NW vs SM Needleman-Wunsch Smith-Waterman Global alignments Local alignments Alignment scores can be negative Alignment scores can only be zero or positive No gap penalty required Works best with a gap penalty First row and column are First row and column are set to zero calculated with the gap penalty https://upload.wikimedia.org/wikipedia/commons/0/03/Alignment-Comparison-En.png What is Multiple Sequence Alignment (MSA) Definition: Multiple sequence alignment (MSA) is the process of aligning three or more biological sequences (DNA, RNA, or proteins) to identify regions of similarity across all sequences. Key Objective: The goal of MSA is to align sequences to maximize similarity, which can reveal evolutionary relationships, conserved regions, and functional elements shared by the sequences. Applications: Phylogenetics: Helps build evolutionary trees by identifying homologous sequences. Functional Analysis: Reveals conserved motifs or domains important for the function of proteins or genes. Protein Structure Prediction: Conserved regions may indicate structural or functional importance. Methods for MSA Progressive Alignment (e.g., ClustalW): Aligns sequences two at a time, starting with the most similar pair, then progressively aligns more sequences. Builds a guide tree to determine the order in which sequences are aligned. Advantage: Fast and easy to implement. Limitation: Sensitive to initial alignment errors, which can propagate as more sequences are aligned. Iterative Alignment (e.g., MUSCLE): Starts with an initial alignment, then refines it by repeatedly realigning subgroups to improve the overall score. Advantage: More accurate than progressive methods. Limitation: Requires more computational resources and time. Methods for MSA Consensus-Based Alignment (e.g., T-Coffee): Uses a combination of different alignment methods and creates a consensus by weighting and combining the results. Advantage: Typically more reliable when aligning divergent sequences. Limitation: Can be slower due to multiple alignment steps Challenges in MSA Gaps and Indels: Introducing gaps for insertions or deletions in one sequence can create difficulty when aligning sequences of varying lengths. Divergence: Highly divergent sequences can be challenging to align due to large evolutionary distances. Computational Complexity: Aligning multiple sequences requires significantly more computational power compared to pairwise alignments. MSA https://ugene.net/multiple-sequence-alignment- overview/

Use Quizgecko on...
Browser
Browser