Dynamic Programming in Sequence Alignment
52 Questions
0 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

What is the time complexity for dynamic programming algorithms used in sequence alignment?

  • O(n^2)
  • O(n + m)
  • O(log(nm))
  • O(nm) (correct)
  • What is the main advantage of linear space alignment methods compared to traditional dynamic programming?

  • They eliminate the need for traceback pointers.
  • They can handle protein sequences only.
  • They require significantly less memory. (correct)
  • They provide faster alignment without losing accuracy.
  • Why is memory usage a critical factor in dynamic programming alignment for long sequences?

  • It is always manageable for genomic DNA alignments.
  • It can cause the algorithm to fail completely.
  • It reduces the speed of alignment significantly.
  • It may exceed the machine’s physical capacity. (correct)
  • What occurs when rows are discarded to limit memory usage during alignment?

    <p>Traceback pointers for alignment are lost. (C)</p> Signup and view all the answers

    Which of the following best describes the relationship between speed and dynamic programming algorithms?

    <p>They are slower but more accurate than heuristic algorithms. (D)</p> Signup and view all the answers

    What is the primary purpose of heuristic algorithms like FASTA and BLAST?

    <p>To find high-scoring local alignments quickly (D)</p> Signup and view all the answers

    What advantage does the FASTA algorithm provide regarding speed and sensitivity?

    <p>It improves speed at the cost of sensitivity (D)</p> Signup and view all the answers

    For which type of sequences is FASTA best suited?

    <p>Global alignments of DNA sequences (A)</p> Signup and view all the answers

    How does FASTA initially identify potential alignments within the sequences?

    <p>Through hash tables for k-tuple matches (B)</p> Signup and view all the answers

    What is the default k value used for DNA sequences in the FASTA algorithm?

    <p>6 (B)</p> Signup and view all the answers

    What is the most important criteria for sequences to be grouped into the same cluster when calculating BLOSUM matrices?

    <p>The sequences should contain a minimum 50% identical residues. (D)</p> Signup and view all the answers

    What does the 'expected pair frequency' refer to in the context of BLOSUM matrices?

    <p>The frequency of an amino acid pair based on the average frequency of those amino acids across all sequences. (C)</p> Signup and view all the answers

    How are substitution scores in BLOSUM matrices calculated?

    <p>By comparing the observed frequency of an amino acid pair to the expected frequency based on their individual frequencies. (D)</p> Signup and view all the answers

    What information does the BLOSUM score for a particular amino acid pair represent?

    <p>The likelihood of a particular amino acid pair being found in the aligned sequences. (C)</p> Signup and view all the answers

    Why might BLOSUM62 be considered more appropriate for aligning more distantly related sequences compared to BLOSUM50?

    <p>BLOSUM62 is constructed from blocks with higher sequence identity which means it captures a larger evolutionary distance. (A)</p> Signup and view all the answers

    What is the implication of a positive BLOSUM score for a particular amino acid pair?

    <p>The two amino acids are structurally similar. (A)</p> Signup and view all the answers

    What is the purpose of using blocks of protein fragments (the BLOCKS database) to construct BLOSUM matrices?

    <p>Blocks represent conserved regions in proteins, which are important for identifying structural and functional relationships. (C)</p> Signup and view all the answers

    How are BLOSUM matrices different from PAM matrices?

    <p>BLOSUM matrices are based on observed frequencies of amino acid pairs, while PAM matrices are based on evolutionary distances. (B)</p> Signup and view all the answers

    What could a negative BLOSUM score for a particular amino acid pair suggest?

    <p>The two amino acids are structurally dissimilar and less likely to replace each other during evolution. (D)</p> Signup and view all the answers

    How does the clustering procedure with a minimum identity threshold affect the construction of BLOSUM matrices?

    <p>It ensures that only sequences with similar evolutionary histories are used for calculating pair frequencies. (A)</p> Signup and view all the answers

    Why is it important that BLOSUM matrices are directly calculated without extrapolations?

    <p>Direct calculation ensures that the scores are consistent with the observed frequencies of amino acid pairs. (C)</p> Signup and view all the answers

    How are BLOSUM matrices used in biological research?

    <p>They are used to identify and characterize homologous proteins within a database. (C)</p> Signup and view all the answers

    What is the main advantage of using a logarithmic scale for BLOSUM scores?

    <p>It enables the use of a smaller range of integers to represent the scores. (A)</p> Signup and view all the answers

    What does the number after the BLOSUM matrix name (e.g., BLOSUM62) refer to?

    <p>The minimum percentage identity of the blocks used to construct the matrix. (C)</p> Signup and view all the answers

    Which of these statements about BLOSUM matrices is TRUE?

    <p>Higher BLOSUM scores for a pair always indicate a stronger evolutionary relationship. (D)</p> Signup and view all the answers

    How does the BLOSUM matrix differ from a scoring matrix used for aligning DNA sequences?

    <p>BLOSUM matrices consider the chemical properties of amino acids, while DNA scoring matrices focus on base pairing rules. (A)</p> Signup and view all the answers

    What is the primary goal of PSI-BLAST?

    <p>To find remote homologues (A)</p> Signup and view all the answers

    What technique does PSI-BLAST use to enhance its search results?

    <p>Position specific scoring matrices (A)</p> Signup and view all the answers

    In which range of sequence identity levels is PSI-BLAST effective?

    <p>15-25% (C)</p> Signup and view all the answers

    What is the function of PHI-BLAST?

    <p>To identify motifs in given protein sequences (B)</p> Signup and view all the answers

    What approach does BLAST use to compute the statistical significance of alignments?

    <p>Random sequence modeling (C)</p> Signup and view all the answers

    Which algorithm is utilized to align random sequences in the statistical evaluation by BLAST?

    <p>Smith-Waterman (D)</p> Signup and view all the answers

    What type of distribution does the maximum score of independent identically distributed random variables follow?

    <p>Extreme value distribution (B)</p> Signup and view all the answers

    How does PSI-BLAST adapt its scoring matrices throughout its iterations?

    <p>It updates based on the alignment of hits from the last search (B)</p> Signup and view all the answers

    What is the purpose of the alignment scoring matrix in the alignment process?

    <p>To compute scores for each diagonal run (B)</p> Signup and view all the answers

    Which feature distinguishes BLAST from other alignment tools according to the provided content?

    <p>Use of short word sequences for viewing alignments (D)</p> Signup and view all the answers

    What is the first step in the BLAST alignment process?

    <p>Look for short matching segments (D)</p> Signup and view all the answers

    How does the FASTA algorithm score regions of identity?

    <p>By scanning regions with a scoring matrix and saving the best (A)</p> Signup and view all the answers

    What is the optimal strategy for extending matches in BLAST?

    <p>To extend in both directions as long as the score is above a threshold (C)</p> Signup and view all the answers

    What does the term 'hot spots' refer to in the context of FASTA alignments?

    <p>Regions of identity with high scoring potential (C)</p> Signup and view all the answers

    What is the main motivation behind the development of the BLAST algorithm?

    <p>To balance speed and sensitivity in aligning sequences (D)</p> Signup and view all the answers

    What happens when overlapping hits are identified in the BLAST process?

    <p>They form a larger segment for potential alignment (B)</p> Signup and view all the answers

    In the content provided, what threshold is critical for continuing alignment extensions?

    <p>The drop-off threshold from the maximum score (B)</p> Signup and view all the answers

    What does the FASTA algorithm utilize to evaluate its hot spots before final alignment?

    <p>A scoring matrix to validate the initial regions (B)</p> Signup and view all the answers

    What does the E-value indicate about the result of a BLAST search?

    <p>The number of times a sequence with the same exact score would be expected to occur by chance in the database (A)</p> Signup and view all the answers

    Which of the following is NOT a factor that affects the E-value of a BLAST search?

    <p>The number of sequences in the database (C)</p> Signup and view all the answers

    What is the primary purpose of normalizing a raw score into a bit score?

    <p>Making scores from different alignments easier to compare, even if they are based on different scoring matrices. (B)</p> Signup and view all the answers

    What is the rule of thumb for database searching?

    <p>Search a larger database whenever possible (B)</p> Signup and view all the answers

    Which of the following is TRUE about the statistical significance of a hit?

    <p>A higher score typically indicates a more statistically significant alignment (C)</p> Signup and view all the answers

    What does the statement "E-value depends on the size of query sequence as well as that of the database" imply about BLAST search results?

    <p>The E-value can vary even for identical sequence pairs when searched against different databases. (C)</p> Signup and view all the answers

    What does performing multiple searches contribute to the interpretation of homology?

    <p>Multiple searches provide independent confirmation of a potential homology relationship. (C)</p> Signup and view all the answers

    Which of the following parameters helps determine the statistical significance of a BLAST alignment?

    <p>The E-value (A)</p> Signup and view all the answers

    Study Notes

    Pairwise Sequence Alignment - Heuristic Algorithms

    • Dynamic programming algorithms are sensitive but not the fastest sequence alignment methods
    • Time complexity is approximately O(nm), and speed is often an issue, especially with database searches
    • Protein databases contain ~100 million residues, requiring ~1011 matrix cells to completely search a database for a 1000-residue query. DNA databases are even larger.
    • At 10 million matrix cells per second, searching a 1000-residue query against a complete database takes approximately 3 hours.

    Pairwise Sequence Alignment - Memory Usage

    • Memory requirements for dynamic programming alignment are approximately O(nm), a product of sequence lengths.
    • For sequences with few hundred residues, memory usage is manageable.
    • Genomic DNA sequences with hundreds of thousands of residues can exceed machine's physical memory capacity.

    Pairwise Sequence Alignment - Linear Space Alignments

    • Techniques exist to achieve optimal sequence alignment with limited memory (n + m)
    • These methods come with the cost of doubling the execution time.

    Pairwise Sequence Alignment - Linear Space Alignments (Maximum Score)

    • Finding the maximum score in dynamic programming alignment is simplified
    • The recurrence relation only depends on the row immediately preceding the current row.
    • Data from rows further away can be discarded.
    • For local alignment, the maximum score in the entire matrix is needed, making it easy to track the maximum value during calculation.

    Pairwise Sequence Alignment - Linear Space Alignments (Alignment Itself)

    • Discarding rows in linear space alignments to save memory results in the loss of traceback pointers
    • A divide and conquer approach is used to recover alignments.

    Pairwise Sequence Alignment - Linear Space Alignments (Hirschberg's Algorithm)

    • A snapshot of Hirschberg's alignment algorithm is shown.

    Pairwise Sequence Alignment - Linear Space Alignments (Steps)

    • Split the DP problem into two subproblems: (0, 0) to (u, v), and (u, v) to (m, n), where u = m/2 (integer part).
    • Find v, the column where the alignment crosses row u=m/2.
    • Combine the optimal alignments of the two subproblems.
    • This process repeats recursively until the sequences are short enough.

    Pairwise Sequence Alignment - Linear Space Alignments (Finding V)

    • Combining "forward" and "backward" DP passes at row u enables finding v.
    • Compute the optimal score from (0, 0) to each point in row u, and from each point to (m,n).
    • Identify the point where the sum of these optimal scores from (0, 0) and from (u, v) give the optimal alignment score.

    Pairwise Sequence Alignment - Linear Space Alignments (EMBOSS Programs)

    • Emboss matcher: a linear-space local alignment algorithm developed by Waterman and Eggert
    • Emboss stretcher: a linear-space global alignment algorithm.

    Database Search - Need for Heuristics

    • Dynamic programming algorithms are computationally intensive and time-consuming (O(nm)).
    • Searching large databases using these algorithms is impractical.
    • For example, if comparing two 1 kb sequences takes milliseconds, comparing a 1 kb sequence to the entire human genome would take 3 hours.

    Database Search - Solutions for Heuristics

    • Solution 1 (Hardware implementation): Implementing the algorithm in hardware can be expensive.
    • Solution 2 (Parallel processing): Distributing the job among multiple processors for parallel processing can also be expensive as the number of processors increases (to 1000).
    • Solution 3 (Heuristics): Use heuristic algorithms to find approximately optimal solutions faster than dynamic programming, at the cost of potentially missing the best alignment.
    • Examples include FASTA and BLAST

    Heuristic Algorithms

    • Basic idea: locate high-scoring short stretches and extend them.
    • FASTA: fast alignment tool, is approximately 50 times faster than dynamic programming
    • BLAST: Basic Local Alignment Search Tool
    • These tools find high-scoring local alignments between a query sequence and a target database.

    FASTA

    • Developed by Pearson and Lipman in 1988
    • Compares sequences pairwise using heuristics
    • A good alignment will have exact matching subsequences
    • It provides speed at the cost of sensitivity
    • Works best for global alignments with DNA sequences

    FASTA: The Algorithm

    • Based on dot matrix method
    • View sequences as sequences of short words (k-tuples).
    • Good alignments have many exact matches.
    • Hashing finds exact matches in O(n) time
    • Diagonals can be formed from exact matches quickly and sorted.
    • Precise alignment is applied to the small search space.
    • Look for all k-tuple matches.
    • For DNA, k=2-6 (default 6).
    • For proteins, k=1, 2 (default 2).
    • A lookup table or hash is used.
    • Database is preprocessed to find the positions of each k-tuple.
    • Lookup the table to identify matches in the database

    FASTA: The Algorithm (Continued)

    • Look for adjacent hot spots (substrings of exact matches with the same offset).
    • Join them to form larger segments (-ve score between segments).
    • Find 10 highest scoring diagonals.
    • Find the best diagonal run and filter out low-scoring runs.
    • Recompute scores, including indels.
    • Combine nearby diagonal runs, including indels.
    • Calculate alternative local alignments, using a band around the best diagonal run.
    • Apply dynamic programming to the query against the top-ranked segments

    FASTA: The Algorithm (final)

    • Convert the initial sequences into hash tables of short words.
    • Look for matching words in the second sequence, finding the locations and saving them.
    • Identify hot spots (regions of identity) using a scoring matrix, preserving the best initial regions and discarding those with a score below the threshold.
    • Join the appropriate initial regions to form larger, more highly scoring regions (joining based on a threshold, with negative scores between hot spots).

    BLAST

    • Basic Local Alignment Search Tool, developed by Altschul et al. in 1990
    • Balances speed and sensitivity in sequence alignment
    • Designed for local alignments (now incorporates gaps).
    • It is a popular tool within the NCBI suite for sequence analysis

    BLAST: Motivation

    • Good alignments will have many close matches.
    • Statistics determine significant matches.
    • Much more sensitive than % identity
    • Extending matches in both directions.
    • High-scoring segment pairs (HSPs/MSPs).

    BLAST: The Algorithm

    • View sequences as short words (k-tuples)
    • DNA: 7, 11, 15 (default 11); protein: 2, 3, 6 (default 6)
    • Create a hash table for neighborhood (closely matching) words
    • Use statistics to set closeness threshold
    • First, find short matching segments
    • Choose matching segments that exceed a threshold
    • Overlapping hits form larger segments
    • Extend in either direction if score exceeds a threshold
    • Continue extension until score falls below "drop-off" threshold from the maximum

    BLAST: The Algorithm (Continued)

    • Convert the first sequence into a word list (considering all frames to capture different patterns).
    • Calculate "neighborhood" words for each word in the list, using a scoring threshold (T). Create a dictionary.
    • Scan the second sequence, find matching words, and store their locations.
    • Extend alignment from matches in both directions. Merge segments if the score above the threshold (S).
    • Align best segments using dynamic programming. Report statistically significant matches

    BLAST: Selecting a Program

    • Blastn: Nucleotide sequence to nucleotide database.
    • Blastp: Protein sequence to protein database.

    BLAST: Specialised Techniques

    • PSI-BLAST: Position-Specific Iterated BLAST (designed to find remote homologs (15-25% identity level))
      • Constructs scoring matrices from multiple alignments of hits.
      • Searches the database iteratively with new scoring matrices and repeats until convergence.
      • The scoring matrix is tailor-made to find query sequences. Detects homologues with 15-25% sequence identity levels.
    • PHI-BLAST: Pattern Hit Iterated BLAST (designed to locate motifs).
      • Given a protein sequence and a motif/pattern.
      • Finds proteins with that pattern and similar surrounding residues.
      • Combines regular expressions with local alignment around the matching region

    BLAST: Statistical Significance

    • Method:
      • Generate random sequences
      • Align with initial sequences
      • Compute the score of optimal alignment
      • These scores follow extreme value distributions with well-known distributions.
    • Expected number of HSPs (with score at least S):
      • E = K(mn)e-−AS.
      • K and λ are empirical parameters.
    • E-value: depends on query & database size

    BLAST: Significance of a hit (Example)

    • Search against 10,000 sequences
    • A fitted extreme-value distribution shows the score distribution, where 99.9% fall below 112.
    • An E-value of 10 indicates a hit with a score of 112 has a 1% chance of occurring by random chance given the database. E-values much smaller than 1 are considered significant.

    BLAST: Bit Scores

    • Raw scores are meaningless without understanding the scoring system (e.g., K, λ).
    • Bit Score: Normalizes a raw score (S) using a formula (S' = [AS - ln K] / ln 2). This allows comparison across different scoring matrices
    • E-value (corresponding to a bit score): E = mn 2−S`

    BLAST: Nucleotide Scoring Matrix

    • Parameters for match (M) and mismatch (N) in scoring matrices are relative and determine the sensitivity of the algorithm to conserved regions.
    • A lower M/N ratio is preferable for high sequence conservation.

    Substitution Matrices - PAM Matrices (Construction)

    • Based on the hypothesis of uncorrelated mutations accumulating to divergence.
    • Align closely-related sequences (typically > 85 % identity)
    • Observe AA change probabilities and calculate Log Odds Ratio values
    • Normalize matrix to obtain PAM-1 (1% average change).
    • Derive scoring matrices for longer evolutionary distances by iterating the PAM-1 matrix (e.g., PAM 250, results of the 250th matrix multiplication). The most common values for this approach are PAM100 and PAM250

    Substitution Matrices - PAM Matrices (Properties)

    • Reflect evolutionary distance between two protein sequences
    • Assumes that substitutional changes occurring over short evolutionary time periods can be extrapolated to longer times.

    Substitution Matrices - BLOSUM Matrices (Construction)

    • Derived from local, ungapped alignments from distantly related sequences, using the BLOCKS database.
    • Amino acid pair frequencies are calculated by summing across all pairs.
    • Different evolutionary distances are incorporated using a clustering procedure (where sequences with similar identities are grouped).

    Substitution Matrices - BLOSUM Matrices Properties

    • Directly calculated from data (no extrapolation)
    • Use minimum percent identity for blocks to construct matrix
    • BLOSUM 62 is the default matrix in BLAST

    Using BLAST to Infer Homology

    • Rules of thumb for percent identity and similarity in optimal alignments:
      • 45% identity: very similar structures and common function.
      • 25% identity: similar general folding pattern.
      • 18-25% identity: "twilight zone," lower degree of sequence similarity doesn't rule out homology.
    • High sequence similarity over the entire length suggests homology.
    • 50% similarity over a short sequence is common by chance.
    • Low complexity regions can have high similarity.
    • Homologous sequences aren't always highly similar.
    • Suggested BLAST cutoffs (for nucleotide-based searches look for hits with e-value ≤ 10−6 and sequence identity ≥ 70%; for protein-based searches, look for hits with e-value ≤ 10−3 and sequence identity ≥ 25%).

    Studying That Suits You

    Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

    Quiz Team

    Related Documents

    Pairwise Sequence Alignment PDF

    Description

    This quiz explores the complexities of dynamic programming algorithms used in sequence alignment. It covers the advantages of linear space alignment methods, the importance of memory usage, and the role of heuristic algorithms like FASTA and BLAST. Test your understanding of these critical concepts and methods in bioinformatics.

    More Like This

    Use Quizgecko on...
    Browser
    Browser