Podcast
Questions and Answers
What is the time complexity for dynamic programming algorithms used in sequence alignment?
What is the time complexity for dynamic programming algorithms used in sequence alignment?
What is the main advantage of linear space alignment methods compared to traditional dynamic programming?
What is the main advantage of linear space alignment methods compared to traditional dynamic programming?
Why is memory usage a critical factor in dynamic programming alignment for long sequences?
Why is memory usage a critical factor in dynamic programming alignment for long sequences?
What occurs when rows are discarded to limit memory usage during alignment?
What occurs when rows are discarded to limit memory usage during alignment?
Signup and view all the answers
Which of the following best describes the relationship between speed and dynamic programming algorithms?
Which of the following best describes the relationship between speed and dynamic programming algorithms?
Signup and view all the answers
What is the primary purpose of heuristic algorithms like FASTA and BLAST?
What is the primary purpose of heuristic algorithms like FASTA and BLAST?
Signup and view all the answers
What advantage does the FASTA algorithm provide regarding speed and sensitivity?
What advantage does the FASTA algorithm provide regarding speed and sensitivity?
Signup and view all the answers
For which type of sequences is FASTA best suited?
For which type of sequences is FASTA best suited?
Signup and view all the answers
How does FASTA initially identify potential alignments within the sequences?
How does FASTA initially identify potential alignments within the sequences?
Signup and view all the answers
What is the default k value used for DNA sequences in the FASTA algorithm?
What is the default k value used for DNA sequences in the FASTA algorithm?
Signup and view all the answers
What is the most important criteria for sequences to be grouped into the same cluster when calculating BLOSUM matrices?
What is the most important criteria for sequences to be grouped into the same cluster when calculating BLOSUM matrices?
Signup and view all the answers
What does the 'expected pair frequency' refer to in the context of BLOSUM matrices?
What does the 'expected pair frequency' refer to in the context of BLOSUM matrices?
Signup and view all the answers
How are substitution scores in BLOSUM matrices calculated?
How are substitution scores in BLOSUM matrices calculated?
Signup and view all the answers
What information does the BLOSUM score for a particular amino acid pair represent?
What information does the BLOSUM score for a particular amino acid pair represent?
Signup and view all the answers
Why might BLOSUM62 be considered more appropriate for aligning more distantly related sequences compared to BLOSUM50?
Why might BLOSUM62 be considered more appropriate for aligning more distantly related sequences compared to BLOSUM50?
Signup and view all the answers
What is the implication of a positive BLOSUM score for a particular amino acid pair?
What is the implication of a positive BLOSUM score for a particular amino acid pair?
Signup and view all the answers
What is the purpose of using blocks of protein fragments (the BLOCKS database) to construct BLOSUM matrices?
What is the purpose of using blocks of protein fragments (the BLOCKS database) to construct BLOSUM matrices?
Signup and view all the answers
How are BLOSUM matrices different from PAM matrices?
How are BLOSUM matrices different from PAM matrices?
Signup and view all the answers
What could a negative BLOSUM score for a particular amino acid pair suggest?
What could a negative BLOSUM score for a particular amino acid pair suggest?
Signup and view all the answers
How does the clustering procedure with a minimum identity threshold affect the construction of BLOSUM matrices?
How does the clustering procedure with a minimum identity threshold affect the construction of BLOSUM matrices?
Signup and view all the answers
Why is it important that BLOSUM matrices are directly calculated without extrapolations?
Why is it important that BLOSUM matrices are directly calculated without extrapolations?
Signup and view all the answers
How are BLOSUM matrices used in biological research?
How are BLOSUM matrices used in biological research?
Signup and view all the answers
What is the main advantage of using a logarithmic scale for BLOSUM scores?
What is the main advantage of using a logarithmic scale for BLOSUM scores?
Signup and view all the answers
What does the number after the BLOSUM matrix name (e.g., BLOSUM62) refer to?
What does the number after the BLOSUM matrix name (e.g., BLOSUM62) refer to?
Signup and view all the answers
Which of these statements about BLOSUM matrices is TRUE?
Which of these statements about BLOSUM matrices is TRUE?
Signup and view all the answers
How does the BLOSUM matrix differ from a scoring matrix used for aligning DNA sequences?
How does the BLOSUM matrix differ from a scoring matrix used for aligning DNA sequences?
Signup and view all the answers
What is the primary goal of PSI-BLAST?
What is the primary goal of PSI-BLAST?
Signup and view all the answers
What technique does PSI-BLAST use to enhance its search results?
What technique does PSI-BLAST use to enhance its search results?
Signup and view all the answers
In which range of sequence identity levels is PSI-BLAST effective?
In which range of sequence identity levels is PSI-BLAST effective?
Signup and view all the answers
What is the function of PHI-BLAST?
What is the function of PHI-BLAST?
Signup and view all the answers
What approach does BLAST use to compute the statistical significance of alignments?
What approach does BLAST use to compute the statistical significance of alignments?
Signup and view all the answers
Which algorithm is utilized to align random sequences in the statistical evaluation by BLAST?
Which algorithm is utilized to align random sequences in the statistical evaluation by BLAST?
Signup and view all the answers
What type of distribution does the maximum score of independent identically distributed random variables follow?
What type of distribution does the maximum score of independent identically distributed random variables follow?
Signup and view all the answers
How does PSI-BLAST adapt its scoring matrices throughout its iterations?
How does PSI-BLAST adapt its scoring matrices throughout its iterations?
Signup and view all the answers
What is the purpose of the alignment scoring matrix in the alignment process?
What is the purpose of the alignment scoring matrix in the alignment process?
Signup and view all the answers
Which feature distinguishes BLAST from other alignment tools according to the provided content?
Which feature distinguishes BLAST from other alignment tools according to the provided content?
Signup and view all the answers
What is the first step in the BLAST alignment process?
What is the first step in the BLAST alignment process?
Signup and view all the answers
How does the FASTA algorithm score regions of identity?
How does the FASTA algorithm score regions of identity?
Signup and view all the answers
What is the optimal strategy for extending matches in BLAST?
What is the optimal strategy for extending matches in BLAST?
Signup and view all the answers
What does the term 'hot spots' refer to in the context of FASTA alignments?
What does the term 'hot spots' refer to in the context of FASTA alignments?
Signup and view all the answers
What is the main motivation behind the development of the BLAST algorithm?
What is the main motivation behind the development of the BLAST algorithm?
Signup and view all the answers
What happens when overlapping hits are identified in the BLAST process?
What happens when overlapping hits are identified in the BLAST process?
Signup and view all the answers
In the content provided, what threshold is critical for continuing alignment extensions?
In the content provided, what threshold is critical for continuing alignment extensions?
Signup and view all the answers
What does the FASTA algorithm utilize to evaluate its hot spots before final alignment?
What does the FASTA algorithm utilize to evaluate its hot spots before final alignment?
Signup and view all the answers
What does the E-value indicate about the result of a BLAST search?
What does the E-value indicate about the result of a BLAST search?
Signup and view all the answers
Which of the following is NOT a factor that affects the E-value of a BLAST search?
Which of the following is NOT a factor that affects the E-value of a BLAST search?
Signup and view all the answers
What is the primary purpose of normalizing a raw score into a bit score?
What is the primary purpose of normalizing a raw score into a bit score?
Signup and view all the answers
What is the rule of thumb for database searching?
What is the rule of thumb for database searching?
Signup and view all the answers
Which of the following is TRUE about the statistical significance of a hit?
Which of the following is TRUE about the statistical significance of a hit?
Signup and view all the answers
What does the statement "E-value depends on the size of query sequence as well as that of the database" imply about BLAST search results?
What does the statement "E-value depends on the size of query sequence as well as that of the database" imply about BLAST search results?
Signup and view all the answers
What does performing multiple searches contribute to the interpretation of homology?
What does performing multiple searches contribute to the interpretation of homology?
Signup and view all the answers
Which of the following parameters helps determine the statistical significance of a BLAST alignment?
Which of the following parameters helps determine the statistical significance of a BLAST alignment?
Signup and view all the answers
Study Notes
Pairwise Sequence Alignment - Heuristic Algorithms
- Dynamic programming algorithms are sensitive but not the fastest sequence alignment methods
- Time complexity is approximately O(nm), and speed is often an issue, especially with database searches
- Protein databases contain ~100 million residues, requiring ~1011 matrix cells to completely search a database for a 1000-residue query. DNA databases are even larger.
- At 10 million matrix cells per second, searching a 1000-residue query against a complete database takes approximately 3 hours.
Pairwise Sequence Alignment - Memory Usage
- Memory requirements for dynamic programming alignment are approximately O(nm), a product of sequence lengths.
- For sequences with few hundred residues, memory usage is manageable.
- Genomic DNA sequences with hundreds of thousands of residues can exceed machine's physical memory capacity.
Pairwise Sequence Alignment - Linear Space Alignments
- Techniques exist to achieve optimal sequence alignment with limited memory (n + m)
- These methods come with the cost of doubling the execution time.
Pairwise Sequence Alignment - Linear Space Alignments (Maximum Score)
- Finding the maximum score in dynamic programming alignment is simplified
- The recurrence relation only depends on the row immediately preceding the current row.
- Data from rows further away can be discarded.
- For local alignment, the maximum score in the entire matrix is needed, making it easy to track the maximum value during calculation.
Pairwise Sequence Alignment - Linear Space Alignments (Alignment Itself)
- Discarding rows in linear space alignments to save memory results in the loss of traceback pointers
- A divide and conquer approach is used to recover alignments.
Pairwise Sequence Alignment - Linear Space Alignments (Hirschberg's Algorithm)
- A snapshot of Hirschberg's alignment algorithm is shown.
Pairwise Sequence Alignment - Linear Space Alignments (Steps)
- Split the DP problem into two subproblems: (0, 0) to (u, v), and (u, v) to (m, n), where u = m/2 (integer part).
- Find v, the column where the alignment crosses row u=m/2.
- Combine the optimal alignments of the two subproblems.
- This process repeats recursively until the sequences are short enough.
Pairwise Sequence Alignment - Linear Space Alignments (Finding V)
- Combining "forward" and "backward" DP passes at row u enables finding v.
- Compute the optimal score from (0, 0) to each point in row u, and from each point to (m,n).
- Identify the point where the sum of these optimal scores from (0, 0) and from (u, v) give the optimal alignment score.
Pairwise Sequence Alignment - Linear Space Alignments (EMBOSS Programs)
- Emboss matcher: a linear-space local alignment algorithm developed by Waterman and Eggert
- Emboss stretcher: a linear-space global alignment algorithm.
Database Search - Need for Heuristics
- Dynamic programming algorithms are computationally intensive and time-consuming (O(nm)).
- Searching large databases using these algorithms is impractical.
- For example, if comparing two 1 kb sequences takes milliseconds, comparing a 1 kb sequence to the entire human genome would take 3 hours.
Database Search - Solutions for Heuristics
- Solution 1 (Hardware implementation): Implementing the algorithm in hardware can be expensive.
- Solution 2 (Parallel processing): Distributing the job among multiple processors for parallel processing can also be expensive as the number of processors increases (to 1000).
- Solution 3 (Heuristics): Use heuristic algorithms to find approximately optimal solutions faster than dynamic programming, at the cost of potentially missing the best alignment.
- Examples include FASTA and BLAST
Heuristic Algorithms
- Basic idea: locate high-scoring short stretches and extend them.
- FASTA: fast alignment tool, is approximately 50 times faster than dynamic programming
- BLAST: Basic Local Alignment Search Tool
- These tools find high-scoring local alignments between a query sequence and a target database.
FASTA
- Developed by Pearson and Lipman in 1988
- Compares sequences pairwise using heuristics
- A good alignment will have exact matching subsequences
- It provides speed at the cost of sensitivity
- Works best for global alignments with DNA sequences
FASTA: The Algorithm
- Based on dot matrix method
- View sequences as sequences of short words (k-tuples).
- Good alignments have many exact matches.
- Hashing finds exact matches in O(n) time
- Diagonals can be formed from exact matches quickly and sorted.
- Precise alignment is applied to the small search space.
- Look for all k-tuple matches.
- For DNA, k=2-6 (default 6).
- For proteins, k=1, 2 (default 2).
- A lookup table or hash is used.
- Database is preprocessed to find the positions of each k-tuple.
- Lookup the table to identify matches in the database
FASTA: The Algorithm (Continued)
- Look for adjacent hot spots (substrings of exact matches with the same offset).
- Join them to form larger segments (-ve score between segments).
- Find 10 highest scoring diagonals.
- Find the best diagonal run and filter out low-scoring runs.
- Recompute scores, including indels.
- Combine nearby diagonal runs, including indels.
- Calculate alternative local alignments, using a band around the best diagonal run.
- Apply dynamic programming to the query against the top-ranked segments
FASTA: The Algorithm (final)
- Convert the initial sequences into hash tables of short words.
- Look for matching words in the second sequence, finding the locations and saving them.
- Identify hot spots (regions of identity) using a scoring matrix, preserving the best initial regions and discarding those with a score below the threshold.
- Join the appropriate initial regions to form larger, more highly scoring regions (joining based on a threshold, with negative scores between hot spots).
BLAST
- Basic Local Alignment Search Tool, developed by Altschul et al. in 1990
- Balances speed and sensitivity in sequence alignment
- Designed for local alignments (now incorporates gaps).
- It is a popular tool within the NCBI suite for sequence analysis
BLAST: Motivation
- Good alignments will have many close matches.
- Statistics determine significant matches.
- Much more sensitive than % identity
- Extending matches in both directions.
- High-scoring segment pairs (HSPs/MSPs).
BLAST: The Algorithm
- View sequences as short words (k-tuples)
- DNA: 7, 11, 15 (default 11); protein: 2, 3, 6 (default 6)
- Create a hash table for neighborhood (closely matching) words
- Use statistics to set closeness threshold
- First, find short matching segments
- Choose matching segments that exceed a threshold
- Overlapping hits form larger segments
- Extend in either direction if score exceeds a threshold
- Continue extension until score falls below "drop-off" threshold from the maximum
BLAST: The Algorithm (Continued)
- Convert the first sequence into a word list (considering all frames to capture different patterns).
- Calculate "neighborhood" words for each word in the list, using a scoring threshold (T). Create a dictionary.
- Scan the second sequence, find matching words, and store their locations.
- Extend alignment from matches in both directions. Merge segments if the score above the threshold (S).
- Align best segments using dynamic programming. Report statistically significant matches
BLAST: Selecting a Program
- Blastn: Nucleotide sequence to nucleotide database.
- Blastp: Protein sequence to protein database.
BLAST: Specialised Techniques
-
PSI-BLAST: Position-Specific Iterated BLAST (designed to find remote homologs (15-25% identity level))
- Constructs scoring matrices from multiple alignments of hits.
- Searches the database iteratively with new scoring matrices and repeats until convergence.
- The scoring matrix is tailor-made to find query sequences. Detects homologues with 15-25% sequence identity levels.
-
PHI-BLAST: Pattern Hit Iterated BLAST (designed to locate motifs).
- Given a protein sequence and a motif/pattern.
- Finds proteins with that pattern and similar surrounding residues.
- Combines regular expressions with local alignment around the matching region
BLAST: Statistical Significance
-
Method:
- Generate random sequences
- Align with initial sequences
- Compute the score of optimal alignment
- These scores follow extreme value distributions with well-known distributions.
-
Expected number of HSPs (with score at least S):
- E = K(mn)e-−AS.
- K and λ are empirical parameters.
- E-value: depends on query & database size
BLAST: Significance of a hit (Example)
- Search against 10,000 sequences
- A fitted extreme-value distribution shows the score distribution, where 99.9% fall below 112.
- An E-value of 10 indicates a hit with a score of 112 has a 1% chance of occurring by random chance given the database. E-values much smaller than 1 are considered significant.
BLAST: Bit Scores
- Raw scores are meaningless without understanding the scoring system (e.g., K, λ).
- Bit Score: Normalizes a raw score (S) using a formula (S' = [AS - ln K] / ln 2). This allows comparison across different scoring matrices
- E-value (corresponding to a bit score): E = mn 2−S`
BLAST: Nucleotide Scoring Matrix
- Parameters for match (M) and mismatch (N) in scoring matrices are relative and determine the sensitivity of the algorithm to conserved regions.
- A lower M/N ratio is preferable for high sequence conservation.
Substitution Matrices - PAM Matrices (Construction)
- Based on the hypothesis of uncorrelated mutations accumulating to divergence.
- Align closely-related sequences (typically > 85 % identity)
- Observe AA change probabilities and calculate Log Odds Ratio values
- Normalize matrix to obtain PAM-1 (1% average change).
- Derive scoring matrices for longer evolutionary distances by iterating the PAM-1 matrix (e.g., PAM 250, results of the 250th matrix multiplication). The most common values for this approach are PAM100 and PAM250
Substitution Matrices - PAM Matrices (Properties)
- Reflect evolutionary distance between two protein sequences
- Assumes that substitutional changes occurring over short evolutionary time periods can be extrapolated to longer times.
Substitution Matrices - BLOSUM Matrices (Construction)
- Derived from local, ungapped alignments from distantly related sequences, using the BLOCKS database.
- Amino acid pair frequencies are calculated by summing across all pairs.
- Different evolutionary distances are incorporated using a clustering procedure (where sequences with similar identities are grouped).
Substitution Matrices - BLOSUM Matrices Properties
- Directly calculated from data (no extrapolation)
- Use minimum percent identity for blocks to construct matrix
- BLOSUM 62 is the default matrix in BLAST
Using BLAST to Infer Homology
- Rules of thumb for percent identity and similarity in optimal alignments:
- 45% identity: very similar structures and common function.
- 25% identity: similar general folding pattern.
- 18-25% identity: "twilight zone," lower degree of sequence similarity doesn't rule out homology.
- High sequence similarity over the entire length suggests homology.
- 50% similarity over a short sequence is common by chance.
- Low complexity regions can have high similarity.
- Homologous sequences aren't always highly similar.
- Suggested BLAST cutoffs (for nucleotide-based searches look for hits with e-value ≤ 10−6 and sequence identity ≥ 70%; for protein-based searches, look for hits with e-value ≤ 10−3 and sequence identity ≥ 25%).
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.
Related Documents
Description
This quiz explores the complexities of dynamic programming algorithms used in sequence alignment. It covers the advantages of linear space alignment methods, the importance of memory usage, and the role of heuristic algorithms like FASTA and BLAST. Test your understanding of these critical concepts and methods in bioinformatics.