Podcast
Questions and Answers
What is the time complexity for dynamic programming algorithms used in sequence alignment?
What is the time complexity for dynamic programming algorithms used in sequence alignment?
- O(n^2)
- O(n + m)
- O(log(nm))
- O(nm) (correct)
What is the main advantage of linear space alignment methods compared to traditional dynamic programming?
What is the main advantage of linear space alignment methods compared to traditional dynamic programming?
- They eliminate the need for traceback pointers.
- They can handle protein sequences only.
- They require significantly less memory. (correct)
- They provide faster alignment without losing accuracy.
Why is memory usage a critical factor in dynamic programming alignment for long sequences?
Why is memory usage a critical factor in dynamic programming alignment for long sequences?
- It is always manageable for genomic DNA alignments.
- It can cause the algorithm to fail completely.
- It reduces the speed of alignment significantly.
- It may exceed the machine’s physical capacity. (correct)
What occurs when rows are discarded to limit memory usage during alignment?
What occurs when rows are discarded to limit memory usage during alignment?
Which of the following best describes the relationship between speed and dynamic programming algorithms?
Which of the following best describes the relationship between speed and dynamic programming algorithms?
What is the primary purpose of heuristic algorithms like FASTA and BLAST?
What is the primary purpose of heuristic algorithms like FASTA and BLAST?
What advantage does the FASTA algorithm provide regarding speed and sensitivity?
What advantage does the FASTA algorithm provide regarding speed and sensitivity?
For which type of sequences is FASTA best suited?
For which type of sequences is FASTA best suited?
How does FASTA initially identify potential alignments within the sequences?
How does FASTA initially identify potential alignments within the sequences?
What is the default k value used for DNA sequences in the FASTA algorithm?
What is the default k value used for DNA sequences in the FASTA algorithm?
What is the most important criteria for sequences to be grouped into the same cluster when calculating BLOSUM matrices?
What is the most important criteria for sequences to be grouped into the same cluster when calculating BLOSUM matrices?
What does the 'expected pair frequency' refer to in the context of BLOSUM matrices?
What does the 'expected pair frequency' refer to in the context of BLOSUM matrices?
How are substitution scores in BLOSUM matrices calculated?
How are substitution scores in BLOSUM matrices calculated?
What information does the BLOSUM score for a particular amino acid pair represent?
What information does the BLOSUM score for a particular amino acid pair represent?
Why might BLOSUM62 be considered more appropriate for aligning more distantly related sequences compared to BLOSUM50?
Why might BLOSUM62 be considered more appropriate for aligning more distantly related sequences compared to BLOSUM50?
What is the implication of a positive BLOSUM score for a particular amino acid pair?
What is the implication of a positive BLOSUM score for a particular amino acid pair?
What is the purpose of using blocks of protein fragments (the BLOCKS database) to construct BLOSUM matrices?
What is the purpose of using blocks of protein fragments (the BLOCKS database) to construct BLOSUM matrices?
How are BLOSUM matrices different from PAM matrices?
How are BLOSUM matrices different from PAM matrices?
What could a negative BLOSUM score for a particular amino acid pair suggest?
What could a negative BLOSUM score for a particular amino acid pair suggest?
How does the clustering procedure with a minimum identity threshold affect the construction of BLOSUM matrices?
How does the clustering procedure with a minimum identity threshold affect the construction of BLOSUM matrices?
Why is it important that BLOSUM matrices are directly calculated without extrapolations?
Why is it important that BLOSUM matrices are directly calculated without extrapolations?
How are BLOSUM matrices used in biological research?
How are BLOSUM matrices used in biological research?
What is the main advantage of using a logarithmic scale for BLOSUM scores?
What is the main advantage of using a logarithmic scale for BLOSUM scores?
What does the number after the BLOSUM matrix name (e.g., BLOSUM62) refer to?
What does the number after the BLOSUM matrix name (e.g., BLOSUM62) refer to?
Which of these statements about BLOSUM matrices is TRUE?
Which of these statements about BLOSUM matrices is TRUE?
How does the BLOSUM matrix differ from a scoring matrix used for aligning DNA sequences?
How does the BLOSUM matrix differ from a scoring matrix used for aligning DNA sequences?
What is the primary goal of PSI-BLAST?
What is the primary goal of PSI-BLAST?
What technique does PSI-BLAST use to enhance its search results?
What technique does PSI-BLAST use to enhance its search results?
In which range of sequence identity levels is PSI-BLAST effective?
In which range of sequence identity levels is PSI-BLAST effective?
What is the function of PHI-BLAST?
What is the function of PHI-BLAST?
What approach does BLAST use to compute the statistical significance of alignments?
What approach does BLAST use to compute the statistical significance of alignments?
Which algorithm is utilized to align random sequences in the statistical evaluation by BLAST?
Which algorithm is utilized to align random sequences in the statistical evaluation by BLAST?
What type of distribution does the maximum score of independent identically distributed random variables follow?
What type of distribution does the maximum score of independent identically distributed random variables follow?
How does PSI-BLAST adapt its scoring matrices throughout its iterations?
How does PSI-BLAST adapt its scoring matrices throughout its iterations?
What is the purpose of the alignment scoring matrix in the alignment process?
What is the purpose of the alignment scoring matrix in the alignment process?
Which feature distinguishes BLAST from other alignment tools according to the provided content?
Which feature distinguishes BLAST from other alignment tools according to the provided content?
What is the first step in the BLAST alignment process?
What is the first step in the BLAST alignment process?
How does the FASTA algorithm score regions of identity?
How does the FASTA algorithm score regions of identity?
What is the optimal strategy for extending matches in BLAST?
What is the optimal strategy for extending matches in BLAST?
What does the term 'hot spots' refer to in the context of FASTA alignments?
What does the term 'hot spots' refer to in the context of FASTA alignments?
What is the main motivation behind the development of the BLAST algorithm?
What is the main motivation behind the development of the BLAST algorithm?
What happens when overlapping hits are identified in the BLAST process?
What happens when overlapping hits are identified in the BLAST process?
In the content provided, what threshold is critical for continuing alignment extensions?
In the content provided, what threshold is critical for continuing alignment extensions?
What does the FASTA algorithm utilize to evaluate its hot spots before final alignment?
What does the FASTA algorithm utilize to evaluate its hot spots before final alignment?
What does the E-value indicate about the result of a BLAST search?
What does the E-value indicate about the result of a BLAST search?
Which of the following is NOT a factor that affects the E-value of a BLAST search?
Which of the following is NOT a factor that affects the E-value of a BLAST search?
What is the primary purpose of normalizing a raw score into a bit score?
What is the primary purpose of normalizing a raw score into a bit score?
What is the rule of thumb for database searching?
What is the rule of thumb for database searching?
Which of the following is TRUE about the statistical significance of a hit?
Which of the following is TRUE about the statistical significance of a hit?
What does the statement "E-value depends on the size of query sequence as well as that of the database" imply about BLAST search results?
What does the statement "E-value depends on the size of query sequence as well as that of the database" imply about BLAST search results?
What does performing multiple searches contribute to the interpretation of homology?
What does performing multiple searches contribute to the interpretation of homology?
Which of the following parameters helps determine the statistical significance of a BLAST alignment?
Which of the following parameters helps determine the statistical significance of a BLAST alignment?
Flashcards
Heuristic Alignment Algorithms
Heuristic Alignment Algorithms
Faster sequence alignment methods using approximations rather than exhaustive searches.
Dynamic Programming Algorithms
Dynamic Programming Algorithms
Alignment methods with time complexity O(nm); sensitive but slow for large databases.
Linear Space Alignments
Linear Space Alignments
Techniques that optimize alignment memory usage to O(n + m) at the cost of speed.
Recurrence Relation F(i, j)
Recurrence Relation F(i, j)
Signup and view all the flashcards
Traceback Pointers
Traceback Pointers
Signup and view all the flashcards
Heuristic Algorithms
Heuristic Algorithms
Signup and view all the flashcards
FASTA
FASTA
Signup and view all the flashcards
k-tuple
k-tuple
Signup and view all the flashcards
Dot Matrix Method
Dot Matrix Method
Signup and view all the flashcards
Lookup Table
Lookup Table
Signup and view all the flashcards
FASTA Algorithm
FASTA Algorithm
Signup and view all the flashcards
Hot Spots
Hot Spots
Signup and view all the flashcards
Diagonal Runs
Diagonal Runs
Signup and view all the flashcards
Threshold Score
Threshold Score
Signup and view all the flashcards
BLAST
BLAST
Signup and view all the flashcards
High-Scoring Segment Pairs (HSP)
High-Scoring Segment Pairs (HSP)
Signup and view all the flashcards
Extension of Alignment
Extension of Alignment
Signup and view all the flashcards
Blosum62 Matrix
Blosum62 Matrix
Signup and view all the flashcards
Local Ungapped Alignments
Local Ungapped Alignments
Signup and view all the flashcards
PSI-BLAST
PSI-BLAST
Signup and view all the flashcards
Remote Homologues
Remote Homologues
Signup and view all the flashcards
Scoring Matrix
Scoring Matrix
Signup and view all the flashcards
Iterations in PSI-BLAST
Iterations in PSI-BLAST
Signup and view all the flashcards
PHI-BLAST
PHI-BLAST
Signup and view all the flashcards
Statistical Significance in BLAST
Statistical Significance in BLAST
Signup and view all the flashcards
Smith-Waterman Algorithm
Smith-Waterman Algorithm
Signup and view all the flashcards
Extreme Value Distribution
Extreme Value Distribution
Signup and view all the flashcards
K and λ parameters
K and λ parameters
Signup and view all the flashcards
E-value
E-value
Signup and view all the flashcards
Significance of a hit
Significance of a hit
Signup and view all the flashcards
Bit score
Bit score
Signup and view all the flashcards
Raw score (S)
Raw score (S)
Signup and view all the flashcards
Normalization effect
Normalization effect
Signup and view all the flashcards
Comparative significance
Comparative significance
Signup and view all the flashcards
BLOSUM50
BLOSUM50
Signup and view all the flashcards
Observed Pair Frequencies
Observed Pair Frequencies
Signup and view all the flashcards
Expected Pair Frequencies
Expected Pair Frequencies
Signup and view all the flashcards
Substitution Scores
Substitution Scores
Signup and view all the flashcards
Scaling in Si,j
Scaling in Si,j
Signup and view all the flashcards
Positive Scores on Diagonal
Positive Scores on Diagonal
Signup and view all the flashcards
Similar Residues
Similar Residues
Signup and view all the flashcards
Evolutionary Distances
Evolutionary Distances
Signup and view all the flashcards
Blocks Database
Blocks Database
Signup and view all the flashcards
Minimum Percent Identity
Minimum Percent Identity
Signup and view all the flashcards
BLOSUM62
BLOSUM62
Signup and view all the flashcards
Distantly Related Sequences
Distantly Related Sequences
Signup and view all the flashcards
Local Alignments
Local Alignments
Signup and view all the flashcards
Clustering Procedure
Clustering Procedure
Signup and view all the flashcards
Identity Threshold
Identity Threshold
Signup and view all the flashcards
Study Notes
Pairwise Sequence Alignment - Heuristic Algorithms
- Dynamic programming algorithms are sensitive but not the fastest sequence alignment methods
- Time complexity is approximately O(nm), and speed is often an issue, especially with database searches
- Protein databases contain ~100 million residues, requiring ~1011 matrix cells to completely search a database for a 1000-residue query. DNA databases are even larger.
- At 10 million matrix cells per second, searching a 1000-residue query against a complete database takes approximately 3 hours.
Pairwise Sequence Alignment - Memory Usage
- Memory requirements for dynamic programming alignment are approximately O(nm), a product of sequence lengths.
- For sequences with few hundred residues, memory usage is manageable.
- Genomic DNA sequences with hundreds of thousands of residues can exceed machine's physical memory capacity.
Pairwise Sequence Alignment - Linear Space Alignments
- Techniques exist to achieve optimal sequence alignment with limited memory (n + m)
- These methods come with the cost of doubling the execution time.
Pairwise Sequence Alignment - Linear Space Alignments (Maximum Score)
- Finding the maximum score in dynamic programming alignment is simplified
- The recurrence relation only depends on the row immediately preceding the current row.
- Data from rows further away can be discarded.
- For local alignment, the maximum score in the entire matrix is needed, making it easy to track the maximum value during calculation.
Pairwise Sequence Alignment - Linear Space Alignments (Alignment Itself)
- Discarding rows in linear space alignments to save memory results in the loss of traceback pointers
- A divide and conquer approach is used to recover alignments.
Pairwise Sequence Alignment - Linear Space Alignments (Hirschberg's Algorithm)
- A snapshot of Hirschberg's alignment algorithm is shown.
Pairwise Sequence Alignment - Linear Space Alignments (Steps)
- Split the DP problem into two subproblems: (0, 0) to (u, v), and (u, v) to (m, n), where u = m/2 (integer part).
- Find v, the column where the alignment crosses row u=m/2.
- Combine the optimal alignments of the two subproblems.
- This process repeats recursively until the sequences are short enough.
Pairwise Sequence Alignment - Linear Space Alignments (Finding V)
- Combining "forward" and "backward" DP passes at row u enables finding v.
- Compute the optimal score from (0, 0) to each point in row u, and from each point to (m,n).
- Identify the point where the sum of these optimal scores from (0, 0) and from (u, v) give the optimal alignment score.
Pairwise Sequence Alignment - Linear Space Alignments (EMBOSS Programs)
- Emboss matcher: a linear-space local alignment algorithm developed by Waterman and Eggert
- Emboss stretcher: a linear-space global alignment algorithm.
Database Search - Need for Heuristics
- Dynamic programming algorithms are computationally intensive and time-consuming (O(nm)).
- Searching large databases using these algorithms is impractical.
- For example, if comparing two 1 kb sequences takes milliseconds, comparing a 1 kb sequence to the entire human genome would take 3 hours.
Database Search - Solutions for Heuristics
- Solution 1 (Hardware implementation): Implementing the algorithm in hardware can be expensive.
- Solution 2 (Parallel processing): Distributing the job among multiple processors for parallel processing can also be expensive as the number of processors increases (to 1000).
- Solution 3 (Heuristics): Use heuristic algorithms to find approximately optimal solutions faster than dynamic programming, at the cost of potentially missing the best alignment.
- Examples include FASTA and BLAST
Heuristic Algorithms
- Basic idea: locate high-scoring short stretches and extend them.
- FASTA: fast alignment tool, is approximately 50 times faster than dynamic programming
- BLAST: Basic Local Alignment Search Tool
- These tools find high-scoring local alignments between a query sequence and a target database.
FASTA
- Developed by Pearson and Lipman in 1988
- Compares sequences pairwise using heuristics
- A good alignment will have exact matching subsequences
- It provides speed at the cost of sensitivity
- Works best for global alignments with DNA sequences
FASTA: The Algorithm
- Based on dot matrix method
- View sequences as sequences of short words (k-tuples).
- Good alignments have many exact matches.
- Hashing finds exact matches in O(n) time
- Diagonals can be formed from exact matches quickly and sorted.
- Precise alignment is applied to the small search space.
- Look for all k-tuple matches.
- For DNA, k=2-6 (default 6).
- For proteins, k=1, 2 (default 2).
- A lookup table or hash is used.
- Database is preprocessed to find the positions of each k-tuple.
- Lookup the table to identify matches in the database
FASTA: The Algorithm (Continued)
- Look for adjacent hot spots (substrings of exact matches with the same offset).
- Join them to form larger segments (-ve score between segments).
- Find 10 highest scoring diagonals.
- Find the best diagonal run and filter out low-scoring runs.
- Recompute scores, including indels.
- Combine nearby diagonal runs, including indels.
- Calculate alternative local alignments, using a band around the best diagonal run.
- Apply dynamic programming to the query against the top-ranked segments
FASTA: The Algorithm (final)
- Convert the initial sequences into hash tables of short words.
- Look for matching words in the second sequence, finding the locations and saving them.
- Identify hot spots (regions of identity) using a scoring matrix, preserving the best initial regions and discarding those with a score below the threshold.
- Join the appropriate initial regions to form larger, more highly scoring regions (joining based on a threshold, with negative scores between hot spots).
BLAST
- Basic Local Alignment Search Tool, developed by Altschul et al. in 1990
- Balances speed and sensitivity in sequence alignment
- Designed for local alignments (now incorporates gaps).
- It is a popular tool within the NCBI suite for sequence analysis
BLAST: Motivation
- Good alignments will have many close matches.
- Statistics determine significant matches.
- Much more sensitive than % identity
- Extending matches in both directions.
- High-scoring segment pairs (HSPs/MSPs).
BLAST: The Algorithm
- View sequences as short words (k-tuples)
- DNA: 7, 11, 15 (default 11); protein: 2, 3, 6 (default 6)
- Create a hash table for neighborhood (closely matching) words
- Use statistics to set closeness threshold
- First, find short matching segments
- Choose matching segments that exceed a threshold
- Overlapping hits form larger segments
- Extend in either direction if score exceeds a threshold
- Continue extension until score falls below "drop-off" threshold from the maximum
BLAST: The Algorithm (Continued)
- Convert the first sequence into a word list (considering all frames to capture different patterns).
- Calculate "neighborhood" words for each word in the list, using a scoring threshold (T). Create a dictionary.
- Scan the second sequence, find matching words, and store their locations.
- Extend alignment from matches in both directions. Merge segments if the score above the threshold (S).
- Align best segments using dynamic programming. Report statistically significant matches
BLAST: Selecting a Program
- Blastn: Nucleotide sequence to nucleotide database.
- Blastp: Protein sequence to protein database.
BLAST: Specialised Techniques
- PSI-BLAST: Position-Specific Iterated BLAST (designed to find remote homologs (15-25% identity level))
- Constructs scoring matrices from multiple alignments of hits.
- Searches the database iteratively with new scoring matrices and repeats until convergence.
- The scoring matrix is tailor-made to find query sequences. Detects homologues with 15-25% sequence identity levels.
- PHI-BLAST: Pattern Hit Iterated BLAST (designed to locate motifs).
- Given a protein sequence and a motif/pattern.
- Finds proteins with that pattern and similar surrounding residues.
- Combines regular expressions with local alignment around the matching region
BLAST: Statistical Significance
- Method:
- Generate random sequences
- Align with initial sequences
- Compute the score of optimal alignment
- These scores follow extreme value distributions with well-known distributions.
- Expected number of HSPs (with score at least S):
- E = K(mn)e-−AS.
- K and λ are empirical parameters.
- E-value: depends on query & database size
BLAST: Significance of a hit (Example)
- Search against 10,000 sequences
- A fitted extreme-value distribution shows the score distribution, where 99.9% fall below 112.
- An E-value of 10 indicates a hit with a score of 112 has a 1% chance of occurring by random chance given the database. E-values much smaller than 1 are considered significant.
BLAST: Bit Scores
- Raw scores are meaningless without understanding the scoring system (e.g., K, λ).
- Bit Score: Normalizes a raw score (S) using a formula (S' = [AS - ln K] / ln 2). This allows comparison across different scoring matrices
- E-value (corresponding to a bit score): E = mn 2−S`
BLAST: Nucleotide Scoring Matrix
- Parameters for match (M) and mismatch (N) in scoring matrices are relative and determine the sensitivity of the algorithm to conserved regions.
- A lower M/N ratio is preferable for high sequence conservation.
Substitution Matrices - PAM Matrices (Construction)
- Based on the hypothesis of uncorrelated mutations accumulating to divergence.
- Align closely-related sequences (typically > 85 % identity)
- Observe AA change probabilities and calculate Log Odds Ratio values
- Normalize matrix to obtain PAM-1 (1% average change).
- Derive scoring matrices for longer evolutionary distances by iterating the PAM-1 matrix (e.g., PAM 250, results of the 250th matrix multiplication). The most common values for this approach are PAM100 and PAM250
Substitution Matrices - PAM Matrices (Properties)
- Reflect evolutionary distance between two protein sequences
- Assumes that substitutional changes occurring over short evolutionary time periods can be extrapolated to longer times.
Substitution Matrices - BLOSUM Matrices (Construction)
- Derived from local, ungapped alignments from distantly related sequences, using the BLOCKS database.
- Amino acid pair frequencies are calculated by summing across all pairs.
- Different evolutionary distances are incorporated using a clustering procedure (where sequences with similar identities are grouped).
Substitution Matrices - BLOSUM Matrices Properties
- Directly calculated from data (no extrapolation)
- Use minimum percent identity for blocks to construct matrix
- BLOSUM 62 is the default matrix in BLAST
Using BLAST to Infer Homology
- Rules of thumb for percent identity and similarity in optimal alignments:
- 45% identity: very similar structures and common function.
- 25% identity: similar general folding pattern.
- 18-25% identity: "twilight zone," lower degree of sequence similarity doesn't rule out homology.
- High sequence similarity over the entire length suggests homology.
- 50% similarity over a short sequence is common by chance.
- Low complexity regions can have high similarity.
- Homologous sequences aren't always highly similar.
- Suggested BLAST cutoffs (for nucleotide-based searches look for hits with e-value ≤ 10−6 and sequence identity ≥ 70%; for protein-based searches, look for hits with e-value ≤ 10−3 and sequence identity ≥ 25%).
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.