Podcast
Questions and Answers
Which database is known for high-quality annotation of some sequences?
Which database is known for high-quality annotation of some sequences?
- MGnify
- Genbank
- Swissprot (correct)
- DDBJ
The MGnify database at EBI contains only perfect quality amplicons.
The MGnify database at EBI contains only perfect quality amplicons.
False (B)
What scoring scheme is simplest for scoring amino acid residues?
What scoring scheme is simplest for scoring amino acid residues?
Score 1 for identical amino acids and 0 for different ones.
Genbank translates initial DNA deposition into ___ sequences.
Genbank translates initial DNA deposition into ___ sequences.
Match the following terms with their definitions or descriptions:
Match the following terms with their definitions or descriptions:
What is the primary purpose of pairwise protein sequence alignment?
What is the primary purpose of pairwise protein sequence alignment?
Errors in protein sequence databases are corrected quickly.
Errors in protein sequence databases are corrected quickly.
Who pioneered the work in metagenomics?
Who pioneered the work in metagenomics?
Which algorithm is known for finding mathematically optimal solutions but is too slow for general searching?
Which algorithm is known for finding mathematically optimal solutions but is too slow for general searching?
BLAST is significantly slower than the Smith-Waterman algorithm.
BLAST is significantly slower than the Smith-Waterman algorithm.
What is the purpose of the E-value in database searching?
What is the purpose of the E-value in database searching?
The BLAST algorithm uses the ____ score to find short segments or seeds in the query.
The BLAST algorithm uses the ____ score to find short segments or seeds in the query.
Match the following terms with their definitions:
Match the following terms with their definitions:
For which type of sequences is BLAST typically used?
For which type of sequences is BLAST typically used?
Short matches of less than 20 residues can be confidently suggested as true homology.
Short matches of less than 20 residues can be confidently suggested as true homology.
What does the term HSP stand for in the context of the BLAST algorithm?
What does the term HSP stand for in the context of the BLAST algorithm?
What method does the Smith-Waterman algorithm primarily use for sequence alignment?
What method does the Smith-Waterman algorithm primarily use for sequence alignment?
The BLOSUM62 matrix is used to assign numerical values for every cell in the alignment array.
The BLOSUM62 matrix is used to assign numerical values for every cell in the alignment array.
What is the primary objective of the Needleman and Wunsch method introduced in 1970?
What is the primary objective of the Needleman and Wunsch method introduced in 1970?
The __________ penalty is introduced when constructing the alignment, affecting the best scoring path.
The __________ penalty is introduced when constructing the alignment, affecting the best scoring path.
Match the following sequence alignment methods with their primary characteristics:
Match the following sequence alignment methods with their primary characteristics:
Where will the maximum match, or highest alignment score, always be found in the matrix?
Where will the maximum match, or highest alignment score, always be found in the matrix?
What does conservative substitution in proteins refer to?
What does conservative substitution in proteins refer to?
PAM250 was created to model sequences with 50% identity.
PAM250 was created to model sequences with 50% identity.
What is the main purpose of the Needleman-Wunsch Algorithm?
What is the main purpose of the Needleman-Wunsch Algorithm?
The gap penalty formula is given by Penalty = o + e × l, where o is the gap opening constant and e is the gap __________ constant.
The gap penalty formula is given by Penalty = o + e × l, where o is the gap opening constant and e is the gap __________ constant.
Which of the following is true about BLOSUM62?
Which of the following is true about BLOSUM62?
What is the significance of gaps in sequence alignments?
What is the significance of gaps in sequence alignments?
Flashcards
GenBank
GenBank
A primary DNA sequence database maintained by NCBI (National Center for Biotechnology Information).
ENA (EMBL)
ENA (EMBL)
Another primary DNA sequence database, maintained by EMBL (European Molecular Biology Laboratory).
DDBJ
DDBJ
The Japanese equivalent of GenBank and ENA, also a primary DNA sequence database maintained by DDBJ (DNA Data Bank of Japan).
GenPept
GenPept
Signup and view all the flashcards
TrEMBL
TrEMBL
Signup and view all the flashcards
SwissProt
SwissProt
Signup and view all the flashcards
UniProtKB
UniProtKB
Signup and view all the flashcards
Metagenomics
Metagenomics
Signup and view all the flashcards
Conservative Substitution
Conservative Substitution
Signup and view all the flashcards
Point Accepted Mutation (PAM)
Point Accepted Mutation (PAM)
Signup and view all the flashcards
BLOSUM62
BLOSUM62
Signup and view all the flashcards
Gap Penalties
Gap Penalties
Signup and view all the flashcards
Protein Domains
Protein Domains
Signup and view all the flashcards
Global Alignment
Global Alignment
Signup and view all the flashcards
Local Alignment
Local Alignment
Signup and view all the flashcards
Needleman-Wunsch Algorithm
Needleman-Wunsch Algorithm
Signup and view all the flashcards
Smith-Waterman Algorithm
Smith-Waterman Algorithm
Signup and view all the flashcards
Dynamic Programming
Dynamic Programming
Signup and view all the flashcards
BLOSUM62 Matrix
BLOSUM62 Matrix
Signup and view all the flashcards
Similarity Value (Sij)
Similarity Value (Sij)
Signup and view all the flashcards
Alignment Score
Alignment Score
Signup and view all the flashcards
Constructing an Alignment
Constructing an Alignment
Signup and view all the flashcards
BLAST (Basic Local Alignment Search Tool)
BLAST (Basic Local Alignment Search Tool)
Signup and view all the flashcards
P-value
P-value
Signup and view all the flashcards
E-value
E-value
Signup and view all the flashcards
PSI-BLAST (Position-Specific Iterative BLAST)
PSI-BLAST (Position-Specific Iterative BLAST)
Signup and view all the flashcards
Seeds (Words)
Seeds (Words)
Signup and view all the flashcards
HSPs (High Scoring Pairs)
HSPs (High Scoring Pairs)
Signup and view all the flashcards
Cut-off Score
Cut-off Score
Signup and view all the flashcards
Study Notes
Protein Sequence Analysis
- Data is exchanged between databases nightly, maintaining consistent core data.
- Initial DNA data is translated into protein sequences (e.g., GenBank to GenPept, EMBL to TrEMBL).
- Swissprot is a high-quality annotation source for some sequences, part of UniProtKB.
- UniProtKB (from Swissprot) has a vast database expansion, currently 240 million entries.
Database Issues
- Database organization changes rapidly.
- Protein naming is variable and error correction is slow.
- Database submissions sometimes aren't corrected without submitter action.
- Database access may be browser-dependent (problematic).
Metagenomics
- Craig Ventor pioneered large-scale microbial sequence acquisition in diverse environments (like ocean or gut).
- Metagenomics aims to study biodiversity.
- Data quality and often fragmentary sequences are common in metagenomic studies.
- MGnify (at the EBI) holds 350,000 amplicons from 33,000 metagenomes.
Evolutionary Relationships
- Evaluating species similarity is necessary for detecting evolutionary relationships.
- Pairwise protein sequence alignment is used to quantify similarity.
- Gene duplication and speciation are two vital evolutionary processes causing changes.
Gene Duplication
- Duplication occurs within a genome, with duplicate copies becoming paralogs.
- Duplicate genes result in a possible change of function while one copy maintains the original function.
- These changes in function/structure allow new functions for the duplicated gene.
Speciation
- Speciation results in new species, with each species having a single copy of the same gene, leading to orthologues.
- Orthologues are the resulting protein sequences from a speciation event, with single copies that represent the original gene
- These orthologues can have similar functions, showcasing the common ancestry of the new species.
Pairwise Protein Sequence Alignment
- A scoring scheme and algorithm are required to quantify similarity between amino acid residues in sequences.
- The aim is to create the best alignment possible that captures relevant biological information.
Scoring Scheme
- Scoring 1 point for identical amino acids and 0 for different (identical nucleotides or bases can also be scored this way).
- Protein evolution constrains allowed amino acid changes, leading to largely conservative substitutions (similar chemical properties).
- This maintains the protein structure and function during evolution.
- Chemical property maintenance is known as conservative substitution (e.g., non-polar residues tend to stay non-polar).
- Point Accepted Mutation (PAM) matrix, from Dayhoff, models residue changes in closely related sequences.
- (Extends to more distant sequences by assuming matrices can be multiplied.)
- Model PAM250 with 20% identity.
PAM 250 Matrix
- Quantifies the odds of one residue's mutation to another residue.
- "Odds equals observed probability of amino acid residue exchanging over probability of the exchange occurring by chance"
- Odds represents the likelihood of a residue resisting mutation.
BLOSUM62 Matrix
- Derived from aligned protein families (BLOCKS) with 62% pairwise identity, focusing on conserved regions.
- This creates the most accurate and sensitive matrix.
- Used in BLAST/PSI-BLAST family of database searching algorithms.
- Identifies favourable residues and substitutions to find related sequences.
- More accurate to find distantly related sequences due to excluding unaligned loops regions of protein structure.
Gap Penalties
- Gaps or indels (insertions or deletions are penalized, since they reflect evolutionary changes.)
- Gap open penalty (o) is larger than the gap extension penalty (e).
Alignment Methods
- Proteins often consist of domains that are evolutionary units.
- Proteins are composed of structurally distinct parts/domains.
Global vs Local Alignment
- Global alignment aims to align entire sequences (suitable for sequences assumed to be homologous).
- Local alignment finds similar regions within sequences (more suitable for distantly related sequences).
Needlman-Wunsch Algorithm
- General algorithm for sequence comparison to find maximum matches.
- Assesses maximal possible matches, accounting for gaps.
- Iterative matrix method uses a 2D array that represents every residue possibility.
- Three main steps: Assign values, find best pathways from initial values in the matrix back to the beginning, create an alignment based from the highest score.
Database Searching
- A query sequence and database of sequences are required to find similar sequences.
- PSI-BLAST is a widely used program for advanced sequence analysis and searches.
- BLAST searches rapidly find short segments matching databases, but doesn’t guarantee optimal alignments.
- Databases have multiple features like sequences (high level matches) and specific regions with specific functions, etc.
BLAST
- A tool that finds high-scoring pairs (HSPs) from the query sequence to database's short segments.
- Uses BLAST algorithm for fast local search.
- Uses precision and recall to identify better matches.
- Measures match quality for a given sequence.
- Uses statistical significance for true match identification.
Accuracy of Database Searches
- P-value: Probability of getting a given score by chance (smaller is better).
- E-value: Expected number of matches with equivalent or higher scores by random chance.
MMSeq2
- Designed to handle the increase speeds in DNA sequencing throughput.
- Designed for large-scale metagenomic and related DNA studies.
- Improves speed-sensitivity for searches and provides improved metagenomic analysis.
- Three stage process: k-mer matches, ungapped alignment, full gapped alignment, which accelerates the matching speed while allowing for more accurate results from remote matches.
Recall and Precision
- Recall (or Sensitivity):Fraction of actual positives that are identified.
- Precision: What fraction of positives called as positive are actually positive.
- Specificity: Fraction of negatives that are correctly identified as negatives.
ROC Curves
- Graphical representation showing different thresholds/cut-offs for sensitivity and specificity and useful for comparative analysis of algorithms.
Protein Families
- Related proteins with similar sequences, structures, and functions.
- Useful for studying evolutionary relationships and structural/functional analysis.
- Key residues tend to be conserved to maintain function.
- Multiple sequence alignment for studying protein families is useful.
CLUSTAL
- A heuristic algorithm (educated guess) to create multiple alignments.
- Iterative approach starting with nearest related sequences.
Other Programs
- Clustal Omega (more advanced).
- MUSCLE (fast algorithm using short sequence region words for approximate matches)
- T-Coffee.
PROSITE
- Database of protein motifs.
- Useful for searching for specific sequence patterns with motifs.
Sequence Profiles and Hidden Markov Models (HMMs)
- Sequence profiles (PSSMs): Position-specific scoring matrices.
- HMMs: Probabilistic models using scoring matrices (PSSM's).
- More sensitive for large sequence regions than PROSITE.
PSI-BLAST
- Advanced BLAST variant using existing alignments to improve searches for distant relationships.
- Iterative procedure for finding better matches, improving accuracy.
- Useful tool for identifying/testing homology relationships.
- Creates new profiles and adds to previously formed profiles.
Next-Generation Sequence Analysis
- More sensitive methods to find distant homologues: HH-BLAST, JACKHMMER
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.