Podcast
Questions and Answers
What is the primary purpose of introducing gaps in sequence alignment?
What is the primary purpose of introducing gaps in sequence alignment?
- To decrease the number of mismatches.
- To shorten the length of the sequences.
- To increase the number of appariements. (correct)
- To penalize the alignment score.
What does the score in a sequence alignment matrix generally represent?
What does the score in a sequence alignment matrix generally represent?
- The length of the sequences being aligned.
- The number of gaps in the alignment.
- The evolutionary distance between sequences.
- The similarity or dissimilarity between sequences. (correct)
In the context of sequence alignment, what does a 'gap extension' refer to?
In the context of sequence alignment, what does a 'gap extension' refer to?
- The addition of a gap in one sequence to align with a known region of another sequence.
- Any gap added at the beginning of a sequence.
- The first gap in a series of consecutive gaps.
- All gaps other than the first gap in a consecutive series of gaps. (correct)
Which of the following statements accurately describes a 'transition' in the context of nucleic acid sequence comparison?
Which of the following statements accurately describes a 'transition' in the context of nucleic acid sequence comparison?
What is the primary rationale for using different scoring matrices in sequence alignment?
What is the primary rationale for using different scoring matrices in sequence alignment?
In a nucleic acid alignment matrix, if an identity score is +1 and a gap score is 0, what does a score of 0 for similarity imply?
In a nucleic acid alignment matrix, if an identity score is +1 and a gap score is 0, what does a score of 0 for similarity imply?
What is the main advantage of using amino acid substitution matrices in protein sequence alignment?
What is the main advantage of using amino acid substitution matrices in protein sequence alignment?
In the context of protein sequence alignment, what do matrices that consider the physico-chemical properties of amino acids primarily aim to capture?
In the context of protein sequence alignment, what do matrices that consider the physico-chemical properties of amino acids primarily aim to capture?
What is the fundamental principle behind constructing protein substitution matrices based on the observed substitutions in homologous protein families?
What is the fundamental principle behind constructing protein substitution matrices based on the observed substitutions in homologous protein families?
What is the significance of the Dayhoff matrix in the context of sequence alignment?
What is the significance of the Dayhoff matrix in the context of sequence alignment?
If a PAM1 matrix is multiplied by itself 100 times, resulting in a PAM100 matrix, what does this imply about the sequences being compared?
If a PAM1 matrix is multiplied by itself 100 times, resulting in a PAM100 matrix, what does this imply about the sequences being compared?
What is a key limitation of the PAM matrices?
What is a key limitation of the PAM matrices?
BLOSUM matrices are derived from what type of sequence data?
BLOSUM matrices are derived from what type of sequence data?
What does the number in a BLOSUM matrix (e.g., BLOSUM62) indicate?
What does the number in a BLOSUM matrix (e.g., BLOSUM62) indicate?
Which matrix would be most suitable for aligning sequences that have considerable evolutionary distance?
Which matrix would be most suitable for aligning sequences that have considerable evolutionary distance?
Flashcards
Distance (sequence alignment)
Distance (sequence alignment)
Measure of evolutionary changes between two genomes after divergence from a common ancestor.
Elementary Score
Elementary Score
Score between two amino acids or nucleotides.
Global Score
Global Score
Sum of the elementary scores on a sequence.
Gaps
Gaps
Signup and view all the flashcards
Gap opening penalty
Gap opening penalty
Signup and view all the flashcards
Gap extension penalty
Gap extension penalty
Signup and view all the flashcards
Identity Matrix
Identity Matrix
Signup and view all the flashcards
Transition (genetics)
Transition (genetics)
Signup and view all the flashcards
Transversion (genetics)
Transversion (genetics)
Signup and view all the flashcards
BLAST
BLAST
Signup and view all the flashcards
Physico-chemical matrices
Physico-chemical matrices
Signup and view all the flashcards
Evolution-based matrices
Evolution-based matrices
Signup and view all the flashcards
Log-odds matrix
Log-odds matrix
Signup and view all the flashcards
PAM Matrices
PAM Matrices
Signup and view all the flashcards
BLOSUM Matrices
BLOSUM Matrices
Signup and view all the flashcards
Study Notes
- The text discusses score matrices for nucleic and protein sequences in bioinformatics, used in sequence alignment.
Score Matrices
- Score matrices in elementary score matrix calculations previously considered similarities/differences as equally important, assigning a score of 1 or 0.
- Realistically, similarities between different letters (e.g., C and U in DNA) should be considered because converting cytosine to uracil only requires oxidation.
Biological Sequence Distance Matrices
- Distance measures the number of evolutionary changes between two genomes since their divergence from a common ancestor.
- Elementary score is the score between two amino acids or nucleotides.
- Elementary score represents all possible states based on the alphabet used in sequence descriptions.
- Global score is the sum of elementary scores on a sequence.
- Scores are generally calculated to qualify and quantify the similarity between sequences, measuring either the rapprochement or the remoteness of sequences.
- These scores rely on a system that assigns an elementary score for each position when sequences are edited one under the other.
Gaps
- Inserting gaps mainly aims to increase alignments.
- With multiple (two or more) consecutive gaps in a sequence alignment, the weight (gap penalty) considers the first gap (higher weight) and the subsequent gaps (noticeably lower weight).
- For example, four consecutive gaps get a penalty of -1; the first gap is penalized at -1, while the other three are each penalized at -0.1.
- Gap penalty Calculation: -1 +(-0.1) + (-0.1) + (-0.1) = -1.3
- Extension of the gap refers to the other gaps aside from the first one, which is heavily penalized.
- Equation for a gap is: Pgap = Po + Pe.k where Po is the penalty for the first gap, Pe is the penalty for each extension gap, and k is the number of extension gaps.
- Gaps are penalties imposed on sequences during score calculation.
Nucleic Matrices
- Because the bases composing DNA molecules have a limited alphabet, only a few nucleic matrices like elementary score matrices exist.
Identity Matrix
- The most widely used matrix is the identity matrix in all its forms, measuring both rapprochement and remoteness between two sequences.
- In this matrix, identity is scored as +1, while similarity and gaps are scored as 0.
- Given sequences Seq1 and Seq2, the identity matrix assigns scores based on matches and mismatches like +1 for identity (Id), 0 for Sub (substitution), and 0 for Gap
- For example given two sequences aligned use score values to calculate matrix values
- Identity percentage calculation for two sequences is (6/9) x 100 = 66.67%.
Transversion/Transition Matrix
- In other matrices, certain associations are favored over others.
- Purines and pyrimidines differentiate and derive a matrix that penalizes transversions more than transitions.
- Purine bases derive from a purine core, which is an aromatic core with 9 atoms (5C and 4N), resulting from the substitution of hydrogen atoms of heterocycles by hydroxyl, amine, or methyl radicals.
- Pyrimidine bases derive from a pyrimidine core, an aromatic core with 6 atoms (4C and 2N), substituted at hydrogen atoms of the heterocycle with hydroxyl, amine, or methyl substituents.
Nucleic Matrices
- Nucleic matrices can be unitary or use 3 potential scores: Identity(3), transition (1), and transversion(0).
- Identity scored as +3 and substitution has two types: transition and transversion.
- Transition is when the two sequences have nucleotides from the same family, either purine (A with G) or pyrimidine (C with T), giving a score = +1.
- Transversion involves substituting nucleotides from different families (A with C or T, G with C or T), resulting in a score of 0.
- The matrix calculation example yields an S-value of +19 based on matches, transitions, and transversions.
Kimura's 2-Parameter Matrix (K80)
- Observed transition rates are generally higher than transversion rates.
- Base frequencies are equal (πA = πT = πC = πG = 0.25).
- The rate parameter equation: λ = α + 2β (α = transitions rate; β = transversions rate; λ= instantaneous change rate for any base).
BLAST Matrix
- Basic Local Alignment Search Tool identifies similar regions between two or more nucleotide or AA sequences and generates an alignment of the homologous regions.
- Default settings for Blast use +1 for matches and -3 for mismatches.
- Gap opening gets penalized at -5 in BlastN (or -11 in BlastP, BlastX, TBlastX, and TBlastN).
- Gap extension receives a penalty of -2 for BlastN and -1 for the other Blast programs.
Protein Matrices
- Protein sequences differ significantly in alignment score calculation because they comprise a combination of 20 amino acids, unlike nucleic acids with only four nucleotides.
- Protein matrices, like nucleic matrices, assign a score to each amino acid residue pair comparison.
- Identity is no longer scored as 1, and non-identity as 0 because of existing similarities between certain amino acids during alignment score calculation.
- For example, Arg and Lys could be similar in physicochemical properties and not affect structure/function.
- An amino acid can be substituted with another without greatly altering a protein's structure/function; amino acids get classified into families, with a scoring system that accounts for the affinity of protein residues.
- Resulting score matrices increase the reliability of protein similarity searches.
- Established after deducing the degeneration of the genetic code.
- Elementary scores were determined by the number of identical nucleotides shared by amino acid codons, which is like considering the minimum base changes required for one amino acid to be converted into another.
- Protein substitution matrices are built by analyzing known protein family alignments' substitution frequencies.
- The matrix score will reflect the frequency at which a given a.a. occurs naturally because certain amino acids are more abundant than others.
- Protein matrices come in two categories: one groups matrices from studies illustrating amino acid substitution during evolution (evolution-linked matrices), which represent possible and acceptable amino acid exchanges during protein evolution; the other relies more on the physicochemical properties of amino acids, like hydrophilicity/hydrophobicity and secondary/tertiary structures.
- Evolution-linked matrices are utilized to align protein sequences.
Protein Matrices Linked to Physicochemical Characteristics
- Clearly group the chemical and structural characteristics of amino acids.
- In some instances, they might not adequately reveal certain physicochemical characteristics shared by two proteins; therefore, matrices are mostly based on these characteristcs were determined.
- Common matrices are hydrophobicity based on free energy measures for water-to-ethanol transfer of amino acids, or secondary structure matrices base on the trend of an amino acid being in a given conformation.
- The increase in determined three-dimensional structures recently allowed establishing matrices based on structural comparisons that can be utilized to compare relatively distant proteins.
- One matrix was established by superimposing 3-D structures of 32 proteins grouped into 11 very close sequences.
Protein Matrices Linked to Evolution
- Built on substitutions between amino acids that occur during evolution, based on substitutions observed in homolog protein alignments.
- They use observed and theoretical substitution frequencies.
- Examples: BLOSUM, PAM, JTT, and WAG.
- Log-odds matrices calculate the score as the logarithm of an odds ratio and count the number of times a particular residue is replaced by another one in nature or would occur by random chance.
Logarithmic Probability Matrix
- Sij = log2 (qij/eij), where qij is the observed substitution frequency, and eij is the expected substitution frequency.
- Substitution is frequent, with a positive score and infrequent with a negative score.
- Log odds ratio describes the ratio of the observed over the expected.
- substitutions observed frequently have a positive score, and those that are less frequent have a negative score.
- Matrixes most frequently used are PAM and BLOSUM
PAM (Point Accepted Matrix)
- Were the first matrices used for sequence analysis and are credited to Margaret Dayhoff.
- It originates from a global manual alignment of 3000 proteins pertaining to the same family.
- The PAM matrix helps to infer protein evolution relationships within families, extrapolating related data from vastly divergent evolutions.
- These matrices reflect potential or acceptable transitions of one amino acid to another during protein evolution
- From these alignments, scientists calculated a probability matrix that assigns each matrix score the likelihood that one AA (amino acid) has changed to AA-B within evolutionary step
- Mutation probability matrix corresponds to an accepted substitution of 100 sites over a long period.
- Such a matrix does not impair a protein.
- Matrices such as these are a Percent Accepted Mutations (1PAM) matrix.
- A unit of measurement resulting from this analysis is called the mutation unit or unit PAM.
Matrices
- Multiplying itself using one matrix by a given amount creates the XPAM- the substitution probabilities matrix when evolution distances grow large.
- Each matrix XPAM converts to a new similarity matrix PAM-X.
- PAMn refers to the count of substitutions among 100 residues where n is the number of substitutions with 100 residues(PAM 100).
- There are multiple matrices of different paramenters, PAM1, PAM30, PAM50, PAM100, PAM250.
- High values indicate greater evolutionary distances.
- Standard PAM-250 is best when differentiating linked proteins versus unrelated sequences.
- Mutation matrices represent a very wide sample range and capture AA(amino acid) exchange expectations.
Limitations of PAM Matrixes
- PAM considers mutation spots as equiprobable for all AA in the protein, where there might be more variability.
- Initial alignments use a global alignment- considering total sequences and including areas that remain unpreserved.
- Molecules used in 1978 may only represent parts of proteins or small dissolved protein segments.
PAM-250 Log Odds Substitution Matrix
- Studied took in account 16,30 sequences in version 15 of swissprot, resulting in 2621 families.
- This type of study may represent all substitutions within 1978.
BLOSUM (Blocks of Amino Acid Substitution)
- BLOSUM matrices goal is to analyze far-removed proteins.
- BLOSUM matrices vary from PAM, identifying patterns inside blocks of protein families, conservations and divergent protein types.
- Blocks refer to alignments between multiple sites lacking insertions to generate BLOCKS databases.
- Motifs are sequence groups which encode a unique structure.
- blocks generalize certain motifs.
- Each block generated by different sequences has an alignment without any deletions/insertions in smaller preservation regions. BLOSUM reduces sequence segments with minimum rates of identical values.
- One determines substitution rates using logarithmic data, generating BLOcks SUbstitution Matrix.
- These matrices provide substitution designs of the more stable zones in comparison to BLOSUM.
- There are a larger amount of known proteins in 1992 compared to the number of proteins available in 1978, offering robust methods in deriving modern matrices.
- BLOSUM matrices get calculated at distinct evolutions, which PAM extrapolates
- BLOSUM matrix depends on AA rates and alignment.
- BLOSUM can find the most similar sequences
BLOSUM K
- This K factor stands for sequence conservations and is used to form the matrices. Higher percentages indicate similar sequences when reducing K assess divergent evolutions based sequence separations.
- The level of conservation of sequences guides matrix construction.
BLOSUM versus PAM
BLOSUM | Identity | Evolutionary Distance |
---|---|---|
BLOSUM45 | Up to 45% | Largest |
BLOSUM50 | Up to 50% | Large |
BLOSUM52 | Up to 52% | Medium |
BLOSUM60 | Up to 60% | Short |
BLOSUM80 | Up to 80% | Shorter |
BLOSUM90 | Up to 90% | Shortest |
- PAM matrixes contains several limits.
- BLOSUM matrices allow avoiding limitations of traditional PAM.
- Blosum can access protein parts to extract common features in those sequences (such as highly similar amino acid structures), whereas global or local alignment types can model the restraints of sequences.
- BLOSUM also has many reliable results because they combine more data.
Protein Matrix Choice
- The choice rests on unique parameters and sequence comparisons.
- Direct sequence matches on BLOSUM generate much better results than Dayhoff.
- One must pick data for a given objective, based on preferred matrices
Selection Matrix
- PAM1 or BLOSUM80 is for identical molecules which are very similar in their evolutions
- PAM120 and BLOSUM62 suits equal points in between (intermediate evolutions).
- PAM250 or BLOSUM45 is more for distant sequences that evolve distantly.
Algorithms
- Algorithms remain computationally extensive when generating sequence alignments
- Requires techniques to generate few meaningless aligninments.
Matrix Recommendation
- Algorithm must be tested
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.