Score Matrices in Bioinformatics

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

What is the primary purpose of introducing gaps in sequence alignment?

  • To decrease the number of mismatches.
  • To shorten the length of the sequences.
  • To increase the number of appariements. (correct)
  • To penalize the alignment score.

What does the score in a sequence alignment matrix generally represent?

  • The length of the sequences being aligned.
  • The number of gaps in the alignment.
  • The evolutionary distance between sequences.
  • The similarity or dissimilarity between sequences. (correct)

In the context of sequence alignment, what does a 'gap extension' refer to?

  • The addition of a gap in one sequence to align with a known region of another sequence.
  • Any gap added at the beginning of a sequence.
  • The first gap in a series of consecutive gaps.
  • All gaps other than the first gap in a consecutive series of gaps. (correct)

Which of the following statements accurately describes a 'transition' in the context of nucleic acid sequence comparison?

<p>Substitution of a purine for another purine or a pyrimidine for another pyrimidine. (D)</p> Signup and view all the answers

What is the primary rationale for using different scoring matrices in sequence alignment?

<p>To account for varying degrees of evolutionary divergence between sequences. (A)</p> Signup and view all the answers

In a nucleic acid alignment matrix, if an identity score is +1 and a gap score is 0, what does a score of 0 for similarity imply?

<p>The sequences are unrelated at that position. (C)</p> Signup and view all the answers

What is the main advantage of using amino acid substitution matrices in protein sequence alignment?

<p>They allow for the non-identity to be scored, recognizing the similarity between certain amino acids. (C)</p> Signup and view all the answers

In the context of protein sequence alignment, what do matrices that consider the physico-chemical properties of amino acids primarily aim to capture?

<p>The structural and functional compatibility of amino acids. (C)</p> Signup and view all the answers

What is the fundamental principle behind constructing protein substitution matrices based on the observed substitutions in homologous protein families?

<p>To favor substitutions that occur frequently in nature. (A)</p> Signup and view all the answers

What is the significance of the Dayhoff matrix in the context of sequence alignment?

<p>It was among the first matrices developed for sequence alignment, based on observed mutation rates. (A)</p> Signup and view all the answers

If a PAM1 matrix is multiplied by itself 100 times, resulting in a PAM100 matrix, what does this imply about the sequences being compared?

<p>The sequences are more divergent, representing a greater evolutionary distance. (D)</p> Signup and view all the answers

What is a key limitation of the PAM matrices?

<p>They assume that all positions within a protein are equally mutable, which is not always the case. (C)</p> Signup and view all the answers

BLOSUM matrices are derived from what type of sequence data?

<p>Directly calculated from conserved blocks of aligned sequences (B)</p> Signup and view all the answers

What does the number in a BLOSUM matrix (e.g., BLOSUM62) indicate?

<p>The percentage identity of the sequences used to construct the matrix. (C)</p> Signup and view all the answers

Which matrix would be most suitable for aligning sequences that have considerable evolutionary distance?

<p>BLOSUM45 (A)</p> Signup and view all the answers

Flashcards

Distance (sequence alignment)

Measure of evolutionary changes between two genomes after divergence from a common ancestor.

Elementary Score

Score between two amino acids or nucleotides.

Global Score

Sum of the elementary scores on a sequence.

Gaps

Penalties applied to sequence alignment to account for insertions or deletions.

Signup and view all the flashcards

Gap opening penalty

Initial high penalty for the first gap in a series of consecutive gaps.

Signup and view all the flashcards

Gap extension penalty

Lower penalty applied to subsequent gaps after the first one.

Signup and view all the flashcards

Identity Matrix

Type of matrix that scores identity by +1 and similarity/gaps by 0.

Signup and view all the flashcards

Transition (genetics)

Substitution within the same family (purine to purine or pyrimidine to pyrimidine).

Signup and view all the flashcards

Transversion (genetics)

Substitution between different families (purine to pyrimidine or vice versa).

Signup and view all the flashcards

BLAST

Algorithm to find similar regions between sequences.

Signup and view all the flashcards

Physico-chemical matrices

Matrices protein that consider the properties of amino acids.

Signup and view all the flashcards

Evolution-based matrices

Substitution matrices from observed substitutions in homologous protein alignments.

Signup and view all the flashcards

Log-odds matrix

Logarithm of odds ratio to account for how frequently a particular residue is substituted.

Signup and view all the flashcards

PAM Matrices

Matrices from global alignment of closely related sequences.

Signup and view all the flashcards

BLOSUM Matrices

BLOSUM Matrices Block substitution matrix, is from local alignments of conserved regions.

Signup and view all the flashcards

Study Notes

  • The text discusses score matrices for nucleic and protein sequences in bioinformatics, used in sequence alignment.

Score Matrices

  • Score matrices in elementary score matrix calculations previously considered similarities/differences as equally important, assigning a score of 1 or 0.
  • Realistically, similarities between different letters (e.g., C and U in DNA) should be considered because converting cytosine to uracil only requires oxidation.

Biological Sequence Distance Matrices

  • Distance measures the number of evolutionary changes between two genomes since their divergence from a common ancestor.
  • Elementary score is the score between two amino acids or nucleotides.
  • Elementary score represents all possible states based on the alphabet used in sequence descriptions.
  • Global score is the sum of elementary scores on a sequence.
  • Scores are generally calculated to qualify and quantify the similarity between sequences, measuring either the rapprochement or the remoteness of sequences.
  • These scores rely on a system that assigns an elementary score for each position when sequences are edited one under the other.

Gaps

  • Inserting gaps mainly aims to increase alignments.
  • With multiple (two or more) consecutive gaps in a sequence alignment, the weight (gap penalty) considers the first gap (higher weight) and the subsequent gaps (noticeably lower weight).
  • For example, four consecutive gaps get a penalty of -1; the first gap is penalized at -1, while the other three are each penalized at -0.1.
  • Gap penalty Calculation: -1 +(-0.1) + (-0.1) + (-0.1) = -1.3
  • Extension of the gap refers to the other gaps aside from the first one, which is heavily penalized.
  • Equation for a gap is: Pgap = Po + Pe.k where Po is the penalty for the first gap, Pe is the penalty for each extension gap, and k is the number of extension gaps.
  • Gaps are penalties imposed on sequences during score calculation.

Nucleic Matrices

  • Because the bases composing DNA molecules have a limited alphabet, only a few nucleic matrices like elementary score matrices exist.

Identity Matrix

  • The most widely used matrix is the identity matrix in all its forms, measuring both rapprochement and remoteness between two sequences.
  • In this matrix, identity is scored as +1, while similarity and gaps are scored as 0.
  • Given sequences Seq1 and Seq2, the identity matrix assigns scores based on matches and mismatches like +1 for identity (Id), 0 for Sub (substitution), and 0 for Gap
  • For example given two sequences aligned use score values to calculate matrix values
  • Identity percentage calculation for two sequences is (6/9) x 100 = 66.67%.

Transversion/Transition Matrix

  • In other matrices, certain associations are favored over others.
  • Purines and pyrimidines differentiate and derive a matrix that penalizes transversions more than transitions.
  • Purine bases derive from a purine core, which is an aromatic core with 9 atoms (5C and 4N), resulting from the substitution of hydrogen atoms of heterocycles by hydroxyl, amine, or methyl radicals.
  • Pyrimidine bases derive from a pyrimidine core, an aromatic core with 6 atoms (4C and 2N), substituted at hydrogen atoms of the heterocycle with hydroxyl, amine, or methyl substituents.

Nucleic Matrices

  • Nucleic matrices can be unitary or use 3 potential scores: Identity(3), transition (1), and transversion(0).
  • Identity scored as +3 and substitution has two types: transition and transversion.
  • Transition is when the two sequences have nucleotides from the same family, either purine (A with G) or pyrimidine (C with T), giving a score = +1.
  • Transversion involves substituting nucleotides from different families (A with C or T, G with C or T), resulting in a score of 0.
  • The matrix calculation example yields an S-value of +19 based on matches, transitions, and transversions.

Kimura's 2-Parameter Matrix (K80)

  • Observed transition rates are generally higher than transversion rates.
  • Base frequencies are equal (Ï€A = Ï€T = Ï€C = Ï€G = 0.25).
  • The rate parameter equation: λ = α + 2β (α = transitions rate; β = transversions rate; λ= instantaneous change rate for any base).

BLAST Matrix

  • Basic Local Alignment Search Tool identifies similar regions between two or more nucleotide or AA sequences and generates an alignment of the homologous regions.
  • Default settings for Blast use +1 for matches and -3 for mismatches.
  • Gap opening gets penalized at -5 in BlastN (or -11 in BlastP, BlastX, TBlastX, and TBlastN).
  • Gap extension receives a penalty of -2 for BlastN and -1 for the other Blast programs.

Protein Matrices

  • Protein sequences differ significantly in alignment score calculation because they comprise a combination of 20 amino acids, unlike nucleic acids with only four nucleotides.
  • Protein matrices, like nucleic matrices, assign a score to each amino acid residue pair comparison.
  • Identity is no longer scored as 1, and non-identity as 0 because of existing similarities between certain amino acids during alignment score calculation.
  • For example, Arg and Lys could be similar in physicochemical properties and not affect structure/function.
  • An amino acid can be substituted with another without greatly altering a protein's structure/function; amino acids get classified into families, with a scoring system that accounts for the affinity of protein residues.
  • Resulting score matrices increase the reliability of protein similarity searches.
  • Established after deducing the degeneration of the genetic code.
  • Elementary scores were determined by the number of identical nucleotides shared by amino acid codons, which is like considering the minimum base changes required for one amino acid to be converted into another.
  • Protein substitution matrices are built by analyzing known protein family alignments' substitution frequencies.
  • The matrix score will reflect the frequency at which a given a.a. occurs naturally because certain amino acids are more abundant than others.
  • Protein matrices come in two categories: one groups matrices from studies illustrating amino acid substitution during evolution (evolution-linked matrices), which represent possible and acceptable amino acid exchanges during protein evolution; the other relies more on the physicochemical properties of amino acids, like hydrophilicity/hydrophobicity and secondary/tertiary structures.
  • Evolution-linked matrices are utilized to align protein sequences.

Protein Matrices Linked to Physicochemical Characteristics

  • Clearly group the chemical and structural characteristics of amino acids.
  • In some instances, they might not adequately reveal certain physicochemical characteristics shared by two proteins; therefore, matrices are mostly based on these characteristcs were determined.
  • Common matrices are hydrophobicity based on free energy measures for water-to-ethanol transfer of amino acids, or secondary structure matrices base on the trend of an amino acid being in a given conformation.
  • The increase in determined three-dimensional structures recently allowed establishing matrices based on structural comparisons that can be utilized to compare relatively distant proteins.
  • One matrix was established by superimposing 3-D structures of 32 proteins grouped into 11 very close sequences.

Protein Matrices Linked to Evolution

  • Built on substitutions between amino acids that occur during evolution, based on substitutions observed in homolog protein alignments.
  • They use observed and theoretical substitution frequencies.
  • Examples: BLOSUM, PAM, JTT, and WAG.
  • Log-odds matrices calculate the score as the logarithm of an odds ratio and count the number of times a particular residue is replaced by another one in nature or would occur by random chance.

Logarithmic Probability Matrix

  • Sij = log2 (qij/eij), where qij is the observed substitution frequency, and eij is the expected substitution frequency.
  • Substitution is frequent, with a positive score and infrequent with a negative score.
  • Log odds ratio describes the ratio of the observed over the expected.
  • substitutions observed frequently have a positive score, and those that are less frequent have a negative score.
  • Matrixes most frequently used are PAM and BLOSUM

PAM (Point Accepted Matrix)

  • Were the first matrices used for sequence analysis and are credited to Margaret Dayhoff.
  • It originates from a global manual alignment of 3000 proteins pertaining to the same family.
  • The PAM matrix helps to infer protein evolution relationships within families, extrapolating related data from vastly divergent evolutions.
  • These matrices reflect potential or acceptable transitions of one amino acid to another during protein evolution
  • From these alignments, scientists calculated a probability matrix that assigns each matrix score the likelihood that one AA (amino acid) has changed to AA-B within evolutionary step
  • Mutation probability matrix corresponds to an accepted substitution of 100 sites over a long period.
  • Such a matrix does not impair a protein.
  • Matrices such as these are a Percent Accepted Mutations (1PAM) matrix.
  • A unit of measurement resulting from this analysis is called the mutation unit or unit PAM.

Matrices

  • Multiplying itself using one matrix by a given amount creates the XPAM- the substitution probabilities matrix when evolution distances grow large.
  • Each matrix XPAM converts to a new similarity matrix PAM-X.
  • PAMn refers to the count of substitutions among 100 residues where n is the number of substitutions with 100 residues(PAM 100).
  • There are multiple matrices of different paramenters, PAM1, PAM30, PAM50, PAM100, PAM250.
  • High values indicate greater evolutionary distances.
  • Standard PAM-250 is best when differentiating linked proteins versus unrelated sequences.
  • Mutation matrices represent a very wide sample range and capture AA(amino acid) exchange expectations.

Limitations of PAM Matrixes

  • PAM considers mutation spots as equiprobable for all AA in the protein, where there might be more variability.
  • Initial alignments use a global alignment- considering total sequences and including areas that remain unpreserved.
  • Molecules used in 1978 may only represent parts of proteins or small dissolved protein segments.

PAM-250 Log Odds Substitution Matrix

  • Studied took in account 16,30 sequences in version 15 of swissprot, resulting in 2621 families.
  • This type of study may represent all substitutions within 1978.

BLOSUM (Blocks of Amino Acid Substitution)

  • BLOSUM matrices goal is to analyze far-removed proteins.
  • BLOSUM matrices vary from PAM, identifying patterns inside blocks of protein families, conservations and divergent protein types.
  • Blocks refer to alignments between multiple sites lacking insertions to generate BLOCKS databases.
  • Motifs are sequence groups which encode a unique structure.
  • blocks generalize certain motifs.
  • Each block generated by different sequences has an alignment without any deletions/insertions in smaller preservation regions. BLOSUM reduces sequence segments with minimum rates of identical values.
  • One determines substitution rates using logarithmic data, generating BLOcks SUbstitution Matrix.
  • These matrices provide substitution designs of the more stable zones in comparison to BLOSUM.
  • There are a larger amount of known proteins in 1992 compared to the number of proteins available in 1978, offering robust methods in deriving modern matrices.
  • BLOSUM matrices get calculated at distinct evolutions, which PAM extrapolates
  • BLOSUM matrix depends on AA rates and alignment.
  • BLOSUM can find the most similar sequences

BLOSUM K

  • This K factor stands for sequence conservations and is used to form the matrices. Higher percentages indicate similar sequences when reducing K assess divergent evolutions based sequence separations.
  • The level of conservation of sequences guides matrix construction.

BLOSUM versus PAM

BLOSUM Identity Evolutionary Distance
BLOSUM45 Up to 45% Largest
BLOSUM50 Up to 50% Large
BLOSUM52 Up to 52% Medium
BLOSUM60 Up to 60% Short
BLOSUM80 Up to 80% Shorter
BLOSUM90 Up to 90% Shortest
  • PAM matrixes contains several limits.
  • BLOSUM matrices allow avoiding limitations of traditional PAM.
  • Blosum can access protein parts to extract common features in those sequences (such as highly similar amino acid structures), whereas global or local alignment types can model the restraints of sequences.
  • BLOSUM also has many reliable results because they combine more data.

Protein Matrix Choice

  • The choice rests on unique parameters and sequence comparisons.
  • Direct sequence matches on BLOSUM generate much better results than Dayhoff.
  • One must pick data for a given objective, based on preferred matrices

Selection Matrix

  • PAM1 or BLOSUM80 is for identical molecules which are very similar in their evolutions
  • PAM120 and BLOSUM62 suits equal points in between (intermediate evolutions).
  • PAM250 or BLOSUM45 is more for distant sequences that evolve distantly.

Algorithms

  • Algorithms remain computationally extensive when generating sequence alignments
  • Requires techniques to generate few meaningless aligninments.

Matrix Recommendation

  • Algorithm must be tested

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

Use Quizgecko on...
Browser
Browser