Bioinformatics lecture 3+4 Bi4999en
143 Questions
11 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

Protein synthesis is a process that occurs in three steps: Transcription, Splicing, and Translation.

False (B)

UniProtKB is a central repository of protein sequences and is supported by a collaboration between EBI, Swiss Institute of Bioinformatics, and Protein Information.

True (A)

Post-translational modifications refer to the changes that occur to proteins after they have been synthesized, transforming them into mature proteins.

True (A)

Motifs or profiles databases contain exhaustive primary sequences of proteins with no abstractions or patterns.

<p>False (B)</p> Signup and view all the answers

Generalist databases, such as UniProtKB, only include sequences from very specifically defined sources.

<p>False (B)</p> Signup and view all the answers

There is only one type of protein sequence database available for researchers.

<p>False (B)</p> Signup and view all the answers

The quality level of annotation in databases like UniProtKB can vary between manual and automatic entry.

<p>True (A)</p> Signup and view all the answers

The primary sequence of proteins is unrelated to annotations and cross-references in databases.

<p>False (B)</p> Signup and view all the answers

PAM matrices represent evolutionary information based on distant protein relationships.

<p>False (B)</p> Signup and view all the answers

BLOSUM62 is derived from sequences clustered at 62% identity or greater.

<p>True (A)</p> Signup and view all the answers

Higher numbers in BLOSUM matrices indicate more evolutionary distance between sequences.

<p>False (B)</p> Signup and view all the answers

PAM250 corresponds to a residue identity of 45% between proteins.

<p>True (A)</p> Signup and view all the answers

BLOSUM1 corresponds to 1% identity and evaluates highly diverse protein alignments.

<p>True (A)</p> Signup and view all the answers

PAM1 corresponds to a residue identity of 99%.

<p>True (A)</p> Signup and view all the answers

The BLOSUM matrices are derived from individual sequences without any clustering.

<p>False (B)</p> Signup and view all the answers

PAM matrices are extrapolated from PAM1 to represent various evolutionary distances.

<p>True (A)</p> Signup and view all the answers

Introducing a gap in sequence alignment results in a negative score penalty.

<p>True (A)</p> Signup and view all the answers

The identity matrix for protein similarity uses a score of 1 for different amino acids.

<p>False (B)</p> Signup and view all the answers

Substitution models evaluate the likelihood of one specific amino acid replacing another during mutation.

<p>True (A)</p> Signup and view all the answers

1 PAM is defined as the time it takes for 1 out of 100 amino acids to mutate.

<p>True (A)</p> Signup and view all the answers

The Dayhoff Mutation Data Matrix is based on inferred evolutionary distances derived from genome sequencing.

<p>False (B)</p> Signup and view all the answers

The PAM matrix product allows for inference of homology in proteins beyond the twilight zone.

<p>False (B)</p> Signup and view all the answers

Gaps introduced in sequence alignments are beneficial as they eliminate the need for substitution models.

<p>False (B)</p> Signup and view all the answers

Scores in substitution models are based exclusively on the identity of the amino acids involved.

<p>False (B)</p> Signup and view all the answers

PDB format is advantageous because it is rarely supported by the majority of tools.

<p>False (B)</p> Signup and view all the answers

A significant disadvantage of the PDB format is the absolute limits on the size of certain items of data.

<p>True (A)</p> Signup and view all the answers

The mmCIF format was developed to simplify the handling of complicated structure data.

<p>False (B)</p> Signup and view all the answers

One disadvantage of the mmCIF format is that it is easily readable by humans and computers.

<p>False (B)</p> Signup and view all the answers

A notable feature of PDB format is its consistency across individual entries.

<p>False (B)</p> Signup and view all the answers

The mmCIF format is more suitable for accessing individual entries compared to the PDB format.

<p>False (B)</p> Signup and view all the answers

Hydrogen bonding and active sites are part of the data captured in the PDB format.

<p>True (A)</p> Signup and view all the answers

The maximum number of chains allowed in the PDB format is over 30.

<p>False (B)</p> Signup and view all the answers

R-factor should always be ≤ 0.4 for reliable models.

<p>False (B)</p> Signup and view all the answers

DRESS and RECOORD web servers provide improved versions of NMR models.

<p>True (A)</p> Signup and view all the answers

Local errors in a structure are indicated by residue B-factors < 50.

<p>False (B)</p> Signup and view all the answers

Predictions of atomic resolution in NMR structures can be made using the ResProx tool.

<p>True (A)</p> Signup and view all the answers

No guidelines exist for selecting NMR structures unlike X-ray structures.

<p>True (A)</p> Signup and view all the answers

Quality checks involve only comparisons against high-resolution structures of nucleic acids.

<p>False (B)</p> Signup and view all the answers

A structure showing a high number of outliers is likely to be problematic.

<p>True (A)</p> Signup and view all the answers

B-factor values are irrelevant for assessing the reliability of a structure.

<p>False (B)</p> Signup and view all the answers

The Ramachandran plot is used to check the stereochemical quality of protein structures by plotting the Ψ versus the Φ main chain torsion angles.

<p>True (A)</p> Signup and view all the answers

In a well-defined protein structure, residues are typically dispersed in the 'disallowed' regions of the Ramachandran plot.

<p>False (B)</p> Signup and view all the answers

Bad atom-atom contacts in protein structures are defined as two nonbonded atoms that have a center-to-center distance greater than the sum of their van der Waals radii.

<p>False (B)</p> Signup and view all the answers

Counts of unsatisfied hydrogen bond donors are a parameter evaluated in validating protein structures.

<p>True (A)</p> Signup and view all the answers

A real space R-factor is used to express how poorly each residue fits its electron density in a protein structure.

<p>False (B)</p> Signup and view all the answers

Knowledge-based potentials assess how 'happy' each residue is in its local environment according to predefined criteria.

<p>True (A)</p> Signup and view all the answers

The databases EDS and PDBREPORT provide pre-computed quality criteria for every structure in the Protein Data Bank (PDB).

<p>True (A)</p> Signup and view all the answers

Poorly defined protein structures generally show residues clustered tightly in the most favored regions of the Ramachandran plot.

<p>False (B)</p> Signup and view all the answers

Protein synthesis includes four steps: Transcription, Splicing, Translation, and Elimination.

<p>False (B)</p> Signup and view all the answers

Bioinformatics relies on databases that can provide sequences from any source, such as UniProtKB.

<p>True (A)</p> Signup and view all the answers

PAM and BLOSUM matrices are interchangeable for evaluating amino acid substitutions across all evolutionary distances.

<p>False (B)</p> Signup and view all the answers

The dynamic programming algorithm used for sequence alignments is optimized for both global and local alignments.

<p>False (B)</p> Signup and view all the answers

Post-translational modifications occur before protein synthesis is completed, altering proteins into their mature forms.

<p>False (B)</p> Signup and view all the answers

Transmembrane beta-strand barrels (TMB) typically contain 10 - 30 residues.

<p>False (B)</p> Signup and view all the answers

UniProtKB annotations may vary in quality depending on whether they are created manually or automatically.

<p>True (A)</p> Signup and view all the answers

Word-based methods for sequence alignments guarantee optimal alignments each time they are applied.

<p>False (B)</p> Signup and view all the answers

In the context of multiple sequence alignments, progressive methods begin by aligning the least similar sequences first.

<p>False (B)</p> Signup and view all the answers

Low-quality B-factor values indicate that residues are likely stable in their local environment.

<p>True (A)</p> Signup and view all the answers

The positive-inside rule indicates that positively charged residues are more prevalent in loop regions outside the membrane.

<p>False (B)</p> Signup and view all the answers

Diagonal transitions in the dynamic programming matrix represent gaps in the sequence alignment.

<p>False (B)</p> Signup and view all the answers

Motif databases derive information solely from full primary sequences without abstract representation.

<p>False (B)</p> Signup and view all the answers

PDB format is a flexible format that allows variable lengths for its entries.

<p>False (B)</p> Signup and view all the answers

The Ramachandran plot illustrates the steric arrangement of amino acid residues based on the angles of the main chain torsion.

<p>True (A)</p> Signup and view all the answers

Hydrophobicity analysis is particularly useful for predicting transmembrane beta-strand barrels.

<p>False (B)</p> Signup and view all the answers

The final alignment in dynamic programming corresponds to the path in the matrix that minimizes the score.

<p>False (B)</p> Signup and view all the answers

Methods for solubility and expressability prediction do not rely on machine learning techniques.

<p>False (B)</p> Signup and view all the answers

Back-tracing in sequence alignment starts from the top-left corner of the scoring matrix.

<p>False (B)</p> Signup and view all the answers

The mmCIF format is specifically designed to complicate the handling of structure data.

<p>False (B)</p> Signup and view all the answers

Gaps in sequence alignments are always beneficial as they improve alignment scores.

<p>False (B)</p> Signup and view all the answers

The substitution model scores are based solely on the identity of the corresponding amino acids.

<p>False (B)</p> Signup and view all the answers

The PDB format is the least supported format for 3D structure data representation.

<p>False (B)</p> Signup and view all the answers

ResProx tool is used to make predictions about atomic resolution in NMR structures.

<p>True (A)</p> Signup and view all the answers

Using an identity matrix, a score of 1 is assigned when two different amino acids are present.

<p>False (B)</p> Signup and view all the answers

The Dayhoff Mutation Data Matrix is based on a large sample of observed mutations for estimating evolutionary distances.

<p>True (A)</p> Signup and view all the answers

A gap in sequence alignment is treated as a positive score penalty to encourage shorter alignments.

<p>False (B)</p> Signup and view all the answers

Evolutionary distance in PAM is measured as the time for 1 out of 100 amino acids to remain unchanged.

<p>False (B)</p> Signup and view all the answers

The PAM250 matrix represents a scenario where the proteins considered have approximately 45% residue identity.

<p>True (A)</p> Signup and view all the answers

Substitution models assess the probability of observing mutations without considering evolutionary relations.

<p>False (B)</p> Signup and view all the answers

The introduction of more gaps in sequence alignment can enhance the accuracy of biologically meaningful alignments.

<p>True (A)</p> Signup and view all the answers

A Markov chain model is utilized to derive the PAM matrix product, which helps infer protein homology.

<p>True (A)</p> Signup and view all the answers

The maximum number of atom records in a PDB file is limited to 99,999.

<p>True (A)</p> Signup and view all the answers

The mmCIF format is rarely supported by visualization and computational tools.

<p>True (A)</p> Signup and view all the answers

PDB format is deemed suitable for computer extraction of information due to its consistency.

<p>False (B)</p> Signup and view all the answers

Each field of information in the mmCIF format is linked to other fields using a designated syntax.

<p>True (A)</p> Signup and view all the answers

PDB format allows for a maximum of 30 chains in a single file.

<p>False (B)</p> Signup and view all the answers

Inconsistencies within a single PDB entry include different residue numbering in the SEQRES and ATOM sections.

<p>True (A)</p> Signup and view all the answers

The advantages of the PDB format include being difficult to read and use.

<p>False (B)</p> Signup and view all the answers

The mmCIF format is suitable for accessing individual entries as it is easily readable.

<p>False (B)</p> Signup and view all the answers

In a Ramachandran plot, residues of a well-defined protein structure are typically dispersed in the 'disallowed' regions.

<p>False (B)</p> Signup and view all the answers

Bad atom-atom contacts are defined as two nonbonded atoms with a center-to-center distance less than the sum of their van der Waals radii.

<p>True (A)</p> Signup and view all the answers

Hydrogen bonding energies are not assessed during the validation of protein structures.

<p>False (B)</p> Signup and view all the answers

The Ramachandran plot is only useful in evaluating RNA structures, not protein structures.

<p>False (B)</p> Signup and view all the answers

A high number of unsatisfied hydrogen bond donors in a protein structure is a sign of good structural quality.

<p>False (B)</p> Signup and view all the answers

The real space R-factor is a metric that expresses how well each residue fits its electron density.

<p>True (A)</p> Signup and view all the answers

Knowledge-based potentials evaluate how 'unhappy' each residue is in its local environment, indicating a problematic overall structure.

<p>True (A)</p> Signup and view all the answers

All major databases provide pre-computed quality criteria for every structure in the Protein Data Bank (PDB).

<p>False (B)</p> Signup and view all the answers

Alternative splicing can result in multiple isoforms of proteins that share identical sequences.

<p>False (B)</p> Signup and view all the answers

The evolutionary information can enhance the accuracy of predictions related to protein properties.

<p>True (A)</p> Signup and view all the answers

The process of sequence alignment aims to assess the differences exclusively without considering evolutionary relationships.

<p>False (B)</p> Signup and view all the answers

Darwinian evolution posits that variations that enhance an individual's biological fitness will likely be inherited by future generations.

<p>True (A)</p> Signup and view all the answers

The assumption of large inter-individual differences is essential for Darwinian evolutionary theory.

<p>False (B)</p> Signup and view all the answers

Homology can be inferred solely from matching the primary sequences of proteins without any additional information.

<p>False (B)</p> Signup and view all the answers

Proteins can exhibit properties such as transmembrane regions solely based on their secondary structure.

<p>False (B)</p> Signup and view all the answers

Speciation is a direct result of the accumulation of inherited variations over time due to natural selection.

<p>True (A)</p> Signup and view all the answers

Function is solely dictated by sequence without regard for 3D structure.

<p>False (B)</p> Signup and view all the answers

Selective pressure operates primarily at the sequence level in proteins.

<p>False (B)</p> Signup and view all the answers

Homologous proteins arise from genes that evolved from a common ancestor.

<p>True (A)</p> Signup and view all the answers

Innovation in proteins occurs exclusively through large-scale genetic changes.

<p>False (B)</p> Signup and view all the answers

3D structures of proteins are unaffected by their amino acid sequences.

<p>False (B)</p> Signup and view all the answers

Adaptation in proteins leads to improved function in a given environment.

<p>True (A)</p> Signup and view all the answers

Mutations cannot be passed down to subsequent generations.

<p>False (B)</p> Signup and view all the answers

The sequence-structure-function paradigm emphasizes the relationship between these three aspects in proteins.

<p>True (A)</p> Signup and view all the answers

Protein synthesis involves processes including Transcription, Splicing, and Translation, followed by Post-translational modifications to form mature proteins.

<p>True (A)</p> Signup and view all the answers

UniProtKB is exclusively a specialist database that focuses solely on sequences from a limited biological pathway.

<p>False (B)</p> Signup and view all the answers

Motifs or profiles databases do not provide abstracted information from primary sequences of proteins.

<p>False (B)</p> Signup and view all the answers

Post-translational modifications occur prior to the synthesis of proteins and are essential for their final functional state.

<p>False (B)</p> Signup and view all the answers

The quality of annotations in databases like UniProtKB is only determined by automatic processes, with no human intervention.

<p>False (B)</p> Signup and view all the answers

BLOSUM matrices are used to evaluate evolutionary information based on proteins that share at least 62% identity.

<p>True (A)</p> Signup and view all the answers

Multiple databases such as WormBase exclusively provide exhaustive primary sequences without any additional annotations.

<p>False (B)</p> Signup and view all the answers

The mmCIF format is specifically designed to restrict access to individual entries, unlike PDB format.

<p>False (B)</p> Signup and view all the answers

A pairwise alignment technique is only associated with Global alignments.

<p>False (B)</p> Signup and view all the answers

Local alignments only consider similarity across the entire sequence of proteins.

<p>False (B)</p> Signup and view all the answers

Substitution scores in amino-acid alignments are fixed and do not vary.

<p>False (B)</p> Signup and view all the answers

Homologous proteins are those that share structural, functional, or sequence similarities regardless of their evolutionary background.

<p>False (B)</p> Signup and view all the answers

Iterative methods are the only techniques used for multiple sequence alignments.

<p>False (B)</p> Signup and view all the answers

Gaps in sequence alignments receive a positive score, encouraging their introduction.

<p>False (B)</p> Signup and view all the answers

The purpose of a substitution matrix in sequence alignment is to optimize the total alignment score by pairing amino acids.

<p>True (A)</p> Signup and view all the answers

The concept of homology in proteins is irrelevant to their structure and function.

<p>False (B)</p> Signup and view all the answers

The dynamic programming algorithm for sequence alignments allows back-tracing from the top-left corner of the matrix for global alignment.

<p>False (B)</p> Signup and view all the answers

Progressive methods for multiple sequence alignments first align the most divergent sequences before adding similar ones.

<p>False (B)</p> Signup and view all the answers

Word methods in sequence alignment guarantee an optimal alignment by matching short non-overlapping sequence stretches.

<p>False (B)</p> Signup and view all the answers

In local alignment, the Smith & Waterman algorithm allows back-tracing from any position in the alignment matrix.

<p>True (A)</p> Signup and view all the answers

Dynamic programming algorithms for sequence alignments are known for being computationally efficient at all times.

<p>False (B)</p> Signup and view all the answers

The relative positions of matching regions in word methods define an offset, which is the sum of corresponding coordinates.

<p>False (B)</p> Signup and view all the answers

Substitution models assess the likelihood of residue pairs being aligned based solely on their sequence identity.

<p>False (B)</p> Signup and view all the answers

Access to the PDB archive is available only through paid subscriptions.

<p>False (B)</p> Signup and view all the answers

Systematic errors in model structures contribute to the overall accuracy of the data.

<p>False (B)</p> Signup and view all the answers

Most structures in the PDB are of high quality, typically containing only systematic errors.

<p>False (B)</p> Signup and view all the answers

Completely wrong structures can be caused by misinterpretation of the electron density map.

<p>True (A)</p> Signup and view all the answers

All structures in the PDB are guaranteed to be correct and free from any type of error.

<p>False (B)</p> Signup and view all the answers

Sequence-based and text-based queries are available through the wwPDB sites.

<p>True (A)</p> Signup and view all the answers

Random errors are less common than systematic errors in structural models.

<p>False (B)</p> Signup and view all the answers

Quality checks on structures require critical assessment before being used for specific purposes.

<p>True (A)</p> Signup and view all the answers

Flashcards

Protein Synthesis Steps

Protein synthesis involves transcription (DNA to RNA), splicing (RNA to mRNA), translation (mRNA to protein), and post-translational modifications (protein to mature protein).

Protein Sequence Databases

Databases that store protein sequences, often with annotations and cross-references to other information. Types include generalist (like UniProtKB) and specialist databases (like WormBase) with different scopes.

UniProtKB

A central repository of protein sequences and functional information, known for detailed and quality annotations.

Protein Sequence Sources

Multiple databases hold protein sequences, categorized by scope (e.g., general or specific organism) and content (e.g., primary sequences and derived motifs).

Signup and view all the flashcards

Transcription

The process of copying DNA information into RNA, a crucial step in protein synthesis.

Signup and view all the flashcards

Translation

Converting mRNA information into a protein chain.

Signup and view all the flashcards

Splicing

The step in protein synthesis where non-coding parts of RNA are removed to create mature mRNA.

Signup and view all the flashcards

Protein Structure Levels

Proteins have different structural levels, with primary structure being the amino acid sequence.

Signup and view all the flashcards

Sequence Alignments

Methods used to compare similarities between different protein sequences.

Signup and view all the flashcards

Identity Matrix

A method for measuring protein similarity where 1 indicates identical amino acids and 0 otherwise.

Signup and view all the flashcards

Substitution Models

Methods for measuring protein similarity that consider the probability of amino acid substitutions.

Signup and view all the flashcards

PAM Matrix

A substitution model derived from observed mutations, useful for identifying evolutionary relationships among protein sequences.

Signup and view all the flashcards

Point Accepted Mutation (PAM)

A unit of evolutionary time, where 1 PAM signifies the time needed for one-hundredth of the amino acids to mutate.

Signup and view all the flashcards

Evolutionary Distance

The degree of difference between two sequences, often measured by PAM (Point Accepted Mutation) units.

Signup and view all the flashcards

Gap Penalty

A negative score given for introducing gaps in sequence alignments, to promote biologically meaningful alignments.

Signup and view all the flashcards

250 PAM matrix

A substitution matrix that targets the limit of reliably inferring homology where it gets too distant.

Signup and view all the flashcards

BLOSUM Matrices

Substitution matrices based on aligned fragments of conserved protein blocks (BLOCKS database).

Signup and view all the flashcards

PAM1

A PAM matrix derived from the comparison of similar protein that represents one amino acid difference in 100 amino acid sequences.

Signup and view all the flashcards

BLOSUM62

A BLOSUM matrix derived from sequences clustered at 62% or greater identity.

Signup and view all the flashcards

Protein Block

Conserved fragments found in proteins that align well suggesting that mutations have been kept from a common ancestor.

Signup and view all the flashcards

R-factor

A measure of how well a model fits experimental data.

Signup and view all the flashcards

Rfree

A more reliable measure of model quality than R-factor.

Signup and view all the flashcards

Side chain torsional conformers

Different shapes a side chain in a protein can take.

Signup and view all the flashcards

Protein structure quality

Assessing if a protein structure is reliable based on various checks.

Signup and view all the flashcards

NMR structures quality

Checking for quality of a structure created by Nuclear Magnetic Resonance.

Signup and view all the flashcards

Structure Validation

Process of checking protein or nucleic acid structures for accuracy and reliability.

Signup and view all the flashcards

B-Factors

Indicate the mobility or flexibility of atoms in a protein structure.

Signup and view all the flashcards

ResProx/DRESS/RECOORD

Tools used for validating and improving NMR protein structures.

Signup and view all the flashcards

PDB format

A file format storing atomic coordinates, chemical features, experimental details, and structural features of molecules, such as protein structures. Includes secondary structure, hydrogen bonds, biological assemblies, and active sites.

Signup and view all the flashcards

PDB format advantages

Widely used by many tools, easy to read & use, making it suitable for accessing individual molecular structures.

Signup and view all the flashcards

PDB format disadvantages

Inconsistent information between entries and within an entry (e.g., differing residue numbering), not good for extracting structured data programmatically. Data item size limitations exist, unsuitable for large or complex molecular structures.

Signup and view all the flashcards

mmCIF format

Macromolecular Crystallographic Information File; a format designed for increasingly complex molecular data. Uses tags to explicitly assign and link information, improving data consistency.

Signup and view all the flashcards

mmCIF format advantages

Easily parsed by software, consistent data across the database making it suitable for large-scale analysis and comparison.

Signup and view all the flashcards

mmCIF format disadvantages

Not easily read by humans, and often not supported by visualisation tools or software.

Signup and view all the flashcards

Data consistency

Ensuring uniformity and accuracy of data across entire data sets. Important in analyzing many structures.

Signup and view all the flashcards

Database size limitation

Restrictions on the amount of data that can be stored in a database. This can limit the analysis of large structures, necessitating the division of structures into smaller components if stored in a format like PDB.

Signup and view all the flashcards

Ramachandran Plot

A 2D graph showing the distribution of angles in a protein's backbone. It helps assess the quality of protein structures by revealing allowed and disallowed regions for the angles.

Signup and view all the flashcards

Disallowed Regions

Areas on the Ramachandran plot where protein backbone angles are highly unlikely or impossible, indicating potential errors in the structure.

Signup and view all the flashcards

Bad Contacts

Unfavorable interactions between atoms in a protein structure, often detected by comparing distances to van der Waals radii.

Signup and view all the flashcards

Hydrogen Bond Energy

A measure of the strength of hydrogen bonds within a protein structure, indicating important interactions for stability.

Signup and view all the flashcards

Knowledge-Based Potentials

Scores that evaluate how 'happy' each amino acid residue is in its local environment within a protein structure.

Signup and view all the flashcards

EDS

A database that provides pre-computed quality criteria for protein structures deposited in the Protein Data Bank (PDB).

Signup and view all the flashcards

PDBsum

A database that provides a summary of information about protein structures, including quality and structural features.

Signup and view all the flashcards

Real-Space R-Factor

A measure of how well each residue in a protein structure fits its electron density, reflecting the accuracy of the structure.

Signup and view all the flashcards

Protein Synthesis

The process of creating proteins from DNA instructions, involving transcription, splicing, translation, and post-translational modifications.

Signup and view all the flashcards

Protein Sequence Databases (Specialist)

Databases that focus on specific protein types or conditions, such as pathways, diseases, or specific organisms.

Signup and view all the flashcards

Motifs or Profiles Databases

Databases storing patterns derived from protein sequences, representing conserved features among related proteins.

Signup and view all the flashcards

Levels of Protein Structure

Proteins have different structural levels: Primary (amino acid sequence), Secondary (alpha helices and beta sheets), Tertiary (3D shape of a single chain), and Quaternary (interactions between multiple chains).

Signup and view all the flashcards

Solubility Property Prediction

Predicting whether a protein sequence will fold into a soluble protein in a specific expression system. This also involves predicting the likelihood of aggregation.

Signup and view all the flashcards

Transmembrane Region Prediction

Identifying regions within a protein sequence that span cell membranes. These regions are crucial for proteins that interact with the cell's outer boundary.

Signup and view all the flashcards

Transmembrane Helix (TMH)

A type of transmembrane region where the protein folds into a helical structure, passing through the membrane.

Signup and view all the flashcards

Transmembrane Beta-Strand Barrel (TMB)

A type of transmembrane region where the protein forms a beta-barrel structure, creating a pore through the membrane.

Signup and view all the flashcards

Positive-Inside Rule

The principle that transmembrane helices have more positively charged amino acids on the intracellular side of the membrane.

Signup and view all the flashcards

Why is data consistency important in structural databases?

Data consistency ensures uniformity and accuracy across a dataset, allowing for reliable comparisons and analyses of multiple structures.

Signup and view all the flashcards

What are the limitations of certain database formats for storing large complex molecules?

Some formats have limitations on the amount of data they can store, such as limited numbers of atoms or chains, restricting the storage of large and complex molecular structures.

Signup and view all the flashcards

Why are gaps important?

Gaps in sequence alignments are necessary to account for insertions and deletions (indels) that occur during evolution. They allow for the alignment of sequences that have diverged over time, revealing similarities that might otherwise be missed.

Signup and view all the flashcards

What are the limitations of Identity Matrices?

Identity matrices, which simply score identical amino acids as 1 and different ones as 0, are limited in their ability to accurately reflect protein similarity. They often force the introduction of too many gaps in alignments, making it difficult to determine the true relationship between sequences.

Signup and view all the flashcards

What is the purpose of substitution models?

Substitution models are a way to overcome the limitations of identity matrices by considering the probability of amino acid substitutions. They allow us to more accurately measure protein similarity by taking into account the likelihood of certain mutations occurring over time.

Signup and view all the flashcards

Dynamic Programming Algorithm

A method for finding the optimal alignment between two protein sequences by calculating scores for all possible alignments and choosing the one with the highest score. It uses a matrix to store scores for aligning different segments of the sequences.

Signup and view all the flashcards

Global Alignment

A sequence alignment method that aims to align the entire length of two sequences, including gaps to maximize similarity. It involves finding the best alignment across the entire length of both sequences.

Signup and view all the flashcards

Local Alignment

A sequence alignment method that identifies regions of high similarity within two sequences, finding the best matching segments even if the sequences are not overall very similar. It seeks out the most similar sections within the sequences.

Signup and view all the flashcards

Word Methods for Alignment

A method for efficiently comparing protein sequences by identifying short, non-overlapping sequences (words) within the target sequence, and finding matching words in the query sequence. This method uses matching words to define regions for further analysis and alignment.

Signup and view all the flashcards

Heuristic Methods

Computational methods that use approximations and simplifying assumptions to find solutions quickly. They are often used for large-scale comparisons of protein sequences, but may not guarantee the best solution.

Signup and view all the flashcards

Progressive Alignment

A method for aligning multiple protein sequences by first aligning the most similar pairs, and then gradually adding less similar sequences to the alignment. This build-up approach helps to find relationships between multiple sequences.

Signup and view all the flashcards

Why are databases important for structure prediction?

Databases store and organize a vast amount of information about protein structures (3D models) and sequences, aiding researchers in understanding protein function and predicting their structures.

Signup and view all the flashcards

What is UniProtKB?

A comprehensive database that houses protein sequences from various organisms, including detailed annotations and cross-references to other databases.

Signup and view all the flashcards

What are the benefits of using evolutionary information in structure predictions?

Incorporating evolutionary information, such as sequence alignments and mutation rates, improves the accuracy of predictions because it provides insights into the relationships between sequences and how they have changed over time.

Signup and view all the flashcards

What is the purpose of sequence alignment?

It compares protein sequences, identifying similarities and differences to understand biological functions and evolutionary relationships.

Signup and view all the flashcards

How does Darwinian evolution relate to protein sequence variation?

Variations in protein sequences that make an individual better adapted to its environment are more likely to be passed down, leading to evolution and the development of new protein functions.

Signup and view all the flashcards

What is a PAM matrix?

A substitution matrix that measures the likelihood of amino acid substitutions based on evolutionary time.

Signup and view all the flashcards

How do BLOSUM matrices differ from PAM matrices?

BLOSUM matrices are based on aligned fragments of conserved protein blocks, while PAM matrices use observed mutations.

Signup and view all the flashcards

What is the purpose of structure validation?

It assesses the quality and reliability of predicted or experimentally determined protein structures by checking for errors and inconsistencies.

Signup and view all the flashcards

Protein Function

The specific task a protein performs. This is determined by the protein's 3D structure.

Signup and view all the flashcards

Protein Structure & Evolution

Changes in a protein's sequence (mutations) lead to changes in its 3D structure. These structural changes can affect its function, impacting the organism's survival and adaptation.

Signup and view all the flashcards

Homologous Proteins

Proteins that share a common ancestor. This means they share similar sequences and structures, suggesting they perform similar or related functions.

Signup and view all the flashcards

Sequence/Structure/Function Paradigm

This principle states that a protein's sequence determines its 3D structure, which in turn determines its function.

Signup and view all the flashcards

Selective Pressure

Environmental factors that favor certain traits, leading to the survival and reproduction of organisms with advantageous adaptations.

Signup and view all the flashcards

Adaptation

A change in an organism's traits over time that allows it to better survive and reproduce in a specific environment.

Signup and view all the flashcards

Evolution

The gradual process of change in living organisms over generations due to factors like mutations and environmental pressures.

Signup and view all the flashcards

Diversity

The variation in traits within a population of organisms, which allows for adaptation and evolution.

Signup and view all the flashcards

What are the two main types of protein sequence databases?

Protein sequence databases can be divided into two primary categories: generalist databases, which store sequences from any source, and specialist databases, which focus on specific conditions, such as biological pathways, diseases, or organisms.

Signup and view all the flashcards

What's a primary source for protein sequences and annotations?

UniProtKB is a central repository for protein sequences and their related information. It's a collaboration between the European Bioinformatics Institute (EBI), the Swiss Institute of Bioinformatics, and the Protein Information Resource.

Signup and view all the flashcards

What are motifs or profiles databases?

These databases store patterns derived from primary protein sequences, representing conserved features found in families of related proteins. They offer a way to understand protein functions by analyzing recurring patterns.

Signup and view all the flashcards

What are the main steps of protein synthesis?

Protein synthesis involves four key steps: transcription, in which DNA is copied into RNA; splicing, where non-coding parts of RNA are removed; translation, where RNA is used to build a protein chain; and post-translational modifications, where the protein is further processed to become functional.

Signup and view all the flashcards

What are the levels of protein structure?

Proteins have four levels of structure: primary structure refers to the amino acid sequence; secondary structure describes local folds like alpha-helices and beta-sheets; tertiary structure is the 3D shape of a single protein chain; and quaternary structure involves the arrangement of multiple protein chains.

Signup and view all the flashcards

What are substitution models?

Substitution models are used in sequence alignment to account for amino acid substitutions that occur during evolution. They provide a more refined way to measure protein similarity than simple identity matrices.

Signup and view all the flashcards

What is the difference between global and local alignment?

Global alignment attempts to align the entire length of two sequences, while local alignment focuses on finding regions of high similarity within the sequences. Global alignment seeks overall similarity, while local alignment identifies best matching segments.

Signup and view all the flashcards

Molecular Evolution

The study of how protein sequences change over time due to mutations and natural selection. It helps us understand the relationships between different species and how proteins evolved.

Signup and view all the flashcards

Homology

Two proteins are homologous if they share a common ancestor. This means they originated from the same gene and have been modified over time.

Signup and view all the flashcards

Annotation Problem

The challenge of accurately labeling protein sequences with their functions. There are often many unknowns, making it difficult to assign the right function to a protein.

Signup and view all the flashcards

Substitution Matrix

A table that assigns scores to different pairs of amino acids, reflecting the likelihood of one amino acid being replaced by another during evolution.

Signup and view all the flashcards

Dynamic Programming

A method used to solve complex problems by breaking them down into smaller, overlapping subproblems. Applied to sequence alignment, it helps find the best alignment between two sequences.

Signup and view all the flashcards

What is dynamic programming?

A method for finding the best alignment between two sequences by calculating scores for all possible alignments and choosing the one with the highest score.

Signup and view all the flashcards

Word methods

An efficient method for comparing sequences by identifying short, non-overlapping sequences (words) and finding matching words.

Signup and view all the flashcards

What is a substitution model?

A method for measuring protein similarity that considers the probability of amino acid substitutions, taking into account the likelihood of mutations over time.

Signup and view all the flashcards

wwPDB

A collaboration of three organizations (RCSB PDB, PDBe, and PDBj) that manages and distributes the Protein Data Bank (PDB) archive, a database containing 3D structures of biomolecules.

Signup and view all the flashcards

PDB Archive Access

The PDB archive is freely available to the public through the RCSB PDB, PDBe, and PDBj websites. Data is updated weekly and can be accessed via FTP or web interfaces.

Signup and view all the flashcards

Structural Quality Assurance

The process of evaluating the accuracy and reliability of protein structures, ensuring their suitability for research and applications.

Signup and view all the flashcards

Systematic Errors

Errors in protein structures that stem from fundamental flaws in the modeling process, often due to limitations in experimental data or interpretation.

Signup and view all the flashcards

Random Errors

Errors in protein structures that occur randomly due to limitations in experimental data or the inherent nature of protein dynamics.

Signup and view all the flashcards

Misstracing

A type of systematic error where the protein chain is incorrectly traced through the electron density map, leading to a distorted structure.

Signup and view all the flashcards

Frame-shift Errors

A type of systematic error where the protein chain is shifted by one or more residues in the electron density map, leading to an incorrect structure.

Signup and view all the flashcards

Incorrect Fold

A significant error in protein structure where the overall 3D shape is completely wrong.

Signup and view all the flashcards

Study Notes

Bioinformatics Protein Sequences and Databases

  • Bioinformatics protein sequences and databases are a crucial area of study
  • Databases are used to store and manage protein sequences

Structure Prediction

  • Artificial intelligence (AI) algorithms (like AlphaFold) can predict protein 3D structures from their amino acid sequences
  • Google DeepMind developed AlphaFold
  • This is a significant advancement in protein science

Protein Synthesis

  • Protein synthesis occurs in two steps:
    • Transcription: DNA → RNA
    • Translation: mRNA → protein
  • Post-translational modifications occur after translation, transforming the protein into a mature form.

Levels of Protein Structure

  • Primary structure: sequence of amino acids
  • Secondary structure: local folding patterns (alpha-helices, beta-sheets)
  • Tertiary structure: overall 3D folding of the polypeptide chain
  • Quaternary structure: arrangement of multiple polypeptide chains in a protein complex

Sources of Protein Sequences

  • UniProtKB: comprehensive generalist database of protein sequences and annotations. Includes manual annotations
    • Collaborates with EBI and the Swiss Institute of Bioinformatics
    • Includes biological information, quality level, and links to other databases.
  • WormBase: specialist resource focusing on organisms like the worm
  • UniProtKB- also includes primary sequence information, and annotations and cross-references.
  • PFam: focuses on patterns within proteins; finds the most conserved features among related proteins

UniProt KB

  • Contains reviewed protein entries (SwissProt) and automatic entries (TrEMBL)
  • High-quality manual annotations provide reliable information
  • ~570,000 curated protein records (2024)
  • ~250,000,000 automatically translated protein records of lower quality (2024)
  • Rich human-readable information about functions, names, taxonomy, subcellular locations, phenotypes, PTMs, expression, interactions, family/domains, sequence, and similar proteins.

UniProt KB - Specific Features

  • Human-readable explanations of protein function and information about enzymatic parameters (activity, kinetics)
  • Pathways and biological interpretations
  • Access to mutations and their effect on protein activity
  • Displays available 3D structures, links to AlphaFold predictions and other 3D structure databases

UniProt KB - Access

  • Unique accession numbers for protein sequences. -Serial accession numbers are used for variants (P21397-1, P21397-2).

Summary of 1D Predictions

  • Primary sequence information can predict various properties of proteins like solvent accessibility, solubility, transmembrane regions, and secondary structure.
  • Evolutionary information enhances accuracy of prediction methods.

Introduction to Sequence Alignment

  • Alignments model similarities and differences between protein sequences to identify conserved regions
  • Evolutionary information inferences (homology) can be drawn from sequence similarity

A Few Words on Evolution

  • Species develop through natural selection.
  • Variations are small and inheritable contributing to evolutionary fitness.
  • Selection pressure favors traits that improve functions allowing better survival and reproduction. This leads to speciation.
  • Evolutionary adaptation is a key concept.

A Few Words on Molecular Evolution

  • Function is determined by 3D shape.
  • Adaptations are influenced by environmental pressure.
  • Structure is determined by the sequence of a protein.

Sequence, Structure, Function Paradigm

  • 3D structure is determined by amino acid sequence.
  • Function is determined by 3D structure of the protein.
  • This is a core principle of protein science.

A Few Words on Molecular Evolution(2)

  • Evolutionary changes occur at the sequence level as mutations.
  • Selective pressures act and support variations that enhance function and adaptations through natural selection

A Few Words on Molecular Evolution (3)

  • Homology means two proteins trace back a common ancestor.
  • Paralogs are homologous proteins that evolved from the same ancestor gene through duplication.

Sequence Alignments

  • Alignments aim to align similar regions of different protein sequences.
  • Global alignments = consider similarity of the entire sequence.
  • Local alignments = consider similarity in small parts of the sequence.
  • Pairwise alignments = compare two sequences
  • Multiple sequence alignments = align more than two sequences. -Methods for performing these alignments include: Dynamic programming, Progressive methods, Iterative methods -Algorithms like Needleman-Wunsch, Smith-Waterman -Dot plot methods -Word methods
  • Using matrices to score similarity (Identity, substitution models)
  • matrices like Dayhoff’s PAM matrix, BLOSSUM matrices can help scoring.

Dynamic Programming Algorithm

  • Measures similarity in a pairwise alignment.
  • Each dimension corresponds to one of the proteins to be aligned. Each square contains the score based on substitutions.
  • Diagonal transitions show aligned positions, vertical and horizontal transitions identify gaps.
  • The optimal path in the matrix shows the best match.

Word Methods

  • Small sequence-stretches (k-tuples or words) in the query are matched across target sequences.
  • The position of matches defines alignment offsets.

Multiple Sequence Alignments (MSA)

  • MSA methods like Dynamic programming, Progressive Methods and Iterative methods are used.
  • These methods often start by aligning most similar pairs and sequentially add less similar sequences, accounting for sequence length variations.
  • Additional information from T-Coffee is helpful in this process

Beyond Pure Sequences:Patterns and Models

  • Aligned sequences help define patterns for searches in databases.
  • Useful tools include Position-Specific Scoring Matrices and Hidden Markov Models.

Secondary Structure Prediction

  • Prediction of the conformational state of each amino acid residue.
  • Common states include helix (H), strand (S), and coil (C)
  • Software that can predict : PSI-PRED

Solvent Accessibility Prediction

  • Predict the extent to which a residue is accessible to solvent.
  • Different amino acids and accessibility values vary.
  • Simplified description is exposed vs buried residues

Solubility and Expressability Prediction

  • Predicting solubility in an expression system or propensity for aggregation.
  • Major methods rely heavily on machine-learning.

Transmembrane Region Prediction

  • Transmembrane (TM) proteins are predicted.
  • TMH(Transmembrane helix): 12–35 residues.
  • TMB(Transmembrane beta-strand): 10–25 residues.
  • Hydrophobicity is not helpful for TMB prediction. Analysis of charged residues and positive inside rule is used.

Structural Databases

  • A variety of databases exist for storage and analysis.
  • Examples include; wwPDB

Data Formats

  • PDB
  • mmCIF
  • PDBML
    • Contains atom coordinates and associated data.

PDB Format

  • Designed in the early 1970s.
  • Rigid structure, 80 characters per line.
  • Most widely supported.
  • Contains atom coordinates, biological features and experimental details.
    • Advantages: Ease of reading and use, majority of tools support this format; often easy to access individual records.
    • Disadvantage: Inconsistency between records, limited information and data constraints.

mmCIF Format

  • Designed to handle complex data; contains each item of info explicitly assigned by a tag, linked by syntax
  • Advantages: Computer parsable, database-wide consistency
  • Disadvantage: Difficult to read.

Structural Databases (2)

  • wwPDB: Extensive database of 3D structures (proteins, nucleic acids, oligosaccharides)
    • RCSB PDB, EMDB,PDBe, PDBj, BMRB
    • Includes data from X-ray crystallography, NMR, and Electron microscopy.
  • Other Databases include PDBsum, SCOP, Protopedia and Structural Biology KnowledgeBase

wwPDB - Data Deposition

  • All data to the database is deposited in a central repository
  • Common standards are used ensuring consistency and quality of structures.
  • PDB-IDs are unique identifiers for each structure.

wwPDB - Data Access

  • Free access to the archive's resources
  • Sites distribute archives that are updated regularly.
  • Different interfaces facilitate varied searches

Structural Quality Assurance

  • Structures are models that satisfy experimental data.
  • These models can have random and systematic errors.
  • Important tests are crucial for identifying and interpreting data.

Errors in Deposited Structures

  • Types of errors include: systematic errors (related to accuracy relative to true structure).
  • Examples include misinterpretations, tracing errors, spectral interpretations
  • Random errors (uncertainties in atomic positions)
    • Examples include flip in side-chains, or positions

Examples of Systematic Errors

  • Completely wrong structures: tracing the protein chain through electron density yields an entirely incorrect fold.
  • Incorrect connectivity between secondary elements: a false order of secondary components, and residues in improper places in the 3D model
  • "Frame-shifts": Residues fitted into electron density of the next residue resulting in incorrect structure interpretation

Examples of Random Errors

  • Uncertainties in atomic positions ( e.g., 0.01-1.27 Å)
  • Side-chain flips: symmetry in some amino acid side-chain shapes can lead to incorrect placements.

Rules of Thumb for Selecting Structures (X-ray)

  • High resolution (≤ 2.0Å) and low R-factor (≤ 0.2) indicate higher accuracy.
  • Selection depends on the need for the analysis type and if comparing folds, an even higher resolution is required.
  • R-factor is a parameter to consider for reliability
    • RFREE is a tool that accounts for a fraction of the data for better reliability.
  • Residue B-factor (>50) indicates local errors; checks should be used

Rules of Thumb for Selecting Structures (NMR)

  • No simple rule of thumb exists due to differences in accuracy.
  • Quality of the structure is found in the original paper and quality checks often required.

Quality Checks of Structures

  • Testing for normality compares a given protein or structure against common structures..
  • Assessing outliers: multiple outliers suggest problems.
    • Examples: Ramachandran plot (favorable/disallowed regions), unfavorable atom-atom contacts analysis, potential energy evaluation, real-space R-factor, and various other parameters.

Validation of Protein Structures

  • Ramachandran plot (favored/disallowed regions)
  • Unfavorable Atom-Atom Contacts
  • Parameters like: H-bond donor counts, H-bond energies, real-space R-factor and others

Quality Information on the Web

  • Several databases provide pre-calculated quality criteria for all wwPDB structures. Specific databases: EDS, PDBsum, PDBREPORT, and RCSB PDB are discussed in detail.
  • The Electron Density Server, PDBsum provide various parameters.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

Description

This quiz covers essential topics in bioinformatics, focusing on protein sequences and databases, structure prediction using AI algorithms like AlphaFold, and the protein synthesis process. Test your understanding of protein structure levels, including primary, secondary, tertiary, and quaternary structures.

More Like This

Use Quizgecko on...
Browser
Browser