Bioinformatics Protein Sequences and Databases PDF

B i o i nfo r m a t i c s p ro te i n s e q u e n c e s a n d d ata b a s e s Structure prediction 3-Binf DB & Str. Pred -> Intro 4 Protein synthesis Protein synthesis occurs in two steps:...

B i o i nfo r m a t i c s p ro te i n s e q u e n c e s a n d d ata b a s e s Structure prediction 3-Binf DB & Str. Pred -> Intro 4 Protein synthesis Protein synthesis occurs in two steps: Transcription: DNA -> RNA Splicing: RNA -> mRNA Translation: mRNA -> Protein Post-translational modifications: protein  mature protein Translation 3-Binf DB & Str. Pred -> 1ry sequence of proteins 6 Levels of protein structure 3-Binf DB & Str. Pred -> 1ry sequence of proteins 21 Sources of protein sequences  Multiple databases available:  With different scope focus:  Generalist: sequences from any source (UniProtKB)  Specialist: sequences focusing on one more specific condition(s) (i.e. biologic pathway, disease, organism) (WormBase)  With different types of sequence content:  Primary sequence of proteins, and annotations and cross- references to that sequence (UniProtKB)  Motifs or profiles databases: contain information derived from the primary sequence, in the form of abstractions (patterns) that distil the most conserved features among related proteins (PFam) 3-Binf DB & Str. Pred -> protein seq. databases 22 Sources of protein sequences  Multiple databases available  UniProtKB  Collaboration between EBI, Swiss Institute of Bioinformatics and Protein Information  Central repository of protein sequences and functional information  Quality annotations - information on protein function and individual amino acids, experimental information, biological ontologies, classification, links to other databases  Quality level of the annotation (manual vs. automatic) 3-Binf DB & Str. Pred -> protein seq. databases 23 UniProt KB  Main component of the database  Reviewed protein entries (SwissProt): High quality manual annotations  Manual annotations  reliable info  >570,000 protein records (2024)  Automatic protein entries (TrEMBL): Automatic translation of protein sequences from EMBL data bank  Automatic annotations  lower quality, chance for errors.  ~250,000,000 protein records (2024) (400x info ammount) 3-Binf DB & Str. Pred -> protein seq. databases 25 UniProt KB Human readable explanation of the protein function Wealth of systematically organized information. In the illustrated example: Catalytic activity: with details of the enzymatic reaction and cross-links to chemical databases Activity regulation: competitive inhibitors Kinetics: experimental measurements towards n substrates Optimal pH Implication in biological pathways Catalytic and Key Residues (active/binding sites) Gene Ontology (GO) annotations (enrichment values) Enzyme/Pathways and Protein Family DBs Keywords 3-Binf DB & Str. Pred -> protein seq. databases 28 UniProt KB Unique accession numbers Serialized for sequence variants (later) 3-Binf DB & Str. Pred -> protein seq. databases 32 UniProt KB Describe the effect of mutations in the activity of the protein Mutations mapped on the protein sequence 3-Binf DB & Str. Pred -> protein seq. databases 34 UniProt KB Displays available tertiary structures (experimentally determined) for the protein. Links to AlphaFold predictions if available (cover later) Describes secondary structure content mapped to seq. Links to databases with 3D structure models 3-Binf DB & Str. Pred -> protein seq. databases 37 UniProt KB When multiple isoforms are avaliable due to alternative splicing the different sequences are available here, with serialized accession codes (i.e. P21397-1, P21397-2) 3-Binf DB & Str. Pred -> protein seq. databases 40 Summary of 1D predictions Different protein properties or characteristics can be predicted from its primary sequence: Secondary structure Solvent accessibility Solubility/expressability Transmembrane regions The methods that do such predictions improve if they consider evolutionary information Bioinformatics databases & Structure prediction 44 Introduction to sequence alignment Protein sequences can also be directly “compared” among them. Their similarities or differences can be assessed.. Alignments are models that aim to pair the most similar parts among different proteins. If the model considers evolutionary information (and biologically relevant protein alignments do), evolutionary relationships (homology) can be inferred from sequence similarity. 3-Binf DB & Str. Pred -> Sequence alignments 45 A few words on evolution Darwinian ideas on evolution: All species of organisms arise and develop through the natural selection of small, inherited variations that increase the individual's ability to compete, survive, and reproduce (biological fitness). Inter-individual differences need to be: Small Inheritable There exists a natural selective pressure. Variations that make an individual fitter (improve its functions) to the conditions of the selective pressure are more likely to be transmitted to next generations. Accumulation of variation causes speciation. 3-Binf DB & Str. Pred -> Seq align -> Evolution of proteins 47 A few words on molecular evolution Improved function on a given environment (adaptation) is a key concept in evolution. How does this apply to proteins? How do proteins function? Molecular Catalyst Molecular Pore [gift box] [tube] Function is dictated by shape (3D structure) 3-Binf DB & Str. Pred -> Seq align -> Evolution of proteins 48 A few words on molecular evolution Improved function on a given environment (adaptation) is a key concept in evolution. How does this apply to proteins? How do proteins function? Structure is determined by sequence. Function is dictated by shape (3D structure) 3-Binf DB & Str. Pred -> Seq align -> Seq/Str/Function Paradigm 49 Sequence, Structure, Function Paradigm  3D structure is determined by the sequence  Function is dictated by 3D structure MSLGAKPFGEKKFIEIKGRRMAYIDEGTGDPILFQHGNPTSSYLWRNIMPHCA GLGRLIACDLIGMGDSDKLDPSGPERYAYAEHRDYLDALWEALDLGDRVVLVV HDWGSALGFDWARRHRERVQGIAYMEAIAMPIEWADFPEQDRDLFQAFRS QAGEELVLQD sequence function structure 3-Binf DB & Str. Pred -> Seq align -> Seq/Str/Function Paradigm 50 A few words on molecular evolution  Innovation happens at the sequence level Mutations (small changes) introduced in DNA (inheritable) Subsequently transcribed, processed, and translated into polypeptidic chains (proteins)  Selective pressure operates at the function level Proteins working better in their environments make individuals fitter, adaptation occurred in human lineage Schaffner S. & Sabeti P (2008) Evolutionary adaptation in human lineage. Nature Education 1:14. 3-Binf DB & Str. Pred -> Seq align -> Evolution of proteins 51 A few words on molecular evolution Diversity Structure Function Sequence Homology: two proteins are homologous if they are the products of genes that evolved from the same ancestor 3-Binf DB & Str. Pred -> Seq align -> Evolution of proteins 52 A few words on molecular evolution Paralogs Structure Function Sequence Homology: two proteins are homologous if they are the products of genes that evolved from the same ancestor 3-Binf DB & Str. Pred -> Seq align -> Evolution of proteins 53 A few words on molecular evolution Annotation problem Structure Function Sequence Homology: two proteins are homologous if they are the products of genes that evolved from the same ancestor 3-Binf DB & Str. Pred -> Seq align -> Evolution of proteins 54 Sequence alignments Alignments are models that aim to pair the most similar parts among different proteins. Global alignments: consider similarity across the entire sequence Local alignments: consider similarity across sequence fragments Pairwise alignments: two sequences compared Multiple sequence alignments: multiple 3-Binf DB & Str. Pred -> Seq align -> Alignments  Classification 55 Sequence alignments Alignments are models that aim to pair the most similar parts among different proteins. Pairwise alignment techniques DotPlot methods Dynamic programming algorithm Needelman & Wunsch (Global) Smith & Waterman (Local) Word methods Multiple sequence alignment techniques: Dynamic programming Progressive methods Iterative methods 3-Binf DB & Str. Pred -> Seq align -> Alignments  Classification 56 Sequence alignments How can similarity among different parts of proteins be measured? Assessing similarity in pairs of Amino-acids: Each possible pair of amino-acids is given a substitution score (substitution matrix) Amino-acids from the (two) sequences should be paired such as the total alignment score is optimized. Sometimes no good pairing can be found and a gap needs to be introduced. Gaps require a special penalty (negative score) in order to force longer and biologically meaningful alignments. 3-Binf DB & Str. Pred -> Seq align -> Alignments  Substitution Models 59 Sequence alignments How can similarity among different parts of proteins be measured? Identity matrix (Dot-matrix plots): 1 if same amino-acid 0 otherwise  Limited model: forces the introduction of too many gaps. 3-Binf DB & Str. Pred -> Seq align -> Alignments  Substitution Models 60 Sequence alignments How can similarity among different parts of proteins be measured? Identity matrix (Dot-matrix plots): 1 if same amino-acid 0 otherwise  Limited model: forces the introduction of too many gaps. Substitution models: Score depending on the probability of observing a substitution (mutation) of one particular Aa for another (i.e. Arg  Lys should score better than Arg  Glu) 3-Binf DB & Str. Pred -> Seq align -> Alignments  Substitution Models 61 Sequence alignments Substitution models include evolutionary information Dayhoff Mutation Data Matrix Score is based on the concept of Point Accepted Mutation (PAM) Evolutionary distance 1 PAM = time in which 1/100 amino acids are expected to mutate. Higher evolutionary times inferred from a Markov chain model: PAM matrix product. 250 PAM matrix – targets the limit where is safe to infer homology in proteins (twilight). Limitation: derived from 1572 observed mutations in (manual) alignment of sequences >85% identical 3-Binf DB & Str. Pred -> Seq align -> Alignments  Substitution Models 63 Sequence alignments Substitution models include evolutionary information PAM250 3-Binf DB & Str. Pred -> Seq align -> Alignments  Substitution Models 64 Sequence alignments Substitution models include evolutionary information BLOSSUM matrices BLOcks SUbstitution Matrix Derived from blocks of aligned sequences in BLOCKS database – implicitly represents distant relationships. bias from identical sequences is removed by clustering at a sequence identity threshold BLOSUM62 = matrix derived from sequences clustered at 62% or greater identity 3-Binf DB & Str. Pred -> Seq align -> Alignments  Substitution Models 65 Sequence alignments PAM BLOSUM Similar proteins compared as Conserved BLOKS (fragments) whole compared PAM1 corresponds to 1 ≠ BLOSUM1 corresponds to 1% ID residue in 100  99% ID Other PAM matrices Each matrix based on observed extrapolated from PAM1 alignments Higher numbers, more Higher numbers, more similarity evolutionary distance (less evolutionary distance) 100 90 120 80 160 62 200 50 250 45 3-Binf DB & Str. Pred -> Seq align -> Alignments  Substitution Models 66 Sequence alignments Dynamic Programing Algorithm Matrix: Each dimension corresponds to one of the proteins to be aligned. Each cell contains the score value from the substitution model corresponding to the residue pair. Diagonal transitions represent aligned positions Vertical and horizontal transitions represent gaps and are penalized. The final alignment corresponds to the path in the matrix that maximizes the score. 3-Binf DB & Str. Pred -> Seq align -> Alignments  Pairwise alignment 67 Sequence alignments Dynamic Programing Algorithm Back-trace from bottom-right Global: Needelman & Wunsch. From the corner Local: Smith & Waterman. From any position.  DETERMINISTIC  Comp. expensive 3-Binf DB & Str. Pred -> Seq align -> Alignments  Pairwise alignment 68 Sequence alignments Word methods Short non-overlapping sequence stretches (k- tuples or words) are identified in the query sequence and matched in target sequence(s). Relative positions of the matching region define an offset (subtraction) Multiple words matching with similar offset define a region prone to alignment. Alignments are subsequently extended in alingment-prone regions.  HEURISTIC, optimal align not guaranteed.  Efficient for database searches. BLAST, FASTA. 3-Binf DB & Str. Pred -> Seq align -> Alignments  Pairwise alignment 69 Sequence alignments Multiple sequence alignments Dynamic programming algorithm Progressive methods First align the most similar pair Subsequently add less similar sequences Sensitive to similarity inaccuracy (i.e. due to differences in sequence length) CLUSTAL Additional info considered: T-Coffee (slow) Iterative methods 3-Binf DB & Str. Pred -> Seq align -> Alignments  MSA 71 Sequence alignments Multiple sequence alignments Dynamic programming algorithm Progressive methods Iterative methods Initial global alignment Objective function (based on score) to optimise similarity assessment. Chose best. All possible remaining sequence subsets re- aligned and re-scored Best subset included in the alignment/iter. Typically slower, more accurate MUSCLE, MAFT. 3-Binf DB & Str. Pred -> Seq align -> Alignments  MSA 72 Sequence alignments Beyond pure sequences: patterns and models Aligned sequences can be used to define patterns, that can then be used to perform searches in databases. Position Specific Scoring Matrices Hidden Markov Models 3-Binf DB & Str. Pred -> Seq align -> Alignments  Motifs 73 Secondary structure prediction  prediction of the conformational state of each amino acid (AA) residue of a protein sequence as one of the possible states:  helix (H)  strand (S)  coil (C) 3-Binf DB & Str. Pred -> Properties prediction -> Secondary Structure 75 Secondary structure prediction programs  PSI-PRED  http://bioinf.cs.ucl.ac.uk/psipred/  combination of PSI-BLAST profiles and neural networks  careful selection of sequences used for profile construction 3-Binf DB & Str. Pred -> Properties prediction -> Secondary Structure 77 Solvent accessibility prediction  prediction of the extent to which a residue embedded in a protein structure is accessible to solvent  comparison of accessibility of different amino acids – relative values (actual area as percentage of maximally accessible area)  simplified two state description – buried vs. exposed residues 3-Binf DB & Str. Pred -> Properties prediction -> Solvent Accessibiility 80 Solubility and expressability prediction  Complicated definition of the property  Prediction of the extent to which a given sequence will produce a soluble protein in a given expression system or  Prediction of aggregation propensity  Methods heavily rely on machine learning. 3-Binf DB & Str. Pred -> Properties prediction -> Solubiility & Expressability 84 Transmembrane region prediction  transmembrane (TM) proteins – challenge for experimental determination of 3D structure → structure prediction needed even more than for globular water-soluble proteins  two major classes of integral membrane proteins  transmembrane helices (TMH)  transmembrane beta-strand barrels (TMB) 3-Binf DB & Str. Pred -> Properties prediction -> Transmembrane region 86 Transmembrane region prediction  prediction of TMH simplified by strong environmental constraints – lipid bilayer of the membrane  TMHs are predominantly apolar and 12-35 residues long (hydrophobicity)  specific distribution of Arg and Lys (positively charged) → connecting loop regions at the inside of the membrane have more positive charges than loop regions at the outside = positive-inside rule 3-Binf DB & Str. Pred -> Properties prediction -> Transmembrane region 88 Transmembrane region prediction  prediction of TMB  transmembrane beta-strands contain 10 - 25 residues  only every second residue faces the lipid bilayers and is hydrophobic, other residues face the pore of the β-barrel and are more hydrophilic → analysis of hydrophobicity NOT useful for TMB prediction 3-Binf DB & Str. Pred -> Properties prediction -> Transmembrane region 89 S t r u c t u ra l d ata b a s e s Data formats  different formats are used to represent primary macromolecular 3D structure data  PDB  mmCIF  PDBML ...  The spatial 3D coordinates for each atom are recorded 4-Str. DBs & 3D Modelling -> Str. DBs 4 PDB format  designed in the early 1970s - first entries of PDB database  rigid structure of 80 characters per line, including spaces  still the most widely supported format 4-Str. DBs & 3D Modelling -> Str. DBs -> PDB format 5 PDB format  atomic coordinates  chemical and biological features  experimental details of the structure determination  structural features  secondary structure assignments  hydrogen bonding  biological assemblies REMARK 350  active sites ... 4-Str. DBs & 3D Modelling -> Str. DBs -> PDB format 7 PDB format  advantages  widely used → supported by majority of tools  easy to read and easy to use → suitable for accessing individual entries 4-Str. DBs & 3D Modelling -> Str. DBs -> PDB format 8 PDB format  disadvantages  inconsistency between individual PDB entries as well as PDB records within one entry (e.g., different residue numbering in SEQRES and ATOM sections) → not suitable for computer extraction of information 4-Str. DBs & 3D Modelling -> Str. DBs -> PDB format 9 PDB format  disadvantages  inconsistency between individual PDB entries as well as PDB records within one entry → not suitable for computer extraction of information  absolute limits on the size of certain items of data, e.g.: max. number of atom records limited to 99,999; max. number of chains limited to 26 → large systems such as the ribosomal subunit must be divided into multiple PDB files → not suitable for analysis and comparison of experimental and structure data across the entire database 4-Str. DBs & 3D Modelling -> Str. DBs -> PDB format 10 mmCIF format  macromolecular Crystallographic Information File (mmCIF)  developed to handle increasingly complicated structure data  each ﬁeld of information is explicitly assigned by a tag and linked to other ﬁelds through a special syntax 4-Str. DBs & 3D Modelling -> Str. DBs -> mmCIF format 11 mmCIF format  advantages  easily parsable by computer software  consistency of data across the database  disadvantages  difficult to read  rarely supported by visualization and computational tools → suitable for analysis and comparison of experimental and structure data across the entire database → not suitable for accessing individual entries 4-Str. DBs & 3D Modelling -> Str. DBs -> mmCIF format 12 Structural databases  Primary wwPDB: 3D structure of biopolymers BMRB: Nuclear Magnetic Resonance specific EMDB: Electron-Microscopy specific NDB: 3D structure of nucleic acids: http://ndbserver.rutgers.edu/ CSD: 3D structure of small molecules (commercial) http://www.ccdc.cam.ac.uk/products/csd/  Other sources PDBsum, SCOP, Protopedia, Structural Biology KnowledgeBase 4-Str. DBs & 3D Modelling -> Str. DBs 14 wwPDB  joint initiative of four organizations  Research Collaboratory for Structural Bioinformatics (RCSB PDB)  Protein Data Bank in Europe (PDBe)  Protein Data Bank Japan (PDBj)  Biological Magnetic Resonance Data Bank (BMRB) 4-Str. DBs & 3D Modelling -> Str. DBs -> wwPDB 15 wwPDB  worldwide Protein Data Bank (wwPDB)  http://www.wwpdb.org/  central repository of experimental macromolecular structures  more than 225,000 structures (October 2024), updated every week  mostly protein structures (87 %), structures of protein/nucleic acids or oligosaccharides complexes (11 %) and nucleic acid structures (2 %)  majority of structures from X-ray crystallography (84 % ), NMR (6 %), or EM (10%)  deposition of the structure into wwPDB is a requirement for its publication 4-Str. DBs & 3D Modelling -> Str. DBs -> wwPDB 17 wwPDB – data deposition  All data can be deposited at RCSBPDB, PDBe or PDBj site  Same requirements content and format of the final files:  structures of biopolymers  structures determined by experimental techniques  structures containing required information  Same validation methods → uniformity of the final archive  PDB-ID  assigned to each deposition  unique identifier of each structure  four-character code 4-Str. DBs & 3D Modelling -> Str. DBs -> wwPDB 18 wwPDB – data access  the access to the PDB archive is free and publicly available from the RCSB PDB site, PDBe site or PDBj site  FTP  RCSB PDB, PDBe and PDBj sites distribute the same PDB archive  updated weekly  web sites  each wwPDB site provides its own services and resources → different views and analyses of the structural data  sequence-based and text-based queries 4-Str. DBs & 3D Modelling -> Str. DBs -> wwPDB 20 S t r u c t u ra l q u a l i t y a s s u ra n c e 4-Str. DBs & 3D Modelling -> 3D data validation 28 Important truths about structures  all structures are just models devised to satisfy experimental data → random and systematic errors  individual structures differ in the quality  most structures are reasonably accurate, containing “only” random errors, but some structures are seriously incorrect  structures should be carefully selected and critically assessed before being used for a specific purpose → quality checks of structures 4-Str. DBs & 3D Modelling -> 3D data validation -> Truths 33 Errors in deposited structures  systematic errors  random errors 4-Str. DBs & 3D Modelling -> 3D data validation -> Errors 34 Systematic errors  relate to the accuracy of the model—how well it corresponds to the “true” structure of the molecule in question  often include errors of interpretation  low quality of electron density map → difficult to find the correct tracing of the molecule(s) through it → misstracing and “frame-shift” errors  spectral interpretations (assignment of individual NMR signals to individual atoms)  may lead to completely wrong final structure 4-Str. DBs & 3D Modelling -> 3D data validation -> Errors -> Systematic 35 Examples of systematic errors  completely wrong structures  trace of the protein chain following the wrong path through the electron density → completely incorrect fold Incorrect model (1PHY) Corrected model (2PHY) 4-Str. DBs & 3D Modelling -> 3D data validation -> Errors -> Systematic 36 Examples of systematic errors  wrong connectivity between secondary structure elements  incorrect order of secondary structure elements → many protein’s residues in the wrong place in the 3D structure Incorrect model (1PTE) Corrected model (3PTE) 4-Str. DBs & 3D Modelling -> 3D data validation -> Errors -> Systematic 37 Examples of systematic errors  frame-shift errors  occur where a residue is fitted into the electron density that belongs to the next residue and persists until compensating error is made (two residues are fitted into the density of a single residue)  occur almost exclusively at very low resolution (> 3.0 Å), often in loop regions  fitting of incorrect main chain or side chain conformations into the density  usually the least serious, however still can have effects on biological interpretations 4-Str. DBs & 3D Modelling -> 3D data validation -> Errors -> Systematic 38 Random errors  depend on how precisely a given measurement can be made  all measurements contain errors at some degree of precision → uncertainties in atomic positions  less serious than systematic errors  if a structure is essentially correct, the sizes of the random errors determine how precise the structure is 4-Str. DBs & 3D Modelling -> 3D data validation -> Errors -> Random 39 Examples of random errors  uncertainties in atomic positions  typically in range of 0.01 - 1.27 Å, median 0.28 Å 4-Str. DBs & 3D Modelling -> 3D data validation -> Errors -> Random 40 Examples of random errors  side chain flips  His/Asn/Gln – symmetrical in terms of shape → fit electron density equally well when rotated by 180° N O O N difficult to distinguish N and O atoms of the side-chain amide from X-ray data 4-Str. DBs & 3D Modelling -> 3D data validation -> Errors -> Random 41 Rules of thumb for selecting structures  X-ray structures  reasonably accurate structure: resolution ≤ 2.0 Å and R-factor ≤ 0.2  selection criteria always depend on the type of analysis required (e.g., comparison of folds – 3.0 Å resolution is sufficient vs. analysis of side chain torsional conformers – resolution ≤ 1.2 Å is required)  R-factor can easily be fooled → a better indicator of model reliability is Rfree – calculated in the same way as R-factor but using only a small fraction of the experimental data; Rfree should be ≤ 0.4  local errors indicated by residue B-factors > 50 but quality checks should always be performed to assess possible local problems in a structure 4-Str. DBs & 3D Modelling -> 3D data validation -> Str. Sel. -> Rules of thumb 43 Rules of thumb for selecting structures  NMR structures  no simple rule of thumb as in the case of X-ray structures  information on structure quality can be found in the original paper or obtained by quality checks  ResProx (http://www.resprox.ca/) – predicts the atomic resolution of NMR protein structures using machine learning  DRESS (http://www.cmbi.ru.nl/dress/ ) and RECOORD (http://www.ebi.ac.uk/pdbe-apps/nmr/recoord/main.html)web servers – provide improved versions of old NMR models (obtained by re-refinement of the original experimental data using more up- to-date force fields and refinement protocols) 4-Str. DBs & 3D Modelling -> 3D data validation -> Str. Sel. -> Rules of thumb 44 Quality checks of structures  checks of structure geometry, stereochemistry and other structural properties  tests of normality  comparison of a given protein or nucleic acid structure against what is already known about these molecules  knowledge comes from high-resolution structures of small molecules and systematic analyses of existing protein and nucleic acid structures  not all outliers from the norm are errors (e.g., an unusual torsion angle of a single residue), however, a structure exhibiting a large number of outliers and oddities is probably problematic 4-Str. DBs & 3D Modelling -> 3D data validation -> Str. Sel. -> Quality checks 45 Validation of protein structures  Ramachandran plot  check of stereochemical quality of protein structures  plot of the Ψ versus the Φ main chain torsion angles for every amino acid residue in the protein (except the two terminal residues)  favorable and “disallowed” regions of the plot determined from analyses of existing structures  typical protein structures – residues tightly clustered in the most favored regions, only few or none residues in the “disallowed” regions  poorly defined protein structures– residues more dispersed and many of them lie in the “disallowed” regions of the Ramachandran plot 4-Str. DBs & 3D Modelling -> 3D data validation -> Str. Sel. -> Quality checks 46 Validation of protein structures  Ramachandran plot Ψ Ψ Φ Φ typical protein structure poorly defined protein structure 4-Str. DBs & 3D Modelling -> 3D data validation -> Str. Sel. -> Quality checks 47 Validation of protein structures  bad and unfavorable atom-atom contacts  “simple” count of bad contacts, e.g., two nonbonded atoms with a center-to-center distance < sum of their van der Waals radii  evaluation of the environment of individual atoms or residue fragments with respect to the environments found in the high resolution crystal structures 4-Str. DBs & 3D Modelling -> 3D data validation -> Str. Sel. -> Quality checks 50 Validation of protein structures  other parameters  counts of unsatisfied hydrogen bond donors  hydrogen bonding energies  knowledge-based potentials assessing how “happy” each residue is in its local environment – many unhappy residues → “sad” overall structure  real space R-factor expressing how well each residue fits its electron density; can also be expressed as a Real-space correlation coefficient 4-Str. DBs & 3D Modelling -> 3D data validation -> Str. Sel. -> Quality checks 52 Quality information on the web  several databases provide pre-computed quality criteria for all wwPDB structures  EDS  PDBsum  PDBREPORT  RCSB PDB 4-Str. DBs & 3D Modelling -> 3D data validation -> Str. Sel. -> Quality checks 57 Quality information on the web  Electron Density Server (EDS)  http://eds.bmc.uu.se/eds/, also available via the PDBe site  information about local quality of the structure for all structures from wwPDB with deposited experimental data  plot of real-space R-factor (RSR) – how well each residue fits its electron density  plot of Z-score – large positive spike → residue has considerably worse RSR than the average residue of the same type in structures determined at similar resolution.  Ramachandran plot ... 4-Str. DBs & 3D Modelling -> 3D data validation -> Str. Sel. -> Quality checks 58 Quality information on the web  PDBsum  http://www.ebi.ac.uk/pdbsum/  provides numerous structural analyses of all wwPDB structures, including full PROCHECK output (for all protein-containing entries) 4-Str. DBs & 3D Modelling -> 3D data validation -> Str. Sel. -> Quality checks 60

Bioinformatics Protein Sequences and Databases PDF

Document Details

Tags

Related

Summary

Full Transcript