Introduction to the Structure of Macromolecules PDF
Document Details
Uploaded by DynamicPigeon1676
Tags
Summary
This document provides an introduction to the structure of macromolecules, covering visualization techniques, energetics, and different types of molecular interactions. It details methods for structure determination. It includes information on X-ray crystallography, NMR spectroscopy, and electron microscopy, detailing the advantages and disadvantages of each technique.
Full Transcript
I nt ro d u c t i o n to t h e st r u c t u re o f m a c ro m o l e c u l e s Course information ❑ Examination ❑ Written exam, multiple choices, 25 questions, 25 points ▪ A: 25-22 ▪ B: 21-19 ▪ C: 18-16 ▪ D: 15-13 ▪ E: 12-10...
I nt ro d u c t i o n to t h e st r u c t u re o f m a c ro m o l e c u l e s Course information ❑ Examination ❑ Written exam, multiple choices, 25 questions, 25 points ▪ A: 25-22 ▪ B: 21-19 ▪ C: 18-16 ▪ D: 15-13 ▪ E: 12-10 ▪ F (fail): < 10 ❑ 3 exam dates; you can attend them all ❑ 10/17 Dec. 2024 (to be voted) ❑ Jan. 2025 ❑ Feb. 2025 ❑ Slides with essential information have the sign: Course information 4 Structure visualization Wire Stick Ball and stick ❑ Bonds-based representation ▪ Fast, little resource-demanding ▪ Suitable for detailed analysis ▪ Incorrect impression about atom packing (empty space) and interatomic distances ❑ Hydrogen atoms are often omitted for simplicity Structure visualization 21 Structure visualization Helices Strands Loops Etc. Ribbon Cartoon ❑ Backbone-based representation ▪ Moderately fast, not very resource-demanding ▪ Suitable to investigate secondary structure and protein folds ▪ Shows main landmarks; good for overall orientation in the structure Structure visualization 22 Structure visualization CPK/ spheres Surface ❑ Surface-based representation ▪ Very slow, very resource-demanding ▪ Suitable to study shapes, volume, cavities and molecular contacts Structure visualization 23 Energetics of structures ❑ Energy ❑ Entropy ❑ Free energy ❑ Energy landscape Energetics of structures 24 Energetics of structures ❑ Energy ▪ Internal energy U (const. V); enthalpy H (constant P), … ▪ Total energy often inaccessible -> differences in energy ▪ Convention: negative energy is favorable, positive is unfavorable ▪ Potential energy Ep – interactions of atoms in a system ▪ Kinetic energy Ek – movement of atoms U = E p + Ek H = U + P.V Energetics of structures 25 Energetics of structures ❑ Entropy ▪ Related to the thermal disorder or conformational availability (degrees of freedom) ▪ Total entropy S > 0 ▪ Higher entropy is more favorable Energetics of structures 26 Energetics of structures ❑ Free energy ▪ Helmholtz A or F (const. V), Gibbs G (const. P) ▪ Combination of internal energy or enthalpy and entropy S A = U – TS ; G = H – TS → G = H - TS (T = temperature) ▪ Negative change of free energy (ΔG < 0) is favorable H↓ S↑ H↑ S↓ Energetics of structures 27 Energy landscape ❑ Relationship between structure and its potential energy ▪ Structure dictates potential energy – how strong are the individual interactions ▪ Potential energy reflects probability of finding the different structures – lower energy ➔ more frequently occurrence Transition ❑ Potential/free energy surface states ▪ Minima – stable structures ▪ Saddle points – transient ▪ Maxima – unstable structures State 1 Intermediate State 2 ▪ Energy barriers Energetics of structures 28 Energy landscape ❑ Relationship between structure and its potential energy ▪ Structure dictates potential energy – how strong are the individual interactions ▪ Potential energy reflects probability of finding the different structures – lower energy ➔ more frequently occurrence ❑ Potential/free energy surface ▪ Minima – stable structures ▪ Saddle points – transient ▪ Maxima – unstable structures ▪ Multidimensional surface Energetics of structures 29 Energy landscape ❑ Relationship between structure and its potential energy ▪ Structure dictates potential energy – how strong are the individual interactions ▪ Potential energy reflects probability of finding the different structures – lower energy ➔ more frequently occurrence Local maximum Saddle point Global maximum ❑ Potential/free energy surface ▪ Minima – stable structures ▪ Saddle points – transient ▪ Maxima – unstable structures Local minima ▪ Multidimensional surface Global minimum Energetics of structures 30 Molecular interactions ❑ Covalent interactions (chemical bonds) ▪ Between two atoms sharing electrons ▪ Very stable under standard condition ❑ Non-covalent interactions ▪ Much weaker than covalent bonds ▪ Electrostatic interactions ▪ Polar interactions ▪ Non-polar interactions Molecular interactions 32 Electrostatic interactions ❑ Charge-charge or ionic interactions ▪ Coulomb’s law – between any two charges ▪ Attractive (opposite signs) or repulsive (same sign) ▪ Long-range interactions (up to 10 Å) – decrease with r2 q1 q2 F= 4 r 2 r = distance = permittivity Molecular interactions – electrostatics 33 Electrostatic interactions ❑ Charge-charge or ionic interactions q1 q2 F= ▪ Environment-dependent 4 r 2 ▪ Permittivity ε = ε0·εr ε0 = vacuum permittivity ▪ Relative permittivity (εr) = dielectric constant Non-polar Stronger force Highly polar Weaker force Molecular interactions – electrostatics 34 Electrostatic interactions ❑ Charge-charge or ionic interactions ▪ Environment dependent ▪ Salt concentration – presence of counter-ions (Na+, K+, Cl-, etc.) ▪ pH – may induce a change of charge Low pH (8) Very high pH (>10) Molecular interactions – electrostatics 35 Polar interactions acceptor (-) donor (+) ❑ Hydrogen bonds (H-bonds) ▪ Only between highly electronegative atoms: fluorine, oxygen, nitrogen (F, O, N) ▪ Donor and acceptor atoms sharing hydrogen ▪ H-bond distance: 2.8 – 3.4 Å orbitals ❑ Aromatic (π-π) interactions ▪ Attractive interaction between aromatic rings ▪ Distance between the center of mass of rings: ~ 5 Å parallel displaced T-shaped sandwich Molecular interactions – polar 36 Polar interactions ❑ Van der Waals (vdW) interactions ▪ Between any two atoms ▪ Permanent dipole-dipole (in polar molecules) Molecular interactions – polar 37 Non-polar interactions ❑ Van der Waals (vdW) interactions ▪ Between any two atoms ▪ London dispersion forces, or temporary dipole-induced dipole (in non-polar molecules) ▪ Short-range interactions – up to 5 Å R1, R2 – van der Waals radii r - distance Molecular interactions – non-polar 38 Non-polar interactions ❑ Hydrophobic interactions ▪ Entropic origin – water molecules ordered around hydrophobic moiety -> unfavorable ▪ Hydrophobic packing -> favorable release of some ordered water molecules Molecular interactions – non-polar 39 Structure determination ❑ Established methods ▪ X-ray crystallography ▪ NMR spectroscopy ▪ Electron microscopy ▪ Bioinformatics predictions – theoretical Structure determination 42 Parameters of an X-ray structure ❑ Resolution ▪ Measure of the level of detail present in the diffraction pattern 3Å 2Å 1Å (bad) (acceptable) (exceptional) ❑ R-factor (residual factor; R-value) ▪ Measure of a model quality – i.e. the agreement between the crystallographic model and the diffraction data ▪ Varies from 0 (ideal) to 0.63 (random structure), typically about 0.2 Structure determination – X-ray crystallography 47 Parameters of an X-ray structure ❑ B-factors (thermal factors) ▪ Measure of how much an atom oscillates or vibrates around the position specified in the model ▪ Considered a measure of flexibility Structure determination – X-ray crystallography 48 X-ray crystallography ❑ Advantages ▪ No limitations in size ▪ Possibility to obtain an atomic resolution ❑ Disadvantages ▪ Requirement of a crystal ▪ Structure in a crystalline state (non-native) ▪ Static picture of macromolecule ▪ Position of hydrogen atoms (usually) are not detected Structure determination – X-ray crystallography 49 Parameters of an NMR structure ❑ RMSD ▪ Root-mean-squared deviation of atomic positions across the ensemble of solutions ▪ Reveals the mean differences between individual conformations ▪ Important parameter to compare different structures of the same molecule = atom displacement ▪ ` N = total No. atoms Structure determination – NMR spectroscopy 52 NMR spectroscopy ❑ Advantages ▪ Structure in solution state (native) ▪ Possibility to investigate dynamics of macromolecules ▪ Position of hydrogen atoms detected ❑ Disadvantages ▪ Size limited to approximately 40 kDa (~ 400 amino acid proteins) ▪ Requirement of isotopically labeled sample Structure determination – NMR spectroscopy 53 Electron microscopy ❑ Advantages ▪ Applicable to extremely large systems ▪ Complements other methods e. g. X-ray, NMR ❑ Disadvantages ▪ Lower resolution (2-3 Å at best) Structure determination – electron microscopy 56 Bioinformatics predictions ❑ Homology modeling ❑ Machine learning ❑ Ab initio prediction Structure determination – bioinformatics predictions 57 Bioinformatics predictions Comparative modelling Ab initio predictions Amino acid sequence Amino acid sequence Find similar Homology search proteins in Profile method 3D structure Threading databases Machine learning Energy minimization, Molecular dynamics 3D structure database Predicted Predicted structure structure Structure determination – bioinformatics predictions 58 Bioinformatics predictions ❑ Homology modeling Comparison of sequences in databases: Multiple sequence alignment (MSA) Structure determination – bioinformatics predictions 59 Bioinformatics predictions ❑ Ab initio prediction Structure determination – bioinformatics predictions 61 Bioinformatics predictions ❑ Advantages ▪ Very fast (except ab initio) ▪ Low cost ❑ Disadvantages ▪ Ab initio is very demanding ▪ Theoretical model – experimental validation is needed Structure determination – bioinformatics predictions 62 S t r u c t u re o f b i o m o l e c u l e s Outline ❑ Proteins ▪ Primary structure ▪ Secondary structure ▪ Tertiary structure ▪ Motifs and folds ▪ Quaternary structure ❑ Nucleic acids ▪ Main types of structures ❑ Primary structural databases ❑ Structural data formats ❑ PDB and mmCIF formats Structure of biomolecules 2 Protein structure Structure of proteins… 3 Hierarchy of protein structure Proteins – hierarchy of protein structure 4 Amino acids Side ❑ 20 L-amino acids (natural) chain Amino ❑ Side chains group Chiral Acid ▪ Charged, polar, hydrophobic centre group Amino acid backbone Side chain - - + + Proteins – basic building blocks 5 Primary structure ❑ Linear chain of amino acid residues MSLGAKPFGEKKFIEIKGRRMAYIDEGTGDPILFQHGNPTSSYLWRNIM N-terminus C-terminus ❑ Protein backbone ▪ From N-terminus to C-terminus ▪ Connected by covalent bonds condensation ❑ Peptide bond (amide bond) -H2O ▪ Partial double bond character → Planar geometry Proteins – primary structure 6 Geometry of protein backbone ❑ Conformation of the peptide chain ▪ Defined by Φ (phi) and Ψ (psi) dihedral angles N+1 O C ❑ Ramachandran plot (Φ, Ψ) → The majority of proteins follow this distribution R 180 N C-1 O-1 Ψ φ (phi) = dihedral angle {C-1 − N − Cα − C} -180 ψ (psi) = dihedral angle {N − Cα − C − N+1} -180 Φ 180 Proteins – primary structure 7 Geometry of protein backbone ❑ Conformation of the peptide chain ▪ Defined by Φ (phi) and Ψ (psi) dihedral angles N+1 O C ❑ Ramachandran plot (Φ, Ψ) → The majority of proteins follow this distribution R N C-1 O-1 φ (phi) = dihedral angle {C-1 − N − Cα − C} ψ (psi) = dihedral angle {N − Cα − C − N+1} Proteins – primary structure 8 Secondary structure ❑ Local three-dimensional structure of polypeptide chain ❑ Governed by hydrogen bonding between backbone atoms ❑ Types of structures Helices ▪ Helices Strands Regular patterns Loops ▪ β-Structures ▪ Loops and coils - Irregular patterns Proteins – secondary structure 9 Helices ❑ Types of helices ▪ 3.613 helix (α-helix) – most common ▪ 310 helix – less frequent, end of α-helices ▪ 4.116 helix (π-helix) (rare) Left-handed ▪ Left-handed helix (very rare) α-helix → Represented by helical 310-helix cartoons or cylinders Ψ ❑ Right-handed (mostly) π-helix α-helix ❑ Hydrogen bonding ▪ Within a single chain Φ Proteins – secondary structure 11 Helices H-bonds 310 helix Helix helix Proteins – secondary structure 12 β-structures ❑ Types of typical β-structures ▪ β-sheets ▪ β-turns ▪ β-bulge ▪ Polyproline helices polyproline helices β-sheets ❑ Hydrogen bonding Ψ ▪ Between adjacent chains Φ Proteins – secondary structure 13 β-structures ❑ Types of β-sheets ▪ Parallel ▪ Antiparallel (stronger) ▪ Mixed → Represented by ribbons H-bonds with arrows indicating the sequence direction ❑ Side-chains ▪ Towards the sides of the sheets Proteins – secondary structure 14 Tertiary structure ❑ Global three-dimensional structure of protein ❑ Governed mainly by hydrophobic interactions involving side chains of amino acid residues Proteins – tertiary structure 18 Tertiary structure ❑ Supersecondary structures (motifs) ▪ Small substructures formed by several secondary structures ❑ Domain ▪ Structurally (functionally) independent regions ▪ Compact parts of structure – around single hydrophobic core ▪ Formed in separate folding unit (fold independently) ❑ Fold ▪ General architecture of protein ▪ Type of protein structure Proteins – tertiary structure 19 Quaternary structure ❑ Association of several protein chains (monomers/subunits) into oligomers (multimers) ▪ Homomeric protein – from identical monomers ▪ Heteromeric protein – from different types of monomers Homotetramer Heterodimer Heterotetramer hemoglobin tryptophan synthase immunoglobulin Proteins – quaternary structure 27 Nucleotides ❑ Composition Nucleotide Nitrogenous base ❑ Phosphate ❑ Pentose sugar ❑ Heterocyclic base charge Sugar ❑ DNA bases: A, T; G, C ❑ RNA bases: A, U; G, C ❑ Rotation about glycosidic bond The anti conformation is dominant in DNA with rare exceptions Nucleic acids – basic building blocks 30 Primary structure ❑ Linear chain of nucleotides (oligonucleotides or polynucleotides) CGCGAATTCGCG ❑ Sugar-phosphate backbone ▪ Covalent character ▪ Phosphodiester bond ▪ From 5’-end to 3’-end Nucleic acids – primary structure 31 Primary structure ❑ Linear chain of nucleotides (oligonucleotides or polynucleotides) CGCGAATTCGCG ❑ Sugar-phosphate backbone ▪ Covalent character ▪ Phosphodiester bond ▪ From 5’-end to 3’-end oligonucleotide dGCAT (d indicates deoxyribose sugar, or a DNA sequence) Nucleic acids – primary structure 32 Sugar-phosphate backbone ❑ Very flexible backbone ▪ Six torsion angles ❑ Ribose is not planar → sugar puckering ▪ Denotes the phosphate-phosphate proximity ▪ Two main types of conformation 2′-deoxyribose (in DNA) Nucleic acids – primary structure 33 Secondary structure ❑ Local interactions between nucleotide bases →Base pairs ❑ DNA base pairs: Adenine - Thymine Cytosine - Guanine H-bonds ❑ RNA base pairs: Adenine - Uracil Cytosine - Guanine ❑ Complementarity due to hydrogen bonds Nucleic acids – secondary structure 34 Tertiary structure of DNA ❑ Overall three-dimensional arrangement and folding ❑ Three types: A-DNA, B-DNA, Z-DNA ❑ B-DNA is the most common (described by Watson & Crick) A-DNA B-DNA (rare) (predominant!) Z-DNA (rarer) Type A-DNA B-DNA Z-DNA Helix sense Right Right Left Bases per turn 11 10.5 12 Helical rise per nucleotide (Å) 2.6 3.4 3.7 C2’-endo Sugar pucker C3’-endo C2’-endo C3’-endo Nucleic acids – tertiary structure of DNA 36 Tertiary structure of DNA ❑ Grooves: crucial for DNA-protein interactions ❑ Major groove: wide and deep – where most proteins interact ❑ Minor groove: narrower and shallower Major 22 Å Groove 360 ~ 10 base pairs 34 Å Minor 12 Å Groove Sugar-phosphate backbone Nucleic acids – tertiary structure of DNA Structural data formats ❑ Different file formats used to represent 3D structure data ▪ PDB ▪ mmCIF ▪ PDBML ▪ MOL2 ▪... ❑ The spatial 3D coordinates and other information are recorded for each atom Structural data formats 48 PDB format ❑ Designed in the early 1970s - first entries of PDB database ❑ Rigid structure of 80 characters per line, including spaces ❑ Still the most widely supported format Structural data formats – PDB format 49 PDB format ❑ Atomic coordinates ❑ Chemical and biological features ❑ Experimental details of the structure determination ❑ Structural features ▪ Secondary structure assignments ▪ Hydrogen bonding ▪ Biological assemblies ▪ Active sites ▪... https://www.wwpdb.org/documentation/file-format-content/format33/v3.3.html https://www.cgl.ucsf.edu/chimera/docs/UsersGuide/tutorials/pdbintro.html Structural data formats – PDB format 51 PDB format ❑ Advantages ▪ Widely used → supported by majority of tools ▪ Easy to read and easy to use ▪ Can be manually edited → Suitable for accessing individual entries Structural data formats – PDB format 52 PDB format ❑ Disadvantages ▪ Potential inconsistency between individual PDB entries as well as PDB records within one entry Ex: different residue numbering in SEQRES and ATOM sections → Not suitable for computer extraction of information Primary sequence Atoms and residues in the file Structural data formats – PDB format 53 PDB format ❑ Disadvantages ▪ Absolute limits on the size of certain items of data Ex.: max. number of atom records limited to 99,999; max. number of chains limited to 26, etc. → Large systems such as the ribosomal subunit must be divided into multiple PDB files → Not suitable for analysis and comparison of experimental and structural data across the entire database Structural data formats – PDB format 54 mmCIF format ❑ Macromolecular crystallographic information file (mmCIF) ❑ Developed to handle increasingly complicated structural data ❑ Each field of information is explicitly assigned by a tag and linked to other fields through a special syntax Structural data formats – mmCIF format 55 mmCIF format ❑ Advantages ▪ Easily parsable by computer software ▪ Consistency of data across the database → Suitable for analysis and comparison of experimental and structural data across the entire database ❑ Disadvantages ▪ Difficult to read ▪ Rarely supported by visualization and computational tools → Not suitable for accessing individual entries Structural data formats – mmCIF format 56 B i o i nfo r m a t i c s p ro te i n s e q u e n c e s a n d d ata b a s e s Outline Introduction Primary sequence of proteins Protein sequence databases Sequence alignments evolution of proteins Sequence-structure-function paradigm Alignment of sequences Prediction of protein properties from sequence Bioinformatics databases & Structure prediction 2 Structure prediction 3-Binf DB & Str. Pred -> Intro 4 Protein synthesis Protein synthesis occurs in two steps: Transcription: DNA -> RNA Splicing: RNA -> mRNA Translation: mRNA -> Protein Post-translational modifications: protein mature protein Translation 3-Binf DB & Str. Pred -> 1ry sequence of proteins 6 Levels of protein structure 3-Binf DB & Str. Pred -> 1ry sequence of proteins 20 Sources of protein sequences Multiple databases available: With different scope focus: Generalist: sequences from any source (UniProtKB) Specialist: sequences focusing on one more specific condition(s) (i.e. biologic pathway, disease, organism) (WormBase) With different types of sequence content: Primary sequence of proteins, and annotations and cross- references to that sequence (UniProtKB) Motifs or profiles databases: contain information derived from the primary sequence, in the form of abstractions (patterns) that distil the most conserved features among related proteins (PFam) 3-Binf DB & Str. Pred -> protein seq. databases 22 Sources of protein sequences Multiple databases available UniProtKB Collaboration between EBI, Swiss Institute of Bioinformatics and Protein Information Central repository of protein sequences and functional information Quality annotations - information on protein function and individual amino acids, experimental information, biological ontologies, classification, links to other databases Quality level of the annotation (manual vs. automatic) 3-Binf DB & Str. Pred -> protein seq. databases 23 UniProt KB Main component of the database Reviewed protein entries (SwissProt): High quality manual annotations Manual annotations reliable info >570,000 protein records (2024) Automatic protein entries (TrEMBL): Automatic translation of protein sequences from EMBL data bank Automatic annotations lower quality, chance for errors. ~250,000,000 protein records (2024) (400x info ammount) 3-Binf DB & Str. Pred -> protein seq. databases 25 UniProt KB Human readable explanation of the protein function Wealth of systematically organized information. In the illustrated example: Catalytic activity: with details of the enzymatic reaction and cross-links to chemical databases Activity regulation: competitive inhibitors Kinetics: experimental measurements towards n substrates Optimal pH Implication in biological pathways Catalytic and Key Residues (active/binding sites) Gene Ontology (GO) annotations (enrichment values) Enzyme/Pathways and Protein Family DBs Keywords 3-Binf DB & Str. Pred -> protein seq. databases 28 UniProt KB Unique accession numbers Serialized for sequence variants (later) 3-Binf DB & Str. Pred -> protein seq. databases 32 UniProt KB Describe the effect of mutations in the activity of the protein Mutations mapped on the protein sequence 3-Binf DB & Str. Pred -> protein seq. databases 34 UniProt KB Displays available tertiary structures (experimentally determined) for the protein. Links to AlphaFold predictions if available (cover later) Describes secondary structure content mapped to seq. Links to databases with 3D structure models 3-Binf DB & Str. Pred -> protein seq. databases 37 UniProt KB When multiple isoforms are avaliable due to alternative splicing the different sequences are available here, with serialized accession codes (i.e. P21397-1, P21397-2) 3-Binf DB & Str. Pred -> protein seq. databases 40 Summary of 1D predictions Different protein properties or characteristics can be predicted from its primary sequence: Secondary structure Solvent accessibility Solubility/expressability Transmembrane regions The methods that do such predictions improve if they consider evolutionary information Bioinformatics databases & Structure prediction 44 Introduction to sequence alignment Protein sequences can also be directly “compared” among them. Their similarities or differences can be assessed.. Alignments are models that aim to pair the most similar parts among different proteins. If the model considers evolutionary information (and biologically relevant protein alignments do), evolutionary relationships (homology) can be inferred from sequence similarity. 3-Binf DB & Str. Pred -> Sequence alignments 45 A few words on evolution Darwinian ideas on evolution: All species of organisms arise and develop through the natural selection of small, inherited variations that increase the individual's ability to compete, survive, and reproduce (biological fitness). Inter-individual differences need to be: Small Inheritable There exists a natural selective pressure. Variations that make an individual fitter (improve its functions) to the conditions of the selective pressure are more likely to be transmitted to next generations. Accumulation of variation causes speciation. 3-Binf DB & Str. Pred -> Seq align -> Evolution of proteins 47 A few words on molecular evolution Improved function on a given environment (adaptation) is a key concept in evolution. How does this apply to proteins? How do proteins function? Molecular Catalyst Molecular Pore [gift box] [tube] Function is dictated by shape (3D structure) 3-Binf DB & Str. Pred -> Seq align -> Evolution of proteins 48 A few words on molecular evolution Improved function on a given environment (adaptation) is a key concept in evolution. How does this apply to proteins? How do proteins function? Structure is determined by sequence. Function is dictated by shape (3D structure) 3-Binf DB & Str. Pred -> Seq align -> Seq/Str/Function Paradigm 49 Sequence, Structure, Function Paradigm 3D structure is determined by the sequence Function is dictated by 3D structure MSLGAKPFGEKKFIEIKGRRMAYIDEGTGDPILFQHGNPTSSYLWRNIMPHCA GLGRLIACDLIGMGDSDKLDPSGPERYAYAEHRDYLDALWEALDLGDRVVLVV HDWGSALGFDWARRHRERVQGIAYMEAIAMPIEWADFPEQDRDLFQAFRS QAGEELVLQD sequence function structure 3-Binf DB & Str. Pred -> Seq align -> Seq/Str/Function Paradigm 50 A few words on molecular evolution Innovation happens at the sequence level Mutations (small changes) introduced in DNA (inheritable) Subsequently transcribed, processed, and translated into polypeptidic chains (proteins) Selective pressure operates at the function level Proteins working better in their environments make individuals fitter, adaptation occurred in human lineage Schaffner S. & Sabeti P (2008) Evolutionary adaptation in human lineage. Nature Education 1:14. 3-Binf DB & Str. Pred -> Seq align -> Evolution of proteins 51 A few words on molecular evolution Diversity Structure Function Sequence Homology: two proteins are homologous if they are the products of genes that evolved from the same ancestor 3-Binf DB & Str. Pred -> Seq align -> Evolution of proteins 52 A few words on molecular evolution Paralogs Structure Function Sequence Homology: two proteins are homologous if they are the products of genes that evolved from the same ancestor 3-Binf DB & Str. Pred -> Seq align -> Evolution of proteins 53 A few words on molecular evolution Annotation problem Structure Function Sequence Homology: two proteins are homologous if they are the products of genes that evolved from the same ancestor 3-Binf DB & Str. Pred -> Seq align -> Evolution of proteins 54 Sequence alignments Alignments are models that aim to pair the most similar parts among different proteins. Global alignments: consider similarity across the entire sequence Local alignments: consider similarity across sequence fragments Pairwise alignments: two sequences compared Multiple sequence alignments: multiple 3-Binf DB & Str. Pred -> Seq align -> Alignments Classification 55 Sequence alignments Alignments are models that aim to pair the most similar parts among different proteins. Pairwise alignment techniques DotPlot methods Dynamic programming algorithm Needelman & Wunsch (Global) Smith & Waterman (Local) Word methods Multiple sequence alignment techniques: Dynamic programming Progressive methods Iterative methods 3-Binf DB & Str. Pred -> Seq align -> Alignments Classification 56 Sequence alignments How can similarity among different parts of proteins be measured? Assessing similarity in pairs of Amino-acids: Each possible pair of amino-acids is given a substitution score (substitution matrix) Amino-acids from the (two) sequences should be paired such as the total alignment score is optimized. Sometimes no good pairing can be found and a gap needs to be introduced. Gaps require a special penalty (negative score) in order to force longer and biologically meaningful alignments. 3-Binf DB & Str. Pred -> Seq align -> Alignments Substitution Models 59 Sequence alignments How can similarity among different parts of proteins be measured? Identity matrix (Dot-matrix plots): 1 if same amino-acid 0 otherwise Limited model: forces the introduction of too many gaps. 3-Binf DB & Str. Pred -> Seq align -> Alignments Substitution Models 60 Sequence alignments How can similarity among different parts of proteins be measured? Identity matrix (Dot-matrix plots): 1 if same amino-acid 0 otherwise Limited model: forces the introduction of too many gaps. Substitution models: Score depending on the probability of observing a substitution (mutation) of one particular Aa for another (i.e. Arg Lys should score better than Arg Glu) 3-Binf DB & Str. Pred -> Seq align -> Alignments Substitution Models 61 Sequence alignments Substitution models include evolutionary information Dayhoff Mutation Data Matrix Score is based on the concept of Point Accepted Mutation (PAM) Evolutionary distance 1 PAM = time in which 1/100 amino acids are expected to mutate. Higher evolutionary times inferred from a Markov chain model: PAM matrix product. 250 PAM matrix – targets the limit where is safe to infer homology in proteins (twilight). Limitation: derived from 1572 observed mutations in (manual) alignment of sequences >85% identical 3-Binf DB & Str. Pred -> Seq align -> Alignments Substitution Models 63 Sequence alignments Substitution models include evolutionary information BLOSSUM matrices BLOcks SUbstitution Matrix Derived from blocks of aligned sequences in BLOCKS database – implicitly represents distant relationships. bias from identical sequences is removed by clustering at a sequence identity threshold BLOSUM62 = matrix derived from sequences clustered at 62% or greater identity 3-Binf DB & Str. Pred -> Seq align -> Alignments Substitution Models 65 Sequence alignments PAM BLOSUM Similar proteins compared as Conserved BLOKS (fragments) whole compared PAM1 corresponds to 1 ≠ BLOSUM1 corresponds to 1% ID residue in 100 99% ID Other PAM matrices Each matrix based on observed extrapolated from PAM1 alignments Higher numbers, more Higher numbers, more similarity evolutionary distance (less evolutionary distance) 100 90 120 80 160 62 200 50 250 45 3-Binf DB & Str. Pred -> Seq align -> Alignments Substitution Models 66 Sequence alignments Dynamic Programing Algorithm Matrix: Each dimension corresponds to one of the proteins to be aligned. Each cell contains the score value from the substitution model corresponding to the residue pair. Diagonal transitions represent aligned positions Vertical and horizontal transitions represent gaps and are penalized. The final alignment corresponds to the path in the matrix that maximizes the score. 3-Binf DB & Str. Pred -> Seq align -> Alignments Pairwise alignment 67 Sequence alignments Dynamic Programing Algorithm Back-trace from bottom-right Global: Needelman & Wunsch. From the corner Local: Smith & Waterman. From any position. DETERMINISTIC Comp. expensive 3-Binf DB & Str. Pred -> Seq align -> Alignments Pairwise alignment 68 Sequence alignments Word methods Short non-overlapping sequence stretches (k- tuples or words) are identified in the query sequence and matched in target sequence(s). Relative positions of the matching region define an offset (subtraction) Multiple words matching with similar offset define a region prone to alignment. Alignments are subsequently extended in alingment-prone regions. HEURISTIC, optimal align not guaranteed. Efficient for database searches. BLAST, FASTA. 3-Binf DB & Str. Pred -> Seq align -> Alignments Pairwise alignment 69 Sequence alignments Multiple sequence alignments Dynamic programming algorithm Progressive methods First align the most similar pair Subsequently add less similar sequences Sensitive to similarity inaccuracy (i.e. due to differences in sequence length) CLUSTAL Additional info considered: T-Coffee (slow) Iterative methods 3-Binf DB & Str. Pred -> Seq align -> Alignments MSA 71 Sequence alignments Multiple sequence alignments Dynamic programming algorithm Progressive methods Iterative methods Initial global alignment Objective function (based on score) to optimise similarity assessment. Chose best. All possible remaining sequence subsets re- aligned and re-scored Best subset included in the alignment/iter. Typically slower, more accurate MUSCLE, MAFT. 3-Binf DB & Str. Pred -> Seq align -> Alignments MSA 72 Sequence alignments Beyond pure sequences: patterns and models Aligned sequences can be used to define patterns, that can then be used to perform searches in databases. Position Specific Scoring Matrices Hidden Markov Models 3-Binf DB & Str. Pred -> Seq align -> Alignments Motifs 73 Secondary structure prediction prediction of the conformational state of each amino acid (AA) residue of a protein sequence as one of the possible states: helix (H) strand (S) coil (C) 3-Binf DB & Str. Pred -> Properties prediction -> Secondary Structure 75 Secondary structure prediction programs PSI-PRED http://bioinf.cs.ucl.ac.uk/psipred/ combination of PSI-BLAST profiles and neural networks careful selection of sequences used for profile construction 3-Binf DB & Str. Pred -> Properties prediction -> Secondary Structure 77 Solvent accessibility prediction prediction of the extent to which a residue embedded in a protein structure is accessible to solvent comparison of accessibility of different amino acids – relative values (actual area as percentage of maximally accessible area) simplified two state description – buried vs. exposed residues 3-Binf DB & Str. Pred -> Properties prediction -> Solvent Accessibiility 80 Solubility and expressability prediction Complicated definition of the property Prediction of the extent to which a given sequence will produce a soluble protein in a given expression system or Prediction of aggregation propensity Methods heavily rely on machine learning. 3-Binf DB & Str. Pred -> Properties prediction -> Solubiility & Expressability 84 Transmembrane region prediction transmembrane (TM) proteins – challenge for experimental determination of 3D structure → structure prediction needed even more than for globular water-soluble proteins two major classes of integral membrane proteins transmembrane helices (TMH) transmembrane beta-strand barrels (TMB) 3-Binf DB & Str. Pred -> Properties prediction -> Transmembrane region 86 Transmembrane region prediction prediction of TMH simplified by strong environmental constraints – lipid bilayer of the membrane TMHs are predominantly apolar and 12-35 residues long (hydrophobicity) specific distribution of Arg and Lys (positively charged) → connecting loop regions at the inside of the membrane have more positive charges than loop regions at the outside = positive-inside rule 3-Binf DB & Str. Pred -> Properties prediction -> Transmembrane region 88 Transmembrane region prediction prediction of TMB transmembrane beta-strands contain 10 - 25 residues only every second residue faces the lipid bilayers and is hydrophobic, other residues face the pore of the β-barrel and are more hydrophilic → analysis of hydrophobicity NOT useful for TMB prediction 3-Binf DB & Str. Pred -> Properties prediction -> Transmembrane region 89 S t r u c t u ra l d ata b a s e s Outline Structural databases Data formats (PDB, mmCIF, PDBML) wwPDB Other resources 3D data validation 3D protein modelling Models validation and databases 4-Str. DBs & 3D Modelling -> Str. DBs 3 Data formats different formats are used to represent primary macromolecular 3D structure data PDB mmCIF PDBML ... The spatial 3D coordinates for each atom are recorded 4-Str. DBs & 3D Modelling -> Str. DBs 4 PDB format designed in the early 1970s - first entries of PDB database rigid structure of 80 characters per line, including spaces still the most widely supported format 4-Str. DBs & 3D Modelling -> Str. DBs -> PDB format 5 PDB format atomic coordinates chemical and biological features experimental details of the structure determination structural features secondary structure assignments hydrogen bonding biological assemblies REMARK 350 active sites ... 4-Str. DBs & 3D Modelling -> Str. DBs -> PDB format 7 PDB format advantages widely used → supported by majority of tools easy to read and easy to use → suitable for accessing individual entries 4-Str. DBs & 3D Modelling -> Str. DBs -> PDB format 8 PDB format disadvantages inconsistency between individual PDB entries as well as PDB records within one entry (e.g., different residue numbering in SEQRES and ATOM sections) → not suitable for computer extraction of information 4-Str. DBs & 3D Modelling -> Str. DBs -> PDB format 9 PDB format disadvantages inconsistency between individual PDB entries as well as PDB records within one entry → not suitable for computer extraction of information absolute limits on the size of certain items of data, e.g.: max. number of atom records limited to 99,999; max. number of chains limited to 26 → large systems such as the ribosomal subunit must be divided into multiple PDB files → not suitable for analysis and comparison of experimental and structure data across the entire database 4-Str. DBs & 3D Modelling -> Str. DBs -> PDB format 10 mmCIF format macromolecular Crystallographic Information File (mmCIF) developed to handle increasingly complicated structure data each field of information is explicitly assigned by a tag and linked to other fields through a special syntax 4-Str. DBs & 3D Modelling -> Str. DBs -> mmCIF format 11 mmCIF format advantages easily parsable by computer software consistency of data across the database disadvantages difficult to read rarely supported by visualization and computational tools → suitable for analysis and comparison of experimental and structure data across the entire database → not suitable for accessing individual entries 4-Str. DBs & 3D Modelling -> Str. DBs -> mmCIF format 12 Structural databases Primary wwPDB: 3D structure of biopolymers BMRB: Nuclear Magnetic Resonance specific EMDB: Electron-Microscopy specific NDB: 3D structure of nucleic acids: http://ndbserver.rutgers.edu/ CSD: 3D structure of small molecules (commercial) http://www.ccdc.cam.ac.uk/products/csd/ Other sources PDBsum, SCOP, Protopedia, Structural Biology KnowledgeBase 4-Str. DBs & 3D Modelling -> Str. DBs 14 wwPDB joint initiative of four organizations Research Collaboratory for Structural Bioinformatics (RCSB PDB) Protein Data Bank in Europe (PDBe) Protein Data Bank Japan (PDBj) Biological Magnetic Resonance Data Bank (BMRB) 4-Str. DBs & 3D Modelling -> Str. DBs -> wwPDB 15 wwPDB worldwide Protein Data Bank (wwPDB) http://www.wwpdb.org/ central repository of experimental macromolecular structures more than 225,000 structures (October 2024), updated every week mostly protein structures (87 %), structures of protein/nucleic acids or oligosaccharides complexes (11 %) and nucleic acid structures (2 %) majority of structures from X-ray crystallography (84 % ), NMR (6 %), or EM (10%) deposition of the structure into wwPDB is a requirement for its publication 4-Str. DBs & 3D Modelling -> Str. DBs -> wwPDB 17 wwPDB – data deposition All data can be deposited at RCSBPDB, PDBe or PDBj site Same requirements content and format of the final files: structures of biopolymers structures determined by experimental techniques structures containing required information Same validation methods → uniformity of the final archive PDB-ID assigned to each deposition unique identifier of each structure four-character code 4-Str. DBs & 3D Modelling -> Str. DBs -> wwPDB 18 wwPDB – data access the access to the PDB archive is free and publicly available from the RCSB PDB site, PDBe site or PDBj site FTP RCSB PDB, PDBe and PDBj sites distribute the same PDB archive updated weekly web sites each wwPDB site provides its own services and resources → different views and analyses of the structural data sequence-based and text-based queries 4-Str. DBs & 3D Modelling -> Str. DBs -> wwPDB 20 S t r u c t u ra l q u a l i t y a s s u ra n c e 4-Str. DBs & 3D Modelling -> 3D data validation 28 Outline Revision of concepts Important truths about structures Errors in deposited structures systematic errors random errors Selecting reliable structure rules of thumbs quality checks programs and databases 4-Str. DBs & 3D Modelling -> 3D data validation 29 Important truths about structures all structures are just models devised to satisfy experimental data → random and systematic errors individual structures differ in the quality most structures are reasonably accurate, containing “only” random errors, but some structures are seriously incorrect structures should be carefully selected and critically assessed before being used for a specific purpose → quality checks of structures 4-Str. DBs & 3D Modelling -> 3D data validation -> Truths 33 Errors in deposited structures systematic errors random errors 4-Str. DBs & 3D Modelling -> 3D data validation -> Errors 34 Systematic errors relate to the accuracy of the model—how well it corresponds to the “true” structure of the molecule in question often include errors of interpretation low quality of electron density map → difficult to find the correct tracing of the molecule(s) through it → misstracing and “frame-shift” errors spectral interpretations (assignment of individual NMR signals to individual atoms) may lead to completely wrong final structure 4-Str. DBs & 3D Modelling -> 3D data validation -> Errors -> Systematic 35 Examples of systematic errors completely wrong structures trace of the protein chain following the wrong path through the electron density → completely incorrect fold Incorrect model (1PHY) Corrected model (2PHY) 4-Str. DBs & 3D Modelling -> 3D data validation -> Errors -> Systematic 36 Examples of systematic errors wrong connectivity between secondary structure elements incorrect order of secondary structure elements → many protein’s residues in the wrong place in the 3D structure Incorrect model (1PTE) Corrected model (3PTE) 4-Str. DBs & 3D Modelling -> 3D data validation -> Errors -> Systematic 37 Examples of systematic errors frame-shift errors occur where a residue is fitted into the electron density that belongs to the next residue and persists until compensating error is made (two residues are fitted into the density of a single residue) occur almost exclusively at very low resolution (> 3.0 Å), often in loop regions fitting of incorrect main chain or side chain conformations into the density usually the least serious, however still can have effects on biological interpretations 4-Str. DBs & 3D Modelling -> 3D data validation -> Errors -> Systematic 38 Random errors depend on how precisely a given measurement can be made all measurements contain errors at some degree of precision → uncertainties in atomic positions less serious than systematic errors if a structure is essentially correct, the sizes of the random errors determine how precise the structure is 4-Str. DBs & 3D Modelling -> 3D data validation -> Errors -> Random 39 Examples of random errors uncertainties in atomic positions typically in range of 0.01 - 1.27 Å, median 0.28 Å 4-Str. DBs & 3D Modelling -> 3D data validation -> Errors -> Random 40 Examples of random errors side chain flips His/Asn/Gln – symmetrical in terms of shape → fit electron density equally well when rotated by 180° N O O N difficult to distinguish N and O atoms of the side-chain amide from X-ray data 4-Str. DBs & 3D Modelling -> 3D data validation -> Errors -> Random 41 Rules of thumb for selecting structures X-ray structures reasonably accurate structure: resolution ≤ 2.0 Å and R-factor ≤ 0.2 selection criteria always depend on the type of analysis required (e.g., comparison of folds – 3.0 Å resolution is sufficient vs. analysis of side chain torsional conformers – resolution ≤ 1.2 Å is required) R-factor can easily be fooled → a better indicator of model reliability is Rfree – calculated in the same way as R-factor but using only a small fraction of the experimental data; Rfree should be ≤ 0.4 local errors indicated by residue B-factors > 50 but quality checks should always be performed to assess possible local problems in a structure 4-Str. DBs & 3D Modelling -> 3D data validation -> Str. Sel. -> Rules of thumb 43 Rules of thumb for selecting structures NMR structures no simple rule of thumb as in the case of X-ray structures information on structure quality can be found in the original paper or obtained by quality checks ResProx (http://www.resprox.ca/) – predicts the atomic resolution of NMR protein structures using machine learning DRESS (http://www.cmbi.ru.nl/dress/ ) and RECOORD (http://www.ebi.ac.uk/pdbe-apps/nmr/recoord/main.html)web servers – provide improved versions of old NMR models (obtained by re-refinement of the original experimental data using more up- to-date force fields and refinement protocols) 4-Str. DBs & 3D Modelling -> 3D data validation -> Str. Sel. -> Rules of thumb 44 Quality checks of structures checks of structure geometry, stereochemistry and other structural properties tests of normality comparison of a given protein or nucleic acid structure against what is already known about these molecules knowledge comes from high-resolution structures of small molecules and systematic analyses of existing protein and nucleic acid structures not all outliers from the norm are errors (e.g., an unusual torsion angle of a single residue), however, a structure exhibiting a large number of outliers and oddities is probably problematic 4-Str. DBs & 3D Modelling -> 3D data validation -> Str. Sel. -> Quality checks 45 Validation of protein structures Ramachandran plot check of stereochemical quality of protein structures plot of the Ψ versus the Φ main chain torsion angles for every amino acid residue in the protein (except the two terminal residues) favorable and “disallowed” regions of the plot determined from analyses of existing structures typical protein structures – residues tightly clustered in the most favored regions, only few or none residues in the “disallowed” regions poorly defined protein structures– residues more dispersed and many of them lie in the “disallowed” regions of the Ramachandran plot 4-Str. DBs & 3D Modelling -> 3D data validation -> Str. Sel. -> Quality checks 46 Validation of protein structures Ramachandran plot Ψ Ψ Φ Φ typical protein structure poorly defined protein structure 4-Str. DBs & 3D Modelling -> 3D data validation -> Str. Sel. -> Quality checks 47 Validation of protein structures bad and unfavorable atom-atom contacts “simple” count of bad contacts, e.g., two nonbonded atoms with a center-to-center distance < sum of their van der Waals radii evaluation of the environment of individual atoms or residue fragments with respect to the environments found in the high resolution crystal structures 4-Str. DBs & 3D Modelling -> 3D data validation -> Str. Sel. -> Quality checks 50 Validation of protein structures other parameters counts of unsatisfied hydrogen bond donors hydrogen bonding energies knowledge-based potentials assessing how “happy” each residue is in its local environment – many unhappy residues → “sad” overall structure real space R-factor expressing how well each residue fits its electron density; can also be expressed as a Real-space correlation coefficient 4-Str. DBs & 3D Modelling -> 3D data validation -> Str. Sel. -> Quality checks 52 Quality information on the web several databases provide pre-computed quality criteria for all wwPDB structures EDS PDBsum PDBREPORT RCSB PDB 4-Str. DBs & 3D Modelling -> 3D data validation -> Str. Sel. -> Quality checks 57 Quality information on the web Electron Density Server (EDS) http://eds.bmc.uu.se/eds/, also available via the PDBe site information about local quality of the structure for all structures from wwPDB with deposited experimental data plot of real-space R-factor (RSR) – how well each residue fits its electron density plot of Z-score – large positive spike → residue has considerably worse RSR than the average residue of the same type in structures determined at similar resolution. Ramachandran plot ... 4-Str. DBs & 3D Modelling -> 3D data validation -> Str. Sel. -> Quality checks 58 Quality information on the web PDBsum http://www.ebi.ac.uk/pdbsum/ provides numerous structural analyses of all wwPDB structures, including full PROCHECK output (for all protein-containing entries) 4-Str. DBs & 3D Modelling -> 3D data validation -> Str. Sel. -> Quality checks 60 M o d e l s o f st r u c t u re s 3D structure prediction q homology modeling q fold recognition q ab initio prediction q “hybrid” approaches q Assesment q databases of protein models 5-3D Modelling 3 Importance of structure q no experimental structure for most of the sequences Number of entries [millions] Year 5-3D Modelling 4 Homology modelling q basic principle – structure is more conserved than sequence 5-3D Modelling 5 Homology modeling q basic principle – structure is more conserved than sequence § similar sequences adopt practically identical structures haloalkane dehalogenase haloalkane dehalogenase LinB (PDB-ID 1iz7) DhaA (PDB-ID 1cqw) sequence identity: ~ 50 % 5-3D Modelling Homology modeling q basic principle – structure is more conserved than sequence § distantly related sequences still fold into similar structures haloalkane dehalogenase chloroperoxidase L LinB (PDB-ID 1iz7) (PDB-ID 1a88) sequence identity: ~ 15 % 5-3D Modelling Homology modeling q number of folds in SCOP database Number of folds q Per year q Total Year 5-3D Modelling Homology modeling q basic principle – structure is more conserved than sequence § similar sequences adopt practically identical structures § distantly related sequences still fold into similar structures q builds an atomic-resolution model of the target protein based on the experimental 3D structure (template) of a homologous protein q the most accurate 3D prediction approach q if no reliable template is available → fold recognition or ab initio prediction 5-3D Modelling Homology modeling q the quality of the model depends on the sequence identity /similarity between the target and template proteins q For a standard length protein it should be > 25% / > 40% Safe homology modeling zone Twilight zone 5-3D Modelling Homology modelling – steps...MSLGAKPFGE... target sequence 5-3D Modelling -> Homology modelling 11 Homology modelling – steps...MSLGAKPFGE... target database search sequence 5-3D Modelling -> Homology modelling 12 Homology modelling – steps...MSLGAKPFGE... target selection of database search sequence template 5-3D Modelling -> Homology modelling 14 Selection of template q wrong template = wrong model q more than one possible template may be identified → a combination of different criteria to select the final template: § sequence identity between the template and target protein § coverage between the template and query sequences § the resolution of the template structure, number of errors § a portion of conserved residues in the region of interest (e.g., binding site residues) §... q multiple templates can be used to create a combined model 5-3D Modelling -> Homology modelling 15 Homology modelling – steps...MSLGAKPFGE......MSLGAKPFGE......MGV-AKTYGE... target selection of sequence database search sequence template alignment 5-3D Modelling -> Homology modelling 16 Sequence alignments q reliability of alignment decreases with decreasing similarity of the target and template sequences q quality of alignment is crucial – it determines the quality of the final model q the pairwise target-template alignment provided by the database search methods is almost guaranteed to contain errors → more sophisticated methods needed § multiple sequence alignment § Profile-driven alignments § correction of alignment based on the template structure 5-3D Modelling -> Homology modelling 17 Homology modelling – steps...MSLGAKPFGE......MSLGAKPFGE......MGV-AKTYGE... target selection of sequence database search sequence template alignment building model framework 5-3D Modelling -> Homology modelling 19 Homology modelling – steps...MSLGAKPFGE......MSLGAKPFGE......MGV-AKTYGE... target selection of sequence database search sequence template alignment loop and side- building model chain modeling framework 5-3D Modelling -> Homology modelling 21 Homology modelling – steps...MSLGAKPFGE......MSLGAKPFGE......MGV-AKTYGE... target selection of sequence database search sequence template alignment model loop and side- building model optimization chain modeling framework 5-3D Modelling -> Homology modelling 25 Homology modelling – steps...MSLGAKPFGE......MSLGAKPFGE......MGV-AKTYGE... target selection of sequence database search sequence template alignment model model loop and side- building model validation optimization chain modeling framework 5-3D Modelling -> Homology modelling 27 Model validation q finished model contain errors (like any other structure) – the number of errors (for a given method) mainly depends on: q the percentage of sequence identity between template and target sequence, e.g., 90 %: the accuracy of the model comparable to X-ray structures; 50 %-90 %: larger local errors; identity < 25 %: often very large errors q the number of errors in the template structure q problems that occur far from the site of interest may be ignored, others should be tackled 5-3D Modelling -> Homology modelling 28 Homology modelling – steps...MSLGAKPFGE......MSLGAKPFGE......MGV-AKTYGE... target selection of sequence database search sequence template alignment iteration model model loop and side- building model validation optimization chain modeling framework 5-3D Modelling -> Homology modelling 29 Homology modelling – steps...MSLGAKPFGE......MSLGAKPFGE......MGV-AKTYGE... target selection of sequence database search sequence template alignment iteration model model loop and side- building model validation optimization chain modeling framework 5-3D Modelling -> Homology modelling 30 Iteration q portions of the homology modeling process can be iterated to correct identified errors § small errors introduced during the optimization → running a shorter molecular dynamics simulation § error in a loop → choosing another loop conformation in the loop modeling step § large mistakes in the backbone conformation → repeating the whole process with another alignment or even different template §... 5-3D Modelling -> Homology modelling 31 Homology modeling programs q MODELLER § http://salilab.org/modeller/ § models built by satisfying the spatial restraints of the C α - C α bond lengths and angles, the dihedral angles of the side-chains, and van der Waals interactions § restraints calculated from the template structures § available as a web server at different sites, e.g., part of: ModWeb workflow https://modbase.compbio.ucsf.edu/modweb/, GeneSilico server https://genesilico.pl/toolkit/unimod?method=Modeller or Bioinformatics toolkit http://toolkit.lmb.uni-muenchen.de/modeller 5-3D Modelling -> Homology modelling (Programs) 32 Homology modeling programs q SWISS-MODEL § http://swissmodel.expasy.org/ § fully automated protein structure homology modeling server 5-3D Modelling -> Homology modelling (Programs) 33 Model validation q mostly the same principles as used for the validation of experimental structures q always check both model and template § The model cannot improve the template if this is “bad” in regions q checks of normality § inside/outside distributions of polar and apolar residues § bad contacts § evaluation of atom/residue environment q energy-based checks § side-chain clashes § bond lengths and angles 5-3D Modelling -> Homology modelling (Model validation) 34 Model validation programs q QMEAN § https://swissmodel.expasy.org/qmean/ § composite scoring function for the quality estimation of protein structure models; evaluates torsion angles, solvation and non-bonded interactions and the agreement between predicted and calculated secondary structure and solvent accessibility 5-3D Modelling -> Homology modelling (Model validation) 35 Model validation programs q Verify3D q ANOLEA q PROCHECK q WHATCHECK q PROSA II q … 5-3D Modelling -> Homology modelling (Model validation, Programs) 36 Fold recognition (Threading) q predicts the fold of a protein by fitting its sequence into a structural database and selecting the best fitting fold q provides a rough approximation of the overall topology of the native structure → does not generate fully refined atomic models for the query sequence q can be used when no suitable template structures available for homology modeling q fails if the correct protein fold does not exist in the database q high rates of false positives 5-3D Modelling -> Fold recognition (Threading) 37 Fold recognition (Threading) q pairwise energy-based methods (threading) – protein sequence is searched for in a structural database to find the best matching structural fold using energy-based criteria 1. alignment of the query sequence with each structural fold in the fold library (essentially performed at the sequence profile level) 5-3D Modelling -> Fold recognition (Threading) 40 Fold recognition (Threading) q pairwise energy-based methods (threading) – protein sequence is searched for in a structural database to find the best matching structural fold using energy-based criteria 1. alignment of the query sequence with each structural fold in the fold library (essentially performed at the sequence profile level) 2. building a crude model for the target sequence (replacing aligned residues in the template structure with the corresponding residues in the query) 4-Str. DBs & 3D Modelling -> 3D modelling -> Fold recognition (Threading) 42 Fold recognition (Threading) q pairwise energy-based methods (threading) – protein sequence is searched for in a structural database to find the best matching structural fold using energy-based criteria 1. alignment of the query sequence with each structural fold in the fold library (essentially performed at the sequence profile level) 2. building a crude model for the target sequence (replacing aligned residues in the template structure with the corresponding residues in the query) 3. calculating energy of the raw model 4-Str. DBs & 3D Modelling -> 3D modelling -> Fold recognition (Threading) 44 Fold recognition (Threading) q pairwise energy-based methods (threading) – protein sequence is searched for in a structural database to find the best matching structural fold using energy-based criteria l is distance in Energy (kcal/mol) sequence (density normalization Glu-Asp (l>10) required) Glu-Arg (l>10) can be calculated from collections of known structures Distance Cb-Cb 5-3D Modelling -> Fold recognition (Threading) 45 5-3D Modelling -> Fold recognition (Threading) Fold recognition (Threading) q pairwise energy-based methods (threading) – protein sequence is searched for in a structural database to find the best matching structural fold using energy-based criteria 1. alignment of the query sequence with each structural fold in the fold library (essentially performed at the sequence profile level) 2. building a crude model for the target sequence (replacing aligned residues in the template structure with the corresponding residues in the query) 3. calculating energy of the raw model 4. ranking of the models based on the energetics – the lowest energy fold represents the structurally most compatible fold 5-3D Modelling -> Fold recognition (Threading) 47 Ab initio prediction q attempts to generate a structure by using physicochemical principles only q used when neither homology modeling nor fold recognition can be applied q search for the structure in the global free-energy minimum q so far still limited success in getting correct structures 5-3D Modelling -> Ab initio 63 Ab initio prediction programs q Rosetta § http://www.rosettacommons.org/ § software suite for predicting and designing protein structures, protein folding mechanisms, and protein-protein interactions 5-3D Modelling -> Ab initio 64 “Hybrid” 3D structure prediction programs q I-TASSER § http://zhanglab.ccmb.med.umich.edu/I-TASSER/ § combines homology modeling, threading and ab initio predictions § No. 1 server for protein structure prediction in previous CASP experiments q Robetta § http://robetta.bakerlab.org/ § combines homology modeling and ab initio predictions § implements ROSETTA software 5-3D Modelling -> Hybrid 66 Assessment of prediction methods q CASP (Critical Assessment of techniques for protein Structure Prediction) § http://predictioncenter.org/ § biannual international contest providing objective evaluation of the performance of individual prediction methods § evaluation based on a large number of blind predictions - contestants are given protein sequences whose structures have been solved, but not yet published - results of the predictions are compared with the newly determined structure § competition in several categories 5-3D Modelling -> Assessment 71 Assessment of prediction methods q CAMEO (Continuous Automated Model EvaluatiOn) § https://www.cameo3d.org/ § weekly assessment of new structures in the PDB § registered prediction servers are sent weekly requests on not-so- easy new structures in the weekly PDB pre-release. § Multiple scores considered, normalized average (IDDT) reported § Categories: § 3D: Prediction of the 3D coordinates of a protein from sequence § QE: Model quality Estimation: Assessment of quality measures reported by participant servers 5-3D Modelling -> Assessment on real 3D structures. 72 Databases of protein models q ModBase § http://modbase.compbio.ucsf.edu/modbase-cgi/index.cgi § database of annotated protein models generated by the automated pipeline including the MODELLER program § contains ~38 millions models for ~6.5 millions unique sequences 5-3D Modelling -> Databases of predicted structures 74 P ro te i n fo l d i n g , sta b i l i t y a n d d y n a m i c s Outline ❑ Revisions ❑ Protein folding ❑ Protein stability ❑ Protein dynamics Protein folding, stability and dynamics 2 Levinthal’s paradox ❑ Cyrus Levinthal ▪ 1968 – impossibility of random folding ▪ random folding ▪ 100 residue protein (average sized) ▪ 3 conformation per residue (many more) ▪ 0.1 ps sampling time per conformation (much longer) ▪ folding time = 3100*10-13 s ≈ 5*1034 s ≈ ▪ 1 634 251 397 552 039 990 billions of years ❑ Experimental folding rates ▪ 1 ms to 10 min Protein folding – Levinthal’s paradox 10 Anfinsen’s thermodynamic hypothesis ❑ Christian Anfinsen ▪ 1973 – protein folding in vitro ▪ refolding of ribonuclease ❑ Findings ▪ native structure of a protein is the thermodynamically stable structure ▪ folding depends only on the amino acid sequence and on the conditions of solution, and not on the kinetic folding route Protein folding – Anfinsen’s thermodynamic hypothesis 11 Mechanisms of protein folding Protein folding – mechanisms 13 Mechanisms of protein folding ❑ Nucleation-growth (propagation) model ▪ continuous growth of tertiary structure from initial nucleus of local secondary structure ▪ it did not account for folding intermediates -> model dismissed Protein folding – mechanisms 14 Mechanisms of protein folding ❑ Framework model ▪ secondary structure folds first -> coalescence of secondary structural units to the native protein ❑ Hydrophobic collapse model ▪ compaction of the protein -> folding in a confined volume -> narrowing the conformational search to the native state ❑ Nucleation-condensation model ▪ concerted & cooperative secondary and tertiary structure formation ▪ transition state resembles distorted form of the native structure ▪ the least distorted part called folding nucleus or molten globule Protein folding – mechanisms 15 Energetics of protein folding ❑ Free energy of folding (ΔGfold = ΔH - T.ΔS) ▪ protein more structured -> ΔS↓ – unfavorable ▪ solvent less structured -> ΔS↑ – favorable ▪ hydrophobic interactions are driving “force” ▪ more non-covalent interactions -> ΔH↓ – favorable Protein folding – energetics 16 Basics of protein stability ❑ Tertiary structure of protein ▪ sum of non-covalent weak interactions vs conformational entropy ▪ folded protein = thermodynamic compromise ▪ folded protein marginally more stable than unfolded (10-80 kJ/mol) Protein stability – basics 21 Basics of protein stability ❑ Tertiary structure of protein ▪ sum of non-covalent weak interactions vs conformational entropy ▪ folded protein = thermodynamic compromise ▪ folded protein marginally more stable than unfolded (10-80 kJ/mol) ▪ Weak interactions are frequently disrupted ▪ denaturation - disrupted bonds replaced by bonds with solvent ▪ dynamics - disrupted bonds reformed between protein atoms Protein stability – basics 22 Introduction to protein dynamics ❑ Origin of dynamics – disruption of weak interactions by ▪ thermal kinetic energy (kb.T) ▪ binding interactions (ligands or other proteins) – induced fit ❑ Protein atoms fluctuates around their average positions ▪ in tightly packed interior – movement restricted ▪ near surface – movement promoted by solvent movements ▪ -> proteins considered as “semi-liquids” Protein dynamics – introduction 31 Characteristics of protein motions ❑ Divisions of protein motions Type of motion Moving moiety Functionality bond vibration; ligand flexibility; temporal Local atoms; side-chains diffusion pathways active site conformational changes; motion of Medium-scale secondary structures hinge; peptide bond rotation; hinge facilitated domain movements; Large-scale domains allosteric transition Global subunits helix-loop transition; folding/unfolding Protein dynamics – characteristics of protein motions 32 Time scales of protein motions Protein dynamics – characteristics of protein motions 34 Time scales of protein motions ❑ Time scales governed by local environment ▪ interior – motions coupled due to packing restraints ▪ surface – no coupling of motions ❑ Example: aromatic ring flipping ▪ can occur on ps time scale, but often observed on ms time scale ▪ aromatic residues -> hydrophobic -> inside protein -> tightly packed ▪ -> low probability of synchronized movement of surrounding atoms ▪ -> prolonged time scale Protein dynamics – characteristics of protein motions 35 NMR spectroscopy ❑ Ensemble of possible low energy conformations ❑ Directly shows possible amplitudes of motion ❑ Limited applicability to larger proteins ❑ Does not describe ▪ very fast motions & transition states ▪ time scales & energetics of motions Protein dynamics – approaches to study dynamics 37 High resolution X-ray crystallography ❑ Average low energy structure - more conformations: ▪ in one structure only if both are separated by barrier ▪ in multiple structures Protein dynamics – approaches to study dynamics 38 High resolution X-ray crystallography ❑ Average low energy structure - more conformations: ▪ in one structure only if both are separated by barrier ▪ in multiple structures ❑ Crystalline state ▪ non-native contacts ▪ artificially lower amplitudes of motions ❑ Range of fluctuations – B-factors ❑ Does not describe ▪ very flexible regions ▪ collectiveness of motions ▪ time scales & energetics of motions Protein dynamics – approaches to study dynamics 39 Normal mode analysis ❑ Principle ▪ motion of system as harmonic vibration around a local minimum ▪ Coarse-grained model, residues connected with springs ❑ Small number of low-frequency normal modes ▪ shows directionality, collectiveness and sequence of global motions ❑ Does not describe ▪ local movements ▪ amplitudes & time scales ▪ energetics of motions Protein dynamics – approaches to study dynamics 40 Molecular dynamics ❑ Principle ▪ physical description of interactions within the system (force field) ▪ Newton’s laws of motions ❑ Provides information on energetics, amplitudes, and time scales of local motions on the atomic level ❑ Does not describe ▪ slower large-scale motions (> ms) Protein dynamics – approaches to study dynamics 43 Databases of dynamics ❑ Molecular Dynamics Extended Library (MoDEL) ❑ Dynameomics ❑ Molecular Movements Database (MolMovDB) ❑ ProMode-Elastic Protein dynamics – databases 48 A n a l y s i s o f p ro te i n st r u c t u re s Outline ❑ Residue solvent accessibility ❑ Protein solubility ❑ Molecular interactions ❑ Functional sites ▪ Binding sites ▪ Transport pathways Analysis of protein structures 2 Residue solvent accessibility ❑ Solvent accessible surface area What is it? Why do we care? Residue solvent accessibility 3 Residue solvent accessibility ❑ Solvent accessible surface area (ASA, SASA or SAS, in Å2) → It quantifies the extent to which a residue in a protein structure is accessible to the solvent ❑ Typically calculated by rolling a spherical probe of a particular radius over a protein surface and summing the area that can be accessed by this probe on each residue ➔ Residue solvent accessibility 4 Residue solvent accessibility ❑ Solvent accessible surface area (ASA, SASA or SAS, in Å2) ❑ Solvent excluded surface (SES) – also known as molecular surface, or Connolly surface area Water radius 1.4 Å VdW VdW = Van der Waals radius Residue solvent accessibility 5 Residue solvent accessibility ❑ Solvent accessible surface area (ASA, SASA or SAS, in Å2) ❑ Solvent excluded surface (SES) – also known as molecular surface, or Connolly surface area – usually represented in “surface” visualization SASA SES Residue solvent accessibility 6 Residue solvent accessibility ❑ Relative accessible surface area (rASA) ▪ Ratio of the actual accessible area of a given residue rASA = ASA / ASAMAX ▪ Enables comparison of accessibility of different amino acids (e.g., long extended vs. spherical amino acids) ❑ Simplified two state description ▪ Buried vs. exposed residues ▪ Threshold for differentiating surface residues vs. buried is not well defined (usually rASA = 15–25 %) ▪ rASA < threshold => buried rASA ≥ threshold => exposed Residue solvent accessibility 7 Residue solvent accessibility – programs ❑ POLYVIEW-2D (PDB) /