Modelling and Simulation of Biological Macromolecules PDF

Inhaltsverzeichnis Computer Aided Drug Design:..................................................................................................................... 3 Proteins vs. Small molecules:............................................................................................................................... 3 Protein Design for Drug Discovery:...................................................................................................................... 3 Levels of Structure:......................................................................................................................................... 4 Experimental analysis of the 3D protein structure:........................................................................................ 4 Protein Data Bank:.......................................................................................................................................... 5 Structural Alignment of 2 Proteins: RMSD..................................................................................................... 6 Prediction of Protein Structure:.................................................................................................................. 7 Anfinsen’s Dogma:............................................................................................................................................... 7 Levinthal’s Paradox:............................................................................................................................................. 8 Structure Prediction:............................................................................................................................................ 8 Alphafold2:......................................................................................................................................................... 10 Neuronal network:........................................................................................................................................ 11 Evolutionary constraints of the protein structure:....................................................................................... 12 Geometric constraints of the protein structure:.......................................................................................... 13 Spatial Graph:................................................................................................................................................ 13 How does Alphafold work?........................................................................................................................... 14 Other AI Applications for Structure Prediction.................................................................................................. 16 RoseTTAFold:................................................................................................................................................ 16 ESMFold:....................................................................................................................................................... 17 ColabFold:..................................................................................................................................................... 17 Biomolecular Interactions......................................................................................................................... 17 Covalent Interaction in the polypeptide chain (Primary structure)................................................................... 17 Non-covalent interactions (Secondary & Tertiary Structure)............................................................................ 18 Hydrogen bonds............................................................................................................................................ 18 Hydrophobic interactions (Tertiary Structure)............................................................................................. 19 Van der Waals interactions (Tertiary Structure)........................................................................................... 20 Electrostatic interactions (Tertiary Structure).............................................................................................. 20 Aromatic Interaction (Pi-Stacking or Pi-Pi-Stacking)..................................................................................... 20 Disulfide Bridges........................................................................................................................................... 20 Ionic Interaction (Salt Bridges)...................................................................................................................... 20 Secondary Structure Prediction.......................................................................................................................... 21 DSSP (Define Secondary Structure of Proteins) Algorithm:.......................................................................... 21 Classical vs. Quantum Mechanics...................................................................................................................... 21 Classical Mechanics:...................................................................................................................................... 21 Quantum Mechanics:.................................................................................................................................... 22 Covalent vs. non-covalent interactions:....................................................................................................... 22 Molecular Force Fields:................................................................................................................................. 23 Charge-Charge Interactions (Monopol-Monopol):....................................................................................... 24 Dipole interactions:....................................................................................................................................... 25 Hydrophobic Interactions:............................................................................................................................ 26 Ligand-Protein Interactions....................................................................................................................... 27 Molecular Docking (Structure-Based Methods)................................................................................................. 27 Representation of the system/Site Characterization................................................................................... 27 The search algorithm (Searching for lead compounds)................................................................................ 28 The scoring or energy evaluation routine..................................................................................................... 30 General understanding of binding................................................................................................................ 31 Summary: Ligand Docking Programs............................................................................................................ 32 Thermodynamics........................................................................................................................................... 32 Proteins in Motion................................................................................................................................... 37 Molecular Dynamics (MD) Simulations............................................................................................................. 37 MD and Force Fields:.................................................................................................................................... 37 Ensembles:.................................................................................................................................................... 39 Chemoinformatics.................................................................................................................................... 41 Ligand-Based:.................................................................................................................................................... 41 Quantitative Structure-Activity Relationships (QSAR):................................................................................. 41 Pharmacophore:........................................................................................................................................... 50 The Lipinski’s Rule of 5:...................................................................................................................................... 51 Ligand vs. Structure Based Drug Design..................................................................................................... 52 Virtual Screening................................................................................................................................................ 53 Advantages & Disadvantages of Virtual Screening:...................................................................................... 55 AI in Drug Design:..................................................................................................................................... 56 Enhanced Sampling in MD:....................................................................................................................... 56 Machine Learning in Protein Design:......................................................................................................... 57 Modelling & Simulation of Biological Macromolecules Drug Discovery = the process of identifying chemical entities that have the potential to become therapeutic agents Empirical => finding a compound which produces a desired therapeutic effect in vitro, no understanding of drug’s mechanism is needed Rational => finding compound which would interact with the target of interest Computer Aided Drug Design: CADD: Computer-Aided Drug Discovery Ligand-based => Chemoinformatics: Design of structurally similar molecules to the ligand of the protein (analog: replicating a key) Structure-based => Structural Bioinformatics & Bioinformatics: Design of a molecule, which binds to the protein structure Proteins vs. Small molecules: Proteins: higher specificity, modulation of the activity, target proteins, difficult to produce and stabilize, frequent administration Small molecules: chemically synthesized, modulate activity, target proteins or biomolecules, easy to produce, oral administration, less specific, potential off-target effects, and toxicity Selectivity Loop: is a specific amino acid sequence in a protein that is responsible for determining the specificity of the protein for its binding partners. It is often located in the active site of the protein, where it plays a critical role in determining which substrate or ligand the protein will bind to. The loop can physically obstruct the binding of certain molecules or change its conformation upon binding, allowing the protein to bind to a specific ligand and inhibiting binding of other molecules. In some cases, small molecules drugs have been designed to target the selectivity loop and modulate protein activity by binding to it. Protein Design for Drug Discovery: Protein Folding = physical process by which a protein chain acquires its native threedimensional structure (biologically functional) => to understand the biologically active conformation, energetic interactions in secondary structure and decode folding/folding levels to decode structures Levels of Structure: Primary – Amino acid sequence Secondary – Local 3D patterns (helices, sheets, loops) Tertiary – 3D fold of the protein chain Quaternary – Interaction of more than one protein chain Experimental analysis of the 3D protein structure: NMR X-Ray Measuring the Response of the nuclei of scatter of X-rays by atoms in a molecule to a the atoms in the Method specific magnetic field crystal – Different nuclei – Determination of different chemical shifts atom positions Resolution Good (> 1A) Good (> 1A) Sidechains Hydrogen Backbone Sidechains Secondary Analytes Backbone Structure Secondary Structure Tertiary Structure Tertiary Structure Quaternary Structure Molecule Small No restriction size Medium Aqueous sol. Crystal Dynamics Multiple models Static snapshot Pros Versatility No need for Crystals Dynamic information Detailed structural info High resolution Versatility: a wide range of materials Cons Expensive Limited resolution Requires isotopic labeling Limited sensitivity (large amount of sample is needed) Large, highquality crystal of the material is required difficult to obtain especially for proteins Time-consuming Expensive Cryo-EM EM of the frozen sample (native structure) Collection of scattered electrons to create 3D image - Different orientations different shade Medium (> 3A) Backbone Secondary Structure Tertiary Structure Quaternary Structure Big complexes Frozen aq. sol. Static snapshot Versatility No need for Crystals High resolution Study of dynamic processes Requires large amount of high-quality samples Time-consuming Expensive Computationally intensive data processing Ramachandran Plot: a graphical representation of the conformations of peptide bonds in proteins. based on the dihedral angles φ and ψ, which are the angles formed by the backbone atoms of a peptide bond o Phi φ: angle between the N-Cα bond o Psi Ψ: angle between the plane formed by the Cα-C and N-Cα atoms different AA have different preferences to secondary structures Each point on the Ramachandran plot represents a unique combination of φ and ψ angles for a peptide bond, and the plot can be divided into regions of allowed, favored, and disallowed conformations. The allowed region is where most of the residues in a protein are found, favored regions are the areas where most of the residues are found in specific types of proteins, and disallowed regions are the areas where there are no residues found in a protein. widely used in protein structure prediction, protein modeling and validation, and the analysis of protein structures provides a simple, yet powerful, visual representation of the conformations of peptide bonds in proteins helps to identify potential errors or problems in protein structure models Protein Data Bank: The Protein Data Bank (PDB) is a publicly available repository that stores and provides access to the structural data of biological macromolecules, primarily proteins and nucleic acids, obtained by techniques such as X-ray crystallography, NMR spectroscopy, and Cryo-EM. PDB File Format: is a text file containing: Properties and info (resolution, method used, references, etc). Description of the 3D structure of protein (protein sequence, protein chains, domains, etc.) Additional molecules (water/ions/peptides) used to stabilize the structure Visualizing protein structures using: Pymol, Chimerax, Maestro Structural Alignment of 2 Proteins: RMSD RMSD (Root Mean Square Deviation) = measure of the difference between the structures of two or more molecules. for comparing the structures of a protein or other biomolecule determined experimentally to a theoretical model for comparing the structures of different proteins or other biomolecules to each other The RMSD is calculated by superimposing the atoms of the two structures as closely as possible and then measuring the average distance between corresponding atoms. The RMSD value is a measure of the structural similarity between the two molecules and is typically reported in units of angstroms (Å). A lower RMSD value indicates a higher degree of structural similarity between the two molecules. by C-alpha for the structural alignment by backbone for superimposition high sequence identity => high structural similarity => low RMSD When comparing two structures, it is important to ensure that the corresponding atoms in the two structures are correctly matched. This is usually done by applying a structural alignment algorithm, which optimizes the superposition of the two structures by minimizing the RMSD between the corresponding atoms. Alternative: GDT-TS Global Distance Test (GDT): Percentage of residues that can be superimposed under given distance cutoffs (good alignment with cut offs) GDT-Total Score (TS) = Pd… Fraction of residues that can be superimposed under a distance cutoff of d Å, which reduces the dependance on the choice of the cutoff by averaging over four different distance cutoff values Prediction of Protein Structure: Anfinsen’s Dogma: is a principle in molecular biology that states that the native, 3D structure of a protein is determined solely by its amino acid sequence. It states that the unique sequence of amino acids in a protein determines the specific interactions between its atoms and the way in which the protein folds into its final, biologically active conformation. This is accomplished through the action of non-covalent interactions between amino acids, such as hydrogen bonding, electrostatic interactions, and hydrophobic interactions. It is important to note that Anfinsen's Dogma is not absolute, as some proteins require specific chaperones, or other helping molecules, to fold into their native conformation, and some proteins are unable to refold into their native conformation once they are denatured. Additionally, post-translational modifications, such as phosphorylation or glycosylation, can affect protein folding and stability. Anfinsen’s Experiment: Small protein ribonuclease (RNaseA), composed of 124 amino acids and 4 disulfide bonds, as a model system Reduction of S-S bonds to eight -SH groups using mercaptoethanol Denaturation of RNaseA by heating it in a solution of 6 M urea, which breaks the non-covalent interactions that hold the protein in its native conformation Removal of the denaturing agent by dialysis and allowing the protein to refold to its native conformation at physiological conditions, -SH groups oxidized back to S-S bonds Refolded protein retained full enzymatic activity, indicating that the native conformation and activity of the protein had been restored But maybe RNaseA was not completely unfolded in urea? Control: RNaseA was first reduced and denatured as above The enzyme was oxidized to form S-S bonds Urea was removed Activity was only about 1-2% of the untreated enzyme Levinthal’s Paradox: Small protein of 100 AA 2 conformations for each AA => 2100 = 1030 conformations of the protein If we assume, that conformation shift in a constant time of 10-10s, the time required to sample all conformations of this protein: 1030 x 10-10 = 1020 s = 1012 years (older than universe) BUT small proteins fold spontaneously in seconds and even the largest proteins fold within minutes. The difference between the theoretical calculation and observational data is known as Levinthal’s paradox. => Proteins must have pathways to achieve their native conformation. Structure Prediction: 1. De novo 2. Homology Modeling (Comparative or Template-based Modeling)  constructing a model of the target protein based on its AA sequence and an experimental 3D structure of a related homologous protein (“template”)  similar sequences suggest similar structure (modelling proteins with unknown structure starting from known proteins)  Software: Modeller, Prime, RosettaCM o Swissmodel: Uploading the target FASTA sequence, the program offers possible templates and then you can build models using the software) o I-TASSER: Uses multiple templates for each target to create an accurate model (threading) – C score (measure for confidence of each model – based on the significance of threading template alignments and the convergence parameters of the structure assembly simulations) Process: 2.1 Template selection Sequence identity > 30% Sequence or phylogenetic similarity Environmental factors: pH, Solvent type Sequence alignment using database search techniques such as FASTA, BLAST 2.2 Target-template sequence alignment Structure based sequence alignment & incorrect alignment => incorrect model One template & one target => Pairwise Alignment: Comparing 2 sequences Methods o Dot Plot: manual, human error prone o Dynamic Programming: optimizing sub-problems to optimize the overall solution, slow o Heuristic Methods: only approx. solution, fast, e.g. BLAST Required for searching in databases (template selection) Global vs. Local o Global => Calculation of the optimal global similarity score (NeedlemanWunsch Algorithm) => Finding the best alignment for both sequences o Local => Finding regions of similarity, compares the segments of all possible lengths and optimizes the similarity score (Smith-Waterman Algorithm) Multiple templates => Multiple Sequence Alignment: Alignment of more than 2 sequences Methods: o Progressive Alignment Methods: Alignment beginning with the most similar pair and progressing to the most distantly related (ClustalW2, Cobalt) o Iterative Methods: Like progressive methods, but repeatedly realign the initial sequences and adding new sequences to the growing MSA, reduces the errors in progressive methods (Clustal Omega MAFFT, MUSCLE) o Consensus Methods: Finding the optimal MSA given multiple different alignments of the same set of sequences (MergeAlign) 2.3 Model construction (Backbone, Loops, Sidechains) 2.4 Assessment Check model quality using Ramachandran plot 3. AI Critical Assessment of Techniques for Protein Structure Prediction (CASP) => resource to find the best program, a competition o Tencent o Alphafold2 Alphafold2: A machine-learning based model for predicting the 3D structure of proteins using only sequence as input. Uses a deep neural network trained on known sequences and structures from the PDB & large databases of protein sequences to make predictions Incorporates novel neuronal network architectures and training procedures based on the evolutionary, physical, and genomic constraints of protein structure The algorithm finds similar sequences to the input, extracts the information using a special neural network architecture, and passes that information to another neural network that produces a structure Highly accurate AlphaFold Database => for protein structure predictions Neuronal network: Machine learning model inspired by the structure and function of the human brain They can learn to performs various tasks How does neuronal networks work? Information processing through a series of interconnected nodes (artificial neurons), which are organized into layers Each node receive input from other neurons, processes the input using a simple math. operation and then transmits the result to other neurons in the next layer Weight = Strength of the connections between neurons o determines the influence that each neuron has on others => are adjusted during training for a specific task 1. Input data e.g., image 2. Processing Hidden layers calculate the weighted sum of the inputs and weights, add the bias (to avoid overfitting) and execute an activation function. Net input = Input * Weight 3. Activation: Transformation of the output of each neuron by an activation function (fire or not) 4. Training: Comparing the output to the desired output for a given input => adjusting the weights of connections between neurons to minimize the difference 5. Output: After the training, the network can be used to make predictions Machine Learning & Deep Learning: Subfields of artificial intelligence, which focus on training computers to perform tasks without explicitly programmed to do so Feature Extraction = identifying the important variables in the data Complexity Machine Learning Simpler More complex, Deep multiple layers of Learning artificial neural networks Data Requirement Feature Extraction Small to moderate Performed by human datasets engineer Very large datasets => The model learns to extract more suited for features automatically from complex problems raw data Automation Require more human input & supervision Highly automated Evolutionary constraints of the protein structure: Structure of the protein is conserved in different species By measuring coevolution, we can infer contacts in proteins => Sequences can change over time but the link between 2 molecules (e.g., receptor and its ligand) will stay same Contacts in proteins are evolutionarily conserved and encoded in a MSA due to coevolution o Coevolution refers to the evolution of protein sequences in response to changes in their interacting partners, leading to mutual adaptations o Coevolution information can be used in protein structure prediction to constrain the search space and improve the accuracy of predictions. o By analyzing the covariation between residues in evolutionarily related sequences, coevolution information can identify residues that are likely to be in proximity in the three-dimensional structure, which can help to predict the protein's structure Geometric constraints of the protein structure: Contact distance matrix as a tool to describe the spatial relationship between residues in a protein. Represents minimum distance between pairs of residues in a protein 3D structure If two residues are in proximity, the entry in the matrix is set to a small value Spatial Graph: A folded protein ~ spatial graph; residues are the nodes and distances between them are the edges Important to understand the physical interactions within proteins and their evolutionary history To capture the local and global structural features of a protein Can be used as an input for various computational methods o AF2 neural network system to interpret the structure of this graph by using evolutionarily related sequences, MSA and a representation of AA residue pairs o AF2: Angular predictions from Alignment and Fragments Nodes can be assigned various attributes, such as AA type, secondary structure usw., which can be used to incorporate additional info about the proteins structure Edges can also be weighted to represent the strength of the interaction between residues => to identify structural interactions, such as hydrogen bonds usw. How does Alphafold work? 1. User input 2. Database search Encoding AA sequence into a compact and dense representation using a neural network (a representation of the MSA) => input to another neural network, which predicts the spatial relationships between residues in the protein (a representation of the pairs of residues). These relationships are represented by a contact map (shows which residues interact with each other). AF2 takes the AA sequence and a MSA as input to learn a rich “pairwise representation” that is informative about which residue pairs are close in 3D space (highly relevant for model accuracy!). o The sequence of the target protein is compared across a large database. The underlying idea is that, if two amino acids are in close contact, mutations in one of them will be closely followed by mutations of the other, to preserve the structure. Performing a template search to identify proteins that may have a similar structure to the input. o Proteins mutate and evolve, but their structures remain similar. The conservation occurs on a smaller scale, where pieces of the protein remain mostly unchanged while their surroundings evolve. It is possible to identify these conserved fragments and use them as a guide to construct the structure The top 4 templates serve as starting position for the prediction model 3. Prediction model Next, the contact map (proximity of residue pairs) is converted into a more detailed representation of the proteins structure, distance map, which describes the exact distances between pairs of residues in the protein. The prediction model network has two main parts: Transformer (Evoformer blocks) that identify which pieces of information are more informative and refine both the MSA and the pair representation Structure module using the refined MSA and pair representation to build a 3D model. The modules are repeated via a recycling process, where the predicted 3D structure is used as input for a new iteration. By default, three recycling runs are done. o The structure module considers the protein amino acid as triangles, representing the three atoms of the backbone. The triangles float around in space and are moved to form the structure. o At the beginning, all residues are placed at the coordinate origin. At every step of the iterative process, a set of transformations displace and rotate the residues in space. This representation does not reflect any physical or geometrical assumptions, thus resulting in structural violations o The Structure Module also generates a model of the side chains, which positions are parametrized by a list of torsion angles 4. Relaxation + Output 3D structure generation is an iterative process and yields five models, which can be relaxed and are ranked according to the model confidence. Confidence metrics of Alphafold: Averaged local distance difference test (pLDDT): a local confidence per position o Comparison of predicted structure with its experimentally determined structure o Calculation of pairwise distances between all residues in predicted and experimental structures => Calculation of the deviation between them => Lower deviation, more accurate prediction o Range 0 to 100 (the higher, the better) o Useful for deciding which local features (loops etc.) are poorly modeled Other AI Applications for Structure Prediction RoseTTAFold: Combines multiple prediction methods to improve the accuracy of structure predictions Uses MSA to identify homologous proteins, to predict the secondary structure of the target protein => 2D structure info is then used to predict torsion angles, which describe the orientation of each residue in the proteins 3D structure o PSSM: Position specific scoring matrix => summarizing conservation of each position to correctly predict the secondary structure (just the sequence is not enough!) ▪ Matrix that summarizes the conservation of each position in a MSA of homologous proteins ▪ Each row represents a residue, and each column represents a position in the MSA ▪ Used to identify conserved residues => to infer evolutionary relationships between proteins o Psipred: uses AA sequence to predict the secondary structure of proteins ▪ Encodes the AA sequence into a numerical representation (feature vector) ▪ Feature vector is used as input to a neural network, which predicts the secondary structure Promising method for improving the accuracy of protein structure predictions ESMFold: Enables atomic resolution structure prediction Transformer-based language model => trained on a large dataset of protein structures and can capture the relationships between AA sequences and protein structures at an evolutionary scale No need to retrieve and align MSAs and templates Prediction without access to any external databases More accurate predictions than AlphaFold2 and RoseTTAFold ColabFold: Making protein folding accessible to all via Google Colab Sample from uncertainty Custom MSA/Template inputs Fast MSA generation using MMseqs2 (Many against many sequence searching) => avoids recompilation and adds an early stop criterion Increase number of recycles Biomolecular Interactions Covalent Interaction in the polypeptide chain (Primary structure) Peptide bonds linking the AA together in the chain => between the carboxyl group of one amino acid and the amino group of another  Electronegative oxygen pulls on carbonyl bound electrons, making the peptide bond stronger  Resonance (double bond character)  Strong covalent bonds  Highly stable Cis/Trans: Peptide bonds are planar => no free rotation due to the partial double bond character between N and C atoms, which creates a rigidity in the bond, preventing rotation and allowing for a planar conformation Cis: Side chains of the AA are on the same side of the peptide bond => folded/kinked shape Trans: Side chains are on opposite sides of the peptide bond => linear/extended shape Unique properties of the polypeptide chain:  Rigidity  Variability (due to different AA sidechains)  Polar atoms in set distances Non-covalent interactions (Secondary & Tertiary Structure) Hydrogen bonds Electrostatic attraction of H-atom covalently bound to a highly electronegative atom, such as nitrogen, oxygen etc. to another NOF atom (Nitrogen, Oxygen, Fluorine). In the peptide chain:  With water molecules  Between backbone and sidechain => Tertiary Structure?  Between backbone and itself (intra-backbone) => Secondary Structure Intra-Backbone H-Bond: play a role in determining the secondary structure of the proteins => holding the protein backbone in a particular conformation (avoiding steric clashes) Alpha-Helix: o Consecutive stretch of 5-40 AAs o Right-handed spiral conformation o 3.6 AA per turn Beta-Sheet o Different polypeptide chains run alongside each other and are linked together by hydrogen bonds (each section is called ß-strand) o ß-Strands consist of 5-10 AA o Primary structure has an alternating pattern of hydrophobic and polar AA => extended structure, where the AA sidechains alternate between the two faces of the strand, extending into space at an ~90° angle from the face of the strand o Adjacent ß-strands form a ß-sheet (parallel; both strands in N=>C direction, or anti-parallel; N=>C and C=>N) o Twist: Most ß-Sheets are twisted ▪ Parallel sheets are less twisted than anti-parallel ▪ Anti-parallel can withstand greater distortions (more stable) o ß-Turn: a sharp bend in a ß-strand that reverses the direction of the AA backbone (4 AAs) => commonly contains glycine and proline (small size & rigid structure) Loops o Unstructured areas connecting secondary structure elements o Have various length and shape Turn o A non-regular secondary structure that causes a reversal of direction and connects secondary structure elements o Usually, 2-6 AAs o Internal H-bonds enabling the turn o Several types: Gamma turn, inverse gamma turn, ß-turn etc. Hydrophobic interactions (Tertiary Structure)  Due the nonpolar nature of the side chains of certain AA  These AA associate with each other and avoid water  Formation of hydrophobic pockets/clusters within the protein structure Hydrophobic Effect: When a hydrophobic residue or surface enters water, it agitates the hydrogen bonds around it. The H-Bonds reorganize and create a cage around the hydrophobic surface. If multiple hydrophobic molecules aggregate, the surface area of the molecules is lowered (their disruptive effect is minimized => higher entropy). The H2O molecules, which form the cage, have restricted mobility. This leads to significant losses in translational and rotational entropy of water molecules and makes the process unfavorable in terms of free energy in the system. In Proteins: Globular proteins have a hydrophobic core Hydrophobic AAs are bounded inward and hydrophilic AAs are bounded outward This allows the protein to stay soluble in water Van der Waals interactions (Tertiary Structure)  Weak attractive or repulsive forces between non-bonded atoms  Due to fluctuations in the distribution of electrons in the orbitals, which results in temporary dipoles Electrostatic interactions (Tertiary Structure)  Due to presence of +/- charged AA in the side chains  These charges can form ionic bonds/hydrogen bonds with other charged residues Aromatic Interaction (Pi-Stacking or Pi-Pi-Stacking)  attractive, non-covalent interactions between aromatic rings, which are cyclic and planar with resonance between ring bonds Disulfide Bridges  coupling of 2 thiol groups  can connect sequence distant areas Ionic Interaction (Salt Bridges)  arise from electrostatic attraction between 2 groups of opposite charge (acidic x basic) These interactions enable:  Higher-order Structures/Assemblies  Macromolecule/Solvent & Solvent/Solvent Interactions  Stability & 3D Structure Formation Role of side chains in proteins: Determining the proteins overall charge and hydrophobicity Catalyzing chemical reactions Determining specificity of interaction interaction Side chains: Contributing to protein-protein Nonpolar Regulating protein activity Polar Electrically charged (acidic, basic) Glycine: Sidechain is a single H-atom Proline: Sidechain has cyclic structure => (smallest AA) high rigidity compared to other AAs => Glycine and Proline may act as disruptors of Alpha-Helixes and ß-Sheets => Turns Secondary Structure Prediction DSSP (Define Secondary Structure of Proteins) Algorithm: Assigning secondary structure to the AA of a protein, given the atomic-resolution coordinates of the protein 80% Accuracy Evolutionary conservation of secondary structures Some AAs are found more often in specific forms of secondary structure, such as Prolin and Glycine in ß-Turns and Valin and Isoleucin in ß-Sheets Secondary structure depends also on tertiary structure How to physically define/describe these interactions? Classical vs. Quantum Mechanics Model Classical Interest Protein motion Quantum Reaction mechanism & Spectroscopy Reason of Choice No changes in molecular 2D topology Bond Breaking & Formation => changes in molecular 2D topology Electronic excitations => Electronic states need to be considered Classical Mechanics: Motion of a macroscopic Object = Motion of the Center of the Mass Newton’s Second Law of Motion: Acceleration of an object is directly proportional to the net force acting on it and inversely proportional to its mass (F = m*a)  Does not work on relativistic front (speed of light is constant for all observers)  Does not work on microscopic level and cannot explain subatomic motion (movement of electrons, protons etc. within an atom) Born Oppenheimer Approximation:  separates the motion of the nuclei and electrons in a molecule Assumption: Nuclei are >1000 times heavier than electrons, so they move more slowly, and the electrons respond instantaneously to the changing positions of the nuclei. So, the nuclei experience only averaged field of the electron movement. The movement of nuclei can be described by classical mechanics within the static field of electrons.  If the electronic structure of a system does not change and only the positions of the nuclei change, THEN the system’s properties can be described by the laws of classical mechanics! Quantum Mechanics:  Describes the behavior of matter and energy at the smallest scales (microscopic level)  Probability and Wave-Particle-Duality to describe the behavior of subatomic particles Heisenberg’s Uncertainty Principle/Heisenberg Equation: Certain properties of a particle (position, momentum) cannot be precisely known at the same time Schrödinger’s Equation: Probabilities of finding a particle in a particular location or state (Wavefunction) Hamiltonian Operator: a set of operations concerning the interactions that dictate the state of the system Covalent vs. non-covalent interactions: Type Origin Covalent structure-based interaction Bond (1-2) Bond orbital Angle (1-3) Orbital hybridization (Hybrid orbitals with different energy levels & geometric shapes e.g., carbon) Dihedral (1-4) Resonance stabilization (multiple resonance forms => lower overall energy) Non-covalent structure-based interaction Dihedral Steric overlap Hydrophilic/hydrophobic Hydrophilic: Monopol, Dipole, Quadrupole etc. Electrostatic (short and long Charge-charge, Charge-dipole, Dipole-dipole, range) => distance dependent Charge-induced dipole, Dispersion => For a bond 2 atoms are interacting with each other, for an angle 3 and for a dihedral angle 4 atoms! Electronic structure => Atomic orbitals => Specific bond types Geometry restrictions => Specific spatial positions of atoms & their nuclei Quantum Mechanics Nucleus + Orbitals Classical Mechanics Nucleus + parameterized interactions Covalent Sharing of electrons to form electron pairs between atoms => Wave-Particle-Duality Atomic orbitals have a specific 3D shape => induction of a specific geometry of the bond and its surrounding Bonds Optimal bond length depending on the potential energy of the bond (Repulsion  Attraction) Optimal bond orders Optimal bond angles Optimal atom-atom distance Optimal atom-atom-atom (1-3) angles Non-covalent Electronic interaction of atomic orbitals leads to restrictions in angles, bond length and torsions Dihedral (1-4) Interactions Changes in Configuration (cis-trans or D/L isoform) => requires bond breakage Changes in Conformation => no bond break Optimal fixed atom-atomatom-atom (1-4) dihedral angles Polypeptide Chain:  has a resonance structure, would need quantum mechanics to describe the interaction BUT  can be described by classical mechanics due to quasi-rigid trans-configuration (nearly always), because it restricts the overall conformational space considerably Molecular Force Fields: Simplified representation of a molecule in terms of its geometry, bond lengths, bond angles and torsion angles. The force field defines the interatomic potential energy as a function of these parameters, which can then be used to predict the behavior of the molecule in response to external forces. Force fields are useful for simulating the behavior of molecules in drug-target interactions, protein-protein interactions etc. All parameters of the final force fields must be adjusted together at the end to reproduce experimental values. Examples of non-polarizable atomistic force fields: CHARMM (Small and Macromolecules), GROMOS (Aqueous or non-polar solutions of proteins etc.), AMBER (Proteins/DNA), OPLS (include experimental properties of liquids)  Represent the interactions between atoms in a molecular system (molecular model based on classical mechanics to study geometric properties of molecules)  Does not include the explicit consideration of polarization effects (fixed and averaged charges, which do not change in response to electric fields)  Faster simulations & less computationally demanding Charge-Charge Interactions (Monopol-Monopol): Monopole: Ions => Molecules with permanent & localized charge Dipole: Molecules with permanent & induced dipoles  long-range (1/r)  found in all biochemical systems Estimating the strength (energy) of the Glu-/Lys+ Ion pair: Dielectric constant: Measure of the ability of a material to store electrical energy in an electrical field e.g., 𝜀𝑊𝑎𝑡𝑒𝑟 = 80 => a charge is shielded by a factor of 80 in water (good insulator). Charge-charge interaction are ca. 8-20 x stronger in protein interior (𝜀𝑃𝑟𝑜𝑡𝑒𝑖𝑛 𝑖𝑛𝑡𝑒𝑟𝑖𝑜𝑟 = 4 − 10). Ionic Interactions (Salt Bridges):  Arise from electrostatic attraction between two groups of opposite charge (e.g., sidechains of polypeptide chain) Dipole interactions:  Polar molecules have permanent electrical dipole moment Dipole moment: Measure of the separation of positive and negative electrical charges in a system (measure of system’s overall polarity) Dipole: Separation of the positive and negative electrical charges Linear Ion-Permanent Dipole Interaction: Between a linear polar molecule and an ion, result of the permanent electric dipole moment of the polar molecule and the electric field created by the charged ion, leading to an attractive or repulsive force between them. Dipole-Dipole Interaction: When two dipolar molecules interact with each other. Partially negative portion of one polar molecule is attracted to the partially positive portion of the second. Dipole-induced Dipole Interaction: When a polar molecule induces a dipole in an atom or nonpolar molecule by disturbing the arrangement of electrons. Direction of the permanent dipole affects the direction of the induced dipole. Two induced dipoles (Temporary dipole/Dispersion Interaction): Between two neutral molecules/atoms due to their fluctuating or temporary electric dipoles through random motion of electrons in the molecule, which creates a momentary dipole. If two such molecules are close enough, the fluctuating dipole of one molecule induces a dipole in the other (London equation). => All electrostatic interactions are treated by the classical coulomb potential: Hydrogen Bond:  Electrostatic attraction of a hydrogen atom covalently bound to a highly electronegative NOF atom to another NOF atom  Reason: Permanent Dipole and Dispersion Interactions and/or Partial Orbital Overlap (Lone pair: Pair of e- in the outermost shell of an atom which is not involved in bonding. When electron cloud of one atom overlaps with the other => partial orbital overlap)  Can be described by CM applying a 1/r6-dependant functional form Hydrophobic Interactions: No direct attractive interactions, mainly caused by:  Optimization of surrounding hydrophilic interactions  Increase of the internal entropy of the system They are intermolecular forces that arise between nonpolar substances in the presence of water. They are driven by the tendency of nonpolar molecules to minimize their contact with water (polar) => formation of clusters or aggregates of non-polar molecules. Surface tension: Cohesive forces that exist between the liquid molecules at its surface, causing the surface to behave like an elastic sheet. A natural tendency of a system is to increase its entropy (2nd law of thermodynamics) over time. In the case of liquid, the surface molecules have a higher degree of randomness (entropy) than the bulk molecules, because they are exposed to the environment and are free to move in any direction => Surface tension is the balance between the entropy and the cohesive forces. Through forming clusters (hydrophobic interaction) the randomness of the system is reduced. Example: Formation of lipid bilayers in cell membrane => When the lipids are placed in water (cytoplasm), they tend to aggregate or cluster together due to hydrophobic interactions, reducing the entropy of the system by decreasing the number of possible arrangements of the lipids. Ligand-Protein Interactions Molecular Docking (Structure-Based Methods) Docking: Computational methods to predict the binding of a small molecule to a protein target  Determination of the optimal orientation & conformation of the small molecule within the active site of the protein target  Consideration of complex interactions between these 2 molecules  Starting point: atomic coordinates of the two molecules  Additional data may be provided (biochemical, mutational, conservation etc.) to improve the performance Different types: Protein docking, Peptide docking, Small Molecule docking Docking methodology: Representation of the system/Site Characterization Ligand Binding Site: Region on a macromolecule, which binds to another molecule with specificity. Binding partner = Ligand DOCK: molecular docking software program  To predict the binding mode and affinity of small molecules to proteins  Systematic search of the possible binding conformations and energies of small molecules with respect to a protein structure, to identify the most energetically favorable binding mode  It represents the binding site as a collection of overlapping spheres  Dock 6.9: o Several different scoring schemes o Ligand flexibility o Chemical properties of a receptor (each sphere assume a chemical characteristic) Glide (Schrödinger)  Binding site as a grid  Grid Map: Each type of atom is placed at each individual grid point and the change in free energy is calculated The search algorithm (Searching for lead compounds) Conformational Search: Exploring the possible 3D orientations, conformations of a small molecule in relation to a target protein to find the binding pose with the most energetically favorable interaction  Number of ways 2 molecules can be put together makes the docking difficult  Size of this search grows exponentially with the increase of the ligand molecule numbers  Flexibility of the binding partners makes it even more complex During the search, the docking software generates a series of different conformations for the small molecule and then evaluates the interaction energy between target and ligand considering factors such as electrostatic interactions etc. Minimization: Process of refining the 3D orientation or conformation of a small molecule in its binding pose with the target protein to reduce the total energy of the protein-ligand complex. Minimization is performed after a conformational search has identified a set of binding poses that are energetically favorable. This can involve adjusting the positions of atoms to minimize the repulsive forces between atoms and to maximize the attractive forces between atoms, such as van der Waals and hydrogen bonding. Local minima & Global minima: Where the energy of protein ligand complex reaches a minimum value, but this minimum is not necessarily the global minimum. During the conformational search & minimization process the docking software may encounter multiple minima. Monte Carlo Simulated Annealing (Metropolis Monte Carlo): Optimization technique to find the global optimum solution of a complex problem. It’s a combination of Monte Carlo methods and simulated annealing algorithm.  The ligand performs a random walk around the protein (Initial solution & Temperature)  At each step, a random displacement, rotation etc. is applied and the energy is evaluated and compared to the previous energy (Generation of candidate solution & Energy evaluation)  If the new energy is lower, the step is accepted (Acceptance criterion & Update Solution)  If the energy is higher, the step is accepted with a probability exponential function ∆𝐸 exp (𝑘𝑇)  Reduce the temperature according to a pre-defined cooling schedule  Check the termination criteria, such as max iteration, min temperature etc.  Repeat these steps, till the termination criteria are met & return the final solution If performed at a constant temperature => Basic Monte Carlo In simulated annealing, after a specified number of steps, the temperature is lowered, and the search repeated. As the temperature continues to go down, steps, which increase the energy becomes less likely and the system moves to the minimum (or minima). Genetic algorithms: Estimating the effect of modifications  Stochastic search method (heuristic searching algorithm) to generate solutions to optimization and search problems  Function on a population based on natural selection and genetics Individuals within populations exhibit variation and their traits are passed to next generations, new mutations can be included  Encoding of the solution, structure or problem as a genome or chromosome represented by a bit array Treating the conformations as individuals in a population, each with a fitness score that reflects its energy  State variables (Gene): Parameters describing the translation, rotation, and conformation of the ligand Procedure: 1. Generation of a population of genomes (random initial population) 2. Application of crossover and mutation to the individuals to create new populations  Single Point Crossover: A point on both parents' chromosomes is picked randomly and designated a 'crossover point'. Bits to the right of that point are swapped between the two parent chromosomes. This results in two offspring, each carrying some genetic information from both parents.  Two Point Crossover: Points are picked randomly from the parent chromosomes. The bits in between the two points are swapped between the parent organisms.  Uniform Crossover: Each bit is chosen from either parent with equal probability. Other mixing ratios are sometimes used, resulting in offspring which inherit more genetic information from one parent than the other. We don’t divide the chromosome into segments, rather we treat each gene separately. In this, we essentially flip a coin for each chromosome to decide whether or not it’ll be included in the offspring. At each step, the algorithm uses the individuals in the current generation to create the next generation 3. Fitness of each structure is evaluated by an estimate of the binding free energy 4. The best member/s survive to the next generation 5. Repeating this procedure for some number of generations or energy evaluations Binary crossover and mutation can introduce inefficiencies into the algorithm since they can easily drive the system away from the region of interest => algorithm gets stuck in a suboptimal solution, which happens, because the binary crossover only explores small regions of the solution space Solution: mapping the local search back on to the genetic representation  Genetic searches are often combined with local searches  Lamarckian Genetic Algorithm: Allows the transfer of information between generations, allowing the algorithm to learn and evolve over time. The scoring or energy evaluation routine All search methods involve evaluating the fitness/energy of a given binding conformation.  Good scoring function is needed that can give an accurate estimate of the binding free energy  Must be fast and efficient 1. 2. 3. 4. First Principles Methods Semiempirical Methods Empirical Methods Knowledge based potentials Clustering: Docking algorithms produce an ensemble of predictions. Each predicted structure has an associated energy.  Consideration of the binding free energy (or enthalpy) and the relative population  Clustering the data based on some “distance” criteria, such as RMSD (standard choice)  Gain of sense of similarity between predictions General understanding of binding Bound vs. Unbound Docking Bound docking: computational method used to predict the binding mode of a small molecule (ligand) to a protein (receptor). The ligand already knows where to bind (binding site is known). The protein and ligand structures are fixed, and the program calculates the binding pose and energetics of the complex. The goal is to find the lowest energy conformation of the complex.  Reproduce a known complex (bound conformation), where the starting point is atomic structures from a co-crystal  No conformation change needed for binding Unbound docking: to predict the binding mode of a small molecule to a protein in the absence of any bound ligand. Unlike bound docking, this considers the flexibility of both the ligand and the protein, allowing for the prediction of the most probable binding mode in a dynamic, unrestrained system.  Starting point is the structure in their unbound conformation (such as native conformation from experimental data/model)  Predicting the binding mode (exploring the conformations of ligand within the target) Rigid vs. Flexible Docking Rigid:  does not allow conformational changes in the protein structure  computationally easier  faster Flexible:  allows conformational changes withing the protein structure  computationally intensive  slower Protein Flexibility:  Protein can adopt multiple conformations, challenging to predict the most biologically relevant conformation for binding  Dynamic interactions between the 2 proteins can be complex  Flexible regions of proteins can change shape during the binding process Conformation of unbound protein ≠ Conformation of bound protein The docking problem for 2 protein is very difficult since the search space is extremely large (all possible relative conformations). BUT for small molecule (drug, peptide, or ligand) binding to a protein we have a change of exploring the conformational space! Induced fit docking: It models the changes in protein conformation that occur during the binding process  Unbound state (initial) of protein target  Docking of the ligand to the protein  Change of conformation (optimizing the binding mode)  Repeat multiple times until optimal final binding mode is achieved Induced fit: biological concept by which a protein changes its conformation in response to the binding of the ligand. Summary: Ligand Docking Programs  To date >60 docking software available  Validation of a docking protocol is needed through: o Reproduction of an experimental binding geometry The result of the docking calculation is comparable with the conformation retrieved from X-ray analysis o Comparison by calculating the RMSD  A good working software can reproduce 70% of experimental data  Binding geometry of a ligand (orientation & conformation) is defined a “pose”. A docking calculation can provide more poses for the same ligand, that are ranked based on the scoring value. Examples: AUTODOCK, GLIDE, DOCK, SWISSDOCK etc. Thermodynamics For the macroscopic system: 𝐸𝑇𝑜𝑡𝑎𝑙 = 𝐸𝑘𝑖𝑛 + 𝐸𝑝𝑜𝑡 Kinetic energy = Energy of motion Potential energy = Static energy resulting from the position & internal properties Macroscopic system = Motion uniquely determined by force acting on system. The macroscopic properties of a system can be calculated from the microscopic properties of its constituent particles. Microscopic system = Statistical distribution of motion of the individual particles (molecular, atomic level) and the behavior of individual particles, such as atoms/molecules is considered rather than the average behavior of many particles. Statistical thermodynamics: Relationship between thermodynamic properties and the statistical behavior of many particles in a thermodynamic system. Internal energy (U): total energy of a thermodynamic system including the kinetic and the potential energy of the microscopic system and the random, disordered motion of molecules in the microscopic state. It is a state function, meaning it depends only on the initial (E) and final states (A) of a system. Changes in internal energy can be due to the transfer of heat (q; energy transferred to the motion of atoms and molecules) or by work (w; energy transferred to the motion of objects). ∆𝑈 = 𝑈𝐸 − 𝑈𝐴 ∆𝑈 = 𝑞 + 𝑤 First law of thermodynamics: Energy is conserved; it can be neither created nor destroyed. Enthalpy: The sum of internal energy of a thermodynamic system and the product of its pressure and volume. It is a state function. It is often used for expressing the total energy of a system. 𝐻 = 𝑈 + 𝑃𝑉 In chemical reactions, the change in enthalpy is equal to the heat absorbed or released by the system under constant pressure. ∆𝐻 = 𝑞 + 𝑤 + ∆(𝑃𝑉) 𝑢𝑛𝑑𝑒𝑟 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡 𝑝𝑟𝑒𝑠𝑠𝑢𝑟𝑒: ∆𝐻 = 𝑞 + 𝑤 + 𝑃∆(𝑉) 𝑇ℎ𝑒 𝑜𝑛𝑙𝑦 𝑤𝑜𝑟𝑘 𝑖𝑠 𝑑𝑜𝑛𝑒 𝑏𝑦 𝑝𝑟𝑒𝑠𝑠𝑢𝑟𝑒: ∆𝐻 = 𝑞 + 𝑃∆𝑉 − 𝑃∆𝑉 = 𝑞 Second law of thermodynamics: Any spontaneous process increases the disorder of the universe. Entropy = Measure of molecular disorder and is a state function. The sum of the entropies of a system in its surrounding is constantly increasing unless external work is performed to counterbalance this increase. ∆𝑆 = 𝑆𝐸 − 𝑆𝐴 Gibbs Free Energy (Free enthalpy): Internal energy of a system minus the product of its temperature and entropy. It represents the maximum amount of work that can be performed by the system under constant temperature and pressure conditions. ∆𝐺 = ∆𝐻 − 𝑇∆𝑆 NPT (isobaric-isothermic ensamble): constant number of particles, pressure, temperature (standard lab conditions) Ligand-Protein Binding Energy: The energy released or absorbed during the formation of a complex between a ligand (inhibitor, I) and protein (enzyme, E). The energy is due to the attractive forces between them. A positive binding energy indicates that energy is released upon binding (spontaneous). Higher binding energy results in a stronger, more stable complex. ∆𝐺𝐵𝑖𝑛𝑑𝑖𝑛𝑔/𝑆𝑜𝑙𝑢𝑡𝑖𝑜𝑛 = ∆𝐺𝐵𝑖𝑛𝑑𝑖𝑛𝑔/𝑉𝑎𝑐𝑢𝑢𝑚 + ∆𝐺𝑆𝑜𝑙𝑣𝑎𝑡𝑖𝑜𝑛(𝐸𝐼) − ∆𝐺𝑆𝑜𝑙𝑣𝑎𝑡𝑖𝑜𝑛 (𝐸+𝐼) Loss of the conformational entropy of the ligand – Disorder associated with the different possible conformations & solvent can influence the freedom of movement of the ligand.  Hydrophobic solvent: large degree of conformational freedom – high conformational entropy  Hydrophilic solvent: restricted conformational freedom – lower conformational entropy Increase in the entropy of the solvent molecules through the ligand-protein binding (release of ligand from the solvent cage, solvent molecules are now free to move) Movement/Temperature-dependent properties Kinetic energy related Approximation through fitting of the calculated energies to experimental data => Scoring function Stabilization of the protein-ligand complex through non bonded ligand-protein interactions and the torsional energy of the ligand Static properties Potential energy Explicit calculation by placement of the ligand in the binding site Scoring function: Mathematical model used to estimate the binding affinity between a protein and ligand. They are based on various physical and chemical properties of the protein and ligand, such as their size, shape, electrostatic interactions, hydrogen bonding, and van der Waals interactions. The function calculates a numerical score for each interaction, and the ligandprotein complexes with the highest scores are the most favorable. Take non-bonded force field terms and parameters (AMBER force field provides parameters for the description of protein molecules etc.) Approximate the entropy loss of the ligand and entropy increase of the solvation Fit binding terms against experimental binding data => Simplifying the calculation of the free enthalpy => Calculation of a modified potential energy (scoring function-based) Types of scoring functions: Empirical: Based on experimental data and use statistical models to predict the binding affinity of a protein-ligand-complex. Simple functional form, fitted to experimental data. Accurate predictions for well-understood system but not well generalizable for new systems. o Based on counting the number of various types of interactions between the two binding partners (sets of weighted energy terms) o Simplified additive interaction potentials for the ligandprotein interaction + Dependency on atom-atom distances and angles included via simple functional form (geometrical parameter optimization) o ∆𝐺 Parameter Optimization: Calculating score for a series of X-ray structures, for which the binding affinities are available and then comparing the results (measured values vs. scoring function parameters) Knowledge-based/Statistical-potential based: Based on prior knowledge and experience to predict the binding affinity of a protein-ligand complex. Statistical potentials based on structural data from PDB. Useful for understanding the underlying principles but may not provide accurate predictions. o Based on statistical observations of intermolecular close contacts in large 3D databases o Energy potential that are derived from the structural information embedded in experimentally determined atomic structures Physical force field-based: Based on the principles of molecular mechanics and on the precise physics-based descriptors. Most accurate, but CPU demanding (CPU: Central processing unit – brain of the computer). o Protein-Ligand Electrostatic Interactions o Van der Waals Interactions o Lipophilic Contact Interactions o Polar and nonpolar solvation contributions o Ligand torsional entropy contribution (Entropy associated with rotational degrees of freedom in a molecule) Ligand free in a solution has many possible rotational conformations => higher entropy. When ligand binds to a protein, some of this freedom may be restricted, leading to a decrease in entropy => contributes to the stability of the protein-ligand complex. Assumption: Contributions to binding are linearly combined. Machine Learning based: Use of machine learning algorithms to predict the binding affinity. They are trained on large datasets of experimentally determined binding affinities. Provide highly accurate predictions. Energy landscape: graphical representation of the potential energy of a system as a function of its conformation. Proteins in Motion Molecular Dynamics (MD) Simulations The motion of an object can be fully described by knowing its position as a function of time. The position and velocity of every atom within the system must be known as a function of time The interatomic forces & the effect local and globally must be accounted for (Force field description) Velocity changes resulting from forces acting on an object => Acceleration Newton’s Second Law: F = m*a MD simulation rely on Newton’s equations of motion for every particle o F is the force operating on a particle i o m is the mass o a is the resulting acceleration Forces acting on each atom are calculated iteratively (iterative: repeating a process) MD and Force Fields: The force field energies are calculated as the sum of the energy functions accounting for bonds, angles, and electronic effects. Bond stretching Bond angle bending Torsional rotation Non bonded interactions The coupling of stretching, bending and rotation Verlet Algorithm: Simulation of the motion of particles Calculation of the position and velocity of particles over time based on their interactions with each other and with external forces Uses the positions and velocities of particles at two different points in time to calculate their motion over a period Computationally simple but can be erroneous in its velocity approximations Two adaptations to reduce the errors: o Leap Frog Algorithm (different approach in integrating equations of motion) o Velocity Verlet Algorithm (Consideration of the acceleration of particles over the course of each time step – more accurate results) Calculation: Periodic Boundary Conditions (PBCs): Boundary conditions used for approximating a large system by using a unit cell capable of tiling into a 3D crystal like cube. This unit cell is copied in all directions to produce an infinitely repeating system. Any particle that crosses the boundary of the simulation box is reentered at the opposite boundary. Water Models: To study the interactions and dynamics of water molecules – important for to mimic the behavior of real water SPC/E (Simple Point Charge) TIP3P, TIP4P (Transferable Intermolecular Potential with … Points) OPC (Optimal Point Charge) Ensembles: The measurement conditions of any molecular dynamics system can be defined using a statistical ensemble. Statistical ensemble describes the probability distribution of the state of a system based on the average of certain conserved variables over infinite time. Variable properties: Pressure Energy Volume Temperature Statistical ensemble is a collection of microstates, each representing a possible state of a system. Types of Ensembles: Microcanonical Ensemble (NVE): Total energy of the system is fixed (no energy transfer with outside, isolated), and the number of microstates is proportional to the volume in phase space. Best suited for isolated systems where the energy is conserved. o Total number of particles, total volume and total energy are constant Canonical Ensemble (NVT): Temperature is fixed and the number of fixed microstates is proportional to the volume in phase space. Used to describe the thermodynamics of systems in thermal contact with a heat bath (Energy can be transferred from outside). o Total number of particles, volume and temperature remain constant Grand Canonical Ensemble (µVT): Temperature, pressure, and chemical potential (µ) are fixed, and the number of microstates is proportional to the volume in phase space. Used to describe systems in contact with a reservoir of particles (energy and particles can be transferred). o Volume remains constant o Chemical potential can be used to calculate the rate of a chemical reaction or phase transition through changes in particle numbers Gibbs or NPT: Energy can be transferred via heat bath to control the temperature. The volume of the system can change, with the pressure of the system being controlled by a barostat, to match the pressure exerted by its surroundings. o Number of particles remain constant o Used for free energy calculations of chemical reactions (constant temperature and pressure) or integration? Straightforward MD Simulation: model the behavior of molecules by solving Newton's equations of motion for each atom in the system. Based on the principles of classical mechanics. Conditions of the surrounding environment is not considered. These are not suited to calculate the Gibbs Free Energy of the Protein-Ligand Complex in Solution. Because: Timescale not accessible Energy Barrier might be high => bad statistical sampling Special techniques are needed Do not account for the effect of the environment on the molecule’s behavior Approximating Methods (Enhanced Sampling Techniques): Umbrella Sampling: Calculation of the free energy of a system as a function of certain collective variable (reaction coordinate). Imposing a bias on the reaction coordinate, forcing it to stay within the desired range. Obtaining complete picture of the free energy profile even for complex systems. Thermodynamic Integration: Gradually transforming the system from a known reference state to the desired state and using this info to estimate the free energy difference between the 2 states. Metadynamics: Calculation of the free energy as a function of one or more collective variables. Introduction of a time-dependent bias potential, designed to explore regions of the collective variable space that are poorly samples in the standard MD simulation.  Very time consuming, 1 Ligand takes days to weeks to be calculated!  Not useful for practical drug design applications Chemoinformatics Ligand-Based: Quantitative Structure-Activity Relationships (QSAR): Relation of the physical and chemical properties of a molecule to its biological activity or toxicity. Developing predictive models based on available data on the chemical structure of a molecule and its activity. QSAR models can be developed using a variety of statistical methods, including linear and non-linear regression, principal component analysis (PCA), and artificial neural networks (ANNs). Computational representation of chemical structures: Molecular Fingerprints (bitstring): Encoding molecular structure in a series of binary digits (1,0) that represent the presence or absence of substructures in the molecule (chemical fragments). Fragments may be small groups of atoms, functional groups, rings. These are defined beforehand. MACCS (Molecular ACCess System) Keys: Set of molecular descriptors for molecular fingerprinting and similarity searching. These are set of 166 binary variables that describe various molecular features (166-bit 2D structure fingerprints). The MACCS keys are a set of questions about a chemical structure, eg: Are there fewer than 3 oxygen atoms? Is there a S-S bond? Is there a ring of size 4? Is at least one F, Cl, Br, or I present? The result of this is a list of binary values – either true (1) or false (0). The answers are frequently written as a list of bits (also called a bitstring). The bitstring for Cyclobutan is "1010". Types: Non-hashed fingerprints: Encode precisely defined structural patterns Hashed fingerprints: No assigned meaning for each bit. Fingerprints encode the presence or absence of substructures not previously defined. Patterns (Target, Substructure) refer to the distribution of binary values in the fingerprint. Each pattern activates a certain number of positions (bits) in the fingerprint. An algorithm determines which bits are activated by a pattern. The same pattern always activates the same bits. The algorithm is designed in such a way that it is always possible to assign bits to a pattern. There may be collisions. It is not possible to interpret fingerprints. H-atoms are omitted. Stereochemistry is not considered. Compact and efficient representation of molecular structure Parameters: Fingerprint Length (hashing binary values into a fixed length fingerprint) => preserve the molecular similarity info while reducing the size of fingerprint o Too short => Poor discrimination of molecules, almost all bits = 1 o Too large => To much disk space required, to many bits = 0 Size of patterns o Too short => Poor discrimination of molecules o Too large => Able to discriminate molecules, but many bits = 1 Number of bits activated by each pattern o Too few => Poor ability to discriminate between patterns o Too many => Able to discriminate between patterns, but many bits = 1 => For similarity search in large databases Molecular descriptors: Ensemble of topological, electronic, geometry parameters calculated directly from the molecular structure. A descriptor must have: Invariance with respect to labeling and numbering of atoms Invariance with respect to roto-translation (combination of rotational and translational movement in 3D) Specific algorithmically computable definition Values in a suitable numerical range for the set of molecules 1D: Generated from linear representation, e.g., C6H5NO2 Constitutional descriptors: o Number of Atoms, N o Absolute and Relative Numbers of atoms o Molecular Weight, LogP 2D: Generated from the molecular graphs (connectivity based) Topological indices (TI): o Connectivity indices o Based on adjacency matrix (representation of a molecular graph of a molecule): Square matrix of size N x N. A molecular structure with n atoms may be represented by an n × n matrix (H-atoms are often omitted). Adjacency matrix: indicates which atoms are bonded The adjacency matrix is constructed by defining a binary value for each pair of vertices, where a value of 1 indicates the presence of an edge (bond) between the two vertices and a value of 0 indicates the absence of an edge. Adjacency indices Zagreb group indices Describe the connectivity of a molecule based on the sum of the squared degree of vertices (atoms) in the molecular graph. Two types: o M1: Sum of the squared degree of all vertices o M2: Sum of the product of the degree of all pairs of vertices 3D: Generated from 3D molecular structures Dependent on the connection and conformation of the molecule Requires 3D coordinates of atoms Ovality: How the shape of a molecule approaches a sphere Polar surface area: Total area of the part of the molecular surface that corresponds to polar atoms (O, N, halogens) Lipophilic Descriptors: Describe the hydrophobicity properties of a molecule. Predicting its behavior in different environments. Partition coefficient n-octanol/water: LogP, LogD Partition coeffient: Measure of the distribution of a solute between two immiscible solvents, such as water and octanol. Distribution coefficent: Measure of the distribution of a solute between the immiscible solvents (solid and liquid). Software: DRAGON to calculate molecular descriptors QSAR Modeling: 1. Development Experimental Data Preparation: Curated Dataset – Data Cleaning o Removal of duplicates o Removal of mixtures o Standardizing or normalizing the data (Structures standardization) o Similar experimental conditions Carefully selected and cleaned dataset used for building predictive models. Cleaning and transforming the raw data into a format suitable for QSAR modelling. This step can affect the quality/performance of the predictive model. Descriptors: A good model should be interpretable! Select the most related descriptors Normalize the descriptors Discard highly correlated descriptors Statistical Models: Fitting model’s parameters Provide a way to quantitatively describe the relationship between the molecular structure and its biological activity. Linear Regression Model: Use of linear equation Multilinear Regression Model: Machine Learning Methods: Decision Tree: Representing the relationship using a tree-based approach Artificial neural network: Use of network of interconnected noted for representation Others: Support Vector Machine, Random Forest, K Nearest Neighbors 2. Validation Residual Sum of Squares (RSS) to assess the goodness of fit of a regression model. It represents the difference between the observed values and the predicted values of the dependent variable. It quantifies the amount of variability in the data. Yi…observed value & Ycalc…predicted value Statistical Parameters for Regression: RMSE (Root Mean Squared Error): Difference between the predicted values and actual values for a data set. MAE (Mean Absolute Error): average magnitude of the errors in the model’s predictions. R2 (Determination Coefficient): Proportion of the variation in the dependent variable in a regression model (value between 0 – 1). A higher R-squared value indicates that the regression model is a better fit for the data, while a lower R-squared value suggests that the regression model may not be capturing all the relevant relationships in the data. The data set is divided in 2 parts: Training set => used to build the model (~80) Test set => used to evaluate the model/its performance (~20) Estimation of the model’s predictive performance through: 5-Fold Cross Validation: Dividing the data set into multiple parts (5 equal-sized folds) and using each fold for both training and testing the model (4 folds for training and one-fold for the test). More robust evaluation. Leave-One Out Cross-Validation: N separate times (N: number of observations), the function is trained on all data except for one point and a prediction is made for that point. The result is obtained by averaging the performance measures. On each fold, the test set contains only one molecule. Each observation is used for testing exactly once. High computational cost. 3. Application Applicability Domain: The biological activity of the compound i will be predicted by the model, only if… Examples: 3D QSAR: QSAR analysis with 3D structures of molecules Models should be interpretable Higher information from raw data Comparative Molecular Field Analysis (CoMFA) => based on a concept that the biological activity of a small molecule depends on the molecular fields 2D-QSAR techniques are not able to accurately explain the correlation between the physicochemical properties and 3D spatial arrangement as well as biological activities. Therefore, recently, 3D-QSAR approaches are introduced to overcome the limitation of 2D QSAR. Molecular Fields: is a function of spatial coordinates that characterizes molecules, used to capture the molecular properties related to the biological activity of the molecule. Molecular Interaction Fields: Interaction energy between probe atoms and a molecule Comparative Molecular Field Analysis: 1. Structure Alignment A molecule is chosen as a reference fixed template and the other molecules are moved to their new positions, corresponding to the minimum of the sum of the squared distances between a chosen set of atom pairs. 2. Probe atoms at grid’s node 3. Electrostatic and steric interaction energies (Molecular Field Calculation) => Number of descriptors exceeds the number of molecules by a factor of several thousand 4. Partial least squares regression (Regression Model Building) PLS is suited to handle complex multivariate problem by eliminating the correlation between the descriptors, reducing their number, and enabling the generation of a linear relationship between the field parameters and biological activities (Prediction of biological activity). 5. Regression coefficients used to draw contour plots Contour plot: graphical representations of 3D data that display the relationship between three variables. In the context of QSAR, contour plots are used to visualize the relationship between the molecular structure of a set of compounds and their biological activity. The CoMFA equation can be exploited for the construction of the contour maps by connecting points of the 3D grid with similar favorable or unfavorable coefficient values of a given field. The visualization of the contour map highlights regions where interactions are critical for activity. Contour plots are generated by plotting the molecular descriptors as the x and y axes, and the biological activity as the z axis. Comparative Molecular Similarity Index Analysis: A modified version of CoMFA, based on the standard set of five physicochemical molecular fields: Electrostatic field Steric field Hydrophobic field Hydrogen-bond donor field Hydrogen-bond acceptor field GRID: Each point or node in the grid represents a possible binding site, and the properties of the surrounding environment, such as the shape, electrostatic potential, and hydrophobicity, are calculated for each node. These properties are then used to rank the nodes based on their potential as a ligand-binding site, and the nodes with the highest scores are considered as potential binding sites. Pharmacophore: The ensemble of steric and electronic features that is necessary to ensure the optimal supramolecular interactions with a specific biological target structure and to trigger (or to block) its biological response. Pharmacophore modeling is popular in the industry as it chalks a roadmap of preferred as well as unsought functional groups that are to yield good results. Pharmacophore is not chemical molecules, but just specification of the physiochemical features desired in a structure. A typical pharmacophore will be a 3D coordinate system specified feature table. The pharmacophore can be considered as the largest common substructure shared by a set of active molecules. A pharmacophore gives an insight into important interactions in the active site of the target. This is helpful to find new or novel compounds against the target. Features: Electrostatic H-Bonding Aromatic Interactions Hydrophobic regions Coordination to metal ions etc. + Excluded volumes which does not have specific interactions. 1. Pharmacophoric features in each ligand are identified 2. Ligands are aligned such that corresponding features are overlaid 3. Pharmacophore hypotheses are scored (Parameters: Number of Features, Goodness of Fit to Features, Conformational Energy, Volume of the Overlay) The Lipinski’s Rule of 5: 1. 2. 3. 4. Weight < 500 Da Log(P) < 5 (it shouldn’t be too hydrophobic) < 5 H-Bond Donors (-OH) < 10 H-Bond Acceptors (-O) If there were too many H-Bond-Donors and acceptors, then the small molecule would be too hydrophilic => very soluble in blood but not able to go through the membrane/blood-brain-barrier (it must be just right in between) 5. < 5 freely rotating bonds => gigantic reduction in entropy if free rotation => not good for the free energy Ligand vs. Structure Based Drug Design Ligand-Based: The 3D structure of the biological target is unknown and a set of geometric rules and/or physical-chemical properties (pharmacophore model) obtained by QSAR (structure-activity relationship) studies are used to screen the library of chemicals. Structure-based: It involves molecular docking calculations between each molecule to be tested and the biological target (usually a protein). To evaluate the affinity, a scoring function is applied. The 3D structure of the target must be known. Virtual Screening Computational approach to assess the interaction of an insilico library of small molecules and the structure of a target macromolecule to rapidly identify new molecules with certain desired activity. Virtual screening involves the docking and screening of a compound database against the drug target, followed by scoring based on their binding free energy with the target. Many software are available for screening of compound databases against the selected target. 1. Databases: SciFinder, ZINC, PubChem, ChemSpider etc. 2. Filters to reduce the number of compounds a. Filters for applicability domain (focused database) b. Filters to estimate “drug-likeness” i. A ligand is not a drug! ii. Properties of a drug: High affinity to a protein target, soluble, permeable, absorbable, high bioavailability, specific metabolic rate, low toxicity iii. (L)ADME(T): absorption, distribution, metabolism, elimination/excretion and/or toxicity & liberation => A drug must travel from the site of administration and reach its site of action in sufficient quantity and remain for long enough to exert the required pharmacological response and be excreted without causing toxic side effects iv. Lipinski’s Rule of 5 (RO5) & beyond the rule of five => Cutoff problems, new drug targets such as macromolecules, need of diversity c. Filter for synthetic accessibility i. SAscore = fragmentScore – complexityPenalty => Fragment score is proportional to fragment’s occurance in the PubChem database d. Other Filters to remove: i. Compounds containing too many rings ii. Compounds with toxic groups or reactive groups Structural Alert: Molecular patterns that are associated with particular types of toxicity either directly or after undergoing of a metabolic activation in vivo (Toxichophore), such as aromatic nitro iii. Poorly soluble compounds 3. Similarity Search a. Properties to describe elements (Fingerprints, Descriptors) b. Distance measure (“metric”): Distance coefficients i. Tanimoto coefficient (Tc): Similarity assessment for structures A & B encoded by bitstrings (fingerprints) 4. 5. 6. 7. 8. => generally, improves if multiple reference compounds are available Pharmacophore Models a. Structure-Based: Screen multi-conformational databases to find other compounds that can match the features b. Ligand-Based QSAR Docking By docking a small molecule to a protein, we can predict: a. Docking score: usually based on energetic calculations, shows how likely this molecule will bind to the protein b. Docking pose: The orientation of the ligand relative to the receptor as well as the conformation of the ligand and receptor when bound to each other HITs Validation: To validate the predictive power of a model, a library of active and inactive (or decoy) compounds will be screened against the selected overlay solution and enrichment metrics will be calculated a. Evaluation of virtual screening: i. Sorting the results by scoring (docking or pharmacophore scores) ii. Calculation the hit rate in a set of top ranked molecules iii. Calculation of the enrichment factor 𝐻𝑅𝑇𝑜𝑝𝑁% 𝐸𝐹𝑇𝑜𝑝𝑁% = 𝐻𝑅𝑅𝑎𝑛𝑑𝑜𝑚 Every virtual screening calculation must have at least EF > 1.0 and to be considered enough efficient EF > 2.0. It means that the screening must have performances at least 2-fold better than the random! b. Hit Rate Evaluation: Measure of the probability to find active ligands into a set of molecules and it can be calculated by the following equation: 𝐴𝑐𝑡𝑖𝑣𝑒 𝑀𝑜𝑙𝑒𝑐𝑢𝑙𝑒𝑠 𝐻𝑅 = ∗ 100 𝐴𝑙𝑙 𝑀𝑜𝑙𝑒𝑐𝑢𝑙𝑒𝑠 i. Random Hit Rate: Probability to find an active compound by random choices c. Enrichment Factor: Ratio of the virtual screening success rate => How much work one saves by performing virtual screening followed by experimental tests of the compounds in the hit list compared with random screening of the whole collection 𝑃𝑟𝑜𝑝𝑜𝑟𝑡𝑖𝑜𝑛 𝑜𝑓 𝑎𝑐𝑡𝑖𝑣𝑒𝑠 𝑖𝑛 ℎ𝑖𝑡 𝑙𝑖𝑠𝑡 𝐸𝑅 = 𝑃𝑟𝑜𝑝𝑜𝑟𝑡𝑖𝑜𝑛 𝑜𝑓 𝑎𝑐𝑡𝑖𝑣𝑒𝑠 𝑖𝑛 𝑑𝑎𝑡𝑎𝑏𝑎𝑠𝑒 d. AUC ROC curve: performance measurement for classification problem at various thresholds settings. ROC is a probability curve and AUC represents degree or measure of separability. It tells how much model is capable of distinguishing between classes. Higher the AUC, better the model is at predicting (AOC > 0.5, better than random) Binding residues predictions are validated by comparing the predicted output with the experimentally labeled dataset. A residue is true positive (TP), if it is predicted as binding when it is binding, true negative (TN) if predicted as non-binding when it is a non-binding, false negative (FN) if predicted as non-binding but it is binding, and false positive (FP), if predicted as binding but is non-binding. How do you find false negatives or positives in the curve?? Advantages & Disadvantages of Virtual Screening: AI in Drug Design: Chemical space (concept, methodology, applicability) o Now much more data, better in annotation BUT drug discovery have proxy assays (early state cheap assays) => understanding which assay is predictive for human in vivo efficacy and safety o AI discovers molecules in a chemical space, which is already known (area, where you are sure, that your compound succeeds) Molecular representation (standard vs. more advanced directions) o Compound structure is not an image, it can have different conformations, can be protonated, be reactive etc. (much more than a cat picture) => dynamic entity o Compound is also very conditional (patient dependent, environment dependent) o Characterizing the biological system is the big question Model evaluation and applicability (confusion matrix, applicability domain) o Confusion matrix: Label your images, predicting through computational methods and confusion matrix tells you how often you are wrong/right o Applicability Domain: In vivo data is very conditional (cannot label your data; even the water is toxic if you drink too much) & in vitro data you can label but doesn’t really matter (does it translate to the in vivo? – the translation is very conditional) o Validation is costly & retrospective validation is equally futile (as splitting data in smaller pieces and using one of it as test is tricky => very similar to training data) o Context of the project influences the model & prediction and follow up assays afterwards to confirm the prediction => useful for drug discovery Novelty vs. Success (chemistry focused) o Very novel is risky (no knowledge) & work on target classes, which are already used such as kinases and enzymes, which are data-rich and have less complex pharmacology Bioactivity (agonists, antagonists, partial agonists...) o Compound is not directly ligand of a protein, it can be agonist, antagonist etc. Machine Learning o Supervised (labelled data => predictive models => decision making) & unsupervised models (unlabeled data => such as clustering) Enhanced Sampling in MD: MD timescale o Protein works in a time scale of ms (electrophysiological recording), but we need to integrate fs (for atomic resolution?) => 10^9 repeating the integration Concept of sampling: statistic distribution vs. time o Coarse-grained: Several atoms are grouped => increased time scales at the price of lost resolution, much fewer interactions to compute (more efficient), water is described as 1 bead => no resolution to describe the water molecules individually, good for describing membranes o Enhanced sampling: Modification of MD simulation, lose the resolution over time o Ergodic Principle: Sampling over a period long enough => approximate the distribution of property of interest (series over time and calculating the average of the property) – able to estimate the proportion of the time the receptor spends in outward facing than in inward facing state and how these changes by having sugar or not => description of equilibrium not of time! o Collective variables (CV) => properties which you can describe as a function o Evolution Target structure and druggability Machine Learning in Protein Design: Protein space - novelty Applicability of protein design in drug design - pro and cons Protein design - peptide design Complexity of proteins vs. structured regions of proteins

Modelling and Simulation of Biological Macromolecules PDF

Document Details

Tags

Related

Summary

Full Transcript