Computational Structural Biology PDF
Document Details
Uploaded by YouthfulGothicArt
Tags
Related
- Computational Molecular Microbiology (MBIO 4700) Lecture Notes PDF
- Computational Protein Structure Prediction PDF
- Computational Protein Structure Prediction PDF
- Computational Protein Structure Prediction PDF
- Computational Methods Used in Prediction of Protein Structure PDF
- Bioinformatics Analysis PDF
Summary
This document provides lecture notes on computational structural biology. It discusses topics such as structure prediction methods, molecular modeling, and computational approaches for understanding biological systems. The lectures cover aspects like experimental determination of structures, and using computational models such as molecular dynamics to simulate protein structure and function.
Full Transcript
Topic 09: Computational structural biology Structure without experiments Structures offer powerful insights into protein function However, structural experiments are resource expensive, require specialist expertise, and fail for many samples Obtaining structures from first principles...
Topic 09: Computational structural biology Structure without experiments Structures offer powerful insights into protein function However, structural experiments are resource expensive, require specialist expertise, and fail for many samples Obtaining structures from first principles has long been a “holy grail” of structural bioinformatics Simulating protein behavior gives lots of insights, but the time scales of folding are too long to be practical Obtaining structures from knowledge of homologs (homology modelling) or complexes from components (docking) are important sub-problems that have proven more tractable The realization that homolog sequences give powerful insight into interactions has allowed a recent breakthrough, allowing many structures to now be quite accurately predicted Structural biology – just applied physics? The forces that govern protein structure, dynamics and energetics are well understood These physical laws can be embedded in a computer program One can then simulate the behaviour of a protein as a function of time In principle all important aspects of protein behaviour can be computed from these energetics In practice, the computational costs of simulating such complex objects forces us to take a lot of short cuts Various heuristics allow one to apply the power of these approaches to some complex problems including ab initio structure prediction and protein design Force fields wikipedia A force field is a set of equations and parameters that allow the calculation the potential energy of a molecular system A force field describes how to calculate the overall energy of the system (and each atom) from the bond lengths, angles, torsions, van der Waals interactions, electrostatics interactions etc. In essence, a force field sums the influence of all individual types of interactions (as discussed in topic 1) acting on a given atom Newtonian molecular mechanics Using a force field, we can calculate the net force being exerted on a given -1/2 atom in a simulated molecular system +1 Net covalent -1/2 force bond stretching electrostatic electrostatic repulsion attraction d- v.d.W t=0 repulsion Newtonian molecular mechanics This calculation can be repeated for every atom in the system t=0 Newtonian molecular mechanics Given the force experienced by each atom, the acceleration of each atom can be calculated according to Newton’s second law Assuming we know the current velocity of each atom, we can calculate where each atom will be after some small time-step t This yields a new state of the system (at t = 1) Newtonian molecular mechanics Once you have the new state, you can again calculate the net force on each atom This will allow you to calculate the position at the next time point (remembering to keep track of each atom’s velocity) Repeat calculation iteratively until desired time period has been modeled Showing the state of the system successively at each time point gives you a “movie” of the objects motion t=1 This movie should show plausible behaviour of the molecule Molecular dynamics Temperature, or the mean kinetic energy of the component atoms, is a key attribute of any atomic system Temperature dictates how much energy is available to the system, and therefore which states are energetically accessible Molecular dynamics tries to provide a realistic simulation of temperature by providing each atom with an initial random velocity The velocities are assigned according to the Boltzman distribution The idea is that if we give each atom an initial kick, and then evolve the system according to the rules of Newtonian mechanics, we will get a realistic picture of how this protein will behave at this temperature Molecular dynamics – setting up the system Protein simulations require that you also simulate the protein’s environment - include water molecules, ions, and possibly lipids (for membrane proteins) You can only handle a small number of atoms, but having a small ball of water in a virtual vacuum will result in water “boiling off” Instead, you put the protein in a medium (generally mostly water) filled box Critically, the box acts as though the universe were filled with repeating instances of identical boxes Atoms interact with copies of other atoms across the box boundary, and atoms exiting one side of the box reappear at the opposite side Molecular dynamics gives a window into protein motion E.g. gating in a voltage sensitive channel Note that this is a long MD (250 µs); 10-500 ns would be more typical Side chains, solvent and membrane are all simulated, but not shown On this time scale domains partly unfold and reorganize Force fields require trade offs The force fields are classical approximations of the quantum reality They inherently are imperfect, and are highly non-trivial to optimize so they behave well but are tractable to computation Water molecules are very challenging – no force field fully captures their experimental behaviour Electrostatics interactions weaken with distance, but remain non-zero until separation is infinite Because the number of interactions goes up as the square of the number of atoms, you need to ignore interactions beyond some range, but then need to dampen artifacts caused by interactions suddenly kicking in as atoms cross range cutoffs Computational resources limit simulations Proteins are very large, with thousands of atoms, all of which interact with each other Covalent bond formation and breaking needs quantum level simulation Interactions with water are also critical - this is computationally a very complex, messy system Accurate simulations require modeling dynamics on a femto- second (10-13 s) time scale (bond stretching, angle bending, etc.) But biologically interesting stuff (folding, binding, catalysis) happens on a microsecond to second (10-6 - 100 s) time scale Simulating even 10-9 s requires enormous computational resources You can simplify the calculation by approximating certain details, but this compromises accuracy Uses of molecular dynamics MD simulations allow one to observe the behaviour of a protein over a span of time The technique can give a sense of the natural range of motions of a given protein (e.g. domain closure, the degree of natural mobility of a protein chain) Models (e.g. substrate complexes) can be assessed for stability One can also extract approximate binding energetics It is, however, more difficult to study enzyme reactions Molecular dynamics can be extended to chemistry Molecular dynamics simulations do not allow bonds to be made or broken This means that they cannot model enzyme reactions, or subtle electrostatic effects Understanding reaction mechanisms requires simulating the system on a quantum level One solution is to use hybrid classical/quantum models Only key atoms are treated at the quantum level of detail, all other atoms are standard MD Hybrid QD/MD comes at a significantly increased computational cost Protein Folding by MD simulation Molecular dynamics can be used to simulate protein folding This is only doable for small proteins ( 50 % and good coverage, you will get a fairly reliable model Predicted structures are generally no more similar to the true structure than the starting template (i.e. often worse r.m.s.d.) Inserted or structurally divergent loops are likely to be wrongly modeled Homology models look like plausible structures, but generally are not accurate enough to tell you something new Docking and ligand binding Given two structures, can we predict what the complex looks like? In principle this is a relatively “simple” problem, dependent on 6 variables (orientation and position of Cyclin A bound to CDK2 domains) However, conformational changes in the protein and ligand upon binding can add considerable complexity Protein-interactions and protein- ligand have different challenges and are treated independently Lysozyme with peptidoglycan Protein docking rigid interface - straightforward plastic interface – highly challenging Commonly, docking takes structures of two proteins that are known to bind one another, and tries to predict the complex Structures can be experimental, homology models, or ab initio models; less accurate structures can reduce reliability Docking tends to work best when proteins reorganize the least versus the starting model Protein-protein docking Protein docking tries to predict molecular complexes, their interaction geometries and energetics Typically docking requires several steps: – First we sample all possible interaction geometries to find promising ones – Then we optimize the most promising interaction interfaces by repacking – The interaction is then scored, and top hits taken as candidate complexes Side chains typically reorganize during protein interactions, so modelling side chain flexibility is important The backbone may also shift significantly, which requires more challenging backbone repacking Ab initio structure prediction Ab initio structure prediction The ultimate goal of protein modelling is to model protein structures from first principles, without prior knowledge of the structure Such “ab initio” modelling was considered one of the hardest problems in science The challenge with trying to solve this with purely physical models of proteins is that physics does not push models to being correct until they are very nearly correct Molecular dynamics is too computationally costly to simulate folding The key turns out to be finding ways to extract useful information about the structure from pure sequence information Constraints “Constraints” are specific features we predict should be true of our model Experimental structures are, in a sense, structures modelled using a large number of experimental constraints Examples of constraints that might be established from experiments might include known disulfide bonds, residue proximity from cross-links, or surface exposure of residues Constraints can alternatively be based on bioinformatics analysis – e.g. secondary structure predictions Known constraints can be incorporated into modelling to push the models towards the correct structure To be truly useful for ab initio predictions, we need a method that generates a useful number of constraints from sequence data Amino acid covariance as a source of structural constraints Suppose two amino acids pack in the core of a structure Modifying aa #1 will generally require compensatory mutations in aa #2 to maintain optimal interactions E.g. Ala.Phe -> Leu.Ile If you compare enough sequences, you may be able to identify pairs of residues that co-vary significantly more often than you would expect by chance; these are candidates for interacting One complication is that interactions may be intermolecular Also, A-B and B-C contacts may produce an A->C covariance (transitivity) Covariance analysis Extracting robust covariance predictions using standard statistical methods requires tens to hundreds of thousands of sequences Widespread bacterial proteins generally have enough sequence representation to be modelled (esp. if metagenomic data is included; many proteins have ~100k homologs known) Proteins that occur only in eukaryotes however generally do not have enough homologs for this approach to give a robust prediction The average amino acid interacts with 1.5 others, so the top 1.5 x length co-variances are accepted as being predictive of interactions Co-varying residues constrain structure predictions blue dots – covarying residues Distance constraints as yellow/green lines Strongly co-varying residues (represented as a pseudo-distance matrix) are generally in physical contact Covariance can therefore be used as proximity constraints (green/yellow lines on right) for modelling This pushes the model towards the correct fold For homo-oligomers, contacts between protomers will result in unsatisfied covariance constraints These can be used to drive docking of the oligomer Hetero complexes can be modeled by considering both sequences Examples of predicted structures Covariance + Rosetta structures (left) correctly predict x-ray structures (right) This method worked for dimeric proteins and membrane proteins However, success rate and model accuracy are moderate – even these structures are wrong in many details Ab initio predictions by deep learning Structural predictions by machine learning There are subtle patterns in protein sequences that turn out to be strongly predictive of the structure they fold into Most of these would be very challenging to discover using conventional methods, but can be used by machine learning (artificial intelligence) methods AlphaFold 2 (from Alphabet) was the first tool to solve this problem generally, and more recently Alphafold 3 extends its capabilities However, there are several alternative algorithms/ approaches, and developments in this area continue to be rapid Neural networks Neural networks are a class of algorithms that mimic aspects of the organization of neuron signaling They excel at identifying patterns in data without guidance When trained on sample data they will discover key patterns without explicit guidance Once trained, they can apply previously learned patterns to predict features of previously unseen data Neural networks with >3 hidden layers are called deep learning networks These algorithms underpin most “artificial intelligence” applications Neural networks Neural networks process inputs into outputs by passing them through a series of layers In each layer, each node calculates a weighted sum of all of its inputs (i1w1 + i2w2 + i3w3 …) If this sum exceeds some threshold, it passes 1 to all nodes of the next layer; if not it passes 0 The network is tuned by “back propagation” For a given set of training inputs, the weights are iteratively adjusted until the output matches the expected training outputs Large language models Large language models are NN that are trained on large numbers of “texts” to predict masked words in new texts This can produce a NN capable of generating new, coherent sounding sentences – e.g. ChatGPT You can train a large NN using simple amino acid sequences as “text”, then predicting a randomly masked amino acid Sufficiently large LLMs trained on protein sequences eventually “discover” that the identity of an amino acid is linked to those of residues distant in the primary sequence The algorithm learns to predict key features of structures, without ever seeing one Note – LLMs will make predictions given only a single sequence – no MSA or structural homologs required LLM - ESMfold ESMfold is an alternate, purely LLM structure prediction tool from Meta De novo predictions can be Note – 400 ~15s a.a. max run at https://esmatlas.com/resourc es?action=fold Predictions take seconds, but are not as accurate as AF AF 3 seems to use LLM-like approaches in pairformer, allowing predictions for sequences with no homologs AlphaFold 3 The researchers used known experimental structures (from the pdb) and sequence alignments to “train” a neural network to convert sequence information into structures To solve this problem efficiently, it is broken down into distinct sub-problems Each sub-problem is handled by a distinct module The whole process is iterated several times until it converges upon a solution AlphaFold 3 algorithm Abramson et al Nature 2024 AlphaFold takes an input sequence, finds homologs and constructs pairwise sequence alignments It will also look for structural templates, but these are not required Alphafold then reasons at two levels: The pairformer, where information is represented as sequence alignments and residue-residue interaction matrices … and the diffusion module, where a 3D model of the protein is generated Information is iteratively cycled between the pairformer and diffusion module, allowing both predictions to be iteratively improved AlphaFold 3 algorithm – first steps AlphaFold 3 takes as its input protein and nucleic acid sequences, and the chemical structures of ligands (as SMILES strings) Sequences are used to find homolog sequences, with which pairwise sequence alignments are constructed (up to 10s of 1000s) The algorithm searches the PDB for any structural templates Conformers (low energy conformations) are generated for any small molecule ligands specified Abramson et al Nature 2024 Alphafold 3 – pairwise distances The Alphafold 3 pairformer module predicts inter-residue distances For each pair of residues, the algorithm generates and updates a distance probability distribution Homologous structures, if available, are used to help define the initial distance distribution Information from covariance-like reasoning around large numbers of sequence pairs also helps condition the distribution Patterns inherent in the sequence itself (see LLM) likely also help Reasoning about distances – triangle rule The triangle rule states that for any triangle ABC, the distance AB 70 corresponds to a generally correct backbone prediction Lower scores indicate unreliably predicted regions Extended low plDDT scores predict disorder Alphafold predicts atomic positions for every residue included in the sequence Extended regions with plDDT < 50 generally Human Striatin 1 correspond to disordered residues AlphaFold predicts disorder with accuracy comparable to the best disorder predictors Note: do not treat extended plDDT < 50 as indicating actual structure These regions generally do not pack on anything else, though they may have preferred secondary structure Note AF3 is prone to hallucinating excess order, and was trained on AF2 disorder pred. Pre-computed AF2 structures in Uniprot AlphaFold 2 was used to create a database of predicted structures that presently includes essentially all of Uniprot (200 million sequences) Proteins between 16 and 2700 a.a. have been predicted If oligomeric, only the protomeric structure is predicted E.g. https://alphafold.ebi.ac.uk/entry/Q5VSL9 These predictions are linked through the UNIPROT database Free to use Alphafold 3 website https://alphafoldserver.com protein sequence(s) ligand(s) To run a prediction, simply enter all sequences and state the number of copies Up to 5000 residues, but only a few ligands are available Predictions take a couple of minutes Example: Ribf transferase x-ray (7SHG) alphafold domain rotated loop over- predicted termini overpredicted, AF missed critical blocking active site cis-peptide X-ray structure (white) is compared to an Alphafold model (plDDT colouring) No precedent was available for most of the structure The Alphafold structure is generally very accurate, though some global shifts and local details wrong Generally, these areas have poor plDDT With care, this structure gives powerful insight domain rotated Example: using AF2 to find a key fertilization protein Researchers knew of two key proteins that are required for sperm-egg fusion, but suspected they were missing a third partner They used Alphafold 2 to build models of these 2 proteins, plus each of 1400 testis expressed membrane Deneke et al Cell, 2024 proteins Example: using AF2 to find a key fertilization protein They identified TMEM81 as the third member of the complex PAE scores suggest that this complex interacts In vivo experiments confirmed it is an essential component required for fertility Deneke et al Cell, 2024 Note that Alphafold 3 is much faster than AF2 With AF3 it should be possible to screen the whole proteome for interactors Predicted structure limitations These structures are generally accurate, but may differ from the true structure in details Oligomeric complexes are not automatically generated – you need to specify the number of chains of each type The structures are produced in essentially one state, generally the one with the most interactions Nucleic acid and small molecule predictions are less accurate than protein The currently available version of Alphafold 3 does not allow you to include general ligands AlphaFold makes many older approaches to gaining structural insights near obsolete AlphaFold models significantly outperforms the following modelling methods, and should essentially replace them: – Homology modelling – Protein-protein docking – All previous ab initio structure prediction methods It should also outperform most sequence-based structure feature predictions including: – Secondary structure – membrane topology; complementary insight into transmembrane or peripheral membrane helices – Protein disorder (still more or less a tie) – Domain boundaries (these predictions are very useful for cloning subdomains) Deep learning looks poised to revolutionize (structural) biology Being able to predict a structure for essentially any protein gives a powerful short cut to understanding its function The ability to compare folds of proteins will help map out the evolutionary history of all proteins – this is already happening for viruses where sequences have extremely divergent sequences The ability to reliably predict protein interactions will lead to the identification of all important multi-protein complexes in humans This will help, for example, map new signaling pathways, and understand how known ones interact The ability to model protein-ligand complexes will help identify substrates and allosteric modifiers for enzymes/proteins AF 3, trained on the proprietary structural databases in drug companies may allow rapid computational screening of potential drugs to known targets De novo protein design Building on our increasing understanding of protein structure computational methods have been developed to design proteins wholly different from those in nature, fulfilling new roles Making proteins by de novo gene synthesis and biophysical characterization and/or determination of an experimental structure confirms that the structures are often as predicted The ultimate aim of this field is to design proteins that fulfill tasks not seen in nature (e.g. proteins that bind a specific target, or catalyze useful reactions not seen in nature) Recently developed machine learning methods show a huge amount of promise in this regard Rosettafold diffusion RFdiffusion is an “A.I” algorithm that seeks to design de novo structural models RF diffusion initiates de-noising with randomly placed atoms This produces a random, wholly de-novo candidate fold A separate Rosetta algorithm then choses and packs side- chains to produce a sequence that will adopt this biophysically plausible fold AlphaFold then checks if this sequence folds as predicted Watson et al Nature, 2023 Additional criteria can be built into the design Rfdiffusion can accept constraints to produce targeted proteins Oligomeric proteins will emerge from symmetric noise Molecules that will bind set protein or small molecule targets can be designed Pre-chosen structural motifs can be incorporated and extended Watson et al Nature, 2023 Proteins can be designed to bind a specific target Rfdiffusion can build the new protein around an existing protein to design an interacting protein Watson et al Nature, 2023 ProteinPMNN Rosettafold diffusion produces a biophysically plausible model of the backbone only Side chains are added by a separate program – ProteinPMNN This is a neural network program trained on known PDB structures ProteinPMNN – algorithm This program encodes the structure as distances between pairs of nearby backbone atoms (Ca, C, O, N, Cb) The network identifies which residues will allow packing into the available spaces between each nearby pair of backbone atoms The sequence is then constructed from residues that allow are favoured between each pair of atoms This algorithm recovers > 50% of amino acids for native proteins Designed hemagglutinin binder Watson et al Nature, 2023 Here, a protein was designed to bind influenza hemagglutinin Experiments show that it binds the receptor with nanomolar affinity, its structure and binding are exactly as expected Note that this was the best of a few 10s of candidate designs Computational structural biology - conclusions Macromolecules obey known physical chemistry laws, and so can be computationally modeled The challenge is that the size of proteins and their slowness tax available computational resources in all-atom molecular dynamics simulations The solution turned out to be extracting predictive patterns from large numbers of known structures and sequences AlphaFold has largely solved the “hard” problem of predicting a structure from (multi)sequence data However, these structural models lack ligands, show only one state, and are generally inaccurate in places The size of predictions achievable is currently limited by GPU Computational structural biology Currently, an experimental structure is still important for establishing the “ground truth”, for getting the details right, for determining reliable ligand complexes and discovering alternative states It is important to note that the key advances that have made structure prediction and design possible are only 3 years old This area has a huge amount of potential – expect both the methods and the applications to develop rapidly