Lecture 3 - Protein Structure Prediction PDF

Summary

This document details protein structure prediction, covering the protein folding problem and different prediction approaches. It discusses energy functions and various prediction methods, including ab initio energy calculations, sequence-structure gaps, and secondary structure prediction.

Full Transcript

Lecture 3 – Protein Structure Predic0on Protein Folding Problem There are >250,000,000 protein sequences but only 0.5 means the overall fold of the protein is good TM > 0.75 means a good predicted structure These predicBons are evaluated on known structures. RMSD is very good for similar structu...

Lecture 3 – Protein Structure Predic0on Protein Folding Problem There are >250,000,000 protein sequences but only 0.5 means the overall fold of the protein is good TM > 0.75 means a good predicted structure These predicBons are evaluated on known structures. RMSD is very good for similar structures, and having a unit makes it easier to interpret and understand. However, as the differences between the predicted and the real structure gets greater, RMSD provides a poor method of evaluaBon. Also, it is difficult to compare an RMSD of a 50 residue protein to a 250 residue protein, whereas TM enables you to have a uniform scale. CASP This is a blind test of predicBon, required to evaluate the different approaches. The sequences are sent to predictors prior to experimental coordinates reveals. This occurs every two years with manual evaluaBon of results. Ab ini%o Energy Calcula%ons This was the original idea to describe interacBons between atoms and search for conforma3on of lowest energy. The methods are energy minimisa3on and molecular dynamics, now called ab ini3o physical-based methods. Energy Func+on PotenBal energy of a protein in a parBcular conformaBon L = :4C5 >9CF?ℎ + :4C5 =CF>9 + :4C5 51ℎ958=> 84?=?14C + ;=C 598 M==>6 74C?=7?6 + 9>97?846?=?17 1C?98=7?14C6 Energy Minimisa+on The idea is you calculate a proteins potenBal energy, then adjust its bond geometry and see if we decrease the potenBal energy, if yes then keep going down unBl you get to the energy minimum (thermodynamically stable posiBon). Problem 1 Can get stuck in a local minimum. SoluBon: Molecular dynamics fixes this as it simulates moBon and can more readily jump over barriers. Molecular dynamics solves Newtons Laws of MoBon for protein. Problem 2 We don’t know in detail what the energy surface looks like, so the opportunity to simulate protein folding over a huge landscape is really impracBcal because we don’t know what the energy surface looks like. So, energy minimisaBon can be used for local improvement but doesn’t solve the global protein structure predicBon problem. Secondary Structure Predic%on Aim to idenBfy local structure a, b, coil and turn. The theory is that to a large extend the local sequence determines the local structure. Current methods use mul3ply aligned sequences to provide extra informaBon. Gaps suggest loops. Certain residues are oYen found in beta-sheets. Currently, nearly every a-helix has been idenBfied and most b strands although the short edge strands are sBll poorly predicted. The errors tend to be in defining the precise ends. There are programs to do this like PsiPred which use neural networks as an approach. Ter1ary Structure Predic1on Three major approaches: - Template-based - reliable - protein fold space is limited - 50% of the proteome is covered - Template-free - someBmes reliable - fragment based assembly - mulBply aligned sequences can oYen, but not always, yield quality models - recent language models are rapid and powerful - Hybrid - deep learning with templates produces excellent models Template-Based Modelling – Phyre2.2 Query sequence à match query sequence against sequence library of known structures à matched fold How the Phyre2.2 Database Was Created: 1. Extract the sequence and the secondary structure from a library of known structures 2. Run the sequence through PSI-BLAST for a MSA 3. Run MSA through PSI-Pred for predicted secondary structure 3. Now have a Hidden Markov Model that contains the sequence for the known structure, the known secondary structure and the predicted secondary structure Do this for every known sequence to create a library of HMMs. Then from ~200,000 known 3D structures we can have ~200,000 hidden Markov models – hidden Markov model database of known structures. You can then search the 250 million known sequences for homologous using PSI-BLAST. Searching the Database: Do the same process but for your query sequence. This captures the mutaBonal propensiBes at each posiBon in the protein – an evolu3onary fingerprint. Then search your HMM against the library of HMMs (HMM-HMM matching), this is a very sensi3ve approach and can be used to find remote relaBonships. It can confidently detect and align proteins when sequence similarity is as low as 15%. From Alignment to Crude Model AYer searching your HMM against the library of HMMs you get a structure alignment of the query with the sequence with a known structure. So, you can re-label the known structure according to the mapping from the alignment and change the residues according to the query. InserBons are handled with loop modelling. Loop Modelling 1. Fragment the enBre PDB 2. Find sequences similar to inserBon/deleBon 3. Check end-point distances 4. Check backbone geometry 5. Fit fragment to core structure Example: 1. 5 residues to fit in (inserBon) 2. Search for loops of length 5 in PDB 3. GraY loop and regularise the structure Accuracy InserBons and deleBons relaBve to the template are modelled by a loop library up to 15aa in length, however, short loops (90%) and high sequence idenBty (>35%) is almost always very accurate (TM score > 0.7, RMSD 1-3A). - High confidence (>90%) and low sequence idenBty (on Query sequence à Ab iniBo à Build fold from fragments (match sequence against a library of known fragments) Assumes that local sequence determines local structure. 1. Predict the structure of a segment by trailing structures for a local sequence taken from a database of segments of known 3D structures. 2. Construct trail model from segments and find the best model by scoring, adjust the structure by adopBng different fragments to give a good fold. Fragment based methods could someBmes give reasonable predic3ons but someBmes fail, it can also be integrated with template methods to fill gaps or uncertain regions. I-TASSER and Robeka are widely used, but these have now been superseded by deep learning (alphafold). Exploi%ng the Evolu%onary Record in Protein Families You can also use the idea of correlated muta3ons in mulBple sequence alignments to predict contacts, if two residues tend to mutate together then you can infer that these two residues are close together in space, and in contact in 3D. From these predicted contacts, one can start to build up a structure. This is very powerful. You can also predict the probable distances of two residues by deep learning, visualised in a distogram. ConvoluBonal neural networks (CNNs) can be used to do this, inpumng the covaria3on and secondary structure predicBon. CovariaBon data is necessary to achieve high precision when predicBng contacts. Secondary structure is needed to train the model to predict inter-residue contacts and recognize specific contact pa[erns between elements of secondary structure. Hybrid Predic>on – AlphaFold2 Extension using deep learning predicts the probable distance between the residues which can be visualised in a distogram. Approach: 1. The input is a mul3ple sequence alignment of the query sequence. In addiBon, known PDB structures provide structural data known as templates. 2. Two track learning called evoformer and structure: - the first stage is evoformer which calculates residue-residue contacts at different distances - the second stage of learning is the ‘structure’ network - each residue is an independent unit and they are not linked together - the posiBon of the main-chain residue then predicted and then the side chains were fi[ed - the learning is termed end-to-end so the funcBon op3mised is the difference between the final model and the true structure - the algorithm also predicts the expected accuracy of each part of the model 3. Finally the structure is refined using molecular dynamics using Amber, this does not improve the model in terms of RMSD but does correct some local stereochemistry. pLDDT Accuracy Metric Per residue confidence metric pLDDT is colour coded on EBI models on scale 0-100. Stands for predicted local distance dfference test, and measures the local agreement between the two protein structures. pLDDT > 90 = high accuracy pLDDT 70-80 = modelled well with a general good backbone predicBon pLDDT 50-70 = low confidence and should be treated with cauBon pLDDT < 50 = oYen have a ribbon like appearance and should not be interpreted (oYen disordered) PAE Accuracy Metric This is the Predicted Alignment Error – how well predicted is the distance between the two residues. It can be used to assess confidence of domain packing and is colour coded. NOTE: Regions with a very low PAE can be totally misplaced relaBve to one another. An example is the AlphaFold2 predicBon of the human growth hormone receptor where the extracellular and intracellular regions pack together which is biologically impossible. Limita+ons Although it models for 98.5% of human proteins, only ~65% of residues in the human proteome are predicted with high confidence (pLDDT > 70). However, it should be noted that you can only get 75% coverage anyway as 25% of the human proteome is intrinsBcally disordered. ColabFold This is a webserver to run AlphaFold. RoseTTAfold from Baker Group This is a related method to AlphaFold with a three-track network to process sequence, distance, and coordinate informaBon simultaneously and achieved accuracies approaching those of DeepMind. The three- track network involves informaBon at the one- dimensional (1D) sequence level, the 2D distance map level, and the 3D coordinate level which is successively transformed and integrated. Language Models Alphafold and RoseTTAfold require mulBple sequence alignments for accurate predicBons, but for many sequences one cannot generate an MSA as there is no data (orphan sequences), also there are typically no MSAs in protein design as there are no natural sequences. Thus, language models train to fill in masked residues in known sequences, then they link this learning procedure to generate rules for protein structure predicBon. ESMFold – Mapping the Metagenomic Structural Space. Using a language model, Meta has developed a very rapid approach for protein structure predicBon called ESMFold. This is reported as 60x faster than Alphafold2 at predicBng short proteins but is not as accurate. As a test, the researchers unleashed their model on a database of bulk-sequenced ‘metagenomic’ DNA from environmental sources such as soil, seawater and the human gut and skin. Of the 617 million predic3ons, the model deemed more than one-third to be high quality, such that researchers can have confidence that the overall protein shape is correct and, in some cases, can discern atomic-level details. This is the ESMFold model architecture. Arrows show the informa3on flow in the network from the language model to the folding trunk to the structure module which outputs 3D coordinates and confidences. AlphaFold3 In May 2024, Deepmind launched AlphaFold3, mainly designed for docking rather than protein terBary structure predicBon. There is only a marginal improvement over AlphaFold2 on monomers in complexes, and so not a substanBve advance for terBary predicBon. Model There are two modules in this model: 1. Trunk Module - Similar to that of AF2, and inputs into the diffusion model. 2. Diffusion Module (new part) Diffusion Approach StarBng from the ini3al 3D coordinates on the leY, incremental amounts of Gaussian noise are added to the coordinates of individual atoms, iniBally disrup3ng the local structure of the subdomains, followed by the eventual loss of global structural informa3on in the end, where at the final step the atoms appear randomly distributed. Diffusion models uBlize a deep neural network to learn to successively reverse this process (denoising or reverse diffusion), going from a random distribuBon to predic3ng the coordinates of every single atom in the complex. Developing an Algorithm Based on ML The model is learned on a training set, then hyperparameters are defined and the accuracy is evaluated on a held back test dataset. However, there can be a data leakage if the training and validaBon data sets are not disBnct. In many studies, parBcular protein structure predicBon and docking, there must not be homologues between the three sets. This is because it is simply recognising homology rather than tesBng on unseen data (not doing a de novo predicBon). In any study, you must check if homologous have been removed, if not then you cannot trust the accuracy figures. Protein Docking Need for it: - PDB contains ~4,000 hetero-dimers, compared to the number of entries of ~150,000. - Aim – to predict the structure of a complex starBng with the unbound components, where the unbound components are experimental or high-quality predicted structures. - Need to be able to model limited conformaBonal change. Ab ini%o Based Docking This was the first approach – limited number of known complexes so couldn’t use a template-based approach. The idea is you have two start molecules, and you try lots and lots of different packing together, (rigid body docking) and select the best scoring one which will be your final docking answer. Have to do rigid body docking because if we had every degree of freedom for both proteins it would be an impossible task. So treat proteins as rigid bodies and look for shape complementarity as well as electrostaBc complementarity, then once you get a list of possible complexes you need to evaluate how well they pack. Step 1 – Global search to find a good overlap of the surfaces of the protein (look for good global complementarity by trailing all possible combinaBons). Step 2 – Residue-residue interac3ons (look how the residues interact) by scoring with empirical residue pair poten3als. 4:698;95 C4 =: 4F10( ) 9D30% iden3ty to the templates in the complex, then they are typically acceptable models according to the CAPRI criteria and can be obtained for 2/3 of predicted complexes. - For ~450,000 human protein-protein interacBons, 8% have X-ray or NMR structure, and a further 20-40% by template docking – we sBll have a long way to go. Deep Learning The concept of correlated muta3ons also extends to homo and hetero complexes. We can join two sequences of two chains and then run AlphaFold. This is an acBve area of research where the results are very encouraging, we will know more about its accuracy aYer CASP15. AF2Complex You can add mul3ple sequences together and run them all through AlphaFold2 as if it is one protein, to build a model. ColabFold It will fold the individual sequences and dock them together – fold and dock. Note that heterodimers work be[er as the correlated muta3ons tend to be a stronger signal. Even though you can get some remarkably good predic3ons, this is sBll a work in progress. CASP15 Results Huge improvement of score from CASP12-CASP15. AlphaFold3 There is a substanBve improvement in protein-protein docking, and a very large improvement in docking anBbodies to proteins seeing as there is no co-evolu3on to guide docking. AwaiBng blind trails in CASP16 (December 2024). Best Approaches Typically use careful idenBficaBon of mul3ple sequence alignment followed by AlphaFold Mul3mer or similar. Manual groups also used rtemplates if AlphaFold was not successful, and also some groups used scoring interfaces.

Use Quizgecko on...
Browser
Browser