Protein Structure Prediction Lecture PDF

BCH405/BIOT401: Bioinformatics  Levels of Protein structure  Protein structure prediction  Native Methods For Structure Prediction  Computational Methods  Homology Modeling  Databases and software's for Modeling  Know how of Modeling Severs  Know how of Evaluation Servers  Protein structure is often described at four different scales  – primary structure  – secondary structure  – tertiary structure  – quaternary structure  The primary structure is merely the order of bonded amino acids in a protein. For example, MAGTAK is a protein with the sequence Met-Ala-Gly-Thr-Ala-Lys, where Methionine is at the amino terminus and Lysine is at the carboxyl terminus.  The secondary structure is the first type of folding the protein undergoes. There are three basic types of secondary structures: alpha helix, beta strand, and coil.  Relatively accurate structure prediction programs that can successfully predict the secondary structure of a protein when the sequence is known have been developed.  Once the protein begins to fold back onto itself, it forms a tertiary structure.  There are many types of tertiary structures found in proteins and predicting the tertiary structure from a primary sequence is a challenge.  Many proteins have a quaternary structure, which consists of several polypeptide chains that associate into an oligomeric molecule.  X-ray crystallography  NMR spectroscopy  Proteins of any size  Proteins below 50 kDa  Proteins in crystal  Proteins in solution  Complete data/total  Incomplete data map of structure  Fewer details – many  Many details – one models model  RMSD, Ramachandran  Resolution, R-values, plot Ramachandran plot ❑ Structural information from x-ray crystallographic or NMR results obtained much more slowly. techniques involve elaborate technical procedures many proteins fail to crystallize at all and/or cannot be obtained or dissolved in large enough quantities for NMR measurements The size of the protein is also a limiting factor for NMR ❑ With a better computational method this can be done extremely fast. Computational methods for 3D structure includes:  Homology Modeling  Threading  Ab-initio  For 3D structure prediction there exist two basic approaches: (1) compare the structure with proteins with known structure, or (2) to predict the structure just from the sequence including physical laws and empirical knowledge.  Item (1) can be subdivided into comparative modeling by (1.1) sequence-sequence comparison (alignment) and comparative modeling by (1.2) sequence-structure comparison (threading).  If after a global sequence alignment the identity between the proteins is 25-45%, then the two structures are similar. When the similarity is about 45%, then structures are equal, i.e. their structure match exactly.  For (1.1) alignment methods like BLAST are used and for (1.2) different threading methods are introduced. Threading methods put a new sequence on a known structure and compute how well the new sequence fits the known structure, e.g. how many hydrophobic amino acids are buried.  Item (2) includes “ab initio prediction” and molecular modeling or quantum mechanical modeling.  Homology modeling is an extrapolation of protein structure for a target sequence using the known 3D structure of similar sequence as a template.  Basis: proteins with similar sequences are likely to assume same folding  Certain proteins with as low as 25% similarity have been observed to assume same 3D structure use publicly available free tools to predict protein structure by comparative modeling  Over 60,000 protein structures have been determined, mostly by X-ray crystallography (PDB)  3D structure of ~70% of bacterial and 50% of human proteins can be predicted (comparative modeling) GNAAAAKKGSEQESVKEFLAKAKEDFLKKWENPA QNTAHLDQFERIKTLGTGSFGRVMLVKHKETGNH FAMKILDKQKVVKLKQIEHTLNEKRILQAVNFPF LVKLEYSFKDNSNLYMVMEYVPGGEMFSHLRRIG RFSEPHARFYAAQIVLTFEYLHSLDLIYRDLKPE NLLIDQQGYIQVTDFGFAKRVKGRTWTLCGTPEY LAPEIILSKGYNKAVDWWALGVLIYEMAAGYPPF FADQPIQIYEKIVSGKVRFPSHFSSDLKDLLRNL LQVDLTKRFGNLKDGVNDIKNHKWFATTDWIAIY QRKVEAPFIPKFKGPGDTSNFDDYEEEEIRVSIN EKCGKEFSEF Sequence Assumption Result (protein A is Similar to (protein A is Similar to protein B) protein B) Well studied protein  Unknown protein  SRRSASHPTYSEMIAAAIRAEK  GLLTTKFVSLLQEAKDGVLDLK SRGGSSRQSIQKYIKSHYKVGH LAADTLAVRQKRRIYDITNVLE NADLQIKLSIRRLLAA GIGLIEKKSKNSIQW similarity prediction  Statistical reliability of the prediction ▪ E-value - the number of hits one can "expect" to see just by chance when searching a database of a particular size (closer to zero the better) ▪ Z-score – score expressed as a distance from the mean calculated in standard deviations (the bigger the better)  phosphoribosyltransferase and viral coat protein, identity: 42%, different folds, different functions ..... 99 IRLKSYCNDQSTGDIKVIGGDDLSTLTGKNVLIVEDIIDTGKTMQTLLSLVRQY.NPKMVKVASLLVKRTPRSVGY 173 : ||. ||| || |. || | : | | | | || | || |:| | ||.| | 214 VPLKTDANDQ.IGDSLY....SAMTVDDFGVLAVRVVNDHNPTKVT..SKVRIYMKPKHVRV...WCPRPPRAVPY 279   Histone H5 and transcription factor E2F4, identity 7%, similar fold, similar function (DNA binding)  PTYSEMIAAAIRAEKSRGGSSRQSIQKYIKSHYKVGHNADLQIKLSIRRLLAAGVLKQTKGVGASGSFRL | | | | |  GLLTTKFVSLLQEAKD-GVLDLKLAADTLA------VRQKRRIYDITNVLEGIGLIEKKS----KNSIQW  The aim is to build a 3-D model for a protein of unknown structure (target) on the basis of sequence similarity to proteins of known structure (templates).  The most accurate structure prediction method Why?  3D structures of proteins in a given family are mostly conserved  -- ~1/3 of all sequences are recognizably related to at least one known structure  -- the number of unique protein folds is limited Identify template(s) – Initial alignment Improve alignment Backbone generation Loop modelling Side chains Refinement Validation GenBank www.ncbi.nlm.nih.gov/GenBank GeneCensus bioinfo.mbb.yale.edu/genome MODBASE http://modbase.compbio.ucsf.edu PDB www.rcsb.org/pdb/ eMOTIF http://brutlag.stanford.edu/projects.html UniProt http://www.uniprot.org/ BLAST http://www.ncbi.nlm.nih.gov/BLAST/ FastA http://www.ebi.ac.uk/fasta33/ SSM http://www.ebi.ac.uk/msd-srv/ssm/ PredictProtein http://www.predictprotein.org/ 123D; SARF2; PDP http://123d.ncifcrf.gov/ GenTHREADER http://bioinf.cs.ucl.ac.uk/psipred/ UCLA-DOE http://fold.doe-mbi.ucla.edu/ This is the most crucial step in the process. The process of homology modeling can not recover from a bad alignment. !! EMBOSS http://www.ebi.ac.uk/emboss/align/ Tcoffee http://www.igs.cnrs-mrs.fr/Tcoffee ClustalW http://www.ebi.ac.uk/clustalw/ BCM http://searchlauncher.bcm.tmc.edu/multi-align/ POA http://www.bioinformatics.ucla.edu/poa/ STAMP http://www.ks.uiuc.edu/Research/vmd/ SwissModel http://www.expasy.org/spdbv/ V A T T P D K S W L T V A 0 5 0 0 1 0 0 1 -2 -1 0 0 Sequence A: S -1 1 2 2 0 0 0 5 0 -1 2 -1 T 0 0 5 5 0 0 0 2 -1 0 5 0 VATTPDKSWLTV P -1 1 0 0 8 0 0 0 -3 -2 0 -1 E R -2 -1 1 -1 1 0 1 0 1 0 2 -2 1 2 1 1 -2 0 -2 -1 1 0 -2 -1 Sequence B: A S 0 -1 5 1 0 2 0 2 1 0 0 0 0 0 1 5 -2 0 -1 -1 0 2 0 -1 ASTPERASWLGTA W -1 -2 -1 -1 -3 -3 -2 0 6 0 -1 -1 L 2 -1 0 0 -2 -2 -1 -1 0 5 0 2 G -1 0 -1 -1 0 0 0 0 -2 -2 -1 -1 T 0 0 5 5 0 0 0 2 -1 0 5 0 A 0 5 0 0 1 0 0 1 -2 -1 0 0 VATTPDK-SWLTV- VATTPDK-SWL-TV |*||** ||| |*||** ||| |* -ASTPERASWLGTA -ASTPERASWLGTA score 39 score 45 Sequence A: LTLTLTLT- -LTLTLTLT LTLTLTLT HAHAHAHAH HAHAHAHAH Sequence B: score -4 score 0 HAHAHAHAH Sequence C: THTHTHTHT -LTLTLTLT- | | | | THTHTHTHT- | | | | -HAHAHAHAH The third sequence from a homologous protein allows alignment. It’s a very good idea to have more than one template. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Template PHE ASP ILE CYS ARG LEU PRO GLY SER ALA GLU ALA VAL CYS PHE ASN VAL CYS ARG THR PRO --- --- --- GLU ALA ILE CYS PHE ASN VAL CYS ARG --- --- --- THR PRO GLU ALA ILE CYS Gaps in the Model sequence Omits residues from the Template Sequence alignment programs do not always give the best structural Alignment. Break protein folds into conserved core, loops, and side chains Overlap template structures and generate backbone generation of loops (data based) side chain generation based on known preferences ab initio loop building (energy based) overall model optimization (energy minimization) SwissModel http://swissmodel.expasy.org/SWISS-MODEL.html Modeller http://salilab.org Geno3D http://geno3d-pbil.ibcp.fr ESyPred http://www.fundp.ac.be/sciences/biologie/urbm/bioinfo/esypred/ 3D-jigsaw http://www.bmm.icnet.uk/servers/3djigsaw/ CPHmodels http://www.cbs.dtu.dk/services/CPHmodels/ SwissModel http://swissmodel.expasy.org/SWISS-MODEL.html Modeller http://salilab.org Geno3D http://geno3d-pbil.ibcp.fr ESyPred http://www.fundp.ac.be/sciences/biologie/urbm/bioinfo/esypred/ 3D-jigsaw http://www.bmm.icnet.uk/servers/3djigsaw/ CPHmodels http://www.cbs.dtu.dk/services/CPHmodels/ All have similar accuracy. Most important is good alignment and good target selection! Modeling should allow flexibility & automation Multiple templates increase your odds of getting a good model  Errors in side chain packing  Template distortions because of crystal packing forces  Loop generation  Misalignments  Incorrect templates Are there any well characterized Recognition proteins similar to my protein? What is the position-by-position Alignment target/template equivalence What is the detailed 3D Modeling structure of my proteins Model analysis Is my model any good?  BLAST, PSI-BLAST or PFAM, FFAS, metaserver (bioinfo)  Name (PDB code) of the template  Statistical significance of the match (Z-score, e.value, p.value, points)  The same tools as in recognition (perhaps with different parameters), editing by hand  Position by position equivalence table  Freeware: available for all OS ▪ Downloadable ▪ Modeller (Sali, 1998) ▪ DeepView (SwissPDB viewer) ▪ WHATIF (Krieger et al. 2003) ▪ Web based: ▪ SWISS MODEL server (www.expasy.org/swissmod/SWISS- MODEL.html) ▪ CPH model server (http://www.cbs.dtu.dk/services/CPHmodels) ▪ SDSC1 server (http://cl.sdsc.edu/hm.html) Easy – 100-40% sequence id - strong sequence similarity, strong structure similarity, 75 obvious function analogy Difficult – 40%-25% - twilight zone 50 sequence similarity, increasing structure divergence, function diversification 25 Fold prediction – below 25% seq id. no apparent sequence similarity 0 extreme function divergence Swiss Model SWISS-MODEL is a fully automated protein structure homology-modeling server. The purpose of this server is to make Protein Modeling accessible to all biochemists and molecular biologists World Wide. Swiss model involves Search for suitable templates, Check sequence identity with target Generate models with ProModII Energy minimization COLORADO3D http://genesilico.pl/ PROCHECK http://www.biochem.ucl.ac.uk/~roman/procheck/procheck.html VERIFY3D http://fold.doe-mbi.ucla.edu/ PROSAII http://www.came.sbg.ac.at/ WHATCHECK http://swift.cmbi.kun.nl/WIWWWI/modcheck.html Narrated Samura bin Jundab: The prophet (S.A.W.W) said in his narration of dream that he saw, “He whose head was being crushed with a stone was one who learnt the Quran but never acted on it and slept ignoring the compulsory prayers.” (Sahih al Bukhari: 1143)

Protein Structure Prediction Lecture PDF

Document Details

Tags

Related

Summary

Full Transcript