Molecular Biology Database Resources PDF

Document Details

CharmingChrysoprase3834

Uploaded by CharmingChrysoprase3834

SRM Institute of Science and Technology

Tags

molecular biology databases biological databases databases bioinformatics

Summary

This document provides an introduction to database resources in molecular biology. It covers various types of biological databases, including primary and secondary sequence databases (e.g., GenBank, EMBL, DDBJ for nucleic acids; Swiss-Prot, TrEMBL for proteins), and structure databases. It also mentions high-throughput sequencing databases, metabolic databases, and others. The document explains database structure and access, clarifying the different data types stored and their uses in bioinformatics.

Full Transcript

UNIT -2 - DATABASE RESOURCES IN MOLECULAR BIOLOGY  Introduction to databases & biological databases- Uses of biological databases-  Primary sequence databases- Nucleotide- Protein sequence database-  Primary structure databases- PDB file format- FASTA , GCG,VFF etc-...

UNIT -2 - DATABASE RESOURCES IN MOLECULAR BIOLOGY  Introduction to databases & biological databases- Uses of biological databases-  Primary sequence databases- Nucleotide- Protein sequence database-  Primary structure databases- PDB file format- FASTA , GCG,VFF etc-  High Throughput sequencing databases-  Secondary databases-  secondary sequence databases-  Secondary structure databases- SCOP- CATH-  Composite protein databases-  Metabolic databases- SNP -databases-  Whole genome –  mendelian disease databases- chemical structure databases- bibliographic databases BIOLOGICAL DATABASES WHAT IS A DATABASE ?  A collection of...  structured  searchable (index) -> table of contents  updated periodically (release) -> new edition  cross-referenced (hyperlinks) -> links with other db …data  Includes also associated tools (software) necessary for db access, db updating, db information insertion, db information deletion…. DATABASES Information system Query system Storage System Data 4 DATABASES Information system GenBank flat file PDB file Interaction Record Query system Title of a book Book Storage System Data 5 DATABASES Boxes Information system Oracle MySQL Query system PC binary files Unix text files Storage System Bookshelves Data 6 DATABASES A List you look at A catalogue Information system indexed files SQL Query system grep Storage System Data 7 DATABASES Information system Query system The UBC library Google Entrez Storage System SRS Data 8 TYPES OF DATABASE  Many difference database type, depending both on  the nature of the information being stored ( eg. sequences or structures)  The manner of data storage( whether in flat files or in tables) 9 DATABASES: AN SIMPLE EXAMPLE « Introduction To Database »Teacher Database (ITDTdb) (flat file, 3 entries) Accession number: 1 First Name: Amos Last Name: Bairoch Course: DEA=oct-nov-dec 2000 http://expasy4.expasy.ch/people/amos.html // Accession number: 2 First Name: Laurent Last name: Falquet Course: EMBnet=sept 2000;DEA=oct-nov-dec 2000; // Accession number 3: First Name: Marie-Claude Last name: Blatter Garin Course: EMBnet=sept 2000;DEA=oct-nov-dec 2000; http://expasy4.expasy.ch/people/Marie-Claude.Blatter-Garin.html //  Easy to manage: all the entries are visible at the same time ! DATABASES: AN SIMPLE EXAMPLE (CONT.) Relational database (« table file »): Teacher Accession Education number Amos 1 Biochemistry Laurent 2 Biochemistry M-Claude 3 Biochemistry Course Date Involved teachers DEA Oct-nov-dec 2000 1,3 EMBnet Sept 2000 2,3 Easier to manage; choice of the output BIOLOGICAL DATABASE  A biological database is a collection of data that is organized so that its contents can easily be accessed, managed, and updated. The activity of preparing a database can be divided in to:  Collection of data in a form which can be easily accessed  Making it available to a multi-user system ( always available for the user) 12 IMPORTANCE OF DATABASES  A database is any collection of related data.  A computerized archive is used to store and organize data in such a way that information can be retrieved easily.  A database is a collection of interrelated data store together without harmful and unnecessary redundancy (duplicate data) to serve multiple applications  Retrieving is called by firing a query.  Database System is an integrated collection of related files along with the detail about their definition, interpretation, manipulation and maintenance.  A database system is based on the data.  A database system can be run or executed by using software called DBMS (Database Management System).  A database system controls the data from unauthorized access.  A database management system (DBMS) is a collection of programs that enables users to create and maintain a database.  Database management systems provide several functions in addition to simple file management: allow concurrency, control security, maintain data integrity, provide for backup and recovery, control redundancy, allow data independence, provide nonprocedural query language, perform automatic query optimization  Relational database-a database that treats all of its data as a collection of relations WHY BIOLOGICAL DATABASES ?  Explosive growth in biological data  Data (sequences, 3D structures, 2D gel analysis, MS analysis….) are no longer published in a conventional manner, but directly submitted to databases  Essential tools for biological research, as classical publications used to be ! 17 18 TYPE OF DATA nucleotide sequences protein sequences proteins sequence patterns or motifs macromolecular 3D structure gene expression data metabolic pathways proteomics data BIOINFORMATICS DATABASE CATEGORIZED ON THE BASIS OF  Data type  Maintainer status  Technical design  Data source  Data access  Any other parameter Type of data:  proteomic database,  structure database etc Maintainer status:  NCBI-National center for Biotechnology Information  EBI- European Bioinformatics Institute Technical Design:  Flat file ,XML, Relational Model, Object oriented/object relational model Data source:  Primary database, secondary database Data Access:  Various kinds of access status -publicly available with no restrictions (NCBI,EBI) - available with copyright Others  Complete or incomplete entries in the database  Annotation- not annotated or annotated(have the analysis of data)  Curation- When annotation is established, db known as curated Databases in general can be classified into  primary,  secondary and  composite databases. 22 PRIMARY SEQUENCE DATABASES.  A primary database contains information of the sequence or structure alone. Examples of these include Swiss-Prot &PIR for protein sequences, GenBank & DDBJ for Genome sequences and the Protein Databank for protein structures. NUCLEIC ACID PROTEIN EMBL PIR GenBank MIPS DDBJ SWISS-PROT TrEMBL 24 NRL-3D SEQUENCE DATABASES  Primary DNA  DDBJ/EMBL/GenBank  Primary protein  GenPept/TrEMBL  Curated DB  RefSeq (Genomic, mRNA and protein)  Swiss-Prot & PIR -> UniProt (protein) 25 NUCLEIC ACID SEQUENCE DATABASES  The principle DNA sequence databases are DDBJ/EMBL/GenBank  Which exchange data on a daily basis to ensure comprehensive coverage at each of the site. 26 Entrez NIH NCBI Submissions GenBan Submissions Updates k Updates EMBL DDBJ CIB EBI NIG Submissions Updates SRS getentry EMBL 27 WHAT IS GENBANK? GenBank is the NIH genetic sequence database of all publicly available DNA and derived protein sequences, with annotations describing the biological information these records contain. http://www.ncbi.nlm.nih.gov/Genbank/GenbankOverview.ht ml Benson et al., 2004, Nucleic Acids Res. 32:D23-D2628 GENBANK FLAT FILE (GBFF) LOC US MUS NGH 18 03 bp mRN A ROD 2 9-AUG-199 7 DEF INITIO N Mou se neu roblas toma a nd rat gliom a hybr idoma cell l ine NG 108-15 cel l TA20 mRNA, compl ete cd s. ACC ESSION D25 291 NID g18 50791 Title KEY WORDS neu rite e xtensi on act ivity; growt h arre st; TA 20. SOU RCE Mur inae g en. sp. mous e neur oblast ma-rat gliom a hybr idoma cel l_line :NG108 -15 cDNA t o mRNA. Header ORG ANISM Murin ae gen. sp. Euk aryota e; mit ochond rial e ukaryo tes; M etazoa ; Chor data; Taxonomy Ver tebrat a; Mam malia; Euthe ria; R odenti a; Sci urogna thi; M uridae ; Mur inae. REF ERENCE 1 (sites ) AUT HORS Tohda ,C., N agai,S., Toh da,M. and No mura,Y. Citation TIT LE A nov el fac tor, T A20, i nvolve d in n eurona l diff erenti ation: cDNA clo ning a nd exp ressio n JOU RNAL Neuro sci. R es. 23 (1), 21-27 (1995) MED LINE 96064 354 REF ERENCE 3 (bases 1 to 1803) AUT HORS Tohda ,C. TIT LE Direc t Subm ission JOU RNAL Submi tted ( 18-NOV-199 3) to the DD BJ/EMB L/GenB ank da tabase s. Chi hiro Toh da, To yama M edical and P harmac eutica l Univ ersity , Rese arch Ins titute for W akan-yak u, Ana lytica l Rese arch C enter for Eth nomedi cines; 2630 Sugita ni, To yama, Toyama 930-01, Japan (E-mai l:CHIH [email protected] a-mpu.ac.jp , Tel: +81-764-34-228 1(ex.2 841), Fax :+81-764-34-505 7) COM MENT On Feb 26 , 1997 this sequen ce ver sion r eplace d gi:7 93764. FEA TURES Locati on/Qua lifier s sou rce 1..18 03 /or ganism ="Muri nae ge n. sp. " /no te="so urce o rigin of seq uence, eithe r mous e or r at, ha s not been identi fied" /db _xref= "taxon :39108 " /ce ll_lin e="NG1 08-15" /ce ll_typ e="mou se neu roblas tma-rat gliom a hybr idoma" mis c_sign al 156.. 163 Features (AA seq /no te="AP -2 b inding site" GC_ signal 647.. 655 /no te="Sp 1 bind ing si te" TAT A_sign al 694.. 701 gen e 748.. 1311 /ge ne="TA 20" CDS 748.. 1311 /ge ne="TA 20" /fu nction ="neur ite ex tensii on act ivity and gr owth a rrest eff ect" /co don_st art=1 /db _xref= "PID:d 100551 6" /db _xref= "PID:g 793765 " /tr anslat ion="M MKLWVP SRSLPN SPNHYR SFLSHT LHIRYN NSLFIS NTHLSR R KLR VTNPIY TRKRSL NIFYLL IPSCRT RLILWI IYIYRN LKHWST STVRSH SHSIYR L RPS MRTNII LRCHSY YKPPIS HPIYWN NPSRMN LRGLLS RQSHLD PILRFP LHLTIY Y RGP SNRSPP LPPRNR IKQPNR IKLRCR " pol yA_sit e 1803 BAS E COUN T 507 a 45 8 c 311 g 52 7 t ORI GIN 1 t cagttt ttt tt tttttt tt ttt tttttt t tttt tttttt ttttt ttttg ttgatt catg 61 tccgtt taca t ttggta agt tc acaggc ct cag tcaaca c aatt ggactg ctcag gaaat 121 cctcc ttggt gaccgc agta t acttgg cct at gaaccc aa gcc acctat g gcta ggtagg 181 agaag ctcaa ctgtag ggct g actttg gaa ga gaatgc ac atg gctgta t cgac atttca 241 catgg tggac ctctgg ccag a gtcagc agg cc gagggt tc tct tccggg c tgct ccctca 301 ctgct tgact ctgcgt cagt g cgtcca tac tg tgggcg ga cgt tattgc t attt gccttc 361 cattc tgtac ggcatt gcct c cattta gct gg agaggg ac aga gcctgg t tctc tagggc 421 gtttc cattg gggcct ggtg a caatcc aaa ag atgagg gc tcc aaacac c agaa tcagaa 481 ggccc agcgt atttgt aaaa a cacctt ctg gt gggaat ga atg gtacag g ggcg tttcag 541 gacaa agaac agcttt tctg t cactcc cat ga gaaccg tc gca atcact g ttcc gaagag DNA Sequence 601 gagga gtcca gaatac acgt g tatggg cat ga cgattg cc cgg agagag g cgga gcccat 661 ggaag cagaa agacga aaaa c acaccc att at ttaaaa tt att aaccac t catt cattga 721 cctac ctgcc ccatcc aaca t ttcatc atg at gaaact tt ggg tccctt c tagg agtctg 781 cctaa tagtc caaatc atta c aggtct ttt ct tagcca ta cac tacaca t caga tacaat 841 aacag ccttt tcatca gtaa c acacat ttg tc gagacg ta aat tacggg t gact aatccg 901 atata tacac gcaaac ggag c ctcaat att tt ttattt gc tta ttcctt c atgt cggacg 961 aggct tatat tatgga tcat a tacatt tat ag aaacct ga aac attgga g tact tctact 102 1 gttc gcagtc atagc cacag cattta tagg c tacgtc ctt cc atgagg ac aaa tatcat t 108 1 ctga ggtgcc acagt tatta caaacc tcct a tcagcc atc cc atatat tg gaa caaccc t 29 114 1 agtc gaatga atttg agggg gcttct cagt a gacaaa gcc ac cttgac cc gat tcttcg c 120 1 tttc cacttc atctt accat ttatta tcgc g gcccta gca at cgttca cc tcc tcttcc t 126 1 ccac gaaaca ggatc aaaca acccaa cagg a ttaaac tca ga tgcaga ta aaa ttccat t 132 1 tcac ccctac tatac atcaa agatat ccta g gtatcc taa tc atattc tt aat tctcat a 138 1 accc tagtat tattt ttccc agacat acta g gagacc cag ac aactac at acc agctaa t 144 1 ccac taaaca cccca cccca tattaa accc g aatgat att tc ctattt gc ata cgccat t 150 1 ctac gctcaa tcccc aataa actagg aggt g tcctag cct ta atctta tc tat cctaat t 156 1 ttag ccctaa tacct ttcct tcatac ctca a agcaac gaa gc ctaata tt ccg cccaat c 162 1 acac aaattt tgtac tgaat cctagt agcc a acctac tta tc ttaacc tg aat tggggg c 168 1 caac cagtag acacc cattt attatc attg g ccaact agc ct ccatct ca tac ttctca a 174 1 tcat cttaat tctta tacca atctca ggaa t tatcga aga ca aaatac ta aaa ttatat c 180 1 cat // PROTEIN SEQUENCE DATABASES  PROTEIN SEQUENCE DATABASES PIR- Protein Information Resource MIPS-Munich Information Center for Protein Sequences SWISS-PROT TrEMBL - Translated-EMBL NRL-3D - 30 PIR (THE PROTEIN INFORMATION RESOURCE)  TheProtein Information Resource (PIR), located at Georgetown University Medical Center (GUMC), is an integrated public bioinformatics resource to support genomic and proteomic research, and scientific studies.  PIRwas established in 1984 by the National Biomedical Research Foundation (NBRF) 31  Prior to that, the NBRF compiled the first comprehensive collection of macromolecular sequences in the Atlas of Protein Sequence and Structure, published from 1965-1978 under the editorship of Margaret O. Dayhoff.  In 2002 PIR, along with its international partners, EBI (European Bioinformatics Institute) and SIB (Swiss Institute of Bioinformatics), were awarded a grant from NIH to create UniProt,  UniProt - a single worldwide database of protein sequence and function, by unifying the PIR-PSD, Swiss-Prot, and TrEMBL databases.  In the current form the PIR is spilt into four sections  PIR1-Contain fully classification and annotation entries  PIR2-includes preliminary entries, which have not been thoroughly reviewed and may contain redundancy.  PIR3-Unverified entries which have not been thoroughly reviewed  PIR4- Entries fall in to 4 categories: 1. Conceptual translations of art factual sequences. 2. Conceptual translations of sequences that are not transcribed or translated; 3. Protein sequence or Conceptual translations that are extensively genetically engineered 4. Sequences that are not genetically encoded and not 33 produced on ribosome. SWISS-PROT  SWISS-PROT is a protein sequence database was produced collaboratively by the Dept of Medical Biochemistry at the University of Geneva and the EMBL.  After 1994 the collaboration moved to EMBL’S UK outstation, the EBI  InApril 1998, it was move to the Swiss Institute of Bioinformatics (SIB) 34 SWISS-PROT  SWISS-PROT incorporates:  Function of the protein  Post-translational modification  Domains and sites.  Secondary structure.  Quaternary structure.  Similarities to other proteins;  Diseases associated with deficiencies in the protein  Sequence conflicts, variants, etc. 35 ID CYS3 _YEAST S TANDAR D; PRT; 393 AA. AC P313 73; DT 01-JUL-199 3 (REL. 26, CREATE D) DT 01-JUL-199 3 (REL. 26, LAST S EQUENC E UPDA TE) DT 01-NOV-199 5 (REL. 32, LAST A NNOTAT ION UP DATE) DE CYST ATHION INE GA MMA-LYA SE (EC 4.4.1.1) (G AMMA-CYS TATHIO NASE). GN CYS3 OR CY I1 OR STR1 O R YAL0 12W OR FUN35. SWISS-PROT OS SACC HAROMY CES CE REVISI AE (BA KER'S YEAST). OC EUKA RYOTA; FUNGI ; ASCO MYCOTA ; HEMI ASCOMY CETES; SACCH AROMYC ETALES ; OC SACC HAROMY CETACE AE; SA CCHARO MYCES. RN RP SEQU ENCE F ROM N. A., AN D PART IAL SE QUENCE. RX MEDL INE; 9 225043 0. [NC BI, Ex PASy, Israel , Japa n] RA ONO B.-I., TANAK A K., NAITO K., HE IKE C. , SHIN ODA S. , YAMA MOTO S., RA OHMO RI S., OSHIM A T., TOH-E A.; ID CYS3_YEAST STANDARD; PRT; 393 AA. RT RT "Clo ning a nd cha racter izatio n of t he CYS 3 (CYI 1) gen e of Sacc haromy ces ce revisi ae."; AC P31373; RL RN J. B ACTERI OL. 17 4:3339 -334 7(1992 ). RP SEQU ENCE F ROM N. A., AN D CHAR ACTERI ZATION. DT 01-JUL-1993 (REL. 26, CREATED) RC RX STRA IN=DBY 939; MEDL INE; 9 332868 5. [NC BI, Ex PASy, Israel , Japa n] DE CYSTATHIONINE GAMMA-LYASE (EC 4.4.1.1) (GAMMA-CYSTATHIONASE). RA RT YAMA GATA S., D'A NDREA R.J., FUJISA KI S., ISAJI M., N AKAMUR A K.; "Clo ning a nd bac terial expre ssion of the CYS3 gene e ncodin g RT cyst athion ine ga mma-lya se of Saccha romyce s cere visiae and t he GN CYS3 OR CYI1 OR STR1 OR YAL012W OR FUN35. RT RL phys icoche mical and en zymati c prop erties of th e prot ein."; J. B ACTERI OL. 17 5:4800 -480 8(1993 ). OS TAXONOMY RN RP SEQU ENCE F ROM N. A. RC STRA IN=S28 8C / A B972; OC SACCHAROMYCETACEAE; SACCHAROMYCES. RX RA MEDL INE; 9 328981 4. [NC BI, Ex PASy, Israel , Japa n] BART ON A.B., KAB ACK D. B., CL ARK M. W., KE NG T., OUELL ETTE B.F.F., RA STOR MS R.K., ZEN G B., ZHONG W.W., FORTIN N., D ELANEY S., B USSEY H.; RT "Phy sical locali zation of ye ast CY S3, a gene w hose p roduct resem bles RT the rat ga mma-cys tathio nase a nd Esc herich ia col i cyst athion ine ga mma- RX CITATION RT RL synt hase e nzymes."; YEAS T 9:36 3-369 (1993). CC -!- CATALYTIC ACTIVITY: L-CYSTATHIONINE + H(2)O = L-CYSTEINE + RN RP SEQU ENCE F ROM N. A. RC STRA IN=S28 8C / A B972; CC NH(3) + 2-OXOBUTANOATE. RX RA MEDL INE; 9 320953 2. [NC BI, Ex PASy, Israel , Japa n] OUEL LETTE B.F.F. , CLAR K M.W. , KENG T., S TORMS R.K., ZHONG W.W., CC -!- COFACTOR: PYRIDOXAL PHOSPHATE. RA RT ZENG B., F ORTIN N., DE LANEY S., BA RTON A.B., K ABACK D.B., BUSSEY H.; "Seq uencin g of c hromos ome I from S acchar omyces cerev isiae: analy sis RT of a 32 kb regio n betw een th e LTE1 and S PO7 ge nes."; CC -!- PATHWAY: FINAL STEP IN THE TRANS-SULFURATION PATHWAY SYNTHESIZING RL RN GENO ME 36: 32-42( 1993). CC L-CYSTEINE FROM L-METHIONINE. RP RX SEQU ENCE O F 1-18, AND C HARACT ERIZAT ION. MEDL INE; 9 328981 7. [NC BI, Ex PASy, Israel , Japa n] RA ONO B.-I., ISHII N., N AITO K., MIY OSHI S.-I., SHINO DA S., YAMAM OTO S. , CC -!- SUBUNIT: HOMOTETRAMER. RA RT OHMO RI S.; "Cys tathio nine g amma-lya se of Saccha romyce s cere visiae : stru ctural CC -!- SUBCELLULAR LOCATION: CYTOPLASMIC. RT RL gene and c ystath ionine gamma -syn thase activi ty."; YEAS T 9:38 9-397 (1993). CC -!- CAT ALYTIC ACTIV ITY: L -CYS TATHIO NINE + H(2)O = L-CYS TEINE + CC -!- SIMILARITY: BELONGS TO THE TRANS-SULFURATION ENZYMES FAMILY. CC CC NH(3) + 2-OXO BUTANO ATE. -!- COF ACTOR: PYRID OXAL P HOSPHA TE. CC -------------------------------------------------------------------------- CC CC -!- PAT HWAY: FINAL STEP I N THE TRANS-SUL FURATI ON PAT HWAY S YNTHES IZING L-CYS TEINE FROM L -MET HIONIN E. CC -!- SUB UNIT: HOMOTE TRAMER. CC DISCLAMOR CC CC -!- SUB CELLUL AR LOC ATION: CYTOP LASMIC. -!- SIM ILARIT Y: BEL ONGS T O THE TRANS-SUL FURATI ON ENZ YMES F AMILY. CC -------------------------------------------------------------------------- CC CC --- ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ----- This SWISS -PRO T entr y is c opyrig ht. It is pr oduced throu gh a c ollabo ration CC betw een t he Swi ss Ins titute of Bi oinfor matics and the E MBL ou tstati on - CC the Europe an Bio inform atics Instit ute. There are no rest rictio ns on its CC use by n on-pro fit i nstitu tions as lon g as its co ntent is i n no way DR DATABASE cross-reference CC CC modi fied a nd thi s stat ement is not remov ed. U sage by an d for commer cial enti ties r equire s a li cense agreem ent (S ee htt p://ww w.isb -sib.ch/an nounce / CC or s end an email to li cense@ isb-sib.ch). KW CYSTEINE BIOSYNTHESIS; LYASE; PYRIDOXAL PHOSPHATE. CC DR --- ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ----- EMBL ; L051 46; AA C04945.1; -. [ EMBL / GenBa nk / D DBJ] [ CoDing Sequen ce] FT INIT_MET 0 0 DR DR EMBL ; L044 59; AA A85217.1; -. [ EMBL / GenBa nk / D DBJ] [ CoDing Sequen ce] EMBL ; D141 35; BA A03190.1; -. [ EMBL / GenBa nk / D DBJ] [ CoDing Sequen ce] DR PIR; S3122 8; S31 228. FT BINDING 203 203 PYRIDOXAL PHOSPHATE (BY SIMILARITY). DR DR YEPD ; 5280 ; -. SGD; L0000 470; C YS3. [ SGD / YPD] SQ SEQUENCE 393 AA; 42411 MW; 55BA2771 CRC32; DR DR PFAM ; PF01 053; C ys_Met _Meta_ PP; 1. PROS ITE; P S00868 ; CYS_ MET_ME TAB_PP ; 1. DR DOMO ; P313 73. TLQESDKFAT KAIHAGEHVD VHGSVIEPIS LSTTFKQSSP ANPIGTYEYS RSQNPNRENL DR DR PROD OM [Do main s tructu re / L ist of seq. sharin g at l east 1 domai n] PROT OMAP; P31373. ERAVAALENA QYGLAFSSGS ATTATILQSL PQGSHAVSIG DVYGGTHRYF TKVANAHGVE DR DR PRES AGE; P 31373. SWIS S-2DP AGE; G ET REG ION ON 2D PA GE. KW CYST EINE B IOSYNT HESIS; LYASE ; PYRI DOXAL PHOSPH ATE. TSFTNDLLND LPQLIKENTK LVWIETPTNP TLKVTDIQKV ADLIKKHAAG QDVILVVDNT FT FT INIT _MET BIND ING 203 0 0 203 PYRID OXAL P HOSPHA TE (BY SIMIL ARITY). FLSPYISNPL NFGADIVVHS ATKYINGHSD VVLGVLATNN KPLYERLQFL QNAIGAIPSP SQ SEQU ENCE 393 A A; 42 411 MW ; 55B A2771 CRC32; TLQ ESDKFA T KAIH AGEHVD VHGSV IEPIS LSTTFK QSSP A NPIGTY EYS RS QNPNRE NL ERA VAALEN A QYGL AFSSGS ATTAT ILQSL PQGSHA VSIG D VYGGTH RYF TK VANAHG VE FDAWLTHRGL KTLHLRVRQA ALSANKIAEF LAADKENVVA VNYPGLKTHP NYDVVLKQHR TSF TNDLLN D LPQL IKENTK LVWIE TPTNP TLKVTD IQKV A DLIKKH AAG QD VILVVD NT FLS PYISNP L NFGA DIVVHS ATKYI NGHSD VVLGVL ATNN K PLYERL QFL QN AIGAIP SP DALGGGMISF RIKGGAEAAS KFASSTRLFT LAESLGGIES LLEVPAVMTH GGIPKEAREA FDA WLTHRG L KTLH LRVRQA ALSAN KIAEF LAADKE NVVA V NYPGLK THP NY DVVLKQ HR 36 DAL GGGMIS F RIKG GAEAAS KFASS TRLFT LAESLG GIES L LEVPAV MTH GG IPKEAR EA SGV FDDLVR I SVGI EDTDDL LEDIK QALKQ ATN SGVFDDLVRI SVGIEDTDDL LEDIKQALKQ ATN // // SWISS-PROT 37 UNIPROT  New protein sequence database that is the result of a merge from SWISS-PROT and PIR. It will be the annotated curated protein sequence database.  Data in UniProt is primarily derived from coding sequence annotations in EMBL/GenBank/DDBJ nucleic acid sequence data.  UniProt is a Flat-File database just like EMBL and GenBank  Flat-File format is SwissProt-like, or EMBL-like 38  UniProt is comprised of four components, each optimised for different uses:  1) The UniProt Knowledgebase (UniProtKB) is the central access point for extensive curated protein information, including function, classification, and cross-reference. It consists of two sections:  UniProtKB/Swiss-Prot which is manually annotated and is reviewed and  UniProtKB/TrEMBL which is automatically annotated and is not reviewed.  2) The UniProt Reference Clusters (UniRef) databases provide clustered sets of sequences from the UniProtKB and selected UniProt Archive records to obtain complete coverage of sequence space at several resolutions while hiding redundant sequences.  3) The UniProt Archive (UniParc) is a comprehensive repository, used to keep track of sequences and their identifiers.  4) The UniProt Metagenomic and Environmental Sequences (UniMES) database is a repository specifically developed for metagenomic and environmental data. TREMBL  TrEMBL (Translated EMBL) was created in 1996 as computer-annotated protein sequence database supplementing to the SWISS-PROT Protein Sequence Data Bank.  TrEMBL contains the translations of all coding sequences (CDS) present in the EMBL Nucleotide Sequence Database not yet integrated in SWISS- PROT.  TrEMBL can be considered as a preliminary section of SWISS-PROT. For all TrEMBL entries which should finally be upgraded to the standard SWISS- PROT quality, SWISS-PROT accession numbers have been assigned. 40 TREMBL  TrEMBL has 2 main section  SP-TrEMBL-contains entries that will eventually be incorporated into SWISS-PROT, but that have not been manually annotated.  REM-TrEMBL-contain sequences that are not destined to be included in SWISS-PROT (e.g. like peptide with less the 8 aa, and synthetic Seq) 41 NRL-3D  The NRL-3D database is produced by PIR from sequences extracted from the ( Brookhaven protein Database – PDB).  Title, biological sources, bibliographic references and Medline reference are included together with secondary structure active site, binding site and modified site annotations and details of experimental methods,etc.  It is a valuable resource, as it makes the sequence information in the PDB available both for keyword interrogation and for similarity searches.  Many specialized protein databases for specific families or groups of proteins. Examples: YPD (yeast proteins), AMSDb (antibacterial peptides), GPCRDB (7 TM receptors), IMGT (immune system) etc. 42 PDB  Protein DataBase  Protein and NA 3D structures  Sequence present 43 HEADER LEUCINE ZIPPER 15-JUL-93 1DGC 1DGC 2 COMPND GCN4 LEUCINE ZIPPER COMPLEXED WITH SPECIFIC 1DGC 3 COMPND 2 ATF/CREB SITE DNA 1DGC 4 SOURCE GCN4: YEAST (SACCHAROMYCES CEREVISIAE); DNA: SYNTHETIC 1DGC 5 AUTHOR T.J.RICHMOND 1DGC 6 REVDAT 1 22-JUN-94 1DGC 0 1DGC 7 JRNL AUTH P.KONIG,T.J.RICHMOND 1DGC 8 JRNL TITL THE X-RAY STRUCTURE OF THE GCN4-BZIP BOUND TO 1DGC 9 PDB JRNL TITL 2 ATF/CREB SITE DNA SHOWS THE COMPLEX DEPENDS ON DNA 1DGC 10 JRNL TITL 3 FLEXIBILITY 1DGC 11 JRNL REF J.MOL.BIOL. V. 233 139 1993 1DGC 12 JRNL REFN ASTM JMOBAK UK ISSN 0022-2836 0070 1DGC 13 REMARK 1 1DGC 14 REMARK 2 1DGC 15 REMARK 2 RESOLUTION. 3.0 ANGSTROMS. 1DGC 16 REMARK 3 1DGC 17 REMARK 3 REFINEMENT. 1DGC 18 REMARK 3 PROGRAM X-PLOR 1DGC 19 HEADER REMARK 3 AUTHORS BRUNGER 1DGC 20  REMARK REMARK REMARK 3 3 3 R VALUE RMSD BOND DISTANCES RMSD BOND ANGLES 0.216 0.020 ANGSTROMS 3.86 DEGREES 1DGC 1DGC 1DGC 21 22 23 REMARK 3 1DGC 24  COMPND REMARK 3 NUMBER OF REFLECTIONS 3296 1DGC 25 REMARK 3 RESOLUTION RANGE 10.0 - 3.0 ANGSTROMS 1DGC 26 REMARK 3 DATA CUTOFF 3.0 SIGMA(F) 1DGC 27 REMARK 3 PERCENT COMPLETION 98.2 1DGC 28 REMARK 3 1DGC 29 REMARK 3 NUMBER OF PROTEIN ATOMS 456 1DGC 30  SOURCE REMARK 3 NUMBER OF NUCLEIC ACID ATOMS 386 1DGC 31 REMARK 4 1DGC 32 REMARK 4 GCN4: TRANSCRIPTIONAL ACTIVATOR OF GENES ENCODING FOR AMINO 1DGC 33 REMARK 4 ACID BIOSYNTHETIC ENZYMES. 1DGC 34 REMARK 5 1DGC 35 REMARK 5 AMINO ACIDS NUMBERING (RESIDUE NUMBER) CORRESPONDS TO THE 1DGC 36  AUTHOR REMARK 5 281 AMINO ACIDS OF INTACT GCN4. 1DGC 37 REMARK 6 1DGC 38 REMARK 6 BZIP SEQUENCE 220 - 281 USED FOR CRYSTALLIZATION. 1DGC 39 REMARK 7 1DGC 40 REMARK 7 MODEL FROM AMINO ACIDS 227 - 281 SINCE AMINO ACIDS 220 - 1DGC 41 REMARK 7 226 ARE NOT WELL ORDERED. 1DGC 42  DATE REMARK 8 1DGC 43 REMARK 8 RESIDUE NUMBERING OF NUCLEOTIDES: 1DGC 44 REMARK 8 5' T G G A G A T G A C G T C A T C T C C 1DGC 45 REMARK 8 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 1 2 3 4 5 6 7 8 9 1DGC 46 REMARK 9 1DGC 47 REMARK 9 THE ASYMMETRIC UNIT CONTAINS ONE HALF OF PROTEIN/DNA 1DGC 48  JRNL REMARK 9 COMPLEX PER ASYMMETRIC UNIT. 1DGC 49 REMARK 10 1DGC 50 REMARK 10 MOLECULAR DYAD AXIS OF PROTEIN DIMER AND PALINDROMIC HALF 1DGC 51 REMARK 10 SITES OF THE DNA COINCIDES WITH CRYSTALLOGRAPHIC TWO-FOLD 1DGC 52 REMARK 10 AXIS. THE FULL PROTEIN/DNA COMPLEX CAN BE OBTAINED BY 1DGC 53  REMARK REMARK 10 APPLYING THE FOLLOWING TRANSFORMATION MATRIX AND 1DGC 54 REMARK 10 TRANSLATION VECTOR TO THE COORDINATES X Y Z: 1DGC 55 REMARK 10 1DGC 56 REMARK 10 0 -1 0 X 117.32 X SYMM 1DGC 57 REMARK 10 -1 0 0 Y + 117.32 = Y SYMM 1DGC 58 REMARK 10 0 0 -1 Z 43.33 Z SYMM 1DGC 59  SECRES SEQRES 1 A 62 ILE VAL PRO GLU SER SER ASP PRO ALA ALA LEU LYS ARG 1DGC 60 SEQRES 2 A 62 ALA ARG ASN THR GLU ALA ALA ARG ARG SER ARG ALA ARG 1DGC 61 SEQRES 3 A 62 LYS LEU GLN ARG MET LYS GLN LEU GLU ASP LYS VAL GLU 1DGC 62 SEQRES 4 A 62 GLU LEU LEU SER LYS ASN TYR HIS LEU GLU ASN GLU VAL 1DGC 63 SEQRES 5 A 62 ALA ARG LEU LYS LYS LEU VAL GLY GLU ARG 1DGC 64 SEQRES 1 B 19 T G G A G A T G A C G T C 1DGC 65  ATOM COORDINATES SEQRES 2 B 19 A T C T C C 1DGC 66 HELIX 1 A ALA A 228 LYS A 276 1 1DGC 67 CRYST1 58.660 58.660 86.660 90.00 90.00 90.00 P 41 21 2 8 1DGC 68 ORIGX1 1.000000 0.000000 0.000000 0.00000 1DGC 69 ORIGX2 0.000000 1.000000 0.000000 0.00000 1DGC 70 ORIGX3 0.000000 0.000000 1.000000 0.00000 1DGC 71 SCALE1 0.017047 0.000000 0.000000 0.00000 1DGC 72 SCALE2 0.000000 0.017047 0.000000 0.00000 1DGC 73 SCALE3 0.000000 0.000000 0.011539 0.00000 1DGC 74 ATOM ATOM 1 N PRO A 227 2 CA PRO A 227 35.313 108.011 15.140 1.00 38.94 34.172 107.658 15.972 1.00 39.82 44 1DGC 1DGC 75 76 ATOM 842 C5 C B 9 57.692 100.286 22.744 1.00 29.82 1DGC 916 ATOM 843 C6 C B 9 58.128 100.193 21.465 1.00 30.63 1DGC 917 TER 844 C B 9 1DGC 918 MASTER 46 0 0 1 0 0 0 6 842 2 0 7 1DGC 919 END 1DGC 920 HEADER LEUCINE ZIPPER 15-JUL-93 1DGC 1DGC 2 COMPND GCN4 LEUCINE ZIPPER COMPLEXED WITH SPECIFIC 1DGC 3 COMPND 2 ATF/CREB SITE DNA 1DGC 4 SOURCE GCN4: YEAST (SACCHAROMYCES CEREVISIAE); DNA: SYNTHETIC 1DGC 5 AUTHOR T.J.RICHMOND 1DGC 6 REVDAT 1 22-JUN-94 1DGC 0 1DGC 7 JRNL AUTH P.KONIG,T.J.RICHMOND 1DGC 8 JRNL TITL THE X-RAY STRUCTURE OF THE GCN4-BZIP BOUND TO 1DGC 9 JRNL TITL 2 ATF/CREB SITE DNA SHOWS THE COMPLEX DEPENDS ON DNA 1DGC 10 JRNL TITL 3 FLEXIBILITY 1DGC 11 JRNL REF J.MOL.BIOL. V. 233 139 1993 1DGC 12 JRNL REFN ASTM JMOBAK UK ISSN 0022-2836 0070 1DGC 13 REMARK 1 1DGC 14 REMARK 2 1DGC 15 REMARK 2 RESOLUTION. 3.0 ANGSTROMS. 1DGC 16 REMARK 3 1DGC 17 REMARK 3 REFINEMENT. 1DGC 18 REMARK 3 PROGRAM X-PLOR 1DGC 19 REMARK 3 AUTHORS BRUNGER 1DGC 20 REMARK 3 R VALUE 0.216 1DGC 21 REMARK 3 RMSD BOND DISTANCES 0.020 ANGSTROMS 1DGC 22 REMARK 3 RMSD BOND ANGLES 3.86 DEGREES 1DGC 23 REMARK 3 1DGC 24 REMARK 3 NUMBER OF REFLECTIONS 3296 1DGC 25 REMARK 3 RESOLUTION RANGE 10.0 - 3.0 ANGSTROMS 1DGC 26 REMARK 3 DATA CUTOFF 3.0 SIGMA(F) 1DGC 27 REMARK 3 PERCENT COMPLETION 98.2 1DGC 28 45 REMARK 3 1DGC 29 REMARK 3 NUMBER OF PROTEIN ATOMS 456 1DGC 30 REMARK 3 NUMBER OF NUCLEIC ACID ATOMS 386 1DGC 31 REMARK 4 1DGC 32 REMARK 5 1DGC 35 REMARK 5 AMINO ACIDS NUMBERING (RESIDUE NUMBER) CORRESPONDS TO THE 1DGC 36 REMARK 5 281 AMINO ACIDS OF INTACT GCN4. 1DGC 37 REMARK 6 1DGC 38 REMARK 6 BZIP SEQUENCE 220 - 281 USED FOR CRYSTALLIZATION. 1DGC 39 REMARK 7 1DGC 40 REMARK 7 MODEL FROM AMINO ACIDS 227 - 281 SINCE AMINO ACIDS 220 - 1DGC 41 REMARK 7 226 ARE NOT WELL ORDERED. 1DGC 42 REMARK 8 1DGC 43 REMARK 8 RESIDUE NUMBERING OF NUCLEOTIDES: 1DGC 44 REMARK 8 5' T G G A G A T G A C G T C A T C T C C 1DGC 45 REMARK 8 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 1 2 3 4 5 6 7 8 9 1DGC 46 REMARK 9 1DGC 47 REMARK 9 THE ASYMMETRIC UNIT CONTAINS ONE HALF OF PROTEIN/DNA 1DGC 48 REMARK 9 COMPLEX PER ASYMMETRIC UNIT. 1DGC 49 REMARK 10 1DGC 50 REMARK 10 MOLECULAR DYAD AXIS OF PROTEIN DIMER AND PALINDROMIC HALF 1DGC 51 REMARK 10 SITES OF THE DNA COINCIDES WITH CRYSTALLOGRAPHIC TWO-FOLD 1DGC 52 REMARK 10 AXIS. THE FULL PROTEIN/DNA COMPLEX CAN BE OBTAINED BY 1DGC 53 REMARK 10 APPLYING THE FOLLOWING TRANSFORMATION MATRIX AND 1DGC 54 REMARK 10 TRANSLATION VECTOR TO THE COORDINATES X Y Z: 1DGC 55 REMARK 10 1DGC 56 REMARK 10 0 -1 0 X 117.32 X SYMM 1DGC 57 REMARK 10 -1 0 0 Y + 117.32 = Y SYMM 1DGC 58 REMARK 10 0 0 -1 Z 43.33 Z SYMM 1DGC 59 SEQRES 1 A 62 ILE VAL PRO GLU SER SER ASP PRO ALA ALA LEU LYS ARG 1DGC 60 SEQRES 2 A 62 ALA ARG ASN THR GLU ALA ALA ARG ARG SER ARG ALA ARG 1DGC 61 SEQRES 3 A 62 LYS LEU GLN ARG MET LYS GLN LEU GLU ASP LYS VAL GLU 1DGC 62 SEQRES 4 A 62 GLU LEU LEU SER LYS ASN TYR HIS LEU GLU ASN GLU VAL 1DGC 63 46 SEQRES 5 A 62 ALA ARG LEU LYS LYS LEU VAL GLY GLU ARG 1DGC 64 SEQRES 1 B 19 T G G A G A T G A C G T C 1DGC 65 SEQRES 2 B 19 A T C T C C 1DGC 66 Rasmol 47 SEQUENCE FILE FORMATS  Plain sequence format  EMBL format  FASTA file formats  FASTQ File Formats  GCG format-Genetics Computer Group  GCG-RSF (rich sequence format)  IG format PLAIN SEQUENCE FORMAT  A sequence in plain format may contain only IUPAC characters and spaces (no numbers!).  Note: A file in plain sequence format may only contain one sequence, while most other formats accept several sequences in one file.  An example sequence in plain format is:  AACCTGCGGAAGGATCATTACCGAGTGCGGGTCCTTTGGGCCCAACCTC CCATCCGTGTCTATTGTACCGTTGCTTCGGCGGGCCCGCCGCTTGTCGG CCGCCGGGGGGGCGCCTCTGCCCCCCGGGCCCGTGCCCGCCGGAG ACCCCAACACGAACACTGTCTGAAAGCGTGCAGTCTGAGTTGATTGAATG CAATCAGTTAAAACTTTCAACAATGGATCTCTTGGTTCCGGC EMBL FORMAT  A sequence file in EMBL format can contain several sequences. One sequence entry starts with an identifier line ("ID "), followed by further annotation lines. The start of the sequence is marked by a line starting with "SQ" and the end of the sequence is marked by two slashes ("//"). An example sequence in EMBL format is: ID AA03518 standard; DNA; FUN; 237 BP. XX AC U03518; XX DE Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S DE rRNA and 5.8S rRNA genes, partial sequence. XX SQ Sequence 237 BP; 41 A; 77 C; 67 G; 52 T; 0 other; aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc 60 tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg 120 ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc 180 tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc 237 // FASTA FILE FORMATS  A sequence file in FASTA format can contain several sequences.  One sequence in FASTA format begins with a single-line description, followed by lines of sequence data.  The description line must begin with a greater-than (">") symbol in the first column. An example sequence in FASTA format is: >U03518 Aspergillus awamori internal transcribed spacer 1 (ITS1) AACCTGCGGAAGGATCATTACCGAGTGCGGGTCCTTTGGGCCCAACCTCCCATCCGTGTCTATTGTACCC TGTTGCTTCGGCGGGCCCGCCGCTTGTCGGCCGCCGGGGGGGCGCCTCTGCCCCCCGGGCCCGTGCCCGC CGGAGACCCCAACACGAACACTGTCTGAAAGCGTGCAGTCTGAGTTGATTGAATGCAATCAGTTAAAACT TTCAACAATGGATCTCTTGGTTCCGGC  These file types, denoted by the.fas extension, are used by most large curated databases. Specific extensions exist for nucleic acids (.fna), nucleotide coding regions (.ffn), amino acids (.faa), and non-coding RNAs (.frn).  A FASTA file can contain one or many sequences.  Tools like ClustalW can take FASTA files with multiple sequences to generate an alignment.  Converting between FASTA formats and any of the others discussed below can be done with programs like Seqret and MView.  Other simple sequence file formats that you may encounter include GCG and IG. FASTQ FILE FORMATS  The FASTQ format was developed for and used with next-generation sequencing instruments and builds off of the simplicity of the FASTA format.  Information about the quality (“Q” in “FASTQ” stands for quality) of the sequencing reads and base calls are a defining component of the FASTQ file format.  The first is a sequence identifier and description. It begins with an “@” symbol followed by information about the sequence. There is a standardized format with Illumina sequencers that includes the unique instrument name, the flowcell lane, etc.  The second line contains the raw sequence data as with a FASTA file  The third line includes the “+” symbol along with a repeated identifier  The fourth line is the quality score for each base in the sequence on the second line and must be the same length  The quality score on the fourth line is the Phred score (Q), which is the measure of the quality of the identification of the nucleobases generated by automated DNA sequencing and it is formatted as a single ASCII character.  Q is calculated in different ways and ranges, depending on the platform used for sequencing, and is the probability that a specific base call in a raw sequence is incorrect.  In its most straightforward calculation, which is used for Sanger sequencing: Q= -log10p; where p is the probability that the base call is incorrect.  The larger the Q is, the higher the base call accuracy is.  For example,  Q of 20 means that base call is incorrectly identified every 100 base pairs.  Q of 30 means that a base call is incorrectly identified every 1000 base pairs.  FASTQ file formats typically have the file extension.fastq,.sanfastq, or.fq, though there is no standard. GCG FORMAT-GENETICS COMPUTER GROUP  A sequence file in GCG format contains exactly one sequence, begins with annotation lines and the start of the sequence is marked by a line ending with two dot ("..") characters. This line also contains the sequence identifier, the sequence length and a checksum. This format should only be used if the file was created with the GCG package. An example sequence in GCG format is: ID AA03518 standard; DNA; FUN; 237 BP. XX AC U03518; XX DE Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S DE rRNA and 5.8S rRNA genes, partial sequence. XX SQ Sequence 237 BP; 41 A; 77 C; 67 G; 52 T; 0 other; AA03518 Length: 237 Check: 4514.. 1 aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc 61 tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg 121 ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc 181 tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc GCG-RSF (RICH SEQUENCE FORMAT)  The new GCG-RSF can contain several sequences in one file.  This format should only be used if the file was created with the GCG package. IG FORMAT  A sequence file in IG format can contain several sequences, each consisting of several comment lines that must begin with a semicolon (";"), a line with the sequence name (it may not contain spaces!) and the sequence itself terminated with the termination character '1' for linear or '2' for circular sequences. An example sequence in IG format is: ; comment ; comment U03518 AACCTGCGGAAGGATCATTACCGAGTGCGGGTCCTTTGGGCCCAACCTCCCATCCGTGTCTATTGTA CCC TGTTGCTTCGGCGGGCCCGCCGCTTGTCGGCCGCCGGGGGGGCGCCTCTGCCCCCCGGGCCC GTGCCCGC CGGAGACCCCAACACGAACACTGTCTGAAAGCGTGCAGTCTGAGTTGATTGAATGCAATCAGTTAAA ACT TTCAACAATGGATCTCTTGGTTCCGGC1 HIGH THROUGHPUT SEQUENCING DATABASES  Sequence Read Archive The SRA is NIH's archive of high-throughput sequencing data and is part of the International Nucleotide Sequence Database Collaboration (INSDC) that includes the NCBI Sequence Read Archive (SRA), the European Bioinformatics Institute (EBI), and the DNA Database of Japan (DDBJ). Data submitted to any of the three organizations are shared among them.  SRA Mission The SRA is a publicly available repository of high throughput sequencing data. The archive accepts data from all branches of life as well as metagenomic and environmental surveys. SRA stores raw sequencing data and alignment information to enhance reproducibility and facilitate new discoveries through data analysis The preferred data format for files submitted to the SRA is the BAM format, which is capable of storing both aligned and unaligned reads. BAM files (with the.bam file extension) are closely related to SAM files, which are tab-delimited text files used for storing sequence alignment data. The advantage of the BAM file format over the SAM file format is that it’s a compressed binary version that is smaller in size and can be indexed, making them ideal for the storage of sequence alignment information and preferred for the Integrative Genomics Viewer.  Like most file formats used in bioinformatics, BAM files contain a header and a body.  The header stores information about the sequences, preceded by an “@” symbol.  The body contains information about how each sequence aligns with a specific reference sequence.  Each alignment line includes 11 data fields, including Phred score, a string that describes alignment called CIGAR, and other metadata. SECONDARY DATABASES  Secondary databases are the one which as reports of analyses of the sequences in the primary sources.  Either manually curated (i.e. PROSITE, Pfam, etc.) or automatically generated (i.e. ProDom, DOMO)  Some depend on the method used to detect if a protein belongs to a particular domain/family (patterns, profiles, HMM) 60 SECONDARY DATABASE Secondary db Primary source Information PROSITE SWISS-PROT Patterns (Regular expression) PROSITE SWISS-PROT Profiles (Weighted matrices) PRINTS OWL and Aligned motifs (Fingerprints) SWISS-PROT Pfam SWISS-PROT HMM (Hidden Markov Models) BLOCKS PROSITE/PRINTS Aligned motifs IDENTIFY BLOCKS/PRINTS Fuzzy regular expressions PROSITE ▪ Created in 1988 (SIB). This is the first secondary db. ▪ Contains functional domains fully annotated, based on two methods: patterns and profiles. ▪ Helps to determine to which family of proteins a new sequence might be belong or which domain (s) or functional site it may contain. PATTERNS  Motifs are encoded as regular expressions, often simply referred to as patterns. CTDEGGIS CYEDGGIS CYEEGGIT CYHGDGGS CYRGDGNT C-Y-X2-[DG]-G-X-[ST] (Regular expression) ▪ Entries are deposited in PROSITE in two distinct files: ▪ Pattern/profiles with the lists of all matches in the parent version of SWISS-PROT ▪ Documentation ▪ The process used to derive patterns involves the contruction of a multiple alignment and manual inspection to identify the conserved region. o As of 30 August 2012, release 20.85 has 1,650 documentation entries, 1,308 patterns, 1,039 profiles, and 1,041 ProRules.. 64 STRUCTURE OF PROSITE ENTRIES  Entries deposite in the PROSITE in two distinct files.  Patterns:-pattens and lists all the matches in the parent version of SWISS-PROT.  Documentation:-provide details of the characterized family and where known description of the biological role of the chosen motif and support bibliography. 65 DETERMINING SIGNIFICANCE OF DATABASE MATCHES  True-positive:-which are related  True- negative:-which are unrelated  False-positive:-unrelated match  False-negative:-correct match will fail completely to be diagnosed. 66 PROSITE (PATTERN): EXAMPLE ID EPO_TPO; PATTERN. AC PS00817; DT OCT-1993 (CREATED); NOV-1995 (DATA UPDATE); JUL-1998 (INFO UPDATE). DE Erythropoietin / thrombopoeitin signature. PA P-x(4)-C-D-x-R-[LIVM](2)-x-[KR]-x(14)-C. NR /RELEASE=38,80000; Diagnostic performance NR /TOTAL=14(14); /POSITIVE=14(14); /UNKNOWN=0(0); /FALSE_POS=0(0); NR /FALSE_NEG=0; /PARTIAL=1; CC /TAXO-RANGE=??E??; /MAX-REPEAT=1; CC /SITE=3,disulfide; /SITE=11,disulfide; DR P48617, EPO_BOVIN , T; P33707, EPO_CANFA , T; P33708, EPO_FELCA , T; List of DR P01588, EPO_HUMAN , T; P07865, EPO_MACFA , T; Q28513, EPO_MACMU , T; matches DR P07321, EPO_MOUSE , T; P49157, EPO_PIG , T; P29676, EPO_RAT , T; DR P33709, EPO_SHEEP , T; P42705, TPO_CANFA , T; P40225, TPO_HUMAN , T; DR P40226, TPO_MOUSE , T; P49745, TPO_RAT , T; DR P42706, TPO_PIG , P; DO PDOC00644; // PROFILE  Variable regions between the conserved motifs also contains information.  Discriminator termed a profile, is used to indicate where the insertion and deletions (INDELs) are allowed, what type of residue are allowed, at what positions and where more conserved regions are.  Profiles(weight matrices) provide a sensitive means of detecting  distant sequence relationship  Place where few Residues are conserved 68 PROSITE (PROFILE): EXAMPLE PROSITE: PS50097 ID BTB; MATRIX. AC PS50097; DT DEC-1999 (CREATED); DEC-1999 (DATA UPDATE); DEC-1999 (INFO UPDATE). DE BTB domain profile. MA /GENERAL_SPEC: ALPHABET='ABCDEFGHIKLMNPQRSTVWYZ'; LENGTH=67; MA /DISJOINT: DEFINITION=PROTECT; N1=6; N2=62; MA /NORMALIZATION: MODE=1; FUNCTION=LINEAR; R1=.9751; R2=.02068202; TEXT='-LogE'; MA /CUT_OFF: LEVEL=0; SCORE=363; N_SCORE=8.5; MODE=1; TEXT='!'; MA /CUT_OFF: LEVEL=-1; SCORE=267; N_SCORE=6.5; MODE=1; TEXT='?'; MA /DEFAULT: D=-20; I=-20; B1=-50; E1=-50; MI=-105; MD=-105; IM=-105; DM=-105; MM=1; M0=-2; MA /I: B1=0; BI=-105; BD=-105; MA /M: SY='C'; M=-6,-10,28,-14,-9,-15,-20,-14,-19,-15,-17,-14,-8,-19,-14,-15,0,0,-9,-32,-17,-12; MA /M: SY='D'; M=-16,41,-28,53,15,-34,-11,-1,-33,0,-27,-25,21,-11,0,-8,2,-6,-26,-38,-19,7; MA /M: SY='V'; M=2,-23,-8,-28,-24,-1,-24,-25,16,-20,7,6,-20,-25,-23,-20,-10,-4,24,-23,-9,-24; MA /M: SY='T'; M=-2,-13,-18,-19,-13,-7,-24,-19,6,-8,-2,1,-11,-17,-11,-10,-1,10,10,-24,-6,-13; MA /M: SY='L'; M=-11,-30,-22,-33,-24,15,-32,-23,25,-29,35,17,-26,-27,-23,-22,-24,-9,16,-17,3,-24; MA /M: SY='V'; M=0,-11,-18,-13,-10,-12,-20,-13,1,-6,-4,2,-10,-19,-6,-7,-4,-2,8,-25,-9,-9; MA /M: SY='V'; M=1,-25,-3,-29,-25,-2,-26,-26,17,-22,10,7,-23,-25,-23,-22,-11,-3,24,-27,-10,-25; MA /M: SY='D'; M=-6,7,-26,8,7,-25,6,-7,-27,0,-23,-17,8,-13,0,-3,3,-6,-23,-27,-17,3; MA /I: I=-5; MI=0; IM=0; DM=-15; MD=-15; MA /M: SY='G'; M=-6,8,-27,8,-3,-27,22,-7,-30,-8,-26,-19,10,-14,-8,-9,2,-9,-24,-28,-21,-6; …. PROSITE (PROFILE): EXAMPLE (CONT.) …… MA /M: SY='T'; M=-3,3,-16,1,-3,-18,-12,-9,-20,-6,-19,-15,2,-7,-6,-6,10,15,-13,-27,-12,-5; MA /M: SY='G'; M=-1,1,-25,2,-9,-26,31,-12,-32,-10,-26,-18,4,-17,-12,-10,1,-12,-24,-25,-22,-11; MA /M: SY='E'; M=-9,3,-24,4,13,-25,-16,-1,-24,13,-21,-13,3,-9,6,13,-3,-6,-20,-27,-13,8; MA /M: SY='I'; M=-6,-21,-18,-25,-21,-2,-29,-21,21,-21,14,10,-19,-24,-17,-19,-13,-3,19,-23,-3,-20; MA /M: SY='E'; M=-4,3,-23,3,4,-18,-11,-7,-17,-1,-18,-13,3,-9,-1,-5,1,-4,-14,-25,-11,1; MA /M: SY='I'; M=-8,-25,-23,-27,-20,1,-30,-21,21,-20,18,12,-22,-18,-18,-18,-18,-7,16,-21,-1,-20; MA /M: SY='P'; M=-6,0,-24,2,1,-22,-13,-8,-21,-2,-23,-15,1,14,-4,-7,3,2,-19,-31,-18,-3; MA /M: SY='E'; M=-7,1,-27,4,11,-24,-15,-4,-19,2,-18,-11,0,-1,6,-1,-2,-6,-19,-25,-14,7; MA /I: E1=0; IE=-105; DE=-105; NR /RELEASE=39,87397; NR /TOTAL=46(44); /POSITIVE=45(43); /UNKNOWN=1(1); /FALSE_POS=0(0); NR /FALSE_NEG=0; /PARTIAL=0; CC /TAXO-RANGE=??E?V; /MAX-REPEAT=2; DR O14867, BAC1_HUMAN, T; P97302, BAC1_MOUSE, T; P97303, BAC2_MOUSE, T; DR P41182, BCL6_HUMAN, T; P41183, BCL6_MOUSE, T; Q01295, BRC1_DROME, T; DR Q01296, BRC2_DROME, T; Q01293, BRC3_DROME, T; Q28068, CALI_BOVIN, T; DR Q13939, CALI_HUMAN, T; Q08605, GAGA_DROME, T; Q01820, GCL1_DROME, T; DR P10074, HKR3_HUMAN, T; Q04652, KELC_DROME, T; P42283, LOLL_DROME, T; DR P42284, LOLS_DROME, T; O14682, PI10_HUMAN, T; Q05516, PLZF_HUMAN, T; DR O43791, SPOP_HUMAN, T; P42282, TTKA_DROME, T; P17789, TTKB_DROME, T; DR P21073, VA55_VACCC, T; P24768, VA55_VACCV, T; P21037, VC02_VACCC, T; DR P17371, VC02_VACCV, T; P32228, VC04_SPVKA, T; P32206, VC13_SPVKA, T; DR P21013, VF03_VACCC, T; P24357, VF03_VACCV, T; P22611, VMT8_MYXVL, T; DR P08073, VMT9_MYXVL, T; O43167, Y441_HUMAN, T; Q10225, YAZ4_SCHPO, T; DR P40560, YIA1_YEAST, T; P34324, YKV2_CAEEL, T; P34371, YLJ8_CAEEL, T; DR P34568, YNV5_CAEEL, T; P41886, YPT9_CAEEL, T; Q09563, YR47_CAEEL, T; DR Q10017, YSW1_CAEEL, T; Q13105, Z151_HUMAN, T; Q60821, Z151_MOUSE, T; PRINTS  Compendium of protein motif fingerprints  Most protein families are characterized by several conserved motifs  Fingerprint: set of motif(s) (simple or composite, such as multidomains) = signature of family membership  True family members exhibit all elements of the fingerprint, while subfamily members may possess only a part  PRINTS are currently derived from OWL database.  This is done by iterative process. They repeated until no futher complete fingerprint matches can be identifed.  The result are then annotated for inculsion in the database. BLOCKS  The Blocks Database contains multiple alignments of conserved regions in protein families.  The database can be searched by e-mail and World Wide Web (WWW) servers (http://blocks.fhcrc.org/help) to classify protein and nucleotide sequences.  Blocks are multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins.  Block Searcher, Get Blocks and Block Maker are aids to detection and verification of protein sequence homology.  They compare a protein or DNA sequence to a database of protein blocks , retrieve blocks, and create new blocks, respectively.  The BLOCKS Database is based on InterPro entries with sequences from SWISS-PROT and TrEMBL and with cross-references to PROSITE and/or PRINTS and/or SMART, and/or PFAM and/or ProDom entries. BLOCKS DATABASE  The blocks for the Blocks Database are made automatically by looking for the most highly conserved regions in groups of proteins documented in InterPro.  The blocks created by Block Maker are created in the same manner as the blocks in the Blocks Database but with sequences provided by the user.  Results are reported in a multiple sequence alignment format without calibration and in the standard Block format for searching. FORMAT OF A BLOCK ID short_identifier; BLOCK AC block_number; distance from previous block = (min,max) DE description BL xxx motif; width=w; seqs=s; 99.5%=n1; strength=n2 sequence_id (offset) sequence_segment sequence_weight... //  ID line starts a block entry and contains a short identifier for the group of sequences from which the block was made. If the block was taken from InterPro, it will be the InterPro group ID.  The identifier is terminated by a semi-colon, and the word "BLOCK" indicates the entry type.  AC line contains the block number, a seven- character group number for sequences from which the block was made  DE line contains a description of the group of sequences from which the block was made  xxx = the amino acids in the spaced triplet found by MOTIF upon which the block is based. w = width of the sequence segments (columns) in the block. s = number of sequence segments (rows) in the block. n1 = raw calibration score; 99.5th percentile score of true negative sequences. Raw search scores are normalized by dividing by this score and multiplying by 1000. n2 = median normalized score of known true positive sequences as documented in InterPro.  Following the BL line are lines for each sequence with a segment in the block. The segments may be clustered with clusters separated by blank lines. Each segment line contains a sequence identifier, the offset from the beginning of the sequence to the block in parentheses, the sequence segment, and a weight for the segment. The weights are normalized so that the most distant segment has a weight of 100.  // line terminates a block entry PFAM HTTP://WWW.SANGER.AC.UK /SOFTWARE/PFAM/ PFAM  The Pfam database is a large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs).  Proteins are generally composed of one or more functional regions, commonly termed domains.  Different combinations of domains give rise to the diverse range of proteins found in nature.  The identification of domains that occur within proteins can therefore provide insights into their function.  The construction and use of Pfam is tightly tied to the HMMER software package. PFAM Composed of two sets of families: – Pfam-A: Manually curated part containing over 8296 protein families – Pfam-B: automatically generated supplement containing a large number of small families taken from the PRODOM database that do not overlap with Pfam-A (lower quality)  Pfam also generates higher-level groupings of related families, known as clans.  A clan is a collection of Pfam-A entries which are related by similarity of sequence, structure or profile-HMM. PFAM Each family has the following data:  A seed alignment which is a hand edited multiple alignment representing the family.  Hidden Markov Models (HMM) derived from the seed alignment which can be used to find new members of the domain and also take a set of sequences to realign them to the model. One HMM is in ls mode (global) the other is an fs mode (local) model.  A full alignment which is an automatic alignment of all the examples of the domain using the two HMMs to find and then align the sequences  Annotation which contains a brief description of the domain, links to other databases and some Pfam specific data. To record how the family was constructed. PFAM SEARCHES PFAM RESULTS  The data and additional features are accessible via the four websites  http://www.sanger.ac.uk/Software/Pfam/  http://pfam.wustl.edu  http://pfam.jouy.inra.fr/  http://Pfam.cgb.ki.se/). EXERCISE 1 – TEXT SEARCH 1. Go to EXPASY. Click "UniProt Knowledgebase (Swiss-Prot and TrEMBL)” and then search for human cochlin. Notice that there is a wealth of information about this protein. Furthermore, there are many links to sequence analysis tools (some of which you will learn later) and some other nice features. Note that this is merely a graphical display of the original UniProtKB/SwissProt database entry (which is in text). 2. Try to answer all of the questions below. 1. Which year was the NMR structure of the LCCL domain determined? 2. Where is the protein expressed? 3. Which diseases are associated with the protein? EXERCISE 2 – BLAST SEARCH 1. Go to EXPASY. Click "UniProt Knowledgebase (Swiss-Prot and TrEMBL)” and then „BLAST”. 2. Copy the following human amino acid sequence. MSTAVLENPGLGRKLSDFGQETSYIEDNCNQNGAISLIFSLKEEVGALAKVLRLFEENDVNLTHIESRPSRLKKDEYEFFTHLDK RSLPALTNIIKILRHDIGATVHELSRDKKKDTVPWFPRTIQELDRFANQILSYGAELDADHPGFKDPVYRARRKQFADIAYNYRH GQPIPRVEYMEEEKKTWGTVFKTLKSLYKTHACYEYNHIFPLLEKYCGFHEDNIPQLEDVSQFLQTCTGFRLRPVAGLLSSRDF LGGLAFRVFHCTQYIRHGSKPMYTPEPDICHELLGHVPLFSDRSFAQFSQEIGLASLGAPDEYIEKLATIYWFTVEFGLCKQGD SIKAYGAGLLSSFGELQYCLSEKPKLLPLELEKTAIQNYTVTEFQPLYYVAESFNDAKEKVRNFAATIPRPFSVRYDPYTQRIEVL DNTQQLKILADSINSEIGILCSALQKIK 3. Paste the sequence into the query sequence window and adjust the options as necessary. You won't need to specify advanced options, but you should choose a program and database. For simplicity, use e.g. the UniProtKB database. 4. Run the search and identify the protein. Use the link provided to see the UniProtKB/SWISS-PROT report. EXERCISE 2 – BLAST SEARCH 5. Now, try to answer all of the questions below. 1. What is the SWISS-PROT primary accession number? 2. What is the common name of the protein? 3. What is the gene called? 4. Which year was the crystal structure of the catalytic domain determined? Name the first author. 5. Does the enzyme require a co-factor to function? If so, what? 6. Name the most common disease that arises as a result of deficiency of this enzyme. 7. How many amino acid residues are there in the protein? 8. What is the molecular weight of the protein? EXERCISE 3 – DOMAIN SEARCH 1. Go to the PROSITE site. 2. Under "Tools for PROSITE" choose ScanProsite. 3. Paste the sequence below into the box and tick the Option "Exclude patterns with a high probability of occurrence" (to find very common patterns will not tell you much about your protein). MWAPRCRRFWSRWEQVAALLLLLLLLGVPPRSLALPPIRYSHAGICPNDMNPNLWVDAQSTCRRECETDQECETYEKCCPNVCGTKSCVAARYMDVKGKKGPVGMPKE ATCDHFMCLQQGSECDIWDGQPVCKCKDRCEKEPSFTCASDGLTYYNRCYMDAEACSKGITLAVVTCRYHFTWPNTSPPPPETTMHPTTASPETPELDMAAPALLNNPV HQSVTMGETVSFCDVVGRPRPEITWEKQLEDRENVVMRPNHVRGNVVVTNIAQLVIYNAQLQDAGIYTCTARNVAGVLRADFPLSVVRGHQAAATSESSPNGTAFPAAEL KPPDSEDCGEEQTRWHFDAQANNCLTFTFGHCHRNLNHFETYEACMLACMSGPLAACSLPALQGPCKAYAPRWAYNSQTGQCQSFVYGGCEGNGNNFESREACEESP FPRGNQRCRACKPRQKLVTSFCRSDFVILGRVSELTEEPDSGRALVTVDEVLKDEKMGLKFLGQEPLEVTLLHVDWACPCPNVTVSEMPLIIMGEVDGGMAMLRPDSFVG ASSARRVRKLREVMHKKTCDVLKEFLGLH 4. Start the scan. Which are the motifs that are found? EXERCISE 4– DOMAIN SEARCH 1. Go to the Pfam site. 2. Click „Search by protein name or sequence„. 3. Paste the sequence below into the box and choose „Both Global and Fragment Pfam search”. MWAPRCRRFWSRWEQVAALLLLLLLLGVPPRSLALPPIRYSHAGICPNDMNPNLWVDAQSTCRRECETDQECETYEKCCPNVCGTKSCVAARYMDVKGKKGPVGMPKE ATCDHFMCLQQGSECDIWDGQPVCKCKDRCEKEPSFTCASDGLTYYNRCYMDAEACSKGITLAVVTCRYHFTWPNTSPPPPETTMHPTTASPETPELDMAAPALLNNPV HQSVTMGETVSFCDVVGRPRPEITWEKQLEDRENVVMRPNHVRGNVVVTNIAQLVIYNAQLQDAGIYTCTARNVAGVLRADFPLSVVRGHQAAATSESSPNGTAFPAAEL KPPDSEDCGEEQTRWHFDAQANNCLTFTFGHCHRNLNHFETYEACMLACMSGPLAACSLPALQGPCKAYAPRWAYNSQTGQCQSFVYGGCEGNGNNFESREACEESP FPRGNQRCRACKPRQKLVTSFCRSDFVILGRVSELTEEPDSGRALVTVDEVLKDEKMGLKFLGQEPLEVTLLHVDWACPCPNVTVSEMPLIIMGEVDGGMAMLRPDSFVG ASSARRVRKLREVMHKKTCDVLKEFLGLH 4. Search Pfam. 1. Which domains are found? 2, What may be the function of this protein? STRUCTURE CLASSIFICATION DATABASE INTRODUCTION  Nearly all proteins have structural similarities with other proteins and, in some of these cases, share a common evolutionary origin.  A knowledge of these relationships is crucial to our understanding of the evolution of proteins and of development.  It will also play an important role in the analysis of the sequence data that is being produced by worldwide genome projects. STRUCTURE CLASSIFICATION DATABASE  A Database in which protein structures are classified according to their geometrical and evolutionary similarities. Structures tend to be classified at the level of their individual domains, but multidomain structures can also be classified into evolutionary families and superfamilies.  Examples of such databases  SCOP(structural classification of proteins)  CATH (class architecture topology homology). SCOP(STRUCTURAL CLASSIFICATION OF PROTEINS)  The Structural Classification of Proteins (SCOP) database is a largely manual classification of protein structural domains based on similarities of their structures and amino acid sequences.  The SCOP database maintained at the MRC Lab of molecular biology and centre for protein engineering.  Its aims to provide a detailed and comprehensive description of the structural and evolutionary relationships between all proteins whose structure is known, including all entries in Protein Data Bank (PDB).  In addition, the hypertext pages offer a panoply of representations of proteins, including links to PDB entries, sequences, references, images and interactive display systems.  Existing automatic sequence and structure comparison tools cannot identify all structural and evolutionary relationships between proteins.  The SCOP classification of proteins has been constructed manually by visual inspection and comparison of structures, but with the assistance of tools to make the task manageable and help provide generality.  The job is made more challenging--and theoretically daunting--by the fact that the entities being organized are not homogeneous: sometimes it makes more sense to organize by individual domains, and other times by whole multi-domain proteins. CLASSIFICATION  Proteins are classified in a hierarchical to reflect both structural and evolutionary relatedness.  Within the hierarchy there are many levels, but principally these describe  family  super family and  fold. The levels of SCOP are as follows.  Class: Types of folds, e.g., beta sheets.  Fold: The different shapes of domains within a class.  Superfamily: The domains in a fold are grouped into superfamilies, which have at least a distant common ancestor.  Family: The domains in a superfamily are grouped into families, which have a more recent common ancestor.  Protein domain: The domains in families are grouped into protein domains, which are essentially the same protein.  Species: The domains in "protein domains" are grouped according to species.  Domain: part of a protein. For simple proteins, it can be the entire protein.  The exact position of boundaries between these levels are to some degree subjective, but the higher levels generally reflect the clearest structural similarities. FAMILY (CLEAR EVOLUTIONARILY RELATIONSHIP)  Proteins clustered together into families are clearly evolutionarily related. Generally, this means that pairwise residue identities between the proteins are 30% and greater.  However, in some cases similar functions and structures provide definitive evidence of common descent in the absense of high sequence identity;  for example, many globins form a family though some members have sequence identities of only 15% SUPERFAMILY: PROBABLE COMMON EVOLUTIONARY ORIGIN  Proteins that have low sequence identities, but whose structural and functional features suggest that a common evolutionary origin is probable are placed together in superfamilies.  For example, actin, the ATPase domain of the heat shock protein, and hexakinase together form a superfamily. FOLD: MAJOR STRUCTURAL SIMILARITY  Proteins are defined as having a common fold if they have same major secondary structures in same arrangement and with the same topological connections,whether or not they have a common evolutionary origin  In some cases the structural similarities could arise just from the physics and chemistry of proteins favoring certain packing arrangements and chain topologies. CATH  The CATH database is a hierarchical domain classification of protein structures in the Protein Data Bank maintained at UCL.  Only crystal structures solved to resolution better than 4.0 angstroms are considered, together with NMR structures.  All non-proteins, models, and structures with greater than 30% “C-alpha only” are excluded from CATH.  This filtering of the PDB is performed using the SIFT protocol  Protein structures are classified using a combination of automated and manual procedures. FIVE LEVELS IN HIERACHY  CLASS: C-level Class is determined according to the secondary structure composition and packing within the structure.  Three major classes are recognised;  mainly-alpha,  mainly-beta and  alpha-beta. This includes both alternating alpha/beta structures and alpha+beta structures  A fourth class is also identified which contains protein domains which have low secondary structure content. Architecture, A-level  This describes the overall shape of the domain structure as determined by the orientations of the secondary structures but ignores the connectivity between the secondary structures.  It is currently assigned manually using a simple description of the secondary structure arrangement e.g. barrel or 3-layer sandwich. Topology (Fold family), T-level  It gives a description that encompasses both the overall shaper and the connectivity of secondary structures.  This is achieved by means of structure comparison algorithms that use empirically derived parameters to cluster the domains.  Structure at least 60% of the larger protein matches the smaller are assigned to the same topology. HOMOLOGOUS SUPERFAMILY, H- LEVEL  This level groups together protein domains which are thought to share a common ancestor and can therefore be described as homologous. Similarities are identified either by high sequence identity or structure comparison using SSAP (Sequential structure alignment program).  Structures are clustered into the same homologous superfamily if they satisfy one of the following criteria:  Sequence identity >= 35%, overlap >= 60% of larger structure equivalent to smaller.  SSAP score >= 80.0, sequence identity >= 20%, 60% of larger structure equivalent to smaller.  SSAP score >= 70.0, 60% of larger structure equivalent to smaller, and domains which have related functions, which is informed by the literature and Pfam protein family database, (Bateman et al., 2004).  Significant similarity from HMM-sequence searches and HMM- HMM comparisons using SAM (Hughey &Krogh, 1996),HMMER (http://hmmer.wustl.edu [http://hmmer.wustl.edu]) and PRC (http://supfam.org/PRC [http://supfam.org/PRC]). SEQUENCE FAMILY LEVELS  At this level domains have sequence identities >35% (60% of larger structure equivalent to smaller)-→ similar structures and function.  Domains within each H-level are subclustered into sequence families using multi-linkage clustering at the following levels  D (distinct structures) PDBSUM  A major resource providing a key information on each macromolecular structure deposited at the protein data bank.  It was the web-based compendium maintained at UCL.  Now it is moved to EMBL.  It includes  images of the structure  Annotated plots of each protein chain’s secondary structure, detailed structural analyses generated by the promotif program http://www.ebi.ac.uk/pdbsum/ PROMOTIF  PROMOTIF provides details of the location and types of structural motifs in proteins of known structure by analysis of Brookhaven format coordinate files. The current version of the program analyses the following structural features:  Secondary structure Beta strands Disulphide bridges Beta bulges Beta turns Beta hairpins Gamma turns Beta alpha beta units Helical geometry Psi loops Helical interactions Beta sheet topology Main chain hydrogen bonding patterns COMPOSITE PROTEIN SEQUENCE DATABASE  Compile a composite. i.e a database that amalgamates a variety of different primary source.  Easy for us to search much more efficient.  NRDB,OWL,MIPSX,SWISS-PROT+TrEMBL. 115 COMPOSITE PROTEIN SEQUENCE DB Different composite db use different primary sources and different redundancy criteria in their amalgamation procedures NRDB OWL MIPSX SPTrEMBL * PDB SWISS-PROT PIR SWISS-PROT SWISS-PROT PIR MIPS SPTrEMBL PIR GenBank NRL-3D TrEMBLnew GenPept NRL-3D SWISS-PROT SP update EMBL translation GenPept update GenBank translation Kabat (immuno) PseqIP Redundancy priority criteria * Also called SWall at EBI SWIR: SPTrEMBL + Wormpep METABOLIC DATABASES  Metabolic databases play a crucial role in biochemistry by providing comprehensive resources that facilitate the study and analysis of metabolic pathways.  Metabolic databases are essential tools for researchers to access information about metabolic pathways, gene products, and interactions.  These databases help scientists understand complex biochemical processes and can aid in the identification of targets for drug development.  The growth of metabolic databases has been driven by advancements in genomic and proteomic technologies, which have generated vast amounts of data.  There is an ongoing challenge in integrating data from different sources to create more comprehensive biochemical models. SOME OF THE EXAMPLES OF METABOLIC DATABASES  Metabolic databases typically describe collections of enzymes, reactions, and biochemical pathways.  They are also often coupled with software for querying and visualizing metabolic information. ECOCYC DATABASE  EcoCyc is a comprehensive database focused on the metabolism and molecular biology of the model organism Escherichia coli (E. coli).  The EcoCyc database describes the enzymes that carry out each bioreaction, including their cofactors, activators, inhibitors and the subunit structures of the enzymes.  Most entries in EcoCyc contain several links to the primary biomedical literature, using online Medline entries when possible.  EcoCyc is also linked to databases such as GenBank, SWISS-PROT and PDB.  EcoCyc can be queried through the WWW, and can be downloaded as a binary program for Sun workstations.  The EcoCyc data can be downloaded as structured files via Internet FTP.  It integrates information on metabolic pathways, enzymes, genes, and their interactions.  The database is designed to provide a detailed view of the E. coli genome, including annotations and functional data, thus serving as a resource for research related to metabolic functions, regulatory networks, and gene expression.  EcoCyc facilitates the understanding of metabolic processes in E. coli and can be instrumental in studies involving biochemistry, genetics, and systems biology. THE KEGG DATABASE  It contains 85 pathways derived from the Boehringer Mannheim wall chart and from a collection of the Japanese Biochemical Society.  These pathway collections are consensus views of biochemistry that are not specific to particular organisms, but KEGG has been used to predict the pathways of several organisms from their genomes.  The database describes the reactions within each pathway, and the chemical compounds within each reaction, including compound structures.   KEGG reactions are linked to enzymes in SWISS- PROT, PIR, and PDB. The KEGG graphical interface consists of manually drawn diagrams of pathways and of the complete metabolic network.  KEGG can be queried via the WWW, and is available on CD-ROM for Macintosh. THANK YOU

Use Quizgecko on...
Browser
Browser