Unit-2 PDF - Database Resources in Molecular Biology

UNIT -2 - DATABASE RESOURCES IN MOLECULAR BIOLOGY Introduction to databases & biological databases- Uses of biological databases- Primary sequence databases- Nucleotide- Protein sequence database- Primary structure databases- PDB file format- FASTA , GCG,VFF etc- High Throughput sequencing databases- Secondary databases- secondary sequence databases- Secondary structure databases- SCOP- CATH- Composite protein databases- Metabolic databases- SNP -databases- Whole genome – mendelian disease databases- chemical structure databases- bibliographic databases BIOLOGICAL DATABASES WHAT IS A DATABASE ? A collection of... ⚫ structured ⚫ searchable (index) -> table of contents ⚫ updated periodically (release) -> new edition ⚫ cross-referenced (hyperlinks) -> links with other db …data Includes also associated tools (software) necessary for db access, db updating, db information insertion, db information deletion…. DATABASES Information system Query system Storage System Data 4 DATABASES Information system GenBank flat file PDB file Interaction Record Query system Title of a book Book Storage System Data 5 DATABASES Boxes Information system Oracle MySQL Query system PC binary files Unix text files Storage System Bookshelves Data 6 DATABASES A List you look at A catalogue Information system indexed files SQL Query system grep Storage System Data 7 DATABASES Information system Query system The UBC library Google Entrez Storage System SRS Data 8 TYPES OF DATABASE Many difference database type, depending both on ⚫ the nature of the information being stored ( eg. sequences or structures) ⚫ The manner of data storage( whether in flat files or in tables) 9 DATABASES: AN SIMPLE EXAMPLE « Introduction To Database »Teacher Database (ITDTdb) (flat file, 3 entries) Accession number: 1 First Name: Amos Last Name: Bairoch Course: DEA=oct-nov-dec 2000 http://expasy4.expasy.ch/people/amos.html // Accession number: 2 First Name: Laurent Last name: Falquet Course: EMBnet=sept 2000;DEA=oct-nov-dec 2000; // Accession number 3: First Name: Marie-Claude Last name: Blatter Garin Course: EMBnet=sept 2000;DEA=oct-nov-dec 2000; http://expasy4.expasy.ch/people/Marie-Claude.Blatter-Garin.html // Easy to manage: all the entries are visible at the same time ! DATABASES: AN SIMPLE EXAMPLE (CONT.) Relational database (« table file »): Teacher Accession Education number Amos 1 Biochemistry Laurent 2 Biochemistry M-Claude 3 Biochemistry Course Date Involved teachers DEA Oct-nov-dec 2000 1,3 EMBnet Sept 2000 2,3 Easier to manage; choice of the output BIOLOGICAL DATABASE A biological database is a collection of data that is organized so that its contents can easily be accessed, managed, and updated. The activity of preparing a database can be divided in to: ⚫ Collection of data in a form which can be easily accessed ⚫ Making it available to a multi-user system ( always available for the user) 12 IMPORTANCE OF DATABASES A database is any collection of related data. A computerized archive is used to store and organize data in such a way that information can be retrieved easily. A database is a collection of interrelated data store together without harmful and unnecessary redundancy (duplicate data) to serve multiple applications Retrieving is called by firing a query. Database System is an integrated collection of related files along with the detail about their definition, interpretation, manipulation and maintenance. A database system is based on the data. A database system can be run or executed by using software called DBMS (Database Management System). A database system controls the data from unauthorized access. A database management system (DBMS) is a collection of programs that enables users to create and maintain a database. Database management systems provide several functions in addition to simple file management: allow concurrency, control security, maintain data integrity, provide for backup and recovery, control redundancy, allow data independence, provide nonprocedural query language, perform automatic query optimization Relational database-a database that treats all of its data as a collection of relations WHY BIOLOGICAL DATABASES ? Explosive growth in biological data Data (sequences, 3D structures, 2D gel analysis, MS analysis….) are no longer published in a conventional manner, but directly submitted to databases Essential tools for biological research, as classical publications used to be ! 17 18 TYPE OF DATA nucleotide sequences protein sequences proteins sequence patterns or motifs macromolecular 3D structure gene expression data metabolic pathways proteomics data BIOINFORMATICS DATABASE CATEGORIZED ON THE BASIS OF Data type Maintainer status Technical design Data source Data access Any other parameter Type of data: proteomic database, structure database etc Maintainer status: NCBI-National center for Biotechnology Information EBI- European Bioinformatics Institute Technical Design: Flat file ,XML, Relational Model, Object oriented/object relational model Data source: Primary database, secondary database Data Access: Various kinds of access status -publicly available with no restrictions (NCBI,EBI) - available with copyright Others Complete or incomplete entries in the database Annotation- not annotated or annotated(have the analysis of data) Curation- When annotation is established, db known as curated Databases in general can be classified into ⚫ primary, ⚫ secondary and ⚫ composite databases. 22 PRIMARY SEQUENCE DATABASES. A primary database contains information of the sequence or structure alone. Examples of these include Swiss-Prot &PIR for protein sequences, GenBank & DDBJ for Genome sequences and the Protein Databank for protein structures. NUCLEIC ACID PROTEIN EMBL PIR GenBank MIPS DDBJ SWISS-PROT TrEMBL 24 NRL-3D SEQUENCE DATABASES Primary DNA ⚫ DDBJ/EMBL/GenBank Primary protein ⚫ GenPept/TrEMBL Curated DB ⚫ RefSeq (Genomic, mRNA and protein) ⚫ Swiss-Prot & PIR -> UniProt (protein) 25 NUCLEIC ACID SEQUENCE DATABASES The principle DNA sequence databases are DDBJ/EMBL/GenBank Which exchange data on a daily basis to ensure comprehensive coverage at each of the site. 26 Entrez NI H NCBI Submissions GenBan Submissions Updates k Updates EMB DDB L EBI CIB J NI Submissions Updates SRS G getentry EMB 27 WHAT IS GENBANK? GenBank is the NIH genetic sequence database of all publicly available DNA and derived protein sequences, with annotations describing the biological information these records contain. http://www.ncbi.nlm.nih.gov/Genbank/GenbankOverview.ht ml Benson et al., 2004, Nucleic Acids Res. 32:D23-D2628 GENBANK FLAT FILE (GBFF) LOCUS MUSNGH 1803 bp mRNA ROD 29-AUG-1997 DEFINITION Mouse neuroblastoma and rat glioma hybridoma cell line NG108-15 cell TA20 mRNA, complete cds. ACCESSION D25291 NID g1850791 KEYWORDS neurite extension activity; growth arrest; TA20. SOURCE Murinae gen. sp. mouse neuroblastma-rat glioma hybridoma cell_line:NG108-15 cDNA to mRNA. ORGANISM Murinae gen. sp. Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata; Title Head Taxonomy Vertebrata; Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae. REFERENCE 1 (sites) AUTHORS Tohda,C., Nagai,S., Tohda,M. and Nomura,Y. TITLE A novel factor, TA20, involved in neuronal differentiation: cDNA JOURNAL MEDLINE REFERENCE cloning and expression Neurosci. Res. 23 (1), 21-27 (1995) 96064354 3 (bases 1 to 1803) Citation AUTHORS Tohda,C. TITLE Direct Submission er JOURNAL Submitted (18-NOV-1993) to the DDBJ/EMBL/GenBank databases. Chihiro Tohda, Toyama Medical and Pharmaceutical University, Research Institute for Wakan-yaku, Analytical Research Center for Ethnomedicines; 2630 Sugitani, Toyama, Toyama 930-01, Japan (E-mail:[email protected], Tel:+81-764-34-2281(ex.2841), Fax:+81-764-34-5057) COMMENT On Feb 26, 1997 this sequence version replaced gi:793764. FEATURES Location/Qualifiers source 1..1803 /organism="Murinae gen. sp." /note="source origin of sequence, either mouse or rat, has not been identified" /db_xref="taxon:39108" /cell_line="NG108-15" /cell_type="mouse neuroblastma-rat glioma hybridoma" misc_signal 156..163 /note="AP-2 binding site" Features (AA GC_signal 647..655 /note="Sp1 binding site" TATA_signal 694..701 gene 748..1311 /gene="TA20" CDS 748..1311 /gene="TA20" /function="neurite extensiion activity and growth arrest effect" /codon_start=1 /db_xref="PID:d1005516" seq) /db_xref="PID:g793765" /translation="MMKLWVPSRSLPNSPNHYRSFLSHTLHIRYNNSLFISNTHLSRR KLRVTNPIYTRKRSLNIFYLLIPSCRTRLILWIIYIYRNLKHWSTSTVRSHSHSIYRL RPSMRTNIILRCHSYYKPPISHPIYWNNPSRMNLRGLLSRQSHLDPILRFPLHLTIYY RGPSNRSPPLPPRNRIKQPNRIKLRCR" polyA_site 1803 BASE COUNT 507 a 458 c 311 g 527 t ORIGIN 1 tcagtttttt tttttttttt tttttttttt tttttttttt tttttttttg ttgattcatg 61 tccgtttaca tttggtaagt tcacaggcct cagtcaacac aattggactg ctcaggaaat 121 cctccttggt gaccgcagta tacttggcct atgaacccaa gccacctatg gctaggtagg 181 agaagctcaa ctgtagggct gactttggaa gagaatgcac atggctgtat cgacatttca 241 catggtggac ctctggccag agtcagcagg ccgagggttc tcttccgggc tgctccctca 301 ctgcttgact ctgcgtcagt gcgtccatac tgtgggcgga cgttattgct atttgccttc 361 cattctgtac ggcattgcct ccatttagct ggagagggac agagcctggt tctctagggc 421 gtttccattg gggcctggtg acaatccaaa agatgagggc tccaaacacc agaatcagaa 481 ggcccagcgt atttgtaaaa acaccttctg gtgggaatga atggtacagg ggcgtttcag 541 gacaaagaac agcttttctg tcactcccat gagaaccgtc gcaatcactg ttccgaagag 601 gaggagtcca gaatacacgt gtatgggcat gacgattgcc cggagagagg cggagcccat DNA 661 ggaagcagaa agacgaaaaa cacacccatt atttaaaatt attaaccact cattcattga 721 cctacctgcc ccatccaaca tttcatcatg atgaaacttt gggtcccttc taggagtctg 781 cctaatagtc caaatcatta caggtctttt cttagccata cactacacat cagatacaat 841 aacagccttt tcatcagtaa cacacatttg tcgagacgta aattacgggt gactaatccg 901 atatatacac gcaaacggag cctcaatatt ttttatttgc ttattccttc atgtcggacg 961 aggcttatat tatggatcat atacatttat agaaacctga aacattggag tacttctact 1021 gttcgcagtc atagccacag catttatagg ctacgtcctt ccatgaggac aaatatcatt 1081 ctgaggtgcc acagttatta caaacctcct atcagccatc ccatatattg gaacaaccct 1141 agtcgaatga atttgagggg gcttctcagt agacaaagcc accttgaccc gattcttcgc 1201 tttccacttc atcttaccat ttattatcgc ggccctagca atcgttcacc tcctcttcct 1261 ccacgaaaca ggatcaaaca acccaacagg attaaactca gatgcagata aaattccatt 29 Sequence 1321 tcacccctac tatacatcaa agatatccta ggtatcctaa tcatattctt aattctcata 1381 accctagtat tatttttccc agacatacta ggagacccag acaactacat accagctaat 1441 ccactaaaca ccccacccca tattaaaccc gaatgatatt tcctatttgc atacgccatt 1501 ctacgctcaa tccccaataa actaggaggt gtcctagcct taatcttatc tatcctaatt 1561 ttagccctaa tacctttcct tcatacctca aagcaacgaa gcctaatatt ccgcccaatc 1621 acacaaattt tgtactgaat cctagtagcc aacctactta tcttaacctg aattgggggc 1681 caaccagtag acacccattt attatcattg gccaactagc ctccatctca tacttctcaa 1741 tcatcttaat tcttatacca atctcaggaa ttatcgaaga caaaatacta aaattatatc 1801 cat // PROTEIN SEQUENCE DATABASES PROTEIN SEQUENCE DATABASES PIR- Protein Information Resource MIPS-Munich Information Center for Protein Sequences SWISS-PROT TrEMBL - Translated-EMBL NRL-3D - 30 PIR (THE PROTEIN INFORMATION RESOURCE) The Protein Information Resource (PIR), located at Georgetown University Medical Center (GUMC), is an integrated public bioinformatics resource to support genomic and proteomic research, and scientific studies. PIR was established in 1984 by the National Biomedical Research Foundation (NBRF) 31 Prior to that, the NBRF compiled the first comprehensive collection of macromolecular sequences in the Atlas of Protein Sequence and Structure, published from 1965-1978 under the editorship of Margaret O. Dayhoff. In 2002 PIR, along with its international partners, EBI (European Bioinformatics Institute) and SIB (Swiss Institute of Bioinformatics), were awarded a grant from NIH to create UniProt, UniProt - a single worldwide database of protein sequence and function, by unifying the PIR-PSD, Swiss-Prot, and TrEMBL databases. In the current form the PIR is spilt into four sections PIR1-Contain fully classification and annotation entries PIR2-includes preliminary entries, which have not been thoroughly reviewed and may contain redundancy. PIR3-Unverified entries which have not been thoroughly reviewed PIR4- Entries fall in to 4 categories: 1. Conceptual translations of art factual sequences. 2. Conceptual translations of sequences that are not transcribed or translated; 3. Protein sequence or Conceptual translations that are extensively genetically engineered 4. Sequences that are not genetically encoded and not 33 produced on ribosome. SWISS-PROT SWISS-PROT is a protein sequence database was produced collaboratively by the Dept of Medical Biochemistry at the University of Geneva and the EMBL. After 1994 the collaboration moved to EMBL’S UK outstation, the EBI In April 1998, it was move to the Swiss Institute of Bioinformatics (SIB) 34 SWISS-PROT SWISS-PROT incorporates: Function of the protein Post-translational modification Domains and sites. Secondary structure. Quaternary structure. Similarities to other proteins; Diseases associated with deficiencies in the protein Sequence conflicts, variants, etc. 35 ID CYS3_YEAST STANDARD; PRT; 393 AA. AC P31373; DT 01-JUL-1993 (REL. 26, CREATED) DT 01-JUL-1993 (REL. 26, LAST SEQUENCE UPDATE) DT 01-NOV-1995 (REL. 32, LAST ANNOTATION UPDATE) DE CYSTATHIONINE GAMMA-LYASE (EC 4.4.1.1) (GAMMA-CYSTATHIONASE). GN CYS3 OR CYI1 OR STR1 OR YAL012W OR FUN35. OS SACCHAROMYCES CEREVISIAE (BAKER'S YEAST). SWISS-PROT OC EUKARYOTA; FUNGI; ASCOMYCOTA; HEMIASCOMYCETES; SACCHAROMYCETALES; OC SACCHAROMYCETACEAE; SACCHAROMYCES. RN RP SEQUENCE FROM N.A., AND PARTIAL SEQUENCE. RX MEDLINE; 92250430. [NCBI, ExPASy, Israel, Japan] RA ONO B.-I., TANAKA K., NAITO K., HEIKE C., SHINODA S., YAMAMOTO S., RA OHMORI S., OSHIMA T., TOH-E A.; ID CYS3_YEAST STANDARD; PRT; 393 AA. RT RT "Cloning and characterization of the CYS3 (CYI1) gene of Saccharomyces cerevisiae."; RL J. BACTERIOL. 174:3339-3347(1992). AC P31373; RN RP SEQUENCE FROM N.A., AND CHARACTERIZATION. DT 01-JUL-1993 (REL. 26, CREATED) RC RX STRAIN=DBY939; MEDLINE; 93328685. [NCBI, ExPASy, Israel, Japan] RA YAMAGATA S., D'ANDREA R.J., FUJISAKI S., ISAJI M., NAKAMURA K.; DE CYSTATHIONINE GAMMA-LYASE (EC 4.4.1.1) (GAMMA-CYSTATHIONASE). RT "Cloning and bacterial expression of the CYS3 gene encoding RT cystathionine gamma-lyase of Saccharomyces cerevisiae and the GN CYS3 OR CYI1 OR STR1 OR YAL012W OR FUN35. RT RL physicochemical and enzymatic properties of the protein."; J. BACTERIOL. 175:4800-4808(1993). RN OS TAXONOMY RP SEQUENCE FROM N.A. RC STRAIN=S288C / AB972; OC SACCHAROMYCETACEAE; SACCHAROMYCES. RX RA MEDLINE; 93289814. [NCBI, ExPASy, Israel, Japan] BARTON A.B., KABACK D.B., CLARK M.W., KENG T., OUELLETTE B.F.F., RA STORMS R.K., ZENG B., ZHONG W.W., FORTIN N., DELANEY S., BUSSEY H.; RT "Physical localization of yeast CYS3, a gene whose product resembles RT the rat gamma-cystathionase and Escherichia coli cystathionine gamma- RX CITATION RT RL synthase enzymes."; YEAST 9:363-369(1993). RN CC -!- CATALYTIC ACTIVITY: L-CYSTATHIONINE + H(2)O = L-CYSTEINE + RP SEQUENCE FROM N.A. RC STRAIN=S288C / AB972; CC NH(3) + 2-OXOBUTANOATE. RX RA MEDLINE; 93209532. [NCBI, ExPASy, Israel, Japan] OUELLETTE B.F.F., CLARK M.W., KENG T., STORMS R.K., ZHONG W.W., RA ZENG B., FORTIN N., DELANEY S., BARTON A.B., KABACK D.B., BUSSEY H.; CC -!- COFACTOR: PYRIDOXAL PHOSPHATE. RT "Sequencing of chromosome I from Saccharomyces cerevisiae: analysis RT of a 32 kb region between the LTE1 and SPO7 genes."; CC -!- PATHWAY: FINAL STEP IN THE TRANS-SULFURATION PATHWAY SYNTHESIZING RL RN GENOME 36:32-42(1993). RP SEQUENCE OF 1-18, AND CHARACTERIZATION. CC L-CYSTEINE FROM L-METHIONINE. RX MEDLINE; 93289817. [NCBI, ExPASy, Israel, Japan] RA ONO B.-I., ISHII N., NAITO K., MIYOSHI S.-I., SHINODA S., YAMAMOTO S., CC -!- SUBUNIT: HOMOTETRAMER. RA RT OHMORI S.; "Cystathionine gamma-lyase of Saccharomyces cerevisiae: structural RT gene and cystathionine gamma-synthase activity."; CC -!- SUBCELLULAR LOCATION: CYTOPLASMIC. RL YEAST 9:389-397(1993). CC -!- CATALYTIC ACTIVITY: L-CYSTATHIONINE + H(2)O = L-CYSTEINE + CC -!- SIMILARITY: BELONGS TO THE TRANS-SULFURATION ENZYMES FAMILY. CC CC NH(3) + 2-OXOBUTANOATE. -!- COFACTOR: PYRIDOXAL PHOSPHATE. CC -!- PATHWAY: FINAL STEP IN THE TRANS-SULFURATION PATHWAY SYNTHESIZING CC -------------------------------------------------------------------------- CC L-CYSTEINE FROM L-METHIONINE. CC -!- SUBUNIT: HOMOTETRAMER. CC DISCLAMOR CC CC -!- SUBCELLULAR LOCATION: CYTOPLASMIC. -!- SIMILARITY: BELONGS TO THE TRANS-SULFURATION ENZYMES FAMILY. CC -------------------------------------------------------------------------- CC -------------------------------------------------------------------------- CC This SWISS-PROT entry is copyright. It is produced through a collaboration CC between the Swiss Institute of Bioinformatics and the EMBL outstation - CC the European Bioinformatics Institute. There are no restrictions on its CC use by non-profit institutions as long as its content is in no way CC modified and this statement is not removed. Usage by and for commercial DR DATABASE cross-reference CC entities requires a license agreement (See http://www.isb-sib.ch/announce/ CC or send an email to [email protected]). KW CYSTEINE BIOSYNTHESIS; LYASE; PYRIDOXAL PHOSPHATE. CC DR -------------------------------------------------------------------------- EMBL; L05146; AAC04945.1; -. [EMBL / GenBank / DDBJ] [CoDingSequence] DR EMBL; L04459; AAA85217.1; -. [EMBL / GenBank / DDBJ] [CoDingSequence] FT INIT_MET 0 0 DR EMBL; D14135; BAA03190.1; -. [EMBL / GenBank / DDBJ] [CoDingSequence] DR PIR; S31228; S31228. FT BINDING 203 203 PYRIDOXAL PHOSPHATE (BY SIMILARITY). DR DR YEPD; 5280; -. SGD; L0000470; CYS3. [SGD / YPD] DR PFAM; PF01053; Cys_Met_Meta_PP; 1. SQ SEQUENCE 393 AA; 42411 MW; 55BA2771 CRC32; DR PROSITE; PS00868; CYS_MET_METAB_PP; 1. DR DOMO; P31373. TLQESDKFAT KAIHAGEHVD VHGSVIEPIS LSTTFKQSSP ANPIGTYEYS RSQNPNRENL DR DR PRODOM [Domain structure / List of seq. sharing at least 1 domain] PROTOMAP; P31373. DR PRESAGE; P31373. ERAVAALENA QYGLAFSSGS ATTATILQSL PQGSHAVSIG DVYGGTHRYF TKVANAHGVE DR SWISS-2DPAGE; GET REGION ON 2D PAGE. KW CYSTEINE BIOSYNTHESIS; LYASE; PYRIDOXAL PHOSPHATE. TSFTNDLLND LPQLIKENTK LVWIETPTNP TLKVTDIQKV ADLIKKHAAG QDVILVVDNT FT FT INIT_MET BINDING 203 0 0 203 PYRIDOXAL PHOSPHATE (BY SIMILARITY). SQ SEQUENCE 393 AA; 42411 MW; 55BA2771 CRC32; FLSPYISNPL NFGADIVVHS ATKYINGHSD VVLGVLATNN KPLYERLQFL QNAIGAIPSP TLQESDKFAT KAIHAGEHVD VHGSVIEPIS LSTTFKQSSP ANPIGTYEYS RSQNPNRENL ERAVAALENA QYGLAFSSGS ATTATILQSL PQGSHAVSIG DVYGGTHRYF TKVANAHGVE FDAWLTHRGL KTLHLRVRQA ALSANKIAEF LAADKENVVA VNYPGLKTHP NYDVVLKQHR TSFTNDLLND LPQLIKENTK LVWIETPTNP TLKVTDIQKV ADLIKKHAAG QDVILVVDNT FLSPYISNPL NFGADIVVHS ATKYINGHSD VVLGVLATNN KPLYERLQFL QNAIGAIPSP FDAWLTHRGL KTLHLRVRQA ALSANKIAEF LAADKENVVA VNYPGLKTHP NYDVVLKQHR DALGGGMISF RIKGGAEAAS KFASSTRLFT LAESLGGIES LLEVPAVMTH GGIPKEAREA 36 DALGGGMISF RIKGGAEAAS KFASSTRLFT LAESLGGIES LLEVPAVMTH GGIPKEAREA SGVFDDLVRI SVGIEDTDDL LEDIKQALKQ ATN SGVFDDLVRI SVGIEDTDDL LEDIKQALKQ ATN // // SWISS-PROT 37 UNIPROT New protein sequence database that is the result of a merge from SWISS-PROT and PIR. It will be the annotated curated protein sequence database. Data in UniProt is primarily derived from coding sequence annotations in EMBL/GenBank/DDBJ nucleic acid sequence data. UniProt is a Flat-File database just like EMBL and GenBank Flat-File format is SwissProt-like, or EMBL-like 38 UniProt is comprised of four components, each optimised for different uses: 1) The UniProt Knowledgebase (UniProtKB) is the central access point for extensive curated protein information, including function, classification, and cross-reference. It consists of two sections: UniProtKB/Swiss-Prot which is manually annotated and is reviewed and UniProtKB/TrEMBL which is automatically annotated and is not reviewed. 2) The UniProt Reference Clusters (UniRef) databases provide clustered sets of sequences from the UniProtKB and selected UniProt Archive records to obtain complete coverage of sequence space at several resolutions while hiding redundant sequences. 3) The UniProt Archive (UniParc) is a comprehensive repository, used to keep track of sequences and their identifiers. 4) The UniProt Metagenomic and Environmental Sequences (UniMES) database is a repository specifically developed for metagenomic and environmental data. TREMBL TrEMBL (Translated EMBL) was created in 1996 as computer-annotated protein sequence database supplementing to the SWISS-PROT Protein Sequence Data Bank. TrEMBL contains the translations of all coding sequences (CDS) present in the EMBL Nucleotide Sequence Database not yet integrated in SWISS-PROT. TrEMBL can be considered as a preliminary section of SWISS-PROT. For all TrEMBL entries which should finally be upgraded to the standard SWISS-PROT quality, SWISS-PROT accession numbers have been assigned. 40 TREMBL TrEMBL has 2 main section SP-TrEMBL-contains entries that will eventually be incorporated into SWISS-PROT, but that have not been manually annotated. REM-TrEMBL-contain sequences that are not destined to be included in SWISS-PROT (e.g. like peptide with less the 8 aa, and synthetic Seq) 41 NRL-3D The NRL-3D database is produced by PIR from sequences extracted from the ( Brookhaven protein Database – PDB). Title, biological sources, bibliographic references and Medline reference are included together with secondary structure active site, binding site and modified site annotations and details of experimental methods,etc. It is a valuable resource, as it makes the sequence information in the PDB available both for keyword interrogation and for similarity searches. Many specialized protein databases for specific families or groups of proteins. Examples: YPD (yeast proteins), AMSDb (antibacterial peptides), GPCRDB (7 TM receptors), IMGT (immune system) etc. 42 PDB Protein DataBase ⚫ Protein and NA 3D structures ⚫ Sequence present 43 1DGC 19 REMARK 3 AUTHORS BRUNGER 1DGC 20 REMARK 3 R VALUE 0.216 1DGC 21 REMARK 3 RMSD BOND DISTANCES 0.020 ANGSTROMS 1DGC 22 REMARK 3 RMSD BOND ANGLES 3.86 DEGREES 1DGC 23 REMARK 3 1DGC 24 REMARK 3 NUMBER OF REFLECTIONS 3296 1DGC 25 REMARK 3 RESOLUTION RANGE 10.0 - 3.0 ANGSTROMS PDB 1DGC 1DGC 1DGC 26 27 28 REMARK REMARK 3 3 DATA CUTOFF PERCENT COMPLETION 3.0 98.2 SIGMA(F) REMARK 3 1DGC 29 REMARK 3 NUMBER OF PROTEIN ATOMS 456 1DGC 30 HEADER 1DGC 1DGC 31 32 REMARK REMARK 3 4 NUMBER OF NUCLEIC ACID ATOMS 386 REMARK 4 GCN4: TRANSCRIPTIONAL ACTIVATOR OF GENES ENCODING FOR AMINO 1DGC 33 COMPND 1DGC 1DGC 34 35 REMARK REMARK 4 ACID BIOSYNTHETIC ENZYMES. 5 REMARK 5 AMINO ACIDS NUMBERING (RESIDUE NUMBER) CORRESPONDS TO THE 1DGC 36 SOURCE 1DGC 1DGC 37 38 REMARK REMARK 5 281 AMINO ACIDS OF INTACT GCN4. 6 REMARK 6 BZIP SEQUENCE 220 - 281 USED FOR CRYSTALLIZATION. AUTHOR 1DGC 39 REMARK 7 1DGC 40 REMARK 7 MODEL FROM AMINO ACIDS 227 - 281 SINCE AMINO ACIDS 220 - 1DGC 41 REMARK 7 226 ARE NOT WELL ORDERED. DATE 1DGC 42 REMARK 8 1DGC 43 REMARK 8 RESIDUE NUMBERING OF NUCLEOTIDES: 1DGC 44 REMARK 8 5' T G G A G A T G A C G T C A T C T C C JRNL 1DGC 1DGC 45 46 REMARK REMARK 8 9 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 1 2 3 4 5 6 7 8 9 1DGC 47 REMARK 9 THE ASYMMETRIC UNIT CONTAINS ONE HALF OF PROTEIN/DNA REMARK 1DGC 1DGC 48 49 REMARK REMARK 9 COMPLEX PER ASYMMETRIC UNIT. 10 1DGC 50 REMARK 10 MOLECULAR DYAD AXIS OF PROTEIN DIMER AND PALINDROMIC HALF SECRES 1DGC 1DGC 51 52 REMARK REMARK 10 SITES OF THE DNA COINCIDES WITH CRYSTALLOGRAPHIC TWO-FOLD 10 AXIS. THE FULL PROTEIN/DNA COMPLEX CAN BE OBTAINED BY 1DGC 53 ATOM COORDINATES REMARK 10 APPLYING THE FOLLOWING TRANSFORMATION MATRIX AND 1DGC 54 REMARK 10 TRANSLATION VECTOR TO THE COORDINATES X Y Z: 1DGC 55 REMARK 10 1DGC 56 REMARK 10 0 -1 0 X 117.32 X SYMM 1DGC 57 REMARK 10 -1 0 0 Y + 117.32 = Y SYMM 1DGC 1DGC 58 59 REMARK 10 0 0 -1 Z 43.33 Z SYMM 44 SEQRES 1 A 62 ILE VAL PRO GLU SER SER ASP PRO ALA ALA LEU LYS ARG 1DGC 60 SEQRES 2 A 62 ALA ARG ASN THR GLU ALA ALA ARG ARG SER ARG ALA ARG 1DGC 61 SEQRES 3 A 62 LYS LEU GLN ARG MET LYS GLN LEU GLU ASP LYS VAL GLU 1DGC 62 SEQRES 4 A 62 GLU LEU LEU SER LYS ASN TYR HIS LEU GLU ASN GLU VAL 1DGC 63 SEQRES 5 A 62 ALA ARG LEU LYS LYS LEU VAL GLY GLU ARG HEADER LEUCINE ZIPPER 15-JUL-93 1DGC 1DGC 2 COMPND GCN4 LEUCINE ZIPPER COMPLEXED WITH SPECIFIC 1DGC 3 COMPND 2 ATF/CREB SITE DNA 1DGC 4 SOURCE GCN4: YEAST (SACCHAROMYCES CEREVISIAE); DNA: SYNTHETIC 1DGC 5 AUTHOR T.J.RICHMOND 1DGC 6 REVDAT 1 22-JUN-94 1DGC 0 1DGC 7 JRNL AUTH P.KONIG,T.J.RICHMOND 1DGC 8 JRNL TITL THE X-RAY STRUCTURE OF THE GCN4-BZIP BOUND TO 1DGC 9 JRNL TITL 2 ATF/CREB SITE DNA SHOWS THE COMPLEX DEPENDS ON DNA 1DGC 10 JRNL TITL 3 FLEXIBILITY 1DGC 11 JRNL REF J.MOL.BIOL. V. 233 139 1993 1DGC 12 JRNL REFN ASTM JMOBAK UK ISSN 0022-2836 0070 1DGC 13 REMARK 1 1DGC 14 REMARK 2 1DGC 15 REMARK 2 RESOLUTION. 3.0 ANGSTROMS. 1DGC 16 REMARK 3 1DGC 17 REMARK 3 REFINEMENT. 1DGC 18 REMARK 3 PROGRAM X-PLOR 1DGC 19 REMARK 3 AUTHORS BRUNGER 1DGC 20 REMARK 3 R VALUE 0.216 1DGC 21 REMARK 3 RMSD BOND DISTANCES 0.020 ANGSTROMS 1DGC 22 REMARK 3 RMSD BOND ANGLES 3.86 DEGREES 1DGC 23 REMARK 3 1DGC 24 REMARK 3 NUMBER OF REFLECTIONS 3296 1DGC 25 REMARK 3 RESOLUTION RANGE 10.0 - 3.0 ANGSTROMS 1DGC 26 REMARK 3 DATA CUTOFF 3.0 SIGMA(F) 1DGC 27 REMARK 3 PERCENT COMPLETION 98.2 1DGC 28 45 REMARK 3 1DGC 29 REMARK 3 NUMBER OF PROTEIN ATOMS 456 1DGC 30 REMARK 3 NUMBER OF NUCLEIC ACID ATOMS 386 1DGC 31 REMARK 4 1DGC 32 REMARK 5 1DGC 35 REMARK 5 AMINO ACIDS NUMBERING (RESIDUE NUMBER) CORRESPONDS TO THE 1DGC 36 REMARK 5 281 AMINO ACIDS OF INTACT GCN4. 1DGC 37 REMARK 6 1DGC 38 REMARK 6 BZIP SEQUENCE 220 - 281 USED FOR CRYSTALLIZATION. 1DGC 39 REMARK 7 1DGC 40 REMARK 7 MODEL FROM AMINO ACIDS 227 - 281 SINCE AMINO ACIDS 220 - 1DGC 41 REMARK 7 226 ARE NOT WELL ORDERED. 1DGC 42 REMARK 8 1DGC 43 REMARK 8 RESIDUE NUMBERING OF NUCLEOTIDES: 1DGC 44 REMARK 8 5' T G G A G A T G A C G T C A T C T C C 1DGC 45 REMARK 8 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 1 2 3 4 5 6 7 8 9 1DGC 46 REMARK 9 1DGC 47 REMARK 9 THE ASYMMETRIC UNIT CONTAINS ONE HALF OF PROTEIN/DNA 1DGC 48 REMARK 9 COMPLEX PER ASYMMETRIC UNIT. 1DGC 49 REMARK 10 1DGC 50 REMARK 10 MOLECULAR DYAD AXIS OF PROTEIN DIMER AND PALINDROMIC HALF 1DGC 51 REMARK 10 SITES OF THE DNA COINCIDES WITH CRYSTALLOGRAPHIC TWO-FOLD 1DGC 52 REMARK 10 AXIS. THE FULL PROTEIN/DNA COMPLEX CAN BE OBTAINED BY 1DGC 53 REMARK 10 APPLYING THE FOLLOWING TRANSFORMATION MATRIX AND 1DGC 54 REMARK 10 TRANSLATION VECTOR TO THE COORDINATES X Y Z: 1DGC 55 REMARK 10 1DGC 56 REMARK 10 0 -1 0 X 117.32 X SYMM 1DGC 57 REMARK 10 -1 0 0 Y + 117.32 = Y SYMM 1DGC 58 REMARK 10 0 0 -1 Z 43.33 Z SYMM 1DGC 59 SEQRES 1 A 62 ILE VAL PRO GLU SER SER ASP PRO ALA ALA LEU LYS ARG 1DGC 60 SEQRES 2 A 62 ALA ARG ASN THR GLU ALA ALA ARG ARG SER ARG ALA ARG 1DGC 61 SEQRES 3 A 62 LYS LEU GLN ARG MET LYS GLN LEU GLU ASP LYS VAL GLU 1DGC 62 SEQRES 4 A 62 GLU LEU LEU SER LYS ASN TYR HIS LEU GLU ASN GLU VAL 1DGC 63 46 SEQRES 5 A 62 ALA ARG LEU LYS LYS LEU VAL GLY GLU ARG 1DGC 64 SEQRES 1 B 19 T G G A G A T G A C G T C 1DGC 65 SEQRES 2 B 19 A T C T C C 1DGC 66 Rasmol 47 SEQUENCE FILE FORMATS Plain sequence format EMBL format FASTA file formats FASTQ File Formats GCG format-Genetics Computer Group GCG-RSF (rich sequence format) IG format PLAIN SEQUENCE FORMAT A sequence in plain format may contain only IUPAC characters and spaces (no numbers!). Note: A file in plain sequence format may only contain one sequence, while most other formats accept several sequences in one file. An example sequence in plain format is: AACCTGCGGAAGGATCATTACCGAGTGCGGGTCCTTTGGGCCC AACCTCCCATCCGTGTCTATTGTACCGTTGCTTCGGCGGGCCC GCCGCTTGTCGGCCGCCGGGGGGGCGCCTCTGCCCCCCGGG CCCGTGCCCGCCGGAGACCCCAACACGAACACTGTCTGAAAGC GTGCAGTCTGAGTTGATTGAATGCAATCAGTTAAAACTTTCAACA ATGGATCTCTTGGTTCCGGC EMBL FORMAT A sequence file in EMBL format can contain several sequences. One sequence entry starts with an identifier line ("ID "), followed by further annotation lines. The start of the sequence is marked by a line starting with "SQ" and the end of the sequence is marked by two slashes ("//"). An example sequence in EMBL format is: ID AA03518 standard; DNA; FUN; 237 BP. XX AC U03518; XX DE Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S DE rRNA and 5.8S rRNA genes, partial sequence. XX SQ Sequence 237 BP; 41 A; 77 C; 67 G; 52 T; 0 other; aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc 60 tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg 120 ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc 180 tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc 237 // FASTA FILE FORMATS A sequence file in FASTA format can contain several sequences. One sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line must begin with a greater-than (">") symbol in the first column. An example sequence in FASTA format is: >U03518 Aspergillus awamori internal transcribed spacer 1 (ITS1) AACCTGCGGAAGGATCATTACCGAGTGCGGGTCCTTTGGGCCCAACCTCCCATCCGTGTCTATTGTACCC TGTTGCTTCGGCGGGCCCGCCGCTTGTCGGCCGCCGGGGGGGCGCCTCTGCCCCCCGGGCCCGTGCCCGC CGGAGACCCCAACACGAACACTGTCTGAAAGCGTGCAGTCTGAGTTGATTGAATGCAATCAGTTAAAACT TTCAACAATGGATCTCTTGGTTCCGGC These file types, denoted by the.fas extension, are used by most large curated databases. Specific extensions exist for nucleic acids (.fna), nucleotide coding regions (.ffn), amino acids (.faa), and non-coding RNAs (.frn). A FASTA file can contain one or many sequences. Tools like ClustalW can take FASTA files with multiple sequences to generate an alignment. Converting between FASTA formats and any of the others discussed below can be done with programs like Seqret and MView. Other simple sequence file formats that you may encounter include GCG and IG. FASTQ FILE FORMATS The FASTQ format was developed for and used with next-generation sequencing instruments and builds off of the simplicity of the FASTA format. Information about the quality (“Q” in “FASTQ” stands for quality) of the sequencing reads and base calls are a defining component of the FASTQ file format. The first is a sequence identifier and description. It begins with an “@” symbol followed by information about the sequence. There is a standardized format with Illumina sequencers that includes the unique instrument name, the flowcell lane, etc. The second line contains the raw sequence data as with a FASTA file The third line includes the “+” symbol along with a repeated identifier The fourth line is the quality score for each base in the sequence on the second line and must be the same length The quality score on the fourth line is the Phred score (Q), which is the measure of the quality of the identification of the nucleobases generated by automated DNA sequencing and it is formatted as a single ASCII character. Q is calculated in different ways and ranges, depending on the platform used for sequencing, and is the probability that a specific base call in a raw sequence is incorrect. In its most straightforward calculation, which is used for Sanger sequencing: Q= -log10p; where p is the probability that the base call is incorrect. The larger the Q is, the higher the base call accuracy is. For example, ⚫ Q of 20 means that base call is incorrectly identified every 100 base pairs. ⚫ Q of 30 means that a base call is incorrectly identified every 1000 base pairs. FASTQ file formats typically have the file extension.fastq,.sanfastq, or.fq, though there is no standard. GROUP A sequence file in GCG format contains exactly one sequence, begins with annotation lines and the start of the sequence is marked by a line ending with two dot ("..") characters. This line also contains the sequence identifier, the sequence length and a checksum. This format should only be used if the file was created with the GCG package. An example sequence in GCG format is: ID AA03518 standard; DNA; FUN; 237 BP. XX AC U03518; XX DE Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S DE rRNA and 5.8S rRNA genes, partial sequence. XX SQ Sequence 237 BP; 41 A; 77 C; 67 G; 52 T; 0 other; AA03518 Length: 237 Check: 4514.. 1 aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc 61 tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg 121 ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc 181 tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc GCG-RSF (RICH SEQUENCE FORMAT) The new GCG-RSF can contain several sequences in one file. This format should only be used if the file was created with the GCG package. IG FORMAT A sequence file in IG format can contain several sequences, each consisting of several comment lines that must begin with a semicolon (";"), a line with the sequence name (it may not contain spaces!) and the sequence itself terminated with the termination character '1' for linear or '2' for circular sequences. An example sequence in IG format is: ; comment ; comment U03518 AACCTGCGGAAGGATCATTACCGAGTGCGGGTCCTTTGGGCCCAACCTCCCATCCGTG TCTATTGTACCC TGTTGCTTCGGCGGGCCCGCCGCTTGTCGGCCGCCGGGGGGGCGCCTCTGCCCCCC GGGCCCGTGCCCGC CGGAGACCCCAACACGAACACTGTCTGAAAGCGTGCAGTCTGAGTTGATTGAATGCAA TCAGTTAAAACT TTCAACAATGGATCTCTTGGTTCCGGC1 HIGH THROUGHPUT SEQUENCING DATABASES Sequence Read Archive The SRA is NIH's archive of high-throughput sequencing data and is part of the International Nucleotide Sequence Database Collaboration (INSDC) that includes the NCBI Sequence Read Archive (SRA), the European Bioinformatics Institute (EBI), and the DNA Database of Japan (DDBJ). Data submitted to any of the three organizations are shared among them. SRA Mission The SRA is a publicly available repository of high throughput sequencing data. The archive accepts data from all branches of life as well as metagenomic and environmental surveys. SRA stores raw sequencing data and alignment information to enhance reproducibility and facilitate new discoveries through data analysis The preferred data format for files submitted to the SRA is the BAM format, which is capable of storing both aligned and unaligned reads. BAM files (with the.bam file extension) are closely related to SAM files, which are tab-delimited text files used for storing sequence alignment data. The advantage of the BAM file format over the SAM file format is that it’s a compressed binary version that is smaller in size and can be indexed, making them ideal for the storage of sequence alignment information and preferred for the Integrative Genomics Viewer. Like most file formats used in bioinformatics, BAM files contain a header and a body. The header stores information about the sequences, preceded by an “@” symbol. The body contains information about how each sequence aligns with a specific reference sequence. Each alignment line includes 11 data fields, including Phred score, a string that describes alignment called CIGAR, and other metadata. SECONDARY DATABASES Secondary databases are the one which as reports of analyses of the sequences in the primary sources. Either manually curated (i.e. PROSITE, Pfam, etc.) or automatically generated (i.e. ProDom, DOMO) Some depend on the method used to detect if a protein belongs to a particular domain/family (patterns, profiles, HMM) 60 SECONDARY DATABASE Secondary db Primary source Information PROSITE SWISS-PROT Patterns (Regular expression) PROSITE SWISS-PROT Profiles (Weighted matrices) PRINTS OWL and Aligned motifs (Fingerprints) SWISS-PROT Pfam SWISS-PROT HMM (Hidden Markov Models) BLOCKS PROSITE/PRINTS Aligned motifs IDENTIFY BLOCKS/PRINTS Fuzzy regular expressions PROSITE ▪ Created in 1988 (SIB). This is the first secondary db. ▪ Contains functional domains fully annotated, based on two methods: patterns and profiles. ▪ Helps to determine to which family of proteins a new sequence might be belong or which domain (s) or functional site it may contain. PATTERNS Motifs are encoded as regular expressions, often simply referred to as patterns. CTDEGGIS CYEDGGIS CYEEGGIT CYHGDGGS CYRGDGNT C-Y-X2-[DG]-G-X-[ST] (Regular expression) ▪ Entries are deposited in PROSITE in two distinct files: ▪ Pattern/profiles with the lists of all matches in the parent version of SWISS-PROT ▪ Documentation ▪ The process used to derive patterns involves the contruction of a multiple alignment and manual inspection to identify the conserved region. o As of 30 August 2012, release 20.85 has 1,650 documentation entries, 1,308 patterns, 1,039 profiles, and 1,041 ProRules.. 64 STRUCTURE OF PROSITE ENTRIES Entries deposite in the PROSITE in two distinct files. ⚫ Patterns:-pattens and lists all the matches in the parent version of SWISS-PROT. ⚫ Documentation:-provide details of the characterized family and where known description of the biological role of the chosen motif and support bibliography. 65 DETERMINING SIGNIFICANCE OF DATABASE MATCHES True-positive:-which are related True- negative:-which are unrelated False-positive:-unrelated match False-negative:-correct match will fail completely to be diagnosed. 66 PROSITE (PATTERN): EXAMPLE ID EPO_TPO; PATTERN. AC PS00817; DT OCT-1993 (CREATED); NOV-1995 (DATA UPDATE); JUL-1998 (INFO UPDATE). DE Erythropoietin / thrombopoeitin signature. PA P-x(4)-C-D-x-R-[LIVM](2)-x-[KR]-x(14)-C. Diagnostic NR /RELEASE=38,80000; performance NR /TOTAL=14(14); /POSITIVE=14(14); /UNKNOWN=0(0); /FALSE_POS=0(0); NR /FALSE_NEG=0; /PARTIAL=1; CC /TAXO-RANGE=??E??; /MAX-REPEAT=1; CC /SITE=3,disulfide; /SITE=11,disulfide; List of DR P48617, EPO_BOVIN , T; P33707, EPO_CANFA , T; P33708, EPO_FELCA , T; matches DR P01588, EPO_HUMAN , T; P07865, EPO_MACFA , T; Q28513, EPO_MACMU , T; DR P07321, EPO_MOUSE , T; P49157, EPO_PIG , T; P29676, EPO_RAT , T; DR P33709, EPO_SHEEP , T; P42705, TPO_CANFA , T; P40225, TPO_HUMAN , T; DR P40226, TPO_MOUSE , T; P49745, TPO_RAT , T; DR P42706, TPO_PIG , P; DO PDOC00644; // PROFILE Variable regions between the conserved motifs also contains information. Discriminator termed a profile, is used to indicate where the insertion and deletions (INDELs) are allowed, what type of residue are allowed, at what positions and where more conserved regions are. Profiles(weight matrices) provide a sensitive means of detecting ⚫ distant sequence relationship ⚫ Place where few Residues are conserved 68 PROSITE (PROFILE): EXAMPLE PROSITE: PS50097 ID BTB; MATRIX. AC PS50097; DT DEC-1999 (CREATED); DEC-1999 (DATA UPDATE); DEC-1999 (INFO UPDATE). DE BTB domain profile. MA /GENERAL_SPEC: ALPHABET='ABCDEFGHIKLMNPQRSTVWYZ'; LENGTH=67; MA /DISJOINT: DEFINITION=PROTECT; N1=6; N2=62; MA /NORMALIZATION: MODE=1; FUNCTION=LINEAR; R1=.9751; R2=.02068202; TEXT='-LogE'; MA /CUT_OFF: LEVEL=0; SCORE=363; N_SCORE=8.5; MODE=1; TEXT='!'; MA /CUT_OFF: LEVEL=-1; SCORE=267; N_SCORE=6.5; MODE=1; TEXT='?'; MA /DEFAULT: D=-20; I=-20; B1=-50; E1=-50; MI=-105; MD=-105; IM=-105; DM=-105; MM=1; M0=-2; MA /I: B1=0; BI=-105; BD=-105; MA /M: SY='C'; M=-6,-10,28,-14,-9,-15,-20,-14,-19,-15,-17,-14,-8,-19,-14,-15,0,0,-9,-32,-17,-12; MA /M: SY='D'; M=-16,41,-28,53,15,-34,-11,-1,-33,0,-27,-25,21,-11,0,-8,2,-6,-26,-38,-19,7; MA /M: SY='V'; M=2,-23,-8,-28,-24,-1,-24,-25,16,-20,7,6,-20,-25,-23,-20,-10,-4,24,-23,-9,-24; MA /M: SY='T'; M=-2,-13,-18,-19,-13,-7,-24,-19,6,-8,-2,1,-11,-17,-11,-10,-1,10,10,-24,-6,-13; MA /M: SY='L'; M=-11,-30,-22,-33,-24,15,-32,-23,25,-29,35,17,-26,-27,-23,-22,-24,-9,16,-17,3,-24; MA /M: SY='V'; M=0,-11,-18,-13,-10,-12,-20,-13,1,-6,-4,2,-10,-19,-6,-7,-4,-2,8,-25,-9,-9; MA /M: SY='V'; M=1,-25,-3,-29,-25,-2,-26,-26,17,-22,10,7,-23,-25,-23,-22,-11,-3,24,-27,-10,-25; MA /M: SY='D'; M=-6,7,-26,8,7,-25,6,-7,-27,0,-23,-17,8,-13,0,-3,3,-6,-23,-27,-17,3; MA /I: I=-5; MI=0; IM=0; DM=-15; MD=-15; MA /M: SY='G'; M=-6,8,-27,8,-3,-27,22,-7,-30,-8,-26,-19,10,-14,-8,-9,2,-9,-24,-28,-21,-6; …. PROSITE (PROFILE): EXAMPLE (CONT.) …… MA /M: SY='T'; M=-3,3,-16,1,-3,-18,-12,-9,-20,-6,-19,-15,2,-7,-6,-6,10,15,-13,-27,-12,-5; MA /M: SY='G'; M=-1,1,-25,2,-9,-26,31,-12,-32,-10,-26,-18,4,-17,-12,-10,1,-12,-24,-25,-22,-11; MA /M: SY='E'; M=-9,3,-24,4,13,-25,-16,-1,-24,13,-21,-13,3,-9,6,13,-3,-6,-20,-27,-13,8; MA /M: SY='I'; M=-6,-21,-18,-25,-21,-2,-29,-21,21,-21,14,10,-19,-24,-17,-19,-13,-3,19,-23,-3,-20; MA /M: SY='E'; M=-4,3,-23,3,4,-18,-11,-7,-17,-1,-18,-13,3,-9,-1,-5,1,-4,-14,-25,-11,1; MA /M: SY='I'; M=-8,-25,-23,-27,-20,1,-30,-21,21,-20,18,12,-22,-18,-18,-18,-18,-7,16,-21,-1,-20; MA /M: SY='P'; M=-6,0,-24,2,1,-22,-13,-8,-21,-2,-23,-15,1,14,-4,-7,3,2,-19,-31,-18,-3; MA /M: SY='E'; M=-7,1,-27,4,11,-24,-15,-4,-19,2,-18,-11,0,-1,6,-1,-2,-6,-19,-25,-14,7; MA /I: E1=0; IE=-105; DE=-105; NR /RELEASE=39,87397; NR /TOTAL=46(44); /POSITIVE=45(43); /UNKNOWN=1(1); /FALSE_POS=0(0); NR /FALSE_NEG=0; /PARTIAL=0; CC /TAXO-RANGE=??E?V; /MAX-REPEAT=2; DR O14867, BAC1_HUMAN, T; P97302, BAC1_MOUSE, T; P97303, BAC2_MOUSE, T; DR P41182, BCL6_HUMAN, T; P41183, BCL6_MOUSE, T; Q01295, BRC1_DROME, T; DR Q01296, BRC2_DROME, T; Q01293, BRC3_DROME, T; Q28068, CALI_BOVIN, T; DR Q13939, CALI_HUMAN, T; Q08605, GAGA_DROME, T; Q01820, GCL1_DROME, T; DR P10074, HKR3_HUMAN, T; Q04652, KELC_DROME, T; P42283, LOLL_DROME, T; DR P42284, LOLS_DROME, T; O14682, PI10_HUMAN, T; Q05516, PLZF_HUMAN, T; DR O43791, SPOP_HUMAN, T; P42282, TTKA_DROME, T; P17789, TTKB_DROME, T; DR P21073, VA55_VACCC, T; P24768, VA55_VACCV, T; P21037, VC02_VACCC, T; DR P17371, VC02_VACCV, T; P32228, VC04_SPVKA, T; P32206, VC13_SPVKA, T; DR P21013, VF03_VACCC, T; P24357, VF03_VACCV, T; P22611, VMT8_MYXVL, T; DR P08073, VMT9_MYXVL, T; O43167, Y441_HUMAN, T; Q10225, YAZ4_SCHPO, T; DR P40560, YIA1_YEAST, T; P34324, YKV2_CAEEL, T; P34371, YLJ8_CAEEL, T; DR P34568, YNV5_CAEEL, T; P41886, YPT9_CAEEL, T; Q09563, YR47_CAEEL, T; DR Q10017, YSW1_CAEEL, T; Q13105, Z151_HUMAN, T; Q60821, Z151_MOUSE, T; PRINTS Compendium of protein motif fingerprints Most protein families are characterized by several conserved motifs Fingerprint: set of motif(s) (simple or composite, such as multidomains) = signature of family membership True family members exhibit all elements of the fingerprint, while subfamily members may possess only a part PRINTS are currently derived from OWL database. This is done by iterative process. They repeated until no futher complete fingerprint matches can be identifed. The result are then annotated for inculsion in the database. BLOCKS The Blocks Database contains multiple alignments of conserved regions in protein families. The database can be searched by e-mail and World Wide Web (WWW) servers (http://blocks.fhcrc.org/help) to classify protein and nucleotide sequences. Blocks are multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins. Block SearcherBlock Searcher, Get BlocksBlock Searcher, Get Blocks and Block Maker are aids to detection and verification of protein sequence homology. They compare a protein or DNA sequence to a database of protein blocks , retrieve blocks, and create new blocks, respectively. The BLOCKS Database is based on InterPro entries with sequences from SWISS-PROT and TrEMBL and with cross-references to PROSITE and/or PRINTS and/or SMART, and/or PFAM and/or ProDom entries. BLOCKS DATABASE The blocks for the Blocks Database are made automatically by looking for the most highly conserved regions in groups of proteins documented in InterPro. The blocks created by Block Maker are created in the same manner as the blocks in the Blocks Database but with sequences provided by the user. Results are reported in a multiple sequence alignment format without calibration and in the standard Block format for searching. FORMAT OF A BLOCK ID short_identifier; BLOCK AC block_number; distance from previous block = (min,max) DE description BL xxx motif; width=w; seqs=s; 99.5%=n1; strength=n2 sequence_id (offset) sequence_segment sequence_weight... // ID line starts a block entry and contains a short identifier for the group of sequences from which the block was made. If the block was taken from InterPro, it will be the InterPro group ID. The identifier is terminated by a semi-colon, and the word "BLOCK" indicates the entry type. AC line contains the block number, a seven-character group number for sequences from which the block was made DE line contains a description of the group of sequences from which the block was made xxx = the amino acids in the spaced triplet found by MOTIF upon which the block is based. w = width of the sequence segments (columns) in the block. s = number of sequence segments (rows) in the block. n1 = raw calibration score; 99.5th percentile score of true negative sequences. Raw search scores are normalized by dividing by this score and multiplying by 1000. n2 = median normalized score of known true positive sequences as documented in InterPro. Following the BL line are lines for each sequence with a segment in the block. The segments may be clustered with clusters separated by blank lines. Each segment line contains a sequence identifier, the offset from the beginning of the sequence to the block in parentheses, the sequence segment, and a weight for the segment. The weights are normalized so that the most distant segment has a weight of 100. // line terminates a block entry PFAM HTTP://WWW.SANGER.AC.UK/SOFTWARE/PFAM/ PFAM The Pfam database is a large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs). Proteins are generally composed of one or more functional regions, commonly termed domains. Different combinations of domains give rise to the diverse range of proteins found in nature. The identification of domains that occur within proteins can therefore provide insights into their function. The construction and use of Pfam is tightly tied to the HMMER software package. PFAM Composed of two sets of families: – Pfam-A: Manually curated part containing over 8296 protein families – Pfam-B: automatically generated supplement containing a large number of small families taken from the PRODOM database that do not overlap with Pfam-A (lower quality) Pfam also generates higher-level groupings of related families, known as clans. A clan is a collection of Pfam-A entries which are related by similarity of sequence, structure or profile-HMM. PFAM Each family has the following data: A seed alignment which is a hand edited multiple alignment representing the family. Hidden Markov Models (HMM) derived from the seed alignment which can be used to find new members of the domain and also take a set of sequences to realign them to the model. One HMM is in ls mode (global) the other is an fs mode (local) model. A full alignment which is an automatic alignment of all the examples of the domain using the two HMMs to find and then align the sequences Annotation which contains a brief description of the domain, links to other databases and some Pfam specific data. To record how the family was constructed. PFAM SEARCHES PFAM RESULTS The data and additional features are accessible via the four websites http://www.sanger.ac.uk/Software/Pfam/ http://pfam.wustl.edu http://pfam.jouy.inra.fr/ http://Pfam.cgb.ki.se/). EXERCISE 1 – TEXT SEARCH 1. Go to EXPASY. Click "UniProt Knowledgebase (Swiss-Prot and TrEMBL)” and then search for human cochlin. Notice that there is a wealth of information about this protein. Furthermore, there are many links to sequence analysis tools (some of which you will learn later) and some other nice features. Note that this is merely a graphical display of the original UniProtKB/SwissProt database entry (which is in text). 2. Try to answer all of the questions below. 1. Which year was the NMR structure of the LCCL domain determined? 2. Where is the protein expressed? 3. Which diseases are associated with the protein? EXERCISE 2 – BLAST SEARCH 1. Go to EXPASY. Click "UniProt Knowledgebase (Swiss-Prot and TrEMBL)” and then „BLAST”. 2. Copy the following human amino acid sequence. MSTAVLENPGLGRKLSDFGQETSYIEDNCNQNGAISLIFSLKEEVGALAKVLRLFEENDVNLTHIESRPSRLKKDEYEFFTHLDK RSLPALTNIIKILRHDIGATVHELSRDKKKDTVPWFPRTIQELDRFANQILSYGAELDADHPGFKDPVYRARRKQFADIAYNYRH GQPIPRVEYMEEEKKTWGTVFKTLKSLYKTHACYEYNHIFPLLEKYCGFHEDNIPQLEDVSQFLQTCTGFRLRPVAGLLSSRD FLGGLAFRVFHCTQYIRHGSKPMYTPEPDICHELLGHVPLFSDRSFAQFSQEIGLASLGAPDEYIEKLATIYWFTVEFGLCKQG DSIKAYGAGLLSSFGELQYCLSEKPKLLPLELEKTAIQNYTVTEFQPLYYVAESFNDAKEKVRNFAATIPRPFSVRYDPYTQRIE VLDNTQQLKILADSINSEIGILCSALQKIK 3. Paste the sequence into the query sequence window and adjust the options as necessary. You won't need to specify advanced options, but you should choose a program and database. For simplicity, use e.g. the UniProtKB database. 4. Run the search and identify the protein. Use the link provided to see the UniProtKB/SWISS-PROT report. EXERCISE 2 – BLAST SEARCH 5. Now, try to answer all of the questions below. 1. What is the SWISS-PROT primary accession number? 2. What is the common name of the protein? 3. What is the gene called? 4. Which year was the crystal structure of the catalytic domain determined? Name the first author. 5. Does the enzyme require a co-factor to function? If so, what? 6. Name the most common disease that arises as a result of deficiency of this enzyme. 7. How many amino acid residues are there in the protein? 8. What is the molecular weight of the protein? EXERCISE 3 – DOMAIN SEARCH 1. Go to the PROSITE site. 2. Under "Tools for PROSITE" choose ScanProsite. 3. Paste the sequence below into the box and tick the Option "Exclude patterns with a high probability of occurrence" (to find very common patterns will not tell you much about your protein). MWAPRCRRFWSRWEQVAALLLLLLLLGVPPRSLALPPIRYSHAGICPNDMNPNLWVDAQSTCRRECETDQECETYEKCCPNVCGTKSCVAARYMDVKGKKGPVGMPKE ATCDHFMCLQQGSECDIWDGQPVCKCKDRCEKEPSFTCASDGLTYYNRCYMDAEACSKGITLAVVTCRYHFTWPNTSPPPPETTMHPTTASPETPELDMAAPALLNNPV HQSVTMGETVSFCDVVGRPRPEITWEKQLEDRENVVMRPNHVRGNVVVTNIAQLVIYNAQLQDAGIYTCTARNVAGVLRADFPLSVVRGHQAAATSESSPNGTAFPAAEL KPPDSEDCGEEQTRWHFDAQANNCLTFTFGHCHRNLNHFETYEACMLACMSGPLAACSLPALQGPCKAYAPRWAYNSQTGQCQSFVYGGCEGNGNNFESREACEESP FPRGNQRCRACKPRQKLVTSFCRSDFVILGRVSELTEEPDSGRALVTVDEVLKDEKMGLKFLGQEPLEVTLLHVDWACPCPNVTVSEMPLIIMGEVDGGMAMLRPDSFVG ASSARRVRKLREVMHKKTCDVLKEFLGLH 4. Start the scan. Which are the motifs that are found? EXERCISE 4– DOMAIN SEARCH 1. Go to the Pfam site. 2. Click „Search by protein name or sequence„. 3. Paste the sequence below into the box and choose „Both Global and Fragment Pfam search”. MWAPRCRRFWSRWEQVAALLLLLLLLGVPPRSLALPPIRYSHAGICPNDMNPNLWVDAQSTCRRECETDQECETYEKCCPNVCGTKSCVAARYMDVKGKKGPVGMPKE ATCDHFMCLQQGSECDIWDGQPVCKCKDRCEKEPSFTCASDGLTYYNRCYMDAEACSKGITLAVVTCRYHFTWPNTSPPPPETTMHPTTASPETPELDMAAPALLNNPV HQSVTMGETVSFCDVVGRPRPEITWEKQLEDRENVVMRPNHVRGNVVVTNIAQLVIYNAQLQDAGIYTCTARNVAGVLRADFPLSVVRGHQAAATSESSPNGTAFPAAEL KPPDSEDCGEEQTRWHFDAQANNCLTFTFGHCHRNLNHFETYEACMLACMSGPLAACSLPALQGPCKAYAPRWAYNSQTGQCQSFVYGGCEGNGNNFESREACEESP FPRGNQRCRACKPRQKLVTSFCRSDFVILGRVSELTEEPDSGRALVTVDEVLKDEKMGLKFLGQEPLEVTLLHVDWACPCPNVTVSEMPLIIMGEVDGGMAMLRPDSFVG ASSARRVRKLREVMHKKTCDVLKEFLGLH 4. Search Pfam. 1. Which domains are found? 2, What may be the function of this protein? STRUCTURE CLASSIFICATION DATABASE INTRODUCTION Nearly all proteins have structural similarities with other proteins and, in some of these cases, share a common evolutionary origin. A knowledge of these relationships is crucial to our understanding of the evolution of proteins and of development. It will also play an important role in the analysis of the sequence data that is being produced by worldwide genome projects. STRUCTURE CLASSIFICATION DATABASE A Database in which protein structures are classified according to their geometrical and evolutionary similarities. Structures tend to be classified at the level of their individual domains, but multidomain structures can also be classified into evolutionary families and superfamilies. Examples of such databases ⚫ SCOP(structural classification of proteins) ⚫ CATH (class architecture topology homology). SCOP(STRUCTURAL CLASSIFICATION OF PROTEINS) The Structural Classification of Proteins (SCOP) database is a largely manual classification of protein structural domains based on similarities of their structures and amino acid sequences. The SCOP database maintained at the MRC Lab of molecular biology and centre for protein engineering. Its aims to provide a detailed and comprehensive description of the structural and evolutionary relationships between all proteins whose structure is known, including all entries in Protein Data Bank (PDB). In addition, the hypertext pages offer a panoply of representations of proteins, including links to PDB entries, sequences, references, images and interactive display systems. Existing automatic sequence and structure comparison tools cannot identify all structural and evolutionary relationships between proteins. The SCOP classification of proteins has been constructed manually by visual inspection and comparison of structures, but with the assistance of tools to make the task manageable and help provide generality. The job is made more challenging--and theoretically daunting--by the fact that the entities being organized are not homogeneous: sometimes it makes more sense to organize by individual domains, and other times by whole multi-domain proteins. CLASSIFICATION Proteins are classified in a hierarchical to reflect both structural and evolutionary relatedness. Within the hierarchy there are many levels, but principally these describe ⚫ family ⚫ super family and ⚫ fold. The levels of SCOP are as follows. ⚫ Class: Types of folds, e.g., beta sheets. ⚫ Fold: The different shapes of domains within a class. ⚫ Superfamily: The domains in a fold are grouped into superfamilies, which have at least a distant common ancestor. ⚫ Family: The domains in a superfamily are grouped into families, which have a more recent common ancestor. ⚫ Protein domain: The domains in families are grouped into protein domains, which are essentially the same protein. ⚫ Species: The domains in "protein domains" are grouped according to species. ⚫ Domain: part of a protein. For simple proteins, it can be the entire protein. The exact position of boundaries between these levels are to some degree subjective, but the higher levels generally reflect the clearest structural similarities. FAMILY (CLEAR EVOLUTIONARILY RELATIONSHIP) Proteins clustered together into families are clearly evolutionarily related. Generally, this means that pairwise residue identities between the proteins are 30% and greater. However, in some cases similar functions and structures provide definitive evidence of common descent in the absense of high sequence identity; ⚫ for example, many globins form a family though some members have sequence identities of only 15% SUPERFAMILY: PROBABLE COMMON EVOLUTIONARY ORIGIN Proteins that have low sequence identities, but whose structural and functional features suggest that a common evolutionary origin is probable are placed together in superfamilies. ⚫ For example, actin, the ATPase domain of the heat shock protein, and hexakinase together form a superfamily. FOLD: MAJOR STRUCTURAL SIMILARITY Proteins are defined as having a common fold if they have same major secondary structures in same arrangement and with the same topological connections,whether or not they have a common evolutionary origin In some cases the structural similarities could arise just from the physics and chemistry of proteins favoring certain packing arrangements and chain topologies. CATH ⚫ The CATH database is a hierarchical domain classification of protein structures in the Protein Data Bank maintained at UCL. ⚫ Only crystal structures solved to resolution better than 4.0 angstroms are considered, together with NMR structures. ⚫ All non-proteins, models, and structures with greater than 30% “C-alpha only” are excluded from CATH. ⚫ This filtering of the PDB is performed using the SIFT protocol ⚫ Protein structures are classified using a combination of automated and manual procedures. FIVE LEVELS IN HIERACHY CLASS: C-level Class is determined according to the secondary structure composition and packing within the structure. Three major classes are recognised; ⚫ mainly-alpha, ⚫ mainly-beta and ⚫ alpha-beta. This includes both alternating alpha/beta structures and alpha+beta structures ⚫ A fourth class is also identified which contains protein domains which have low secondary structure content. Architecture, A-level This describes the overall shape of the domain structure as determined by the orientations of the secondary structures but ignores the connectivity between the secondary structures. It is currently assigned manually using a simple description of the secondary structure arrangement e.g. barrel or 3-layer sandwich. Topology (Fold family), T-level ⚫ It gives a description that encompasses both the overall shaper and the connectivity of secondary structures. ⚫ This is achieved by means of structure comparison algorithms that use empirically derived parameters to cluster the domains. ⚫ Structure at least 60% of the larger protein matches the smaller are assigned to the same topology. HOMOLOGOUS SUPERFAMILY, H-LEVEL ⚫ This level groups together protein domains which are thought to share a common ancestor and can therefore be described as homologous. Similarities are identified either by high sequence identity or structure comparison using SSAP (Sequential structure alignment program). ⚫ Structures are clustered into the same homologous superfamily if they satisfy one of the following criteria: ⚫ Sequence identity >= 35%, overlap >= 60% of larger structure equivalent to smaller. ⚫ SSAP score >= 80.0, sequence identity >= 20%, 60% of larger structure equivalent to smaller. ⚫ SSAP score >= 70.0, 60% of larger structure equivalent to smaller, and domains which have related functions, which is informed by the literature and Pfam protein family database, (Bateman et al., 2004). ⚫ Significant similarity from HMM-sequence searches and HMM-HMM comparisons using SAM (Hughey &Krogh, 1996),HMMER (http://hmmer.wustl.edu [http://hmmer.wustl.edu]) and PRC (http://supfam.org/PRC [http://supfam.org/PRC]). SEQUENCE FAMILY LEVELS At this level domains have sequence identities >35% (60% of larger structure equivalent to smaller)-🡪 similar structures and function. Domains within each H-level are subclustered into sequence families using multi-linkage clustering at the following levels D (distinct structures) PDBSUM A major resource providing a key information on each macromolecular structure deposited at the protein data bank. It was the web-based compendium maintained at UCL. Now it is moved to EMBL. It includes ⚫ images of the structure ⚫ Annotated plots of each protein chain’s secondary structure, detailed structural analyses generated by the promotif program http://www.ebi.ac.uk/pdbsum/ PROMOTIF ⚫ PROMOTIF provides details of the location and types of structural motifs in proteins of known structure by analysis of Brookhaven format coordinate files. The current version of the program analyses the following structural features: ⚫ Secondary structure Beta strands Disulphide bridges Beta bulges Beta turns Beta hairpins Gamma turns Beta alpha beta units Helical geometry Psi loops Helical interactions Beta sheet topology Main chain hydrogen bonding patterns COMPOSITE PROTEIN SEQUENCE DATABASE Compile a composite. i.e a database that amalgamates a variety of different primary source. Easy for us to search much more efficient. NRDB,OWL,MIPSX,SWISS-PROT+TrEMBL. 115 COMPOSITE PROTEIN SEQUENCE DB Different composite db use different primary sources and different redundancy criteria in their amalgamation procedures NRDB OWL MIPSX SPTrEMBL * PDB SWISS-PROT PIR SWISS-PROT SWISS-PROT PIR MIPS SPTrEMBL PIR GenBank NRL-3D TrEMBLnew GenPept NRL-3D SWISS-PROT SP update EMBL translation GenPept update GenBank translation Kabat (immuno) PseqIP Redundancy priority criteria * Also called SWall at EBI SWIR: SPTrEMBL + Wormpep METABOLIC DATABASES Metabolic databases play a crucial role in biochemistry by providing comprehensive resources that facilitate the study and analysis of metabolic pathways. Metabolic databases are essential tools for researchers to access information about metabolic pathways, gene products, and interactions. These databases help scientists understand complex biochemical processes and can aid in the identification of targets for drug development. The growth of metabolic databases has been driven by advancements in genomic and proteomic technologies, which have generated vast amounts of data. There is an ongoing challenge in integrating data from different sources to create more comprehensive biochemical models. SOME OF THE EXAMPLES OF METABOLIC DATABASES Metabolic databases typically describe collections of enzymes, reactions, and biochemical pathways. They are also often coupled with software for querying and visualizing metabolic information. ECOCYC DATABASE EcoCyc is a comprehensive database focused on the metabolism and molecular biology of the model organism Escherichia coli (E. coli). The EcoCyc database describes the enzymes that carry out each bioreaction, including their cofactors, activators, inhibitors and the subunit structures of the enzymes. Most entries in EcoCyc contain several links to the primary biomedical literature, using online Medline entries when possible. EcoCyc is also linked to databases such as GenBank, SWISS-PROT and PDB. EcoCyc can be queried through the WWW, and can be downloaded as a binary program for Sun workstations. The EcoCyc data can be downloaded as structured files via Internet FTP. It integrates information on metabolic pathways, enzymes, genes, and their interactions. The database is designed to provide a detailed view of the E. coli genome, including annotations and functional data, thus serving as a resource for research related to metabolic functions, regulatory networks, and gene expression. EcoCyc facilitates the understanding of metabolic processes in E. coli and can be instrumental in studies involving biochemistry, genetics, and systems biology. THE KEGG DATABASE It contains 85 pathways derived from the Boehringer Mannheim wall chart and from a collection of the Japanese Biochemical Society. These pathway collections are consensus views of biochemistry that are not specific to particular organisms, but KEGG has been used to predict the pathways of several organisms from their genomes. The database describes the reactions within each pathway, and the chemical compounds within each reaction, including compound structures. KEGG reactions are linked to enzymes in SWISS-PROT, PIR, and PDB. The KEGG graphical interface consists of manually drawn diagrams of pathways and of the complete metabolic network. KEGG can be queried via the WWW, and is available on CD-ROM for Macintosh. THANK YOU

Unit-2 PDF - Database Resources in Molecular Biology

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue