Protein Database PDF
Document Details
Uploaded by BeautifulSerpentine1631
Tags
Summary
This document details a protein database, specifically Swiss-Prot. It describes the annotation process, data structure, and integration with other databases. It also highlights some of the key features of Swiss-Prot such as minimal redundancies and its role in biomolecular data interconnectivity.
Full Transcript
Swiss-Prot SWISS-PROT is an annotated protein sequence database established in 1986 and maintained collaboratively, since 1987, by the Department of Medical Biochemistry of the University of Geneva and the EMBL Data Library (now the EMBL Outstation-The European Bioinf...
Swiss-Prot SWISS-PROT is an annotated protein sequence database established in 1986 and maintained collaboratively, since 1987, by the Department of Medical Biochemistry of the University of Geneva and the EMBL Data Library (now the EMBL Outstation-The European Bioinformatics Institute). The SWISS-PROT protein sequence data bank consists of sequence entries. Sequence entries are composed of different line types, each with their own format. For standardization purposes the format of SWISS-PROT follows as closely as possible that of the EMBL nucleotide sequence database. The SWISS-PROT database distinguishes itself from other protein sequence databases by three distinct criteria : Annotation Minimal redundancy Integration with other databases In SWISS-PROT, as in most other sequence databases, two classes of data can be distinguished, the core data and the annotation. For each sequence entry the core data consists of the sequence data, the citation information (bibliographical references) and the taxonomic data (description of the biological source of the protein), while the annotation consists of a description of the following items: (i) function(s) of the protein; (ii) post-translational modification(s), for example carbohydrates, phosphorylation, acetylation, GPI-anchor, etc.; (iii) domains and sites, for example calcium binding regions, ATP binding sites, zinc fingers, homeobox, kringle, etc. (iv) secondary structure (v) quaternary structure (vi) similarities to other proteins (vii) disease(s) associated with deficiency of the protein (viii) sequence conflicts, variants, etc. In SWISS-PROT annotation is mainly found in the comment lines (CC), in the feature table (FT) and in the keyword lines (KW). Most comments are classified by ‘topics’, an approach which permits easy retrieval of specific categories of data from the database. Many sequence databases contain, for a given protein sequence, separate entries which correspond to different literature reports. In SWISS-PROT we try as much as possible to merge all these data, so as to minimize the redundancy of the database. If conflicts exist between various sequencing reports they are indicated in the feature table of the corresponding entry. It is important to provide the users of biomolecular databases with a degree of integration between the three types of sequence-related databases (nucleic acid sequences, protein sequences and protein tertiary structures), as well as with specialized data collections. SWISS-PROT is currently cross-referenced with 24 different databases. Cross-references are provided in the form of pointers to information related to SWISS-PROT entries and found in data collections other than SWISS-PROT. We have selected a number of organisms that are the target of genome sequencing and/or mapping projects and for which we intend to: be as complete as possible (all sequences available at a given time should be immediately included in SWISS-PROT, including sequence corrections and updates) provide a higher level of annotation cross-reference to specialized databases that contain, among other data, some genetic information about the genes that code for these proteins provide specific indices or documents. The organisms currently selected are: Arabidopsis thaliana (mouse-ear cress); Bacillus subtilis; Caenorhabditis elegans (worm); Dictyostelium discoideum (slime mold); Drosophila melanogaster (fruit fly); Escherichia coli; Haemophilus influenzae; Homo sapiens (human); Saccharomyces cerevisiae (baker’s yeast); Salmonella typhimurium; Schizosaccharomyces pombe (fission yeast); Sulfolobus solfataricus. SWISS-PROT is linked to 24 different databases and has consolidated its role as the major focal point of biomolecular database interconnectivity. Protein Information Resource PIR was established in 1984 by the National Biomedical Research Foundation as a resource to assist researchers and customers in the identification and interpretation of protein sequence information. Since 1988, the Protein sequence database has been maintained by PIR – International, an association of macromolecular sequence data collection centres: the consortium includes the Protien Information Resources at NBRF, the international protein information database of Japan (JIPID) and Martinsried Institute for Protien Sequences (MIPS). In its current form, the database is split into four distinct sections, designated PIR 1 to PIR 4. PIR1 contains fully classified and annoted entries. PIR2 contains preliminary entries which have not been thoroughly reviewed and may contain redundancy. PIR3 includes unverified entries which have not been verified. PIR4 contains data which falls under the category of: conceptual translations of artefactual sequences; conceptual translation of sequences that are not transcribed or translated; protein sequences or conceptual translation that are extensively genetically engineered; or sequences that are not genetically encoded and are not produced on ribosomes. The Protein Information Resource (PIR) is an integrated public resource of protein informatics that supports genomic and proteomic research and scientific discovery. PIR maintains the Protein Sequence Database (PSD), an annotated protein database containing over 283 000 sequences covering the entire taxonomic range. The PIR-PSD is public domain protein sequence database, which currently contains over 283 000 annotated and classified entries, covering the entire taxonomic range. Recent development and annotation efforts have focused on superfamily classification and curation and bibliography mapping and attribution. A unique characteristic of the PIR-PSD is the superfamily classification that provides comprehensive, non-overlapping, and hierarchical clustering of sequences to reflect their evolutionary relationships. To further improve the quality of automated classification, systematic superfamily curation is conducted that: (i) defines the signature domain architecture (number, order, and types of domains) characteristic of the superfamily, (ii) categorizes regular and associate members to distinguish sequence entries sharing the signature features from outliers (such as fragments), and (iii) designates representative and seed members amongst regular members. Several thousand superfamilies have been manually curated. The seed members provide a basis for automatic placement of new sequences into existing superfamilies and for automatic generation of multiple sequence alignments and phylogenetic trees. Currently, over 99% of PSD sequences are classified into families of closely related sequences (at least 45% identical), and over two-thirds of sequences are classified into >36 000 superfamilies. To improve the quality of protein annotation by increasing the amount of experimentally verified data with source attribution, the PIR has developed a bibliography information system and conducted retrospective attribution of literature data. The bibliography system allows browsing and searching of extensive literature collected for all protein entries from PubMed and other curated molecular databases, together with an interface for scientists to categorize and submit literature information for mapped proteins. In PIR-PSD, protein features such as binding sites, structural motifs, and post- translational modifications are tagged with ‘experimental’ status for experimentally determined features to distinguish from those that are computationally predicted; however, they had not been associated with literature citations. A systematic manual attribution of experimental features is being carried out with computer-assisted mapping to existing protein bibliographic information. So far, a few thousand experimental features have been associated with publications. PIR-NREF (Non-redundant REFerence) DATABASE The PIR-NREF provides a timely and comprehensive collection of protein sequence data, keeping pace with the genome sequencing projects and containing source attribution and minimal redundancy. The database contains all sequences in PIR-PSD, SWISS-PROT, TrEMBL, RefSeq, GenPept, and PDB, totaling more than 1 000 000 entries currently. Identical sequences from the same source organism (species) reported in different databases are presented as a single NREF entry with protein IDs, accession numbers, and protein names from each underlying database, as well as amino acid sequence, taxonomy, and composite bibliographic data. Also listed are related sequences identified by all-against-all FASTA search, including identical sequences from different organisms, identical subsequences, and highly similar sequences (95% identity). NREF can be used for sequence searching and protein identification against the entire sequence collection or a subset of one or more genomes. The collective protein names, including synonyms, and the bibliographic information can be used to develop a protein name ontology. The different protein names assigned by different databases may help detect annotation errors, especially those resulting from large-scale genomic annotation. Martinsried Institute for Protein Sequence The MIPS group (Martinsried Institute for Protein Sequences) at the Max- Planck-Institute for Biochemistry, Martinsried near Munich, Germany, collects, processes and distributes protein sequence data within the framework of the tripartite association of the PIR-International Protein Sequence Database. The database is distributed with PATCHX, an exhaustive supplement of unique, unverified protein sequences from external sources compiled by MIPS. MIPS permits internet access to sequence databases, homology data and to yeast genome information. TREMBL An unannotated supplement to SWISS-PROT. Ongoing genome sequencing and mapping projects have dramatically increased the number of protein sequences to be incorporated into SWISS- PROT. Since we do not want to dilute the quality standards of SWISS-PROT by incorporating sequences without proper sequence analysis and annotation, we cannot speed up the incorporation of new incoming data indefinitely. However, as we also want to make the sequences available as fast as possible we will introduce with SWISS-PROT release 33 an unannotated supplement to SWISS-PROT. This supplement consists of entries in SWISS-PROT-like format derived from the translation of all coding sequences (CDS) in the EMBL nucleotide sequence database, except CDS already included in SWISS-PROT. This supplement is named TREMBL, TRanslation from EMBL, since the translation tools used to create translations of the CDS written by Thure Etzold at the EMBL in Heidelberg. Translation of all CDS in the EMBL nucleotide sequence database release 44 resulted in the creation of 145 000 TREMBL pre-entries. Around 65 000 of these pre-entries were already present as sequence reports in SWISS-PROT and were excluded from TREMBL. The remaining ∼80 000 sequence entries have been automatically merged whenever possible, to reduce redundancy in TREMBL. This step led to ∼70 000 TREMBL entries, which supplement SWISS-PROT. TREMBL entries has been divided into two main sections: SP-TREMBL and REM-TREMBL. SP-TREMBL (SWISS-PROT TREMBL) contains entries (∼55 000) which should be incorporated into SWISS-PROT. SWISS-PROT accession numbers have been assigned to these entries. SP-TREMBL is partially redundant against SWISS-PROT, since ∼30 000 of these SP-TREMBL entries aide only additional sequence reports of proteins already in SWISS-PROT. These sequence reports were merged as fast as possible with the already existing SWISS-PROT entries for these proteins, so as to make SWISS-PROT and TREMBL completely non-redundant. REM-TREMBL (REMaining TREMBL) contains those entries (∼15 000) that were not included in SWISS-PROT. This section is organized into four subsections. Most REM-TREMBL entries are immunoglobulins and T-cell receptors. At the moment there are >10 000 immunoglobulins and T cell receptors in TREMBL. SWISS-PROT have stopped entering immunoglobulins and T-cell receptors into its repository since they want to keep only germ line gene-derived translations of these proteins in SWISS-PROT and not all known somatic recombinant variations of these proteins. A specialized database dealing with these sequences as a further supplement to SWISS-PROT is to be created and keep only a representative cross-section of these proteins in SWISS-PROT. Another category of data which will not be included in SWISS-PROT is Synthetic Sequences. A third subsection consists of fragments with less than seven amino acids. The last subsection consists of CDS translations where we have strong evidence to believe that these CDS are not coding for real proteins.