Biological Databases PDF
Document Details
Uploaded by AppealingChiasmus2500
MIT World Peace University
Tags
Related
Summary
This document provides an overview of biological databases, covering their structure, classification, and primary databases. It explains the purpose of these databases and how they facilitate efficient storage and retrieval of biological information. The document also highlights the importance of primary databases for biological research.
Full Transcript
Biological Databases Introduction to Database A database is a computerized archive used to store and organize data in such a way that information can be retrieved easily via a variety of search criteria. Databases are composed of computer hardware and software for data management. The chi...
Biological Databases Introduction to Database A database is a computerized archive used to store and organize data in such a way that information can be retrieved easily via a variety of search criteria. Databases are composed of computer hardware and software for data management. The chief objective of the development of a database is to organize data in a set of structured records to enable easy retrieval of information. Each record, also called an entry,should contain a number of fields that hold the actual data items, for example, fields for names, phone numbers, addresses, dates. To retrieve a particular record from the database, a user can specify a particular piece of information, called value, to be found in a particular field and expect the computer to retrieve the whole data record. This process is called making a query. Although data retrieval is the main purpose of all databases, biological databases often have a higher level of requirement, known as knowledge discovery, which refers to the identification of connections between pieces of information that were not known when the information was first entered. For example, databases containing rawsequence information can perform extra computational tasks to identify sequence homology or conserved motifs. These features facilitate the discovery of new biological insights from raw data Structure of Databases Biological databases are organized in a structured manner to facilitate efficient storage and retrieval of information. The most common structure is a relational database, where data is stored in tables with rows and columns. Each row represents a record, and each column represents an attribute. Databases employ a schema that defines the relationships between different tables and ensures data consistency. Classification of Biological Database (Based on type of data) Biological Database Structure Sequence Other Nucleotide Protein Protein Data Bank Gene (PDB) Expression SCOP Omnibus GenBank Swiss- CATH (GEO) DDBJ PROT PubMed EMBL PIR ArrayExpress TrEMBL BioCyc ProSite BRENDA PFam PubChem Primary Databases There are three major public sequence databases that store raw nucleic acid sequence data produced and submitted by researchers worldwide: GenBank, the European Molecular Biology Laboratory (EMBL) database and the DNA Data Bank of Japan (DDBJ), which are all freely available on the Internet. These three public databases closely collaborate and exchange new data daily. They together constitute the International Nucleotide Sequence Database Collaboration (INSDC). This means that by connecting to any one of the three databases, one should have access to the same nucleotide sequence data. Although the three databases all contain the same sets of raw data, each of the individual databases has a slightly different kind of format to represent the data. Fortunately, for the three-dimensional structures of biological macromolecules, there is only one centralized database, the Protein Data Bank (PDB). The International Nucleotide Sequence Database Collaboration (INSDC) is a global collaboration of independent governmental or non-profit organisations that manage nucleotide sequence databases capturing and preserving nucleotide sequence information and annotations to create a comprehensive collection that preserves the scientific record and enables broad sharing of such data. INSDC Members provide data resources that include raw sequence reads and alignments, structured metadata describing investigated samples such as taxonomic information, experimental, and project design, assembled nucleotide sequence data with functional annotation, and sequence-derived analyses. INSDC Members exchange data and make exchanged data freely accessible without restrictive licensing as part of the scientific record, including all corrections and updates. GenBank Data Storage Public Access https://www.ncbi.nlm.nih.gov/genbank/ GenBank stores information on GenBank data is freely GenBank is a comprehensive database of publicly DNA and RNA sequences, accessible to researchers available nucleotide sequences. It is maintained by the including their annotations, such as worldwide, enabling scientists to National Center for Biotechnology Information gene names, protein products, and collaborate and share their (NCBI) in the United States.The database started in other relevant details. findings. 1982 by Walter Goad and Los Alamos National Laboratory. GenBank has become an important database for research in biological fields and has Regular Updates grown in recent years at an exponential rate by GenBank is regularly updated with new sequences and annotations, ensuring doubling roughly every 18 months. that the database remains current and comprehensive. EMBL EMBL is a nucleotide sequence database maintained by the European Molecular Biology Laboratory (EMBL-EBI) in the United Kingdom. EMBL was created in 1974 and is funded by public research money from its member states. The Laboratory operates from six sites: the main laboratory in Heidelberg (Germany), and sites in Barcelona (Spain), Grenoble (France), Hamburg (Germany), Hinxton (the European Bioinformatics Institute (EBI), in England), and Rome (Italy). https://www.ebi.ac.uk/ 1 Global Collaboration 2 Data Variety EMBL collaborates with other EMBL stores a wide range of international databases, such as GenBank sequence data, including and DDBJ, to ensure data consistency genomic DNA, mRNA, and and facilitate data exchange. protein sequences. 3 Specialized Resources 4 Data Sharing EMBL offers specialized resources and EMBL is committed to open data sharing tools for sequence analysis, including and provides access to its data through its BLAST and ClustalW. website and various data exchange platforms. DDBJ https://www.ddbj.nig.ac.jp/index-e.html DDBJ is a nucleotide sequence database maintained by the DNA Data Bank of Japan (DDBJ) in Japan. It is located at the National Institute of Genetics (NIG) in the Shizuoka prefecture of Japan. DDBJ began data bank activities in 1987 at NIG and remains the only nucleotide sequence data bank in Asia. Japanese Focus International Collaboration Specialized Resources DDBJ has a strong focus on DDBJ collaborates closely with DDBJ offers specialized sequencing data from Japanese GenBank and EMBL, ensuring resources for analyzing sequence researchers, particularly in the that the three databases maintain data, such as the DDBJ fields of genomics and a high level of consistency. Sequence Read Archive (DRA). transcriptomics. Similarities and Differen between the Databases GenBank, EMBL, and DDBJ share the common goal of storin disseminating nucleotide sequence data. GenBank EMBL Feature United States United Location Kingdom Nucleotide Nucleotide Data Type sequences sequences Data Size Largest Medium Collaboration Yes Yes Data Formats and File Types Nucleotide sequence databases utilize specific data formats and file types to store and exchange sequence data and associated metadata. FASTA Format XML Format A flexible format that allows for the representation of A simple text-based format that represents a sequence with a complex data structures and relationships using tags and header line providing information about the sequence, hierarchical organization. XML is used by some followed by the sequence string itself. This format is widely databases to encode sequence data and associated used for exchanging and storing sequence data due to its metadata in a machine-readable format that facilitates simplicity and ease of use. data exchange and integration. GenBank Flat File A more structured format used by the GenBank database that includes extensive annotations and metadata related to the sequence, such as source organism, gene features, references, and other contextual information. This format provides a rich representation of the sequence data. These data formats and file types play a crucial role in the efficient storage, exchange, and analysis of nucleotide sequence data within and across the major sequence databases. Sample GenBank Record FASTA format A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line (defline) is distinguished from the sequence data by a greater-than (“>”) symbol at the beginning. Blank lines are not allowed in the middle of FASTA input. It is recommended that all lines of text be shorter than 80 characters in length. An example sequence in FASTA format is: >P01013GENEXPROTEIN(OVALBUMINRELATED) QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAEK MKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTSVLMA LGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHPFLFLIKHNPT NTIVYFGRYWSP https://www.uniprot.org/ The Universal Protein Resource KnowledgeBase (UniProtKB) is the central hub for the collection of functional information on proteins. Swiss-Prot Tr-EMBL TrEMBL Founder: Rolf Apweiler Secondary database TrEMBL is a computer annotated protein sequence database UniProtKB/TrEMBL contains the translations of all coding sequences (CDS) present in the INSDC. Protein sequences extracted from literature or submitted to UniProtKB/ Swiss-Prot Enriched with automated classification and annotation. UniProtKB REM-TrEMBL contains sequences which will not be incorporated in Swiss-Prot EX- synthetic sequences, Swiss-Prot Tr-EMBL fragment of less than 8 amini acids and CDS where Reviewed Unreviewed there is strong experimental Manually annotated Computationally evidence that sequence does Records with annotated not code for a real protein information extracted Records that await from literature and full manual curator-evaluated annotation computational Sp-TrEMBL analysis contains sequences which will eventually be added into Swiss-Prot https://www.expasy.org/resources/uniprotkb-swiss-prot Swiss-Prot UniProtKB/Swiss-Prot is the expertly curated component of UniProtKB (produced by the UniProt consortium). It contains hundreds of thousands of protein descriptions, including function, domain structure, subcellular location, post-translational modifications and functionally characterized variants. Developed by the Swiss-Prot group and UniProt partners at EMBL-EBI and PIR, and supported by the SIB Swiss Institute of Bioinformatics. https://www.rcsb.org/ 3D structural data of large biological molrcules, such as WHY? proteins and nucleic acids Growing crystallographic data Data-X-ray crystallography, development of BRAD in 1968 NMR spectroscopy or cryo- electron microscopy Overseen by wwPDB How did it start? The PDB was established in 1971 at Brookhaven National Laboratory under the leadership of Walter Hamilton and originally contained 7 structures. After Hamilton's untimely death, Tom Koetzle began to lead the PDB in 1973, and then Joel Sussman in 1994. In 2003, the Worldwide Protein Data Bank (wwPDB) was formed to maintain a single PDB archive of macromolecular structural data that is freely and publicly available to the global community. PDB file format Protein Data Bank (PDB) format is a standard for files containing atomic coordinates Protein Data Bank format consists of lines of information in a text file. Each line of information in the file is called a record. https://www.yeastgenome.org/ SGD project provides the highest-quality manually curated information from peer-reviewed literature. The completion of the S.cerevisiae genomic DNA sequence in 1996 provided the sequence of each of its genes The acquisition, integration and retrieval of these data allow SGD to facilitate experimental design and analysis by providing an encyclopedia of the yeast genome, its chromosomal features, their functions and interactions. SGD has information about the DNA sequence and its individual components, RNAs, encoded proteins and the structures and biological functions of any known gene products. SGD has been to create tools which allow the user to easily retrieve and display these types of information. https://ensemblgenomes.org/ Ensembl Ensembl provides a genome browser that acts as a single point of access to annotated genomes for mainly vertebrate species Information about genes, transcripts and further annotation can be retrieved at the genome, gene and protein level. This includes information on protein domains, genetic variation, homology, syntenic regions and regulatory elements. The project began in 1999 as a joint project between the EMBL European Bioinformatics Institute and the Wellcome Trust Sanger Institute As of Ensembl release 109 (February 2023), over 300 species are supported, along with 15 mouse strains, 5 dog breeds and 13 pig breeds. https://www.brenda-enzymes.org/ BRENDA (BRaunschweig ENzyme DAtabase) was created in 1987 at the German National Research Center for Biotechnology in Braunschweig (GBF) and is now continued at the University of Cologne, Institute of Biochemistry.