Lecture 2: Biological Databases - BIOC 3265
Document Details
Uploaded by Deleted User
UWI Cave Hill
Dr. A. T Alleyne
Tags
Summary
Lecture notes on biological databases, covering learning outcomes, definitions, functions, types of data, database properties, and in-silico biology. Includes a discussion of different database formats and considerations for database management.
Full Transcript
Lecture 2 Biological Databases BIOC 3265-Principles of Bioinformatics Dr. A. T Alleyne- UWI Cave Hill 1 L EARNING O UTCOMES At the end of this lecture, you...
Lecture 2 Biological Databases BIOC 3265-Principles of Bioinformatics Dr. A. T Alleyne- UWI Cave Hill 1 L EARNING O UTCOMES At the end of this lecture, you should be able to: 1. Explain the functions of a biological database 2. Distinguish between a primary and secondary database and a derived or composite database 3. Distinguish between data and metadata 4. Critically use the basic functions in a bioinformatics database for information retrieval 5. Describe important parts of a flat file gene record 6. Explain the meaning of in-silico biology 2 Definition A biological database is a system It may consist of a single file that consists of organized containing many records, biologically determined data, linked each of which includes the to software for querying, retrieving same set of information. and curating the data stored within. 3 Two main functions of a biological database; 1. Makes biological data available to scientists in a single place 2. Makes biological data available in a computer- readable and accessible form. 4 Nucleic acid sequences, and whole genomes Amino acid sequences Types Protein and nucleic acid structure of DATA Expression molecule for DNA EST- Expressed sequence tag Protein function Networks and Metabolic pathways 5 The Protein Information Resource (PIR)is an integrated public bioinformatics resource to support genomic and proteomic research, and scientific studies HISTORY (Wu et al, 2003) Contained all the known protein sequences determined up to 1965 Published by Margaret Dayhoff and colleagues It forms the current basis of the PIR database http://pir.georgetown.edu/ 6 New types of biological data On the Nature of Biological Data - Catalyzing Inquiry at the Interface of Computing and Biology - NCBI Bookshelf (nih.gov) 7 DATABASE PROPERTIES Archive of information - Archiving-Data entry and quality control Scientists (teams) deposit data directly Schema- logical organization of information Tools to access archive Curators add and update data type Check errors Updates ensure consistency, remove redundancy and resolve conflicts 8 Biological database classification based on data types. https://doi.org/10.1016/B978-0-323-89775-4.00021-3 9 P RIMARY OR ARCHIVAL DATABASES continuously updated Contains experimentally derived results- usually publicly funded Not representative of all the sequences for an entire species 1o They exchange data 10 www.ncbi.nlm.nih.gov/Genbank/ USA- NCBI http://nar.oxfordjournals.org/ GenBank www.ncbi.nlm.nih.gov/Genbank/ www.ncbi.nlm.nih. www.ncbi.nlm.nih.gov/Genbank/ gov/Genbank/ content/44/D1/D1.abstract www.ebi.ac.uk/embl/ Europe-EBI www.ebi.ac.uk/embl/ cambridge ENA www.ebi.ac.uk/em bl/ www.ddbj.nig.ac.jp Japan-NIG DDBJ www.ddbj.nig.ac.jp www.ddbj.nig.ac.jp INSDC International Nucleotide Sequence Databases Collaboration. 11 S ECONDARY DATABASES Contains Consolidation Do not results of of several exchange analysis of databases data regularly primary database analysis- highly curated Influenza virus database - NCBI (nih.gov) 12 Some Secondary databases Biosystems- biochemical pathways PubChem (nih.gov) Genome Biology www.ncbi.nlm.nih.gov/Genomes/ The Genome Biology site at NCBI contains information about the available complete genomes. Prosite https://prosite.expasy.org/ consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles. 13 Composite database These consist of data collected from more than two primary database resources. Composite Databases contain non- redundant data The data entered in these types of databases are first compared and then filtered based on desired criteria. About OMIM - OMIM They can reduce search time 14 SOME D ATABASE O RGANISMAL CODES DIVISION ABBR. Name DDBJ EMBL GenBank BCT Bacterial x x FUN Fungal x HUM Human x x INV Invertebrate x x X MAM Other mammalian x x x ORG Organelle x PHG Phage x x x PLN Plant x x x PRI Primate x x x PRO Prokaryotic x ROD Rodent x x x 15 VRL Viral x x x FUNCTIONAL CATEGORIES Division Name Function EST Expressed Sequence Tags Short reads ( 300-500bp) from cDNA produced in large numbers STS Sequence Tagged sites Short (200-500bp), unique sequences used in PCR assays. They generally map to a single position in a genome GSS Genome Survey Similar to ESTs but these represent Sequences genomic sequences. HTG High Throughput Unfinished DNA sequences generated by Genome Sequences High throughput sequencing centers WGS Whole Genome Shotgun Sequences for large DNA sequence sequences projects covering the whole organism. May contain unfinished sequences CON Contigs Constructed record of chromosomes, genomes and other long DNA sequences 16 RefSeq: NCBI Reference Sequence Database R EFS EQ (nih.gov) A non- redundant database managed by NCBI Provides a comprehensive, integrated, sequence set on both the genomic, transcript and protein levels for select organisms Entries are curated-they provide a stable reference analyses 17 Growth of Ref Seq database RefSeq records are derived from publicly available sequence data. manual curation are applied to the RefSeq record About RefSeq (nih.gov) 18 P ROTEIN S EQUENCE DATABASES Some protein databases are secondary databases because; o the information has been derived from translation of nucleotide sequences in DDBJ/EMBL or GenBank A universal protein database covers all species A specialized protein databases may concentrate on particular protein families or groups of proteins 19 EMBL Provides database protein searches among many other services Swiss-Prot- Geneva UniProt archive or UniParc TrEMBL- PIR-PSD- EBI USA UniProt non- UniProt redundant knowledgebase database UNIPROT (2002) http://www.ebi.ac.uk/ 20 BIG DATA-PEYABYTES OF INFORMATION 21 Data and Metadata Data - recorded observations of the biological objects or models studied. Metadata –provides a more detailed, unique description of the corresponding data. It is primarily used for data identification, classification, retrieval, and validation of data sets. Metadata can describe the resources used to generate the data ( experimental protocols and bioinformatics programs, for example). In a biological context, metadata often provide additional information on samples such as sex, disease, and tissue source site. Metadata often includes information about resources such as cell lines and antibodies. 22 Standardisation of databases, datasets and data entry Core principles Findability, Accessibility, Interoperability, and Reuse of digital assets. FAIR Principles - GO FAIR (go-fair.org) For example: About BankIt Submission (nih.gov) Data can be submitted to GeneBank either by Bankit or by tbl2asn, a command-line program that automates the creation of sequence records for submission to GenBank. It is used primarily for the submission of complete genomes and large batches of sequences. 23 S OME D ATABASE MANAGEMENT PROBLEMS Quality of data Need to curate data errors in entry or submission Need to update new data Publish/subscribe specifying constraints in quickly and accurately mechanisms are data integrity Data is commonly imported limited to many sites Inter-operability Lack of design methodology Need for common The discovery problem nomenclatures/vocabularies 24 Database management Text based. implementation is based on only one Tabular or relationship-based organisation record, which incorporates the complete data i.e. all of contents in a database. Each row the attributes for each variable. The consistency in contains a record, and each column flatfile formats enables a one-to-one mapping of specifies an attribute of the record. fields from one format to another The Data Format In Nucleotide Sequence Databases (bioinformaticshome.com) 25 DATABASE FORMATS The basic format of the three main primary databases is a flat file a text file with small amounts of data easily read by the three primary databases consistency format enables a one-to-one mapping of fields from one database to another DDBJ and GenBank flat file formats are almost identical EMBL flat files uses a slightly different format 26 Detailed examination of a database record at GenBank 27 Parts of the flat file a specific sequence, encompassing details such as a sequence description, the originating organism, and relevant bibliographic references. The sequence can represent an individual gene, a gene segment, a whole genome shotgun (WGS), or various RNA forms (mRNA, tRNA, ncRNA, or rRNA) presented as cDNA. Additionally, the database may include Metadata- additional data about the origin of the sample providing the sequence. 28 F LAT FILES Contains the annotation on the record Contains a description of the entire record The nucleotide sequence itself All flat files end with // 29 T HE H EADER This contains the database information on the record The first line in DDBJ and GenBank is called the LOCUS line or the ID line in EMBL DDBJ/GenBank LOCUS DMU54469 2881bp DNA Linear INV 22- FEB-1998 EMBL ID DM54469 standard; genomic DNA; INV; 2881BP 30 T HE LOCUS NAME The locus name helps group entries with similar sequences: the first three characters represents the organism; the fourth and fifth indicates any other group designations, e.g a gene product; the last character is a series of sequential integers The only rule now applied in assigning a locus name is that it must be unique the locus name is now the accession number. 31 A CCESSION N UMBER OR CODE The accession number is usually cited in publications for a record The AN is the only way to absolutely verify the identity of a sequence or database entry They may vary with the database. In GenBank there are two common formats: 1. Six digits: 1+5 format (Genbank, DDBJ , EMBL), o U54469 2. Eight digits: 2+ 6 format e.g. AF002534 o Newer format and here the locus name and the AN are the same 32 Microsoft Word - AccReference.doc (wustl.edu) Accession Number Formats UNIT Configuration Nucleotide 1 letter + 5 numbers or 2 In RefSeq letters + 6 Two letters, underscore and Protein 3 letters + 5 numbers six or more digits NP_123456 WGS 4 letters + 2 WGS assembly 6-8 numerals version 33 S TABILITY OF T HE A CCESSION CODE o The main difference from the gene identifier( gi) is that it is stable: any given accession code always refers to that entry, or its ancestors. o It is often called the primary key for the entry. o Once issued, it must always point to its entry, even after large changes have been made to the entry. o Accession numbers do not change, even if information in the record is changed at the author's request o Where two entries are merged into one, then the new entry will have both accession codes, o one will be the primary and o the other will be the secondary accession code. 34 35 F EATURES The features table (FT) represents the specific regions conveying biological information in the record It describes in detail using single words or abbreviations where the information in the record is located and any additional features that may be important Examples: source, CDS, and gene 36 Some Important features Source- summarizes the length of the sequence, scientific name of the source organism, and Taxon ID Taxon- unique identification number for the taxon of the source oganism. CDS- region of nucleotides that corresponds with the sequence of amino acids in a protein- includes start and stop codos Protein ID: begins with three letters five digits, a dot, and a version number. 37 38 Each record ends with the double slash 39 I N SILICO B IOLOGY On line simulation of laboratory experiments e.g. In silico PCR- sequences can be uploaded and amplified online with specific PCR primers In silico restriction digests Virtual labs Many in silico experiments are conducted via Scientific workflows or pipelines 40 WORKFLOWS AND P IPELINES Scientific workflows are designed to execute a set of computational tasks in bioinformatics In silico PCR In silico restriction digests Virtual labs Many in silico experiments are conducted via Scientific workflows or pipelines Scientific workflows can automate repetitive tasks 41 This Photo by Unknown Author is licensed under CC BY STEP 1 STEP 2 STEP 3 Example of a workflow in Bioserver. Taken from Loughbough This Photo by Unknown Author is licensed under CC BY 42 et al 2011 R EFERENCES Baxevanis, A. D. and Oulette, B. F. Bioinformatics: A practical guide to the analysis of genes and proteins. 2nd, 3rd or 4th ed. Wiley. Pevsner, J. Bioinformatics and Functional Genomics 2nd ed. Wiley, 2009 Lesk, A. M. Introduction to Bioinformatics, 5th ed. Oxford, 2019. Using Genomic Databases for Sequence-Based Biological Discovery (nih.gov) 43