FBDA - S3 Slideshow PDF

# Lecture 03 ## Primary Biological Databases Dr. Abdul Qader Abbady FBDA ## Outlines - Biological Knowledge is Stored in Global Databases - Primary Databases - Nucleotide Sequence Databases - GenBank Database - EMBL and DDBJ - Protein Sequence Databases - Uniprot - SwissProt - NCBI Protein Database - References ## Biological Knowledge is Stored in Global Databases - The most important basis for applied bioinformatics is: - the collection of sequence data - and its associated biological information - For example, with genome sequencing projects - such data are generated daily in very large quantities worldwide - In order to use these data appropriately - a structured filing system of the data is necessary - yet the data should also be accessible to those interested ## Annually, the journal Nucleic Acids Research - dedicates an entire issue (first issue in January) to all available biological databases - that are recorded in tabular form with the respective URLs **Figure:** A screenshot of Nucleic Acids Research, showing a table listing databases and their URL. ## Furthermore, for a number of databases, original articles describe their functions - This database issue, which is freely accessible also on the Web is a good starting point for working with biological databases. **Figure:** A diagram depicting biological databases as a machine, with raw data as input and results as output. A table with examples of databases and their functions. ## Biological Knowledge is Stored in Global Databases Depending on the kind of data included, different categories of biological databases can be distinguished. ### Primary Databases - contain primary sequence information (nucleotide or protein) - and accompanying annotation information regarding function, bibliographies, and cross-references to other databases. ### Secondary Biological Databases - summarize the results from analyses of primary protein sequence databases. - The aim of these analyses is to derive common features for sequence classes, which in turn can be used for the classification of unknown sequences (annotation). In addition, all other databases that save biological or medical information (for example, literature databases) are frequently classified as secondary databases. ## Biological Database Systems - The use of relational database systems (e.g., Oracle, MS Access, Informax, DB2) and their ability to manage large data sets would seem to make them ideal for the structured filing of data - yet these systems have not gained acceptance so far in the field of biological databases. - Rather, sequence data and their accompanying information are usually filed in the form of flat file databases - structured ASCII text files - American Standard Code for Information Interchange **Figure:** A screenshot of Oracle database, showing a list of tables and their details. ## ASCII Text Files - This is for historical reasons and because ASCII text files offer the advantage of conferring the ability to manipulate data without requiring an expensive and complicated database system - ASCII text files also makes data exchange between scientists relatively simple - One drawback, however, is that searching for certain keywords within a data set is both laborious and time-consuming - To minimize this disadvantage, various systems have been developed that can index flat file-based databases - they come with an index register similar to a book - thus accelerating keyword-based searches **Figure:** A text file with highlighted keywords. ## Primary Databases ### Nucleotide Sequence Databases - DDBJ (DNA Data Bank of Japan) - DDBJ Sequence Read Archive - DDBJ Trace Archive - BioProject - INSDC (International Nucleotide Sequence Database Collaboration) - IAC (Institute for Agricultural Research) - ICM (Institute of Cytology and Molecular Genetics) - NCBI (National Center for Biotechnology Information) - GenBank - Sequence Read Archive - Trace Archive - BioProject - ENA/EBI (European Nucleotide Archive/European Bioinformatics Institute) - EMBL-Bank - Sequence Read Archive - Trace Archive - BioProject **Figure:** A DNA double helix with the word "BioProject" in the middle and "DDBJ", "INSDC" and "NCBI" at the ends. ## GenBank Database - The GenBank database is perhaps the best-known nucleotide sequence database available at the U.S. National Center for Biotechnology Information (NCBI) - GenBank is a public sequence database which in its version (217.00, December 2016) contains roughly 199 million sequence entries. - Sequences can be entered into GenBank by anyone - via a Web page (bankit) or by e-mail (sequin) - when working with larger sequence sets - For the publication of new sequences in any scientific journal. - One should submit sequence data into either GenBank or one of its associated databases - for example, the European Nucleotide Archive (ENA) - or the DNA Database of Japan (DDBJ) **Figure:** A screenshot of GenBank webpage, showing a search area and a link to the NCBI website. ## Accession Number (AN) - Each single database entry is provided with a unique identification tag: the accession number (AN) - The AN is a permanent record that remains unchanged even if changes are subsequently made to the database record. - In some cases, a new AN can be assigned to an existing number if, for example, an author adds a new database record that combines existing sequences. - Even then the old AN is retained as a secondary number. - The AN is the only way to absolutely verify the identity of a sequence or database entry. **Figure:** A GenBank record with highlighted Accession Number. ## GenBank Entry - The required structuring of the database record is performed via defined keywords. - Each entry starts with the keyword LOCUS, followed by a locus name. - Like the AN, the locus name is also unique, however, unlike the AN, it may change after revisions of the database. - The locus name consists of eight characters, including the first letter of the genus and species names in addition to a six-digit AN. - Newer entries have an eight-digit AN. **Figure:** A GenBank record with the LOCUS keyword highlighted. ## Nucleotide Sequence Databases - On the same line following the locus name: - the length of the sequence is given. - A sequence must have at least 50 base pairs to be entered into GenBank. - This requirement was introduced only relatively recently, some older entries do not fulfill this criterion. **Figure:** A GenBank record with the sequence length highlighted. - Column 3 denotes the type of molecule of the sequence entry. - Every GenBank entry must contain coherent sequence information of a single molecule type. An entry cannot contain sequence information of both genomic DNA and RNA **Figure:** A GenBank record with the molecule type highlighted. - The last column in the LOCUS line gives the date of the last entry modification. - The end of the database record starts with the keyword ORIGIN. - In newer entries, this filed remains empty. - The actual sequence information begins on the following line and may contain many lines. - A detailed description of all keywords is found on the GenBank sample page **Figure:** A GenBank record with the ORIGIN keyword and sequence information. ## Entrez - Query of the GenBank database is carried out via the NCBI Entre system - which is used to query all NCBI-associated databases - NCBI Resource Coordinators 2016) - Search terms can be combined by means of logical operators (AND, OR, NOT) - and single search terms restricted to certain database fields. - Entrez is an important and effective tool for the execution of both simple and complicated searches. **Figure:** A screenshot of the NCBI Entrez webpage, highlighting the search area and the list of available databases. ## Entrez - The restriction of search terms to single database fields is generally performed by a field ID placed after the term: search term [field-id]. - For example, the search for a sequence from Saccharomyces cerevisiae with a length of between 3260 and 3270 base pairs would require the following search syntax: - (Saccharomyces cerevisiae [ORGN]) - AND 3260:3270[SLEN] - Representative field IDs for performing searches in GenBank are listed in the table - Complete instructions for the use of Entrez are found on the Entrez help page **Figure:** A table listing field ID examples for NCBI Entrez. ## Entrez Advanced Search - To simplify the construction of complex queries, the advanced search was introduced. - To use this search, follow the link beneath the Entrez search field. - Field IDs and logical operators can be selected from list boxes and the respective query is constructed automatically and entered into the search text field. - For better readability in this case, the field IDs are entered wiht their full name. - The latter does also work in the generic search, it is therefore no longer necessary to remember the abbreviated field IDs - Use the builder below to create your search **Figure:** A screenshot of the NCBI Entrez Advanced Search webpage, showing the field ID selection area, the logical operators selection area and the search query area. ## EMBL and DDBJ - The European counterpart to GenBank is the European Nucleotide Archive (ENA), maintained by the European Molecular Biology Laboratory (EMBL), located at the European Bioinformatics Institute (EBI), now also known as EMBL-Bank. - Another primary nucletide sequence database, the DNA Data Bank of Japan (DDBJ), is operated by the National Institute of Genetics (NIG) in Japan, is the primary nucleotide sequence database for Asia. - The three database operators, NCBI, EBI, and NIG compose the International Nucleotide Sequence Database Collaboration (INSDC), synchronize their databases every 24 h - A query of all three individual databases is therefore not necessary nor is it required to enter a new nucleotide sequence into all three databases. **Figure:** A diagram depicting the relationships between ENA, DDBJ and NCBI and INSDC, along with their respective logos. ## Differences Between Nucleotide Sequence Databases - While the database format of the DDBJ is identical to that of the NCBI, that of the EMBL database of ENA differs somewhat - The figure shows an entry in the EMBL database. - The most obvious difference is the use of two-letter codes instead of full keywords. - Furthermore, there are small changes in the organization of the individual data fields. - For example, the date of the last modification is not listed in the field ID (corresponding to the LOCUS field in GenBank) but appears in the field DT (database field). - A complete description of the EMBL format can be found on the ENA manual page - ID U49845; SV1; linear; genomic DNA; STD; FUN; 5028 ΒΡ. - AC U49845; - DT 07-MAY-1996 (Rel. 47, Created) - DT 25-MAR-2010 (Rel. 104, Last updated, Version 5) - DE Saccharomyces cerevisiae TCP1-beta gene, partial cds; and Axl2p (AXL2) and Rev7p (REV7) genes, complete cds. - DE Saccharomyces cerevisiae (baker's yeast) - KW Eukaryota; Fungi; Dikarya; Ascomycota; Saccharomycotina; Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces. **Figure:** A screenshot of an EMBL record with highlighted keywords. ## ENA Online Retrieval - The EMBL database is the predecessor of the ENA. - The ENA offers several search forms. - First is a simple search: - text searches - it is possible to search for accession numbers and for simple free text - sequence retrieval - The search is not limited to certain database fields and does not allow to restrict the search to certain text fields as the Entrez system does. - Instead, all database entries that randomly contain the search term are retrieved. **Figure:** A screenshot of ENA online retrieval page, highlighting the search area and the options for simple text search and advanced search. ## ENA Online Retrieval - To use this kind of parameter, to search for a sequence from S. cerevisiae with a sequence length of 3270 base pairs for instance, the advanced search must be used. - It can be reached by following the corresponding link beneath the simple search text field. - The advanced search form starts with several rather coarse-grained categories of the database fields. - Once one of these categories is selected, additional text fields and option boxes are displayed that make it possible to restrict the search to individual database fields or groups. **Figure:** A screenshot of ENA online retrieval advanced search page, showing the categories for data type, query and fields. ## Nucleotide Sequence Databases - To retrieve S. cerevisiae sequence, we must select the category Sequence and enter the search term Saccharomyces cerevisiae into the field Taxon. - The comparison operator is set to equal. Use of the other two operators does, course, make sense only if we compare numerical values. - In the field Base count, 3270 is entered and the comparison operator is set to less than or equal to (<=) - While entered, all entries are translated into a query simultaneously, which is displayed in the gray text field at the head of the page. **Figure:** A screenshot of ENA advanced search page, showing the field Taxon, Base count and the operators AND, OR, NOT. - The retrieval is started by hitting the Search button. **Figure:** A screenshot of ENA advanced search page, with the highlighted search button and a search query example. ## Nucleotide Sequence Databases - Unfortunately, this search form does not allow one to search for a range like we did in the NCBI Entrez example for the sequence length. - However, it is possible to build the query in the query builder without a range first and then edit the resulting query manually. - To do so, we click on the hyperlink Edit Query on the right of the text search field. - Now we can modify the preconstructed query and add an additional restriction for the field ID basecount with a logical AND. **Figure:** A screenshot of ENA advanced search page, showing the Edit Query hyperlink. The query builder area is also highlighted. - The resulting query now is tax_eq(4932) AND (base/count >= 3260 AND basecount <= 3270) **Figure:** A screenshot of ENA advanced search page with the highlighted query builder and the query written within. ## Nucleotide Sequence Databases - Sometimes it is necessary to use brackets to influence the precedence of the logical operators. - If we had been interested in a S. cerevisiae sequence that is either shorter than 3260 base pairs or longer than 3270 base pairs - we would have had to use brackets to override the logical operator precedence. - The query would have resulted in tax_eq(4932) AND (base_count <= 3260 OR basecount >= 3270) **Figure:** A screenshot of ENA advanced search page showing the bracket functionality for the query. ## Nucleotide Sequence Databases - The ENA also allows for sequence searches using sequence comparisons. - Basically, this is a BLAST search, which can either be carried out using standard BLAST parameters or which makes it possible to tweak BLAST parameters on the advanced search page. **Figure:** A screenshot of ENA online retrieval page showing the button for sequence similarity search. - BLAST searches will be discussed in detail in the following lectures. **Figure:** A screenshot of ENA sequence similarity search page, showing the different options for parameters. ## Protein Sequence Databases - The information available for proteins continues to grow rapidly. - Besides sequence information, expression profiles can be examined, secondary structures predicted, biological/biochemical functions(s) analyzed. - All these data are stored in databases, some of which are quite specialized. - Therefore, it can be time consuming to collect all the relevant information regarding any given protein. - For this reason, EBI, the Swiss Institute of Bioinformatics (SIB), and Georgetown University have built a consortium with the aim of developing a central catalog for protein information. **Figure:** A diagram depicting protein sequence databases, with the UniProt sequence database as a central node and linked to several other databases. - The result is the Universal Protein Resource (UniProt): (UniProt Consortium 2016) **Figure:** The UniProt logo. ## UniProt - UniProt unites the information in the three protein databases Swissprot, TrEMBL, and Protein Information Resource (PIR). **Figure:** A diagram showing the relationship between UniProt and its three components: UniProtKB, UniRef and UniParc. The diagram also displays the relationship between the two realms of UniProtKB, Swiss-Prot and TrEMBL. - UniProt consists of three parts: - the UniProt Knowledgebase (UniProtKB) - the UniProt Reference Clusters Database (UniRef) - the UniProt Archive (UniPArc) - a collection of protein sequences and their history. **Figure:** A description of UniProtKB, UniRef and UniParc with their respective functions. ## Protein Sequence Databases - Protein sequences and their annotations are stored in the UniProt Knowledgebase (UniProtKB) - which is divided into two realms. - TrEMBL realm, which contains automatically annotated sequences, currently (2021) contains approx. 200 million entries. - SwissProt realm, where manually curated and annotated sequences are stored, contains approx. 564,000 entries. **Figure:** A diagram depicting the relationship between UniProtKB and its two realms: Swiss-Prot and TrEMBL. The diagram also displays the relationship between the two realms and UniParc. ## SwissProt - SwissProt realm is regarded as one of the most important protein databases - Because of the manual curation, quite often it is also referred to as the gold standard of protein annotation. **Figure:** A screenshot of the UniProtKB search page, highlighting the options for Reviewed and Unreviewed. ## Protein Sequence Databases - The SwissProt database existed long before the UniProt database was founded - was located at the SIB - Because the team of specialists at the SIB was overwhelmed with the flood of new sequences being entered into the databases, a supplement to the SwissProt database, The TrEMBL database , was introduced. - TrEMBL stands for translated EMBL, contained all protein translations of the EMBL database, which had not yet been manually curated. **Figure:** A diagram showing the relationship between SwissProt and TrEMBL, along with the UniProt logo and "Celebrating 30 years" text. - All entries in TrEMBL (today UniProtKB/TrEMBL) are annotated automatically; the quality of the annotations is not comparable to that of SwissProt annotations. **Figure:** A diagram depicting the relationship between SwissProt and TrEMBL and how they are connected to UniParc and external sources. ## Protein Sequence Databases - Figure: shows an entry in the UniProtKB/SwissProt database. - At first glance, the entry is similar to an EMBL entry - Indeed, the two database formats are related. - Both database schemes use two-letter identifiers. - Most identifiers are identical for the two databases. - Some identifiers, however, are modified for the UniProtKB and some are added. - The raw database entry as shown is rarely found. - Most times, a graphical version is presented by UniProtKB. **Figure:** A screenshot of a UniProtKB record with many details, including a list of related publications. ## Protein Sequence Databases - The UniProtKB can be queried using simple full text search or using complex queries with logical operators; an advanced search form is used. - The search is initiated by clicking on the hyperlink Advanced on the right of the text field. - In the advanced search form, the field IDs, the corresponding logical operators can be selected from drop-down menus - When started, the search query is displayed in the text field and can be tweaked manually if necessary. **Figure:** A screenshot showing UniProtKB's advanced search interface, highlighting the search field, the advanced search button and the query area. ## UniRef Protein Sequence Databases - UniRef is a nonredundant sequence database - that allows for fast similarity searches. - The database exists in three versions: -UniRef100 -UniRef90 -UniRef50 - Each database allows for the searching of sequences - that are 100%, ≥ 90%, or ≥50% identical. - The size of the database changes accordingly, making similarity searches, for example with BLAST, much faster. **Figure:** A diagram showing the relationship between UniRef and the UniProtKB. It highlights the three versions of UniRef, UniRef100, UniRef90 and UniRef50. ## NCBI Protein Database - Another well-known protein sequence database is maintained at the NCBI. - This database, however, is not a single database, but a compilation of entries found in other protein sequence databases. **Figure:** A screenshot of the Entrez Protein Database webpage. - For example: - from Swissprot - the Protein Information Resources (PIR) database - the Protein Data Bank (PDB) database - protein translations of the GenBank database - and several other sequence databases - Its format corresponds to that of GenBank, and queries are carried out analogously to those in GenBank via the Entrez system of NCBI. **Figure:** A description of several databases that are included in the NCBI database. ## References - Andreeva A, Howorth D, Chothia C, Kulesha E, Murzin AG (2014) SCOP2 prototype: a new approach to protein structure mining. Nucleic Acids Res 42(Databaseissue): D310-D314 - Attwood TK, Bradley P, Flower DR, Gaulton A et al (2003) PRINTS and its automatic supplement, pre-PRINTS. Nucleic Acids Res 31:400-402 - Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE (2000) The Protein Data Bank. Nucleic Acids Res 28:235-242 - Finn RD, Coggill P, Eberhardt RY, Eddy SR et al (2016) The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res 44:D279-D285 - Greene LH, Lewis TE, Addou S, Cuff A et al (2007) The CATH domain structure database: new protocols and classification levels give a more comprehensive resource for exploring evolution. Nucleic Acids Res 35:D291-D297

FBDA - S3 Slideshow PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue