Lecture 2: Introduction to Biological Databases PDF
Document Details
Uploaded by EventfulQuantum
New Mansoura University
Dr. Rami Elshazli
Tags
Summary
This lecture introduces biological databases and their categories, including primary, secondary, and composite databases. It also details the NCBI database system and its various resources, emphasizing its role in storing and organizing genomic data.
Full Transcript
Bioinformatics BIO417 Lecture 2 Introduction to Biological Databases Prepared by Dr. Rami Elshazli Associate Professor of Biochemistry and Molecular Genetics Bioinformatics: introduction to databases...
Bioinformatics BIO417 Lecture 2 Introduction to Biological Databases Prepared by Dr. Rami Elshazli Associate Professor of Biochemistry and Molecular Genetics Bioinformatics: introduction to databases Dr. Rami Elshazli Associate Professor of Biochemistry and Molecular Genetics Database is a computerized resource where data is structured in a way that makes it easy to add, access, and update it. The main purpose of databases is to enable easy handling and retrieval of information through multiple search features. Database is organized in a way that each data entry represents a record. A record contains multiple items of data; therefore, it consists of several fields. Biological databases have tools for higher level information processing. Their objective is not only to store information but also to discover. Bioinformatics: Biological databases According to the information added to the database, they are classified into three main categories: Primary databases. Secondary databases. Composite or specialized databases. Primary databases serve as computational archives containing only raw data, e.g., nucleic acids and protein sequences. Examples: GenBank and NCBI (nucleotide sequence). Gene Expression Omnibus (GEO) database (functional genomics data). Protein Data Bank (PDB; coordinates of three-dimensional structures). NCBI stands for the National Center for Biotechnology Information. Dr. Rami Elshazli Associate Professor of Biochemistry and Molecular Genetics Bioinformatics: Biological databases Secondary databases use the original data in primary databases to derive new data sets by using specialized software programs. Examples: UniProt database (sequence and functional information on proteins). Ensembl database (variation, function, regulation and more layered onto whole genome sequences). VarSome is a variant knowledge community, data aggregator and variant data discovery tool. Dr. Rami Elshazli Associate Professor of Biochemistry and Molecular Genetics Dr. Rami Elshazli Bioinformatics: Biological databases Associate Professor of Biochemistry and Molecular Genetics A composite database may combine more than one primary database so that instead of searching each one separately, the user can search related databases together for quick results. The NCBI sources is the best example for this type. Dr. Rami Elshazli Bioinformatics: The NCBI database Associate Professor of Biochemistry and Molecular Genetics The genomic databases store and organize the genomic data in a way that can be searched and retrieved using the Internet. The bioinformatics database is formed of well-organized records that can be updated, searched, and retrieved. The National Center for Biotechnology Information (NCBI) databases are the central databases that integrate with several major databases, including: European Molecular Biology Laboratory (EMBL). DNA Data Bank of Japan (DDBJ). Protein Data Bank. Dr. Rami Elshazli Bioinformatics: The NCBI database Associate Professor of Biochemistry and Molecular Genetics The NCBI provides a variety of online resources for biological information and data. It introduces its database services through the NCBI Entrez, which is a molecular biology database system that provides integrated access to nucleotide and protein sequence data, genomic mapping information, protein structure data, and life science literature. The NCBI databases offer scientists the opportunity to access a wide variety of biologically relevant data, including: Genomic sequences. Related information of multiple organisms. The NCBI’s databases are accessed through the Entrez retrieval system, which is available at www.ncbi.nlm.nih.gov/. Dr. Rami Elshazli Bioinformatics: The NCBI database Associate Professor of Biochemistry and Molecular Genetics The NCBI page contains three important sections: The resource list (A–Z). Search bar. List of popular resources. The resource list, categorizes the databases into several groups. Clicking the group name in the list will open a page with several tabs. For example, for DNA & RNA group there are six tabs: “All”, “Databases”, “Downloads”, “Submission”, “Tools”, and “How To”. Bioinformatics: The NCBI database Dr. Rami Elshazli Associate Professor of Biochemistry and Molecular Genetics The NCBI page contains three important sections: The resource list (A–Z). Search bar. List of popular resources. The resource list, categorizes the databases into several groups. Clicking the group name in the list will open a page with several tabs. For example, for DNA & RNA group there are six tabs: “All”. “Databases”. “Downloads”. “Submission”. “Tools”. “How To”. Dr. Rami Elshazli The NCBI: (1) The resource list Associate Professor of Biochemistry and Molecular Genetics The “All” tab includes the names and descriptions of all resources in that group. The “Databases” tab includes only the database names and descriptions. The “Downloads” tab contains links to download data of the databases in that group. The “Submission” tab contains links to the pages, where information are provided to submit data onto one of the databases in that group. The “Tools” tab includes the bioinformatics tools that can be used with the databases in the group. The “How To” tab includes the links that help the users to do the major tasks related to the database in the group. The NCBI: (2) Entrez search bar Dr. Rami Elshazli Associate Professor of Biochemistry and Molecular Genetics The search bar, on the top of the page, consists of a dropdown list, a search box, and a search button. The dropdown list contains the list of all Entrez databases that can be accessed through the web page. The Entrez databases are searched by selecting the target database from the dropdown list, entering search terms into the search box, and then clicking the “Search” button. The NCBI: (3) Popular resources The list of the popular resources, on the right side of the page, includes the list of the commonly used NCBI databases. Before using the NCBI Entrez web page for searching, it is recommended to create your own MY NCBI account using the link on the top right corner on the main Entrez web page. Creating an account and logging on when using the NCBI Entrez databases will allow you to save searches and manage filters. The NCBI: Searching Entrez Databases Dr. Rami Elshazli Associate Professor of Biochemistry and Molecular Genetics NCBI databases contains thousands of entries for several organisms. Any Entrez database is searched by selecting the database name from the database dropdown list, entering a search term into the search box, and then clicking the “Search” button. If you select “All Databases” from the database dropdown menu, the term you enter in the search box will be searched on all Entrez databases. This feature is called global query searching and you can use it to know the availability of records matching a specific search term on all NCBI databases. Try to select “All Databases” from the dropdown menu, enter any search term (e.g., TERT) into the search box, and click the “Search” button. The results will be the counts of the records matching the search query on all databases. The NCBI: Searching Entrez Databases The results of a database search may include thousands of irrelevant records; therefore, having a prior plan will help you to formulate your search query to return more specific search results. The keywords entered within the search box is called a search query. A search query can be a simple query made up of a single term or an advanced query including multiple search terms linked with Boolean operators (AND, OR, and NOT). Dr. Rami Elshazli Associate Professor of Biochemistry and Molecular Genetics The NCBI: Advanced Search Queries Boolean Operators Boolean Operator Condition Example An advanced search query is the query that consists of AND term1 AND term2 Both term1 and term2 must be found several search terms. OR term1 OR term2 At least one of the two terms must be found The advanced query makes the search results more specific. NOT term1 NOT term2 Term1 and not term2 must be found It is built of a combination of free text and tagged indexed fields linked by Boolean operators (AND, OR, and NOT). The Boolean operators must be in uppercase. When AND is used between two terms, the search will return only the records that include both the search terms (term1 AND term2). When OR is used, records that contain any of the two terms will be reported (term1 OR term2). If NOT is used, the system will search for both terms and then it removes any record with the second term from the results (term1 NOT term2). Dr. Rami Elshazli Associate Professor of Biochemistry and Molecular Genetics Dr. Rami Elshazli The NCBI: Advanced Search Queries Associate Professor of Biochemistry and Molecular Genetics For example, assume that we intend to find the Nucleotide records for the human TERT gene. The pieces of information that you have are the database name (Nucleotide), organism (human), and gene symbol (TERT). The search will be performed by selecting Nucleotide from the database dropdown list and entering the following advanced query into the search box: TERT[GENE] AND human[ORGN]. Using indexed fields and Boolean operators will make the search more specific. Assume that you intend to search for TERT nucleotide records for all mammals except human. You can build your search query as follows: TERT[GENE] AND mammals[ORGN] NOT human[ORGN]. Dr. Rami Elshazli The NCBI: Advanced Search Queries Associate Professor of Biochemistry and Molecular Genetics The database searching system usually processes the query from left to right. The order of the search process is changed by using parentheses “()”. Search terms in parentheses will be processed first from left to right. Assume that you wish to find the mammalian nucleotide Commonly Used Database Indexed Fields TERT records, but you want to exclude human and mouse Indexed Field Description records from the results. [Author] or [AUTH] For searching by author If you enter the following query, you may obtain millions of [Filter] or [FILT] For filtering the results irrelevant records and that is not what you intend to find. TERT[GENE] AND mammals[ORGN] NOT human[ORGN] OR [Gene Name] or [GENE] For searching by gene symbol mouse[ORGN]. [Organism] or [ORGN] To limit the search to the records of a certain organism TERT[GENE] AND mammals[ORGN] NOT (human[ORGN] OR [Properties] or [PROP] To limit the search to a certain property mouse[ORGN]). [Publication Date] or [PDAT] To limit the search to a date or a range of dates [Title] or [TILT] For searching a term in the title Dr. Rami Elshazli The NCBI: NCBI Search Results Associate Professor of Biochemistry and Molecular Genetics Searching a database will display the records matching the search query on the search results page. The layout of the results page is similar for almost all NCBI databases. Filter List: A filter list may be found on the left side of the results pages of most databases. The number of records filtered by each filter is shown between parentheses. Clicking any filter will activate the filter and the filtered records will be displayed on the results page. When a filter is activated, a warning will be displayed at the top of the results. An activated filter can be cleared using the “Clear All” link. The filters can be controlled by clicking the “Show additional filters” link at the bottom of the filter list. Dr. Rami Elshazli The NCBI: NCBI Search Results Associate Professor of Biochemistry and Molecular Genetics Menu Items: The menu items include “Format”, “Items per page”, “Sort by”, and “Send To” dropdown menus. The “Format” dropdown menu lists format options for displaying the results. The “Item per page dropdown” menu is used to limit the number of records displayed on the results page. The “Sort by” dropdown menu lists the options to sort the results on the page. “Send To” dropdown menu is used to send the results to a destination, which can be a file, collections, or the clipboard. Dr. Rami Elshazli The NCBI: NCBI Search Results Associate Professor of Biochemistry and Molecular Genetics Managed Filters: The filters on the top right of the search results page are managed only when a user logs onto My NCBI. Click the “Manage Filters” link to open the Filters window. The Filters window will allow you to choose the preferred filters or to create custom filters. Dr. Rami Elshazli The NCBI: NCBI databases Associate Professor of Biochemistry and Molecular Genetics NCBI GENE: Gene integrates information from a wide range of species. A record may include nomenclature, Reference Sequences (RefSeqs), maps, pathways, variations, phenotypes, and links to genome, phenotype, and locus-specific resources. NCBI Nucleotide: The Nucleotide database is a collection of sequences from several sources, including GenBank, RefSeq, and PDB. NCBI Protein: The Protein database is a collection of sequences from several sources, including translations from annotated coding regions in GenBank, RefSeq, as well as records from SwissProt, and PDB. NCBI SNP: dbSNP contains human single nucleotide variations, and small- scale insertions and deletions along with publication, population frequency, for both common variations and clinical mutations.