Bioinformatic Lecture 2 - PDF
Document Details
Uploaded by DevoutFoxglove
Oman College of Health Sciences
Nabras Al-Mahrami
Tags
Summary
This document provides a lecture on biological databases, including information on NCBI and PDB. It explains how these databases are used to store and manage biological data and knowledge. The lecture also introduces various queries and uses, including Boolean operators.
Full Transcript
Lecture 3: Biological database “For a biologist, bioinformatics expertise is no longer an ‘optional extra’ but a core skill” Janet M Thornton, 1998 Mr. Nabras Al-Mahrami...
Lecture 3: Biological database “For a biologist, bioinformatics expertise is no longer an ‘optional extra’ but a core skill” Janet M Thornton, 1998 Mr. Nabras Al-Mahrami [email protected] 1 Lecture outline Biological Database Introduction to the National Center for Biotechnology Information (NCBI) Introduction Protein Data Bank (PDB) 2 Introduction to Biological Databases Biological data, biological information, and biological knowledge is stored in a database A database is a systematically organized collection of data and information A database can be a single file or collection of files that stand alone or are organized by a specialized computer program called a database management system or DBMS There are many biological databases made available to the public, generally through the internet or by download 3 General Uses of Databases A database is used to manage data by: Storing Maintaining Data Entering Searching Sorting Retrieving Presenting or Displaying 3 Example of databases based on nature of the data Sequence database ▪ National Center for Biotechnology Information (NCBI) Structure database ▪ Protein Data Bank (PDB) 2 How it is ?? So far ?? National Center for Biotechnology Information What is NCBI? The National Center for Biotechnology Information (NCBI) develops and maintains molecular and bibliographic databases as a part of the National Library of Medicine (NLM). They do not generate their own data, but they do: Receive data submissions from researchers Develop software for searching and analysis of these data Provide a web access point for the data and software 3 Common sub-database at NCBI The NCBI database contains several sub-databases, the most important of which are: The NCBI Nucleotide database: contains DNA and RNA sequences the NCBI Protein database: contains protein sequences EST: contains ESTs (expressed sequence tags), which are short sequences derived from mRNAs the NCBI Genome database: contains DNA sequences for whole genomes PubMed: contains data on scientific publications 4 Homepage NCBI 5 Using Entrez System Entrez is an integrated database search and retrieval system that extract information from 39 molecular and literature databases For example, searching for cancer will pull up all records in each database that have the text "cancer" anywhere in the record. You can also retrieve information about proteins or genes by searching for the symbol. The Entrez query results page shows the results of a search for all records in the databases 6 For more information: https://www.ncbi.nlm.nih.gov/books/NBK3837/ Using Entrez System The Entrez query for “Cancer” shows the results of a search for all records in the databases 7 Boolean operators Combine search terms using Boolean operators. AND: Finds documents that contain both terms. OR: Finds documents that contain either term. NOT: Finds documents that contain the term on the left but not the term on the right. 8 Boolean operators Make sure the operators are upper case! Some databases won't recognize them as Boolean operators if they are lowercase. You can also use parentheses to group terms. For example, suppose you're looking for information about the brca1 gene's role in cancer: BRCA1 AND cancer Or maybe you want to know about cancers that involve brca1or P53: (BRCA1 OR P53) AND cancer Or maybe you want to find out information about P53 that isn't related to cancer: P53 NOT cancer 9 Field Tags Field Tags Individual databases have field tags, which allow you to specify which allow you to search for your query only in specific fields. For example, searching PubMed for Smith[au] would search for “Smith" only in the author field, and not in any other field. Common field tag Use Example [ta] to limit your search to the journal only gene therapy[ta], scanning[ta] [dp] Date of Publication cancer AND 2020/06/01[dp] [sb] To search for systematic reviews Thalassemia[sb] You can use other filters to narrow your search results by article type, text availability, publication date, species, article language, sex, age, and others. 10 Exploring NCBI Databases In the diagnostic genetic lab, five scientists gathered around a workstation. "Our task is to analyze the APOE gene”. Help the scientists by using NCBI databases as follows: (PubMed, Gene, Nucleotide and Protein) 13 PubMed database searching PubMed: free full-text archive of biomedical and life sciences journal literature. Strategic search in PubMed: 1. Define topic 2. Break the topic into keywords 3. Search each keyword individually. PubMed facilitates searching across 4. Combine your search using Boolean Concept several literature resources 5. Display and evaluate the outcome 6. Refine your search using filters For more information: https://pubmed.ncbi.nlm.nih.gov/help/ 14 PubMed database searching The search found citation for over 33 591 Use search box Refine your research PMID (PubMed ID) 15 Nucleotide database Submissions Nucleotide database: is a collection of DNA and RNA sequences from several sources. NCBI Key Point: Integration with Other Databases: The Nucleotide GenBank Database is integrated with other NCBI databases like GenBank, RefSeq, and Entrez, allowing users to easily navigate between related records across DDBJ EMBL databases. Submissions Submissions GenBank is part of an international collaboration 16 Genbank Type of database Type of organisms Other Filters Genbank 17 Genbank result format Title, GenBank ID, links to FASTA and Graphics Genbank key points: Flat File Format: Sequences in GenBank are stored in a standardized flat file format that provides both Header sequence data and corresponding annotations. Daily Updates: The database is updated daily with newly submitted sequences. Feature table Sequence 18 GenBank Format: Header Section break down the key components of GenBank results (header): 1. Locus: Indicates the name of the sequence entry and some basic statistics. Includes the length of the sequence, molecule type (e.g., DNA,mRNA), shape (usually linear), and division (e.g., PRI for primates). 2. Definition: Provides a brief description of the sequence. May include the gene name, organism, and a short description of the function or relevance. 19 GenBank Format: Header Section 3. Accession Number: A unique identifier is assigned to each sequence entry. Remains constant between versions, ensuring traceability. 4. Version: Indicates if the sequence has been updated or edited since the original submission. Presented as an accession number followed by a dot and version number 20 GenBank Format: Header Section 5. Organism: Indicates the species or organism from which the sequence originates. 6. Reference: Citations related to the sequence. Includes authors, title, and journal information. 21 GenBank Format: Feature and sequence Section Feature Table Identifier Features table: Describes specific regions or landmarks within the sequence. Features Examples include coding sequences (CDS), regulatory regions, and genes. Information about the biological source of CDS location the sequence, Including details like organism, strain, or tissue type. 22 GenBank Format: Feature Section Sequence Field Identifier Sequence : The actual nucleotide (or amino acid) sequence is presented. Typically presented in a multi-line format with numbers indicating the position in the sequence. Termination Line An interactive example : http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord 23 Unexpected information 24 Gene database Type of database Gene 25 Gene Results 26 Gene database : Summary section 27 Gene database : Summary section The summary section gives basic information about the gene, such as what it's called, links to other resources, and function. Stander Gene name according to HUGO Genome Nomenclature Committee (HGNC) RefSeq Status refers to the status of a non-redundant set of sequences, updated according to new experimental evidence found. Aliases is alternative names for the gene 28 Gene database : Summary section RefSeq Status Code: Provisional - submitted, but not reviewed Predicted- submitted but not reviewed, and some aspect of the refseq record is predicted. Inferred- predicted by genome sequence analysis, possibly homology not experimental evidence. Validated- additional manual curation, such as sequencing errors and misassociation with a locus. Reviewed- additional annotation, a summary description, and other functional information as available. 29 Gene database : Genomics Context The Genomic Context section gives information about where this gene is positioned in the genome, including the identifiers (RefSeq identifier) and exact base pair locations of the gene on these sequences. This section also includes a cartoon of the surrounding genes on the GenBank record, with clickable links that will take you to the gene records for those genes. 30 Gene database : Genomics Context Most common RefSeq identifier 31 Gene database: Genomic regions, transcripts and product 32 FASTA Region information FASTA format Tools such as BLAST & primer design Call other information 33 BLAST (Basic Local Alignment Search Tool) is a set of algorithms What is BLAST? available at NCBI that allow you to use a DNA, RNA or protein sequence to find similar sequences in the NCBI databases. An alignment score is a measure of how well the query and a given search result (subject) are aligned. An E-value, aka the “expect value”, is the number of matches you’d expect to get by random chance in a given database/query combo (false positives). 34 Protein database For search: What kind of information can be retrieved from the protein database 34 Structure database 35 Protein data bank - PDB 36 Protein data bank - PDB PDB File Format Documentation: http://www.wwpdb.org/docs.html 36 To read about Protein data bank - PDB 37 Glossary Search Tags Division in Locus In the context of GenBank, the "division" within the LOCUS field of a sequence record denotes the specific taxonomic or thematic category to which the sequence belongs. The division helps categorize sequences within GenBank, making it easier to manage and retrieve specific sets of data. 1.BCT: Bacteria 2.INV: Invertebrates 3.MAM: Mammals 4.PHG: Phages (viruses that infect bacteria) 5.PLN: Plants and Fungi 6.PRI: Primates 7.ROD: Rodents 8.SYN: Synthetic sequences 9.UNA: Unannotated sequences (historically used for early entries, now largely obsolete) 10.VRL: Viruses 11.VRT: Other vertebrates More unexpected information Research title More unexpected database entries: http://www.biostars.org/p/10846/ GenBank VS. Refseq Further reading ✓ Bioinformatics Thornton JM. The future of bioinformatics. Trends in Biotechnology, 16, 30–31, 1998 Ouzounis CA & Valencia A. Early bioinformatics: the birth of a discipline - a personal view. Bioinformatics, 19(17), 2176– 2190, 2003. Ouzounis CA. Rise and demise of bioinformatics? Promise and progress. PLoS Computational Biology, 8(4), e1002487, 2012 ✓ Main Resources: National Center for Biotechnology Information – NCBI: http://www.ncbi.nlm.nih.gov/ Eropean Bioinformatics Institute (EMBL – EBI): http://www.ebi.ac.uk/