Bioinformatics Lecture Notes PDF

SCI 20: Bioinformatics Lecture #2: Introduction to Biological Databases and Data Handling Chuckcris Tenebro, M.Sc. Instructor Iloilo Science and Technology University 1 What Are Biological Databases? A biological database is an organized collection of data that stores information about biomolecules such as DNA, RNA, proteins, and metabolic pathways. It is used for ▪ Enabling efficient data storage, sharing, and retrieval ▪ Facilitating large-scale data analysis for research ▪ Providing a foundation for discoveries in genomics, proteomics, and systems biology Examples of data types: ▪ Nucleotide and protein sequences ▪ Genomic annotations ▪ 3D protein structures ▪ Gene expression profiles 2 Types of Biological Databases 1. Primary Databases Archive raw data submissions Enable direct submissions from researchers Publicly accessible Examples: ▪ GenBank: Nucleotide sequences with annotations ▪ EMBL-EBI: European repository for nucleotide data ▪ DDBJ: Japanese nucleotide sequence database 3 GenBank The NIH genetic sequence database composed of an annotated collection of all publicly available DNA sequences. GenBank is part of the International Nucleotide Sequence Database Collaboration, which comprises the DNA DataBank of Japan (DDBJ), the European Nucleotide Archive (ENA), and GenBank at NCBI. Several ways to search and retrieve data from GenBank: ▪ Search GenBank for sequence identifiers and annotations with Entrez Nucleotide ▪ Search and align GenBank sequences to a query sequence using BLAST (Basic Local Alignment Search Tool) ▪ Search, link, and download sequences programmatically using NCBI e-utilities 4 GenBank The GenBank database is designed to provide and encourage access within the scientific community to the most up-to-date and comprehensive DNA sequence information. The most important source of new data for GenBank is direct submissions from a variety of individuals, including researchers, using one of the submission tools. If submitting human sequences to GenBank, do not include any data that could reveal the personal identity of the source. 5 GenBank 6 EMBL-EBI The European Bioinformatics Institute (EBI) maintains and distributes the EMBL Nucleotide Sequence database, Europe's primary nucleotide sequence data resource. The EBI also maintains and distributes the SWISS-PROT Protein Sequence database, in collaboration with Amos Bairoch of the University of Geneva. EMBL-EBI is one of the six sites of the European Molecular Biology Laboratory (EMBL), an intergovernmental research organisation funded by over 20 member states, prospect and associate member states. 7 EMBL-EBI 8 DDBJ The DNA Data Bank of Japan (DDBJ) is a public database of nucleotide sequences established at the National Institute of Genetics (NIG). The DDBJ Center also services Japanese Genotype- phenotype Archive (JGA), with the National Bioscience Database Center to collect human-subjected data from Japanese researchers. The current focuses at DDBJ Center are as follows: (i) improved network security and data management for JGA; (ii) virtualization of computing infrastructure for better development and analysis on the high- performance computing (HPC) environment and (iii) restructuring of data processes for updating International Nucleotide Sequence Database Collaboration (INSDC) databases. 9 DDBJ 10 Types of Biological Databases 2. Secondary Databases Provide curated, derived information Value-added annotations Integration with computational tools Examples: ▪ Swiss-Prot: Annotated protein sequences ▪ Pfam: Protein families and functional domains 11 Swiss-Prot SWISS-PROT is a protein sequence database containing detailed annotations. Compared with other protein databases, SWISS-PROT database differs from other protein sequence databases in three different standards: ▪ Annotation. SWISS-PROT database contains the protein sequences that have been carefully examined and accurately annotated in the EMBL nucleic acid sequence database. ▪ Minimum redundancy. Redundant sequences are minimized in SWISS-PROT. ▪ Integration with other databases. SWISS-PROT has cross- referenced with more than 30 other data, including nucleic acid sequence libraries, protein sequence libraries and protein structure libraries. 12 https://www.cusabio.com/c-20905.html#a07 Swiss-Prot The database is currently merged into the UniProt database. Universal Protein Resource (UniProt) comprises the Swiss institute of bioinformatics (SIB), the European institute of bioinformatics (EBI) and the protein information resource (PIR). 13 https://www.cusabio.com/c-20905.html#a07 Swiss-Prot / UniProt 14 Types of Biological Databases 3. Specialized Databases Address specific research areas or organisms Examples: ▪ OMIM: Genetic disorders database ▪ KEGG: Pathway maps for metabolic networks 15 OMIM The Online Mendelian Inheritance in Man (OMIM) is a comprehensive, authoritative compendium of human genes and genetic phenotypes that is freely available and updated daily. The full-text, referenced overviews in OMIM contain information on all known mendelian disorders and over 16,000 genes. OMIM focuses on the relationship between phenotype and genotype. 16 OMIM 17 KEGG KEGG (Kyoto Encyclopedia of Genes and Genomes) is a knowledge base for systematic analysis of gene functions, linking genomic information with higher order functional information. The genomic information is stored in the GENES database, which is a collection of gene catalogs for all the completely sequenced genomes and some partial genomes. The higher order functional information is stored in the PATHWAY database, which contains graphical representations of cellular processes, such as metabolism, membrane transport, signal transduction and cell cycle. A third database in KEGG is LIGAND for the information about chemical compounds, enzyme molecules and enzymatic reactions. 18 KEGG 19 Data Formats in Bioinformatics Standardized formats ensures interoperability between tools and databases and reduces errors during data exchange and analysis. Common Formats: 1. FASTA - A simple format for representing nucleotide or protein sequences. - Consists of a single-line header (starting with ">"), followed by sequence data. 20 Data Formats in Bioinformatics 1. FASTA 21 Data Formats in Bioinformatics 2. GenBank Flat File - A rich format containing nucleotide or protein sequences along with detailed annotations. - Divided into sections such as LOCUS, DEFINITION, SOURCE, FEATURES, and SEQUENCE. - Used for comprehensive data sharing, storing metadata, detailed analysis. 22 Data Formats in Bioinformatics 2. GenBank Flat File 23 Comparison of Data Formats in Bioinformatics Aspect FASTA GenBank Simple, minimal Rich, detailed metadata and 1. Complexity annotation annotations Small due to lack of Larger due to extensive 2. File Size metadata information Easy to parse, suitable Requires specialized tools for 3. Ease of Use for quick tasks parsing Sequence alignment, Comprehensive genomic 4. Applications BLAST input studies Extensive, includes features 5. Annotation None or minimal like CDS 24 Tools for Format Conversion 1. ReadSeq - Converts between sequence formats. 25 Tools for Format Conversion 2. EMBOSS seqret - Flexible conversion utility. 26 SCI 20: Bioinformatics Lecture #3: Database Queries and Sequence Retrieval Chuckcris Tenebro, M.Sc. Instructor Iloilo Science and Technology University 27 Data Retrieval Systems 1. Entrez (NCBI) - Integrated platform for querying multiple databases - Search across PubMed, GenBank, and GEO ▪ PubMed: A search engine that processes searches in a left-to-right sequence ▪ GenBank: A secure database that stores genetic information ▪ GEO (Gene Expression Omnibus): A public repository that stores high-throughput functional genomics data, such as gene expression data - Has advanced search with filters (e.g., organism, sequence length) and cross-database linking 28 Data Retrieval Systems 1. Entrez (NCBI) 29 Data Retrieval Systems 2. DBGET/LinkDB - Can access KEGG and other molecular biology databases - Useful for pathway and enzyme data 3. Sequence Retrieval System (SRS) - Facilitates complex searches across databases using keywords or identifiers - SRS service was decommissioned on Thursday 19th December 2013 30 Data Retrieval Systems 2. DBGET/LinkDB 31 Data Retrieval Systems 2. DBGET/LinkDB 32 Genome Browsers for Visualization Provide a graphical interface for exploring genomic data Used for mapping disease-associated genes, studying gene expression patterns, and exploring chromosomal rearrangements 1. UCSC Genome Browser ▪ Custom tracks for comparative genomic studies ▪ Interactive tools for alignment and annotation 2. Ensembl ▪ Focus on automated annotation of vertebrate genomes ▪ Integrated comparative genomics 3. NCBI Genome Data Viewer ▪ Visualization of gene models, regulatory elements, and SNPs 33 Genome Browsers for Visualization 1. UCSC Genome Browser 34 Genome Browsers for Visualization 2. Ensembl 35 Genome Browsers for Visualization 3. NCBI Genome Data Viewer 36 Challenges in Biological Databases 1. Data Redundancy ▪ Repeated entries due to multiple submissions ▪ This redundancy can inflate database size and complicate downstream analyses ▪ Example: Multiple researchers submit sequence data for the same gene but from different organisms or experimental conditions. 2. Versioning ▪ Keeping track of updated annotations ▪ A gene sequence initially submitted as hypothetical may later be annotated as a protein-coding sequence after further studies ▪ Example: The GenBank accession number for a sequence might update from version 1 to version 2 (e.g., NM_000059.1 → NM_000059.2) after a refined annotation. ▪ Tracking updates ensures researchers are using the latest, most accurate data 37 Challenges in Biological Databases 3. Scalability ▪ Coping with exponential data growth ▪ Example: Next-generation sequencing (NGS) technologies produce massive datasets, such as whole-genome sequencing (WGS) of thousands of individuals in population genomics studies. ▪ Databases like NCBI and Ensembl must handle exponential growth in sequence submissions, requiring robust infrastructure and algorithms to store, retrieve, and analyze data efficiently. 38 Challenges in Biological Databases 4. Data Integrity and Validation ▪ Ensuring accurate submissions ▪ A submission claiming a novel gene sequence must undergo checks for accuracy (e.g., no contamination, proper annotation, and completeness) ▪ Example: Automated pipelines like BLAST and manual curation are used to compare new sequences against existing databases to ensure no errors or wrong annotations. ▪ Role of Curation Teams: ✓ Manual Review: Ensuring that gene names conform to standard nomenclature and annotations follow biological relevance. ✓ Validation: Verifying that experimental metadata and supporting publications align with the submitted sequences. ✓ Example: Curators ensure a "hypothetical protein" is not erroneously labeled as a known gene product without supporting evidence. 39 Summary Biological databases store and manage critical biomolecular data. Primary databases archive raw data, while secondary databases add value through curation. Tools like Entrez and genome browsers make data retrieval and visualization efficient and seamless. Addressing challenges ensures the utility and growth of bioinformatics resources. 40 Decoding Your Questions… 41

Bioinformatics Lecture Notes PDF

Document Details

Tags

Related

Summary

Full Transcript