Intro to Bioinfo & Data (highlighted).pdf
Document Details
Uploaded by Deleted User
Tags
Full Transcript
INTRODUCTION TO BIOINFORMATICS AND DATABASES Session 1 BIOINIFORMATICS 1 BIOINFORMATICS An interdisciplinary field, involves biology, computer science, mathematics, and statistics. All are used together to analyze biological sequence data and genome content....
INTRODUCTION TO BIOINFORMATICS AND DATABASES Session 1 BIOINIFORMATICS 1 BIOINFORMATICS An interdisciplinary field, involves biology, computer science, mathematics, and statistics. All are used together to analyze biological sequence data and genome content. In addition to that it can predict the structure and function of the cellular macromolecules. BIOINIFORMATICS 2 Computational techniques for solving data and biological problems. Data problems like representation (graphics), storage, retrieval (databases), and analysis (statistics, artificial intelligence). Biology problems like sequence analysis, structure or function prediction, and data mining. BIOINIFORMATICS 3 DATA AND INFORMATION Data is raw, unorganized facts that need to be processed. When data is processed, organized, structured, or presented in a given context that makes it useful, it will be called information (processed data). DATABASE Is a collection of data that is structured in a way that facilitates the reaching for the data (searchable). Hence, databases are computerized storehouses that provide a way for locating, adding, removing, and changing data and they should be, frequently updated. BIOINIFORMATICS 4 Margret Dayhoff is considered the founder of Bioinformatics, she was the first to apply mathematics and computational techniques to protein and nucleic acids sequencing data. In the 1960s Professor Dayhoff and her collaborators in the National Biomedical Research Foundation (NBRF) Margaret Oakley Dayhoff 1925 - 1983 established the first protein sequence atlas (database) which became eventually the protein information resource (PIR). Est. 1984 BIOINIFORMATICS 5 CLASSIFICATION OF THE BIOLOGICAL DATABASES Based on: 1. Type of the data 2. Scope and coverage 3. Level of Biocuration BIOINIFORMATICS 6 According to the type of data databases can be categorized into: 1) Sequence (Nucleotide, Protein) 2) Expression 3) Disease 4) Pathway 5) Literature BIOINIFORMATICS 7 Nucleotide databases There are three main repositories for nucleotide sequences: ENA (European Nucleotide Archive) Developed and operated by the EMBL-European Bioinformatics Institute (EMBL-EBI) GenBank (maintained by the National Center for Biotechnology Information) DDBJ (DNA Data Bank of Japan) The three repositories should contain all the published DNA and RNA sequences. Furthermore, they are collaborating and sharing the data with each other ( so if data is submitted to one of them it will be shared with the others). BIOINIFORMATICS 8 THE INTERNATIONAL NUCLEOTIDE SEQUENCE DATABASE COLLABORATION (INSDC) Submitted sequences BIOINIFORMATICS 9 Protein databases Swiss-Prot knowledge base ✓ Created by Bairoch in 1986 ✓ Has collaboration between the SIB (Swiss Institute of Bioinformatics) and EBI (European Bioinformatics Institute) ✓ Data is manually annotated. Amos Bairoch ❖ Manual annotation can’t cope with the huge flow of data. TrEMBL (Translated EMBL) ✓ Created in 1996 ✓ Data is annotated through software tools (computer annotated and maintained at the EBI). ✓ Contains data that is not yet in Swiss-Prot, though the quality of the data is not reaching that of Swiss-Prot. ❖ Once the data enters Swiss-Prot it will be saved in the EMBL archive but is no longer in TrEMBL. BIOINIFORMATICS 10 Protein databases Uniprot (Universal Protein Resource) Data is created by combining the Swiss-Prot, TrEMBL, and PIR. Produced at PIR Produced at EBI BIOINIFORMATICS 11 Protein databases Expasy (Expert Protein Analysis System) ✓ An online resource that provides genomic, proteomic, phylogenetic, and medical chemistry data through over 160 databases and software tools. PDB (Protein Data Bank) ✓ Database for the 3-D structural data of proteins. ✓ Experimental data supplied through scientists (X-ray crystallography and NMR). BIOINIFORMATICS 12 Motif/Domain Protein Databases PROSITE ✓ Provides a detailed description of the annotated protein domains/motifs. Classification of protein families InterPro ✓ It classifies proteins into families and predicts protein features like domains. BIOINIFORMATICS 13 Genomic Databases Focuses on organizing all the information of an organism and is not diverse like sequence databases. Mainly useful for genome sequencing projects. NCBI (genome database), Ensembl, UCSC genome browser. Genome annotation Is the prediction of genes in a genome including, the location of protein- encoding genes, any significant matches to other proteins of known function, and the location of RNA-encoding genes. BIOINIFORMATICS 14 BIOINIFORMATICS 15 According to the Scope and data coverage, databases can be categorized into: Comprehensive databases Specialized databases ✓ Covers different types of data from ✓ Categorize the data, based on field or several species and fields. organism. Examples Examples(organism-specific database) ✓ EMBL(European molecular biology ✓ TAIR (Arabidopsis) laboratory) ✓ WormBase (Nematodes) ✓ NCBI (national center for ✓ Pubmed (bibliography) biotechnology information) ✓ BacMap (all bacteria) ✓ DDBJ (DNA databank of Japan) ✓ METACYc (metabolic pathway) BIOINIFORMATICS 16 According to the level of Biocuration, databases can be categorized into: Primary databases Secondary databases ✓ Data originally submitted by the ✓ Built up from the primary data retrieved experimentalists (not curated). by the primary database. ✓ Includes interpretation of the primary ✓ Also known as archival databases data. ✓ Repositories of “curated” data ✓ Examples include ENA, GENBANK, ✓ Content controlled by human reviewers and SRA (sequence read archive) in addition to some experimental verification of the biological sequence data. ✓ Examples include Uniprot, RefSeq and OMIM Biocuration is the integration and interpretation of biological information into databases, it also involves the organization, representation, and facilitating the accessibility of information to both humans and computers. BIOINIFORMATICS 17 1) NCBI (GenBank) Accession number: Is a unique identifier, which can only be issued by the three main repositories (ENA, DDBJ, and GenBank) and is used permanently to identify a single DNA or protein sequence. BIOINIFORMATICS 18 BIOINIFORMATICS 19 Flat-files In biological databases, the flat file is a text file, which usually contains one record (sequence) and is the indivisible unit of all sequence databases, where data can be displayed in a variety of formats. BIOINIFORMATICS 20 FASTA format Header Sequence BIOINIFORMATICS 21 Format The way of information arrangement, since different programs require the information to be specified to them in a formal manner, using a particular keyword and order. The FASTA format is one of the most common formats for sequence records. BIOINIFORMATICS 22 2) EMBL-EBI (ENA) BIOINIFORMATICS 23 BIOINIFORMATICS 24 BIOINIFORMATICS 25 BIOINIFORMATICS 26 BIOINIFORMATICS 27 BIOINIFORMATICS 28 3) Protein database (UniProt) BIOINIFORMATICS 29 BIOINIFORMATICS 30 BIOINIFORMATICS 31 BIOINIFORMATICS 32 4) TAIR (The Arabidopsis Information Resource) BIOINIFORMATICS 33 2 1 BIOINIFORMATICS 34 3 4 BIOINIFORMATICS 35 5 6 BIOINIFORMATICS 36 7 8 BIOINIFORMATICS 37 BIOINIFORMATICS 38 5) Bibliographic Databases PubMed Medline https://pubmed.ncbi.nlm.nih.gov/ Scopus https://www.scopus.com/search/form.uri?display=basic#basic BIOINIFORMATICS 39 BIOINIFORMATICS 40 BIOINIFORMATICS 41 DATABASE QUERYING Why Querying? Insufficiency of general web searches to give an accurate results. General web searches give thousands of hits. What to do? Keywords Boolean operators (AND, OR, and NOT) They are used to combine terms; parentheses can be used for nesting terms(to give it a priority to be searched first) and quotation marks indicate an exact phrase. BIOINIFORMATICS 42 BIOINIFORMATICS 43 BIOINIFORMATICS 44 BIOINIFORMATICS 45 BIOINIFORMATICS 46 EXAMPLES Searching for: horse liver alcohol dehydrogenase Same as horse AND liver AND alcohol AND dehydrogenase To expand your search (get more hits) use OR: Horse OR liver OR alcohol OR dehydrogenase ❖ Best strategy: start with OR then reduce the number of hits with AND (i.e., do a wide search then narrows it down). BIOINIFORMATICS 47 SEARCH FIELDS IN NCBI Molecular Weight (protein database) 2002 [MOLWT] 2002:2009 [MOLWT] AND human [ORGN] Accession number(s) AF114696:AF114714[ACCN] Sequence length 3000:4000[SLEN] Date 1998/02:2000/01/25[PDAT] BIOINIFORMATICS 48 BIOINIFORMATICS 49 BIOINIFORMATICS 50 Thank you 51 Practice using the following accession AF114714 and find all the necessary info. for the sequence, like the source organism, length, and FASTA sequence using NCBI and EMBL-EBI. Find the function, GO annotations with the related publications of the Arabidopsis thaliana Histone H3 with the following entry ID P59226 using Uniprot. Using TAIR database, find the function of the SLAC1 gene with the following locus ID AT1G12480 and find its orthologues. BIOINIFORMATICS 49