Bioinformatics_Lessons 1-5.pdf

Course Code: BTEC 101 Course Title: Bioinformatics Course Credit: 3 Nominal Duration: 54 hrs Prerequisite: Living in the IT Era (GE EL 103) Course Description: analyze individual DNA and protein sequences using the bioinformatics techniques practical familiarity with common web-based bioinformatics tools. answer biological issues This is not a programming course. Grading System Major Final Output 40% Major output corresponding to the terminal course outcome Class Standing 60% Graded modular/online activities and outputs corresponding to the enabling course outcomes. CLASS RULES in F2F No cellphones/gadgets unless authorized Writing notes is advised Practice Values 1. Arrive on time for class. 2. Raise your hand to speak or volunteer. 3. Follow the dress code of the school. 4. Do not cheat or copy other people’s work. 5. Complete all assignments. 6. Listen to the teacher when being spoken to and answer your question. 7. Respect everyone in the class. 8. Keep your hands, feet, and objects to yourself. 9. Respect the school property. 10. Keep your language clean and appropriate for the classroom setting. 11. Do not leave your seat without permission. 12. Do not eat or drink in class (except for water). 13. Learn at least one thing you did not know before coming to class. 14. Ask for help if you do not understand something the teacher just said, and be respectful while asking for it. 15. Be on time for every assignment or test (except for medical or other emergencies). 16. Do your best work each day, regardless of how much time is left in class. 17. Never give up on yourself or your goals. 18. Be open to new ideas and change with an open mind! 19. Treat others the way you want to be treated, with kindness and respect. 21. Be a friend to everyone in the classroom and keep your friendships strong. 22. Listen to what the teacher says and follow directions carefully. 23. Apologize if you make a mistake or accidentally hurt someone else. 24. Tell the truth! 25. Raise your hand if you have a question and wait to be called on. 26. No one should ever be made to feel bad about who they are. 27. Respect each other’s ideas and opinions even if you disagree with them. 28. Take pride in your work and hand it in on time. 29. Do not let anyone influence you to do anything you know is wrong. 30. Always try your best. Never give up! Google Classroom Etiquette Be on time. Dress Sit in a chair. Raise your appropriately hand.. Turn on video. Mute yourself. Be respectful. No parents, Use chat box siblings, or appropriately. pets on camera. GOALS of today: Define the terms bioinformatics; Explain the scope of bioinformatics; Understand the links between modern biology, genomics, and bioinformatics Determine which biological questions bioinformatics can help you answer quickly What is BIOINFORMATICS? INFORMATION TECHNOLOGY BIOINFORMATICS may be defined as a scientific discipline encompassing acquisition, storage, processing, analysis, interpretation and visualization of BIOLOGICAL INFORMATION. WinsCANOPY Geometric Morphometrics (MOLECULAR) BIOINFORMATICS bioinformatics is a management information system for molecular biology and has many practical applications. NIH Biomedical Information Science and Technology Initiative Consortium BIOINFORMATICS Is a research, development, or application of behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data. BIOINFORMATICS is the field of science in which biology, computer science, and information technology merge into a single discipline. Bioinformatics Luscombe et al. a union of biology and informatics Bioinformatics (Jin Xiong) Is the discipline of quantitative analysis of information relating to biological macromolecules with the aid of computers How is Bioinformatics different from Computational Biology? Bioinformatics is also known as: Computational Molecular Biology Sub-disciplines the development of new algorithms and statistics the analysis and interpretation of various types of data the development and implementation of tools BIOINFORMATICS VS. GENOMICS GENOMICS BIOINFORMATICS Explains how to access biological sequence data compare sequences Compare multiple sequences (primarily by the Basic Local Alignment Search Tool) Perform multiple sequence alignment Show how multiply aligned proteins or nucleotides can be visualized in phylogenetic trees Perspectives of Bioinformatics The Cell The Organism The Tree of Life Bioinformatics: analyzing DNA, RNA, and protein APPROACHES TO BIOINFORMATICS Reproducible Research in Bioinformatics Bioinformatics and Other Informatics Disciplines medical informatics health care informatics nursing informatics library informatics Bioinformatics has an emphasis on DNA and other biomolecules We may also distinguish tool users (e.g., biologists using bioinformatics software to study gene function, or medical informaticists using electronic health records) from tool makers (e.g., those who build databases, create information technology infrastructure, or write computer software). GOALS of BIOINFORMATICS (1) to manage data (2) to develop technological tools (3) to use these tools Scope of Bioinformatics Applications of Bioinformatics Limitations Most common errors in bioinformatics database 1.Sequencing error 2.Cloning vector contamination 3. Redundancy of data 4.Human error BIOLOGICAL DATABASES assumethataswereadthissentencetherearenos pacesbetweenwordsnorpunctuationmarks delimitingtheboundariesofwrittenthought SEAT WORK Decode using the GENETIC CODE: Template: GAT AAT GCT TAG ATC TCG TAA CTT ATC TTT GAA cDNA: tRNA: AA: Letter: DATABASE collection of information that is organized so that it can easily be accessed, managed, and updated. BIBLIOGRAPHIC FULL-TEXT NUMERIC IMAGES DATA Data is raw, unorganized facts that need to be processed. Example:- Each student's test score is one piece of data. INFORMATION When data is processed, organized, structured or presented in a given context so as to make it useful, it is called information. Example:- score of a class or of the average entire school is information that can be derived from the given data. Bibliographic Full-text Database Numeric Database Image Database JSTOR BIOLOGICAL DATABASE Stores of biological information Technology of databases Public and Private repositories RNA, DNA, & Protein Different classifications of biological databases Type of data nucleotide sequences protein sequences proteins sequence patterns or motifs macromolecular 3D structure gene expression data metabolic pathways NUCLEOTIDE DATABASE Protein Database Protein Structure Database Macromolecular Structure Gene Expression Database Metabolic Pathway Database How much DNA sequence is stored in public databases? Where are the data stored? Centralized Databases Store DNA Sequences GenBank Taxa represented in GenBank Ten most sequenced organisms in GenBank Codes in GenBank Types of Data in GenBank/EMBL-Bank/DDBJ Bioinformatics Databases Genomic Database DNA Database RNA Data Protein Database Central Bioinformatics Resources: NCBI and EBI Fundamentals of Genes and Genomes Eva Joie G. Amestoso, M.Sc. Natural Sciences Department College of Arts and Sciences Bukidnon State University BIOLOGICAL MACROMOLECULES, GENOMICS, AND BIOINFORMATICS Genetic information is stored in the cell in the form of biological macromolecules, such as nucleic acids and proteins. THE UNIVERSAL GENETIC MATERIAL Criteria of a Good Genetic Material Information Transmission Replication Variation DNA as the Genetic Material DNA Chromosome Genes The Structure of DNA Primary structure Secondary structure Tertiary structure Alternative Forms of DNA The Chromosome The Gene How is the information in a gene encoded? Morse code is a method used in telecommunication to encode text char acters as standardized sequences of two different signal durations, called dots and dashes, or dits and dahs Genetic Code The genetic code consists of the sequence of nitrogen bases—A, C, G, U—in an mRNA chain. The four bases make up the “letters” of the genetic code. The letters are combined in groups of three to form code “words,” called codons. Each codon stands for (encodes) one amino acid, unless it codes for a start or stop signal. Reading the Genetic Code Characteristics of the Genetic Code The genetic code is universal. The genetic code is unambiguous. The genetic code is redundant. Properties of the Genetic Code 1. The genetic code consists of a sequence of nucleotides in DNA or RNA. Properties of the Genetic Code 2. The genetic code is a triplet code. Properties of the Genetic Code 3. The genetic code is degenerate. Properties of the Genetic Code 4. Isoaccepting tRNAs are tRNAs with different anticodons that accept the same amino acid; wobble allows the anticodon on one type of tRNA to pair with more than one type of codon on mRNA. Properties of the Genetic Code 5. The code is generally nonoverlapping Properties of the Genetic Code 6. The reading frame is set by an initiation codon, which is usually AUG. Properties of the Genetic Code 7. When a reading frame has been set, codons are read as successive groups of three nucleotides. Properties of the Genetic Code 8. Any one of three termination codons (UAA, UAG, and UGA) can signal the end of a protein; no amino acids are encoded by the termination codons. 9. The code is almost universal. Bioinformatics Databases Data in Bioinformatics DNA- Sequences of nucleotides (ATGC) RNA sequences (AUGC) mRNA, tRNA, hnRNA Data in Bioinformatics Protein sequences- Strings of Amino- acid sequences Structure data (Protein structure- primary secondary, tertiary, Quaternary, 3D views) Curation Process This process consists of 6 major mandatory steps: (1) sequence curation (2) sequence analysis (3) literature curation (4) family-based curation (5) evidence attribution (6) quality assurance and integration of completed entries. Bioinformatics Databases Databases are convenient system to properly store, search and retrieve any type of data Databases are different types based on nature of information and manner (complexity) of data storage Types of Bioinformatics databases Based on nature of information db are divided into 1. Generalized db: DNA, Protein (e. g., NCBI) – a. Sequence db: nucleotides or amino acids – b. Structure db: structure of macromolecules Types of Bioinformatics databases Based on nature of information db are divided into 2. Specialized db: Expressed Sequence Tags (EST), Single Nucleotide Polymorphisms (SNP) Based on the manner of data storage, db are divided into 1. Primary or abbreviated db: in original form, taken as such from the source. GenBank Swiss-Prot ENA DDBJ PDB GEO (Gene Expression Omnibus) Array Express Based on the manner of data storage, db are divided into 2. Secondary db: value added db with derived information from primary db UniProt (Universal Protein Resource) InterPro Ensembl KEGG (Kyoto Encyclopedia of Genes and Genomes) GO (Gene Ontology) Based on the manner of data storage, db are divided into 3. Composite db: combined primary db Redundant and Non-redundant db: more than one copy of each sequence Boutique db: species specific sequence data nrdb (Non redundant database) INSD (International Nucleotide Sequence Database) PlnTFDB (Plant Transcription Factor Database) Reactome Mgi (Mouse Genome Informatics) Db entries composed of – Core data: original sequence – Supplementary data or annotation (source, author, date, method used etc) Sequence formats – PIR (Protein Information Resource)/NBRF(National Biomedical Res. Foundation) - >P, >N – FASTA (Fast Alignment) - > – GDE (Genetic Data Environment) - % Primary Databases In original form, taken from the source Original submission by researcher Contents controlled by the submitter Data explosion in 1980s - so started many repositories 1. Nucleic acid/Nucleotide sequence db 2. Protein sequence db 3. Metabolite db Secondary databases Derivative db Result of analyses of sequences in the primary db Secondary db built up from primary db Secondary db analyzed in a variety of ways and contain different information in different formats Contents of secondary db controlled by a third party Eg: Prosite, Prints, Blocks Nucleic acid sequence databases Collection of nucleotide sequences Organize and distribute nucleotide sequences from all available source In the form of a text file Can read by humans and computer Many dbs are assembled from several publications, so overlapping fragments of complete sequence First sequence - Yeast t-RNA with 77 bases in 1964 NCBI National Centre for Biotechnology Information Established on November 4, 1988 as part of the National Library of Medicine (NLM) at the National Institute of Health (NIH), USA Headquarters in Bethesda, Maryland Legislation sponsored by Senator Claude Pepper Services Pubmed Genbank BLAST Entrez GenBank ® is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences GenBank is part of the International Nucleotide Sequence Database Collaboration (INSDC) , which comprises the DNA DataBank of Japan (DDBJ), the European Nucleotide Archive (ENA), and GenBank at NCBI. These three organizations exchange data on a daily basis. What is in it? Annotated nucleotide sequences, including mRNA sequences with coding regions, segments of genomic DNA with a single gene or multiple genes, and ribosomal RNA gene clusters More than 100,000 organisms Aminoacid translations (CDS) EMBL The European Molecular Biology Laboratory (EMBL) is a molecular biology research institution supported by 25 member states, four prospect and two associate member states. EMBL was constituted in 1974 and is an intergovernmental organization funded by public research money from its member states. Stations of EMBL The Laboratory operates from six sites: the main laboratory in Heidelberg outstations in: Hinxton (the European Bioinformatics Institute (EBI) Grenoble (France) Hamburg (Germany) Monterotondo (near Rome) Barcelona (Spain) European Molecular Biology Laboratory (EMBL) From European Bioinformatics Institute (EBI), UK Collect and assemble data from -Direct author submission -Genome sequencing groups -Patent application -Literature Goal - integrate nucleotide sequence data and annotation into the wealth of bioinformatics resources By cross reference and Sequence Retrieval System (SRS) data can be viewed in 200 local stations 2494 completed genomes The EMBL-EBI hosts a number of publicly open, free to use life science resources, including biomedical databases, analysis tools and bio- ontologies. These include: ArrayExpress - archive of gene expression experiments BioModels Database - a database of computational models relevant to the life sciences BioStudies - a database that serves as a generic data archive at EMBL-EBI for biomolecular datasets Chemical Entities of Biological Interest (ChEBI) - database and ontology of molecular entities European Nucleotide Archive (ENA) - resource of nucleotide sequencing information Ensembl project - genome databases for vertebrates and other eukaryotic species (joint with Wellcome Trust Sanger Institute) Europe PubMed Central - database offering free access to collection of biomedical research literature DNA Data Bank of Japan DNA Data Bank of Japan Currently, DDBJ Center is in operation at the National Institute of Genetics (NIG) in Mishima, Japan with endorsement of MEXT; Japanese Ministry of Education, Culture, Sports, Science and Technology. DDBJ Center is reviewed and advised by its own advisory board, DNA Database Advisory Committee (an outside committee of NIG), and also by the advisory board to INSDC, International Advisory Committee. Started in 1986 PROTEIN SEQUENCE DATABASES SWISSPROT, PIR UniProtKB/Swiss-Prot is the manually annotated and reviewed section of the UniProt Knowledgebase (UniProtKB). Swiss-Prot Established in 1986 by Dept. of Biochemistry, University of Geneva Maintenance by Swiss Institute of Bioinformatics (SIB) and EMBL Database composed of 2 parts 1. Core data - sequence reference and taxonomic details 2. Annotation - sequence variants, functions, 2o & 3o structures Provide high level annotation including functions of the protein Maintain high quality and structure - first choice for most research purpose Swiss-Prot is supplemented by TrEMBL in 1996 - translated EMBL TrEMBL has 2 sections 1. SP-TrEMBL - data included in the Swiss-Prot from EMBL 2. REM-TrEMBL - data which are not included in the Swiss-Prot Protein Sequence Databases PIR - Protein Information Resource Established in 1984 by National Biomedical Research Foundation (NBRF), Washington DC Aim - identification and interpretation of protein sequence information Investigating evolutionary relationship among proteins Help to do search and similarity analysis Provide integrated environment for sequence analysis between 3 units PIR is composed of 3 databases: 1. PSD - protein sequence database 2. NREF - Non-redundant reference database 3. iProClass - provides structural and functional features of proteins PIR database split into 4 sections - differ in terms of quality of data and levels of annotation provided 1. fully classified and annotated entries 2. preliminary entries, not thoroughly reviewed 3. unverified entries, not reviewed 4. genetically engineered sequences Structure databases PDB The Protein Data Bank (PDB) is a crystallographic database for the three-dimensional structural data of large biological molecules, such as proteins and nucleic acids. The data, typically obtained by X-ray crystallography, NMR spectroscopy, or, increasingly, cryo-electron microscopy www.rcsb.org/ The data is freely accessible on the Internet via the websites of its member organisations (PDBe, PDBj, and RCSB). The PDB is overseen by an organization called the Worldwide Protein Data Bank, wwPDB. The PDB is a key resource in areas of structural biology, such as structural genomics. Molecular graphics display, the Brookhaven RAster Display (BRAD), is used to visualize protein structures in 3-D. The file format initially used by the PDB was called the PDB file format. PDB was initiated in 1968 with the help of BRAD visualization of protein structure and X ray crystallographic studies of proteins In October 1998, the PDB was transferred to the Research Collaboratory for Structural Bioinformatics (RCSB). INSULIN NDB The Nucleic Acid Database (NDB; Berman et al., 1992) was established in 1991 as a resource for specialists in the field of nucleic acid structure. The core of the NDB has been its relational database of nucleic acid- containing crystal structures. It allows researchers to perform comparative analyses of nucleic acid- containing structures selected from the NDB OMIM Online Mendelian Inheritance in Man A comprehensive, authoritative and timely compendium of human genes and genetic phenotypes The full-text, referenced overviews in OMIM contain information on all known Mendelian disorders and over 12,000 genes. Initiated in the early 1960s by Dr. Victor A. Mc Kusick as a catalogue of Mendelian traits and disorders, entitled Mendelian Inheritance in Man 1995 internet version BIOINFORMATICS EVA JOIE G. AMESTOSO, M.Sc. Bukidnon State University Natural Sciences Department SEQUENCE ALIGNMENT 2 Eye of the tiger * In 1994 Walter Gehring et alum (Un. Basel) turn the gene “eyeless” on in various places on Drosophila melanogaster * Result: on multiple places eyes are formed * ‘eyeless’ is a master regulatory gene that controls +/- 2000 other genes * ‘eyeless’ on induces formation of an eye 3 Eyeless Drosophila 4 Mutant Drosophila melanogaster: gene ‘EYELESS’ turned on 5 SEQUENCE ALIGNMENT HOMEO BOX A homeobox is a DNA sequence found within genes that are involved in the regulation of development (morphogenesis) of animals, fungi and plants. 6 SEQUENCE ALIGNMENT Drosophila melanogaster: HOX homeoboxes 7 SEQUENCE ALIGNMENT Drosophila melanogaster: PAX homeoboxes 8 SEQUENCE ALIGNMENT Homeoboxes and Master regulatory genes 9 SEQUENCE ALIGNMENT 3.2 On sequence alignment Sequence alignment is the most important task in bioinformatics! 10 SEQUENCE ALIGNMENT 3.2 On sequence alignment Sequence alignment is important for: * prediction of function * Homology Detection * database searching * Structural Biology * gene finding * Genome Annotation * sequence divergence * Phylogenetics * sequence assembly * Functional Annotation 11 SEQUENCE ALIGNMENT 3.3 On sequence similarity Homology: genes that derive from a common ancestor-gene are called homologs Orthologous genes are homologous genes in different organisms Paralogous genes are homologous genes in one organism that derive from gene duplication Gene duplication: one gene is duplicated in multiple copies that therefore free to evolve and assume new functions 12 SEQUENCE ALIGNMENT HOMOLOGOUS and PARALOGOUS 13 SEQUENCE ALIGNMENT HOMOLOGOUS and PARALOGOUS 14 Homologous vs. Analogous SEQUENCE ALIGNMENT HOMOLOGOUS and PARALOGOUS versus ANALOGOUS 18 SEQUENCE ALIGNMENT: sequence similarity Causes for sequence (dis)similarity mutation: a nucleotide at a certain location is replaced by another nucleotide (e.g.: ATA → AGA) insertion: at a certain location one new nucleotide is inserted inbetween two existing nucleotides (e.g.: AA → AGA) deletion: at a certain location one existing nucleotide is deleted (e.g.: ACTG → AC-G) indel: an insertion or a deletion 23 SEQUENCE ALIGNMENT 3.4 Sequence alignment: global and local Find the similarity between two (or more) DNA-sequences by finding a good alignment between them. 24 The biological problem of sequence alignment DNA-sequence-1 tcctctgcctctgccatcat---caaccccaaagt |||| ||| ||||| ||||| |||||||||||| tcctgtgcatctgcaatcatgggcaaccccaaagt DNA-sequence-2 Alignment 27 Sequence alignment - definition Sequence alignment is an arrangement of two or more sequences, highlighting their similarity. The sequences are padded with gaps (dashes) so that wherever possible, columns contain identical characters from the sequences involved tcctctgcctctgccatcat---caaccccaaagt |||| ||| ||||| ||||| |||||||||||| tcctgtgcatctgcaatcatgggcaaccccaaagt 28 Algorithms Needleman-Wunsch Pairwise global alignment only. Smith-Waterman Pairwise, local (or global) alignment. BLAST Pairwise heuristic local alignment 29 Pairwise alignment Pairwise sequence alignment methods are concerned with finding the best-matching piecewise local or global alignments of protein (amino acid) or DNA (nucleic acid) sequences. Typically, the purpose of this is to find homologues (relatives) of a gene or gene-product in a database of known examples. This information is useful for answering a variety of biological questions: 1. The identification of sequences of unknown structure or function. 2. The study of molecular evolution. 31 Global alignment A global alignment between two sequences is an alignment in which all the characters in both sequences participate in the alignment. Global alignments are useful mostly for finding closely-related sequences. As these sequences are also easily identified by local alignment methods global alignment is now somewhat deprecated as a technique. Further, there are several complications to molecular evolution (such as domain shuffling) which prevent these methods from being useful. 32 Global Alignment Find the global best fit between two sequences Example: the sequences s = VIVALASVEGAS and t = VIVADAVIS align like: V I V A L A S V E G A S A(s,t) = | | | | | | | V I V A D A - V - - I S indels 33 The Needleman-Wunsch algorithm The Needleman-Wunsch algorithm (1970, J Mol Biol. 48(3):443- 53) performs a global alignment on two sequences (s and t) and is applied to align protein or nucleotide sequences. The Needleman-Wunsch algorithm is an example of dynamic programming, and is guaranteed to find the alignment with the maximum score. 34 The Needleman-Wunsch algorithm Of course this works for both DNA-sequences as for protein-sequences. 35 The substitution matrix A more realistic scoring function is given by the biologically inspired substitution matrix : - A G C T A 10 -1 -3 -4 G -1 7 -5 -3 C -3 -5 9 0 T -4 -3 0 8 Examples: * PAM (Point Accepted Mutation) (Margaret Dayhoff) * BLOSUM (BLOck SUbstitution Matrix) (Henikoff and Henikoff) 36 Scoring function The cost for aligning the two sequences s = VIVALASVEGAS and t = VIVADAVIS : V I V A L A S V E G A S A(s,t) = | | | | | | | V I V A D A - V - - I S is: M(A) = 7 matches + 2 mismatches + 3 gaps =7 –2 –3 =2 37 Optimal global alignment The optimal global alignment A* between two sequences s and t is the alignment A(s,t) that maximizes the total alignment score M(A) over all possible alignments. A* = argmax M(A) Finding the optimal alignment A* looks a combinatorial optimization problem: i. generate all possible allignments ii. compute the score M iii. select the alignment A* with the maximum score M* 38 Local alignment Local alignment methods find related regions within sequences - they can consist of a subset of the characters within each sequence. For example, positions 20-40 of sequence A might be aligned with positions 50-70 of sequence B. This is a more flexible technique than global alignment and has the advantage that related regions which appear in a different order in the two proteins (which is known as domain shuffling) can be identified as being related. This is not possible with global alignment methods. 39 The Smith Waterman algorithm The Smith-Waterman algorithm (1981) is for determining similar regions between two nucleotide or protein sequences. Smith-Waterman is also a dynamic programming algorithm and improves on Needleman-Wunsch. As such, it has the desirable property that it is guaranteed to find the optimal local alignment with respect to the scoring system being used (which includes the substitution matrix and the gap-scoring scheme). However, the Smith-Waterman algorithm is demanding of time and memory resources: in order to align two sequences of lengths m and n, O(mn) time and space are required. As a result, it has largely been replaced in practical use by the BLAST algorithm; although not guaranteed to find optimal alignments, BLAST is much more efficient. 40 Optimal local alignment The optimal local alignment A* between two sequences s and t is the optimal global alignment A(s(i1:i2), t(j1:j2) ) of the sub-sequences s(i1:i2) and t(j1:j2) for some optimal choice of i1, i2, j1 and j2. 41 Sequence alignment - meaning Sequence alignment is used to study the evolution of the sequences from a common ancestor such as protein sequences or DNA sequences. Mismatches in the alignment correspond to mutations, and gaps correspond to insertions or deletions. Sequence alignment also refers to the process of constructing significant alignments in a database of potentially unrelated sequences. 42

Bioinformatics_Lessons 1-5.pdf

Document Details

Tags

Related

Full Transcript

Upgrade to continue