Introduction to Bioinformatics PDF
Document Details

Uploaded by SlicedYams
Winona State University
Chi-Cheng Lin
Tags
Summary
This document provides an introduction to bioinformatics. It explains what bioinformatics is, its fundamental concepts, and its applications. The document also touches on the field's relationship to molecular biology and the Human Genome Project. The paper further underscores the role of computing in understanding biological data.
Full Transcript
Introduction to Bioinformatics Introduction to Bioinformatics Overview Chi-Cheng Lin, Ph.D. Department of Computer Science Winona State University...
Introduction to Bioinformatics Introduction to Bioinformatics Overview Chi-Cheng Lin, Ph.D. Department of Computer Science Winona State University [email protected] Introduction to Bioinformatics Outline What is Bioinformatics Molecular Biology Primer The Human Genome Project Problems Bioinformatics Solves The -Omics World – Fields in Bioinformatics Bioinformatics Applications Conclusion 2 Introduction to Bioinformatics Outline What is Bioinformatics Molecular Biology Primer The Human Genome Project Problems Bioinformatics Solves The -Omics World – Fields in Bioinformatics Bioinformatics Applications Conclusion 3 Introduction to Bioinformatics What is Bioinformatics No one single definition! Introduction to Bioinformatics What is Bioinformatics Living things can store and pass on information Bioinformatics was first coined by Paulien Hogeweg and Ben Hesper in 1970 as – the study of informatic processes in biotic systems Use of computers to solve biological problems predated the term was coined, however “The use of computational tools to organize and analyze genetic and protein sequence data” Introduction to Bioinformatics Classic Definition of Bioinformatics by NCBI NCBI (National Center for Biotechnology Information, http://www.ncbi.nlm.nih.gov/) – “Bioinformatics is the field of science in which biology, computer science, and information technology merge to form a single discipline.” 6 Introduction to Bioinformatics Other Definitions of Bioinformatics “A scientific subdiscipline that involves using computer technology to collect, store, analyze and disseminate biological data and information, such as DNA and amino acid sequences or annotations about those sequences” (National Human Genome Research Institute) “The collection, classification, storage, and analysis of biochemical and biological information using computers especially as applied to molecular genetics and genomics” Merriam-Webster Dictionary “A field of science that uses computers, databases, math, and statistics to collect, store, organize, and analyze large amounts of biological, medical, and health information” (National Cancer Institute) “The application of computer technology to the understanding and effective use of biological and biomedical data” (Swiss Institute of Bioinformatics) “an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex” (Wikipedia) What do they have in common? 7 Introduction to Bioinformatics Other Definitions of Bioinformatics “A scientific subdiscipline that involves using computer technology to collect, store, analyze and disseminate biological data and information, such as DNA and amino acid sequences or annotations about those sequences” (National Human Genome Research Institute) “The collection, classification, storage, and analysis of biochemical and biological information using computers especially as applied to molecular genetics and genomics” Merriam-Webster Dictionary “A field of science that uses computers, databases, math, and statistics to collect, store, organize, and analyze large amounts of biological, medical, and health information” (National Cancer Institute) “The application of computer technology to the understanding and effective use of biological and biomedical data” (Swiss Institute of Bioinformatics) “an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex” (Wikipedia) What do they have in common? 8 Introduction to Bioinformatics Word Cloud 9 Word cloud generated by TagCrowd (https://tagcrowd.com) Introduction to Bioinformatics Lin’s Understanding Bioinformatics = Biology + Informatics Life Science Bioinformatics Computer Statistics Science 10 Introduction to Bioinformatics Why Informatics Complexity in problems and tons of data … Discussions: – Why bioinformatics; why not biocomputing etc.? Introduction to Bioinformatics GenBank and WGS Statistics GenBank: NIH’s genetic sequence database (https://www.ncbi.nlm.nih.gov/genbank/) WGS (Whole Genome Sequencing): determining “the order of the bases in the genome in one process” (https://www.cdc.gov/pulsenet/pathogens/wgs.html) GenBank and WGS Statistics Log (https://www.ncbi.nlm.nih.gov/genbank/statistics/) 12 scale Introduction to Bioinformatics “Booming” point? Rough draft 13 (Source: https://www.nlm.nih.gov/about/image/2020CJ_fig_5.png) Introduction to Bioinformatics What Bioinformatics Does Development of new algorithms and statistics – Assess relationships among members of large data sets Analysis and interpretation of various types of data – Such as nucleotide and amino acid sequences protein structures gene expression Development and implementation of tools – Enable efficient access and management of different types of information Sources: NCBI Introduction to Bioinformatics Outline What is Bioinformatics Molecular Biology Primer The Human Genome Project Problems Bioinformatics Solves The Omics World – Fields in Bioinformatics Bioinformatics Applications Conclusion 15 Introduction to Bioinformatics Central Dogma of Molecular Biology 16 Introduction to Bioinformatics Central Dogma of Molecular Biology – Eukaryote It’s more complex than what’s shown here! Transcription Translation (Protein) 17 https://www.genome.gov/genetics-glossary/Exon Introduction to Bioinformatics DNA, Protein, Gene, & Genome DNA (deoxyribonucleic acid) – Genetic material – Information storage – Information stored in DNA: the basis of inheritance Protein – function unit, such as enzyme Gene – instructions needed to make protein Genome – Full DNA sequence in an organism 18 Introduction to Bioinformatics Neucleotides Genes themselves contain their information as a specific sequence of nucleotides found in DNA molecules Only four different bases, i.e., letters, in DNA sequence – Adenine (A) – Cytosine (C) – Guanine (G) – Thymine (T) 19 Introduction to Bioinformatics DNA Structure and Base Pairing Structure of DNA – Double helix – Seminal paper by James Watson and Francis Crick in 1953 – Rosalind Franklin's contribution Information content on one strand essentially redundant with the information on the other – Not exactly the same – it is complementary Base pair – G paired with C (G º C) – A paired with T (A = T) 20 Introduction to Bioinformatics DNA Structure 21 Introduction to Bioinformatics Food for Thought … Biology is a complex system. – What skill/technique (we are good at) is applied to studying a complex system? – Answer: ____________ The problem domain is biology – Problems need to be formulated to be solvable by computers – Solutions must be biologically meaningful – Avoid garbage-in-garbage-out (GIGO) 22 Introduction to Bioinformatics Outline What is Bioinformatics Molecular Biology Primer The Human Genome Project Problems Bioinformatics Solves The -Omics World – Fields in Bioinformatics Bioinformatics Applications Conclusion 23 Introduction to Bioinformatics Human Genome Project (HGP) International effort – 1990-2003 A driving force of bioinformatics Goals include – Identify genes in human DNA – Determine sequence making up human DNA – Store this information in databases – Improve tools for data analysis – Etc. 24 Introduction to Bioinformatics Human Genome Project Milestone – April 2003: HGP sequencing is completed and project is declared finished two years ahead of schedule! 25 Introduction to Bioinformatics Interesting Numbers characterizing the Human Genome 3 billion: – The number of chemical nucleotide base pairs (or bases) contained in the haploid human genome 3 million: – The number of locations where single-base DNA differences occur in the human genome 2.4 million: – The number of bases comprising the largest known human gene (the average gene comprises 3000 bases) 30,000: – The total number of genes estimated (much lower than previous estimates of 80,000 to 140,000) 26 Introduction to Bioinformatics Interesting Numbers characterizing the Human Genome 99.9% – Fraction of nucleotide bases that are exactly the same in all people – You and I are just 0.1% different in our genomes! 50% – Fraction of discovered genes for which function is unknown 2% – Fraction of genome that codes for proteins (the rest: “junk”(?) DNA) 9%, 11%, 26%, 28%, 45%, 83%, 89%, and 95% – The percentage of genes E. coli, rice, roundworm, yeast, fruit fly, zebrafish, mouse, and chimpanzee share with humans, respectively. 27 Introduction to Bioinformatics Anticipated Benefits of Genome Research Molecular medicine Microbial genomics Bioarchaeology Anthropology Evolution Human Migration DNA identification (forensics) Agriculture, livestock breeding, and bioprocessing 28 Introduction to Bioinformatics Outline What is Bioinformatics Molecular Biology Primer The Human Genome Project Problems Bioinformatics Solves The -Omics World – Fields in Bioinformatics Bioinformatics Applications Conclusion 29 Introduction to Bioinformatics Problems Bioinformatics Solves Sequence alignment – Comparing DNA/RNA/protein sequences Biological database searches Gene prediction Phylogenetics – Identifying and understanding evolutionary relationships among organisms Genome assembly Protein structure prediction And many others 30 Introduction to Bioinformatics Sequence Alignment Given a gene: TAGCCGTACATCGTGTATAG – What does it do? One approach: Is there a similar gene in another species? – Align sequences with known genes – Find the gene with the “best” match Given two sequences, how do we align them to determine how similar they are ACGTCTGATACGCCGTATAGTCTATCT CTGATTCGCATCGTCTATCT 31 Introduction to Bioinformatics Sequence Alignment (cont’d) Finding the optimal alignment is computationally hard with brute force approach: ACGTCTGATACGCCGTATAGTCTATCT CTGAT---TCG-CATCGTC--T-ATCT There are ~888,000 possibilities to align the two sequences given above. Good algorithmic technique needed to reduce the among of work 32 Introduction to Bioinformatics Phylogenetics What are the evolutional relationship among human, chimpanzees, baboons, orangutans, and gorillas? Phylogenetic Tree 33 Introduction to Bioinformatics Phylogenetics (cont’d) There are 105 possible trees given 5 data sets For more data sets … Number of Data Sets Number of Trees 5 105 10 34,459,425 15 213,458,046,767,875 20 8,200,794,532,637,891,559,375 Computer algorithms needed to find the “most-likely” tree and statistics used to assess the result 34 Introduction to Bioinformatics Pancake Flipping Problem A sloppy chef prepares a stack of pancakes of different sizes out of order. The waiter wants to rearrange them into order (the smallest on top, and so on, and the largest at the bottom) by flipping over several from the top, repeating this as many times as necessary. 1 1 4 2 2 3 5 4 3 5 35 Introduction to Bioinformatics Pancake Flipping Problem (cont’d) Given a stack of n pancakes, what is the minimum number of flips to rearrange them into an ordered stack? The upper-bound on the number of reversals was given by W. Gates and C. Papadimitriou in the mid-1970s. – Who is W. Gates? Biological significance? – The number of reversals required to rearrange genes’ order in one genome into another can be used to measure the distance between two genomes (organisms) – Genome rearrangement 36 Introduction to Bioinformatics Genome Rearrangement - Of Mice and Men The full complement of human chromosomes can be cut into about 150 pieces, then reassembled into a reasonable approximation of the mouse genome. The colors of the mouse chromosomes and the numbers alongside indicate the human chromosomes containing homologous segments. This piecewise similarity between the mouse and human genomes means that insights into mouse genetics are likely to illuminate human genetics as well. 37 Source: http://www.ornl.gov/sci/techresources/Human_Genome/publicat/tko/06_img.html Introduction to Bioinformatics Outline What is Bioinformatics Molecular Biology Primer The Human Genome Project Problems Bioinformatics Solves The -Omics World – Fields in Bioinformatics Bioinformatics Applications Conclusion 38 Introduction to Bioinformatics Genomics and the Companies – the “- omics” World Genomics – Study of the whole genome – Sequencing and annotation of genomes Comparative genomics – Comparison and characterization of genomes from different species to identify genes and their functions and to investigate evolutionary history Functional genomics – Understanding the function of genes and other parts of the genome Structural genomics – Determining the 3D structure of all proteins Pharmacogenomics – Study of how an individual's genetic inheritance affects the body's response to drugs 39 Introduction to Bioinformatics The ‘-omics’ World Proteomics – Study of the complete set of proteins in an organism, tissue, or cell Transcriptomics – Study of the complete set of mRNA (transcripts) – Examination of expression level of mRNA – DNA microarrays used as a tool Spliceomics – Study of the set of all possible alternatively spliced mRNA and proteins in an organism Interactomics – Study of whole set of molecular interactions in cells – E.g., Protein network in the context of proteomics Metabolomics, phenomics, systems biology, microbiomics, metagenomics, etc. 40 Introduction to Bioinformatics Outline What is Bioinformatics Molecular Biology Primer The Human Genome Project Problems Bioinformatics Solves The -Omics World – Fields in Bioinformatics Bioinformatics Applications Conclusion 41 Introduction to Bioinformatics Bioinformatics Applications Genome sequencing Building tree of life CSI Medical applications Human migration … etc. 42 Introduction to Bioinformatics Genome Sequencing Drew Sheneman, New Jersey -- The Newark Star Ledger 43 Source: http://cagle.msnbc.com/news/gene/gene14.asp Introduction to Bioinformatics Building Tree of Life 44 Source: http://oceanexplorer.noaa.gov/explorations/06fire/background/microbiology/media/universal_tree2hotbugs.html Introduction to Bioinformatics Outline What is Bioinformatics Molecular Biology Primer The Human Genome Project Problems Bioinformatics Solves The -Omics World – Fields in Bioinformatics Bioinformatics Applications Conclusion 45 Introduction to Bioinformatics Today, biological data are generated at an unprecedented rate Human Genome Sequencing – Human Genome Project vs. – Next-generation sequencing (NGS), i.e., massive parallel sequencing 1990 2009 Human Genome Commercial Personal Genome Project Sequencing Service International Efforts High-Throughput Sequencing Machine 13 Years 4 Weeks* $ 2.7 billions $48,000 46 Introduction to Bioinformatics Today, biological data are generated at an unprecedented low cost 47 https://www.genome.gov/about-genomics/fact-sheets/Sequencing-Human-Genome-cost Introduction to Bioinformatics Tackle Future Challenges: Bioinformatics High volume of data to store, compute, and analyze Huge amount of information to retrieve, interpret, and visualize Complex system to study, model, and simulate THAT’S WHERE BIOINFORMATICS THRIVES !! 48 Introduction to Bioinformatics What Computer Science Contributes Algorithms and Data Structures Programming Languages Software Design/Development Internet World Wide Web Databases Data Management Workflow Design and Development Data Visualization Image Processing Data Mining Natural Language Processing AI Machine Learning Cloud Computing 49 and so on Introduction to Bioinformatics What is Bioinformatics Revisit Bioinformatics = Bio + Informatics Bioinformatics is the “+” in the equation – Bio side Understand what computers CAN and CANNOT do Understand the rationales behind algorithms used to develop tools – CS side Understand the complexity and uncertainty in biological systems Solutions must be biologically meaningful Avoid GIGO 50 Introduction to Bioinformatics Future – 50 or 500 Years? “Biology has at least 50 more interesting years.” James D. Watson, Nobel laureate, December 31, 1984 “Biology easily has 500 years of exciting problems to work on, …” Donald Knuth, a world-renowned computer scientist, in Computer Literacy Interview by Dan Doernberg, December 7th, 1993 Introduction to Bioinformatics References NCBI (National Center for Biotechnology Information) http://www.ncbi.nlm.nih.gov/ homepage National Human Genome Research Institute http://www.genome.gov Human Genome Project Information http://www.ornl.gov/sci/techresources/Human_G enome/home.shtml (esp. link to the Education module) Genomics and Its Impact on Science and Society - The Human Genome Project and Beyond http://web.ornl.gov/sci/techresources/Human_Ge nome/publicat/primer2001/primer11.pdf 52