Computational Molecular Microbiology (MBIO 4700) Lecture Notes PDF

Summary

These notes provide an introduction to the concepts of bioinformatics, data structures, algorithms, and biological databases. The lecture notes cover fundamental topics, relevant to the study of computational biology and molecular microbiology. Topics include Markov chains and Hidden Markov Models.

Full Transcript

Computational Molecular Microbiology (MBIO 4700) ABDULLAH ZUBAER UNIVERSITY OF MANITOBA Concepts of bioinformatics • Computational molecular microbiology and its relation to other disciplines • Data (types and structures) • Database • Algorithm • Bayesian statistics • Markov and Hidden Markov Mode...

Computational Molecular Microbiology (MBIO 4700) ABDULLAH ZUBAER UNIVERSITY OF MANITOBA Concepts of bioinformatics • Computational molecular microbiology and its relation to other disciplines • Data (types and structures) • Database • Algorithm • Bayesian statistics • Markov and Hidden Markov Models • Monte Carlo simulations 2 What is bioinformatics There are many different definitions of bioinformatics. Here are some examples: The retrieval and analysis of biochemical and biological data using mathematics and computer science, as in the study of genomes. (Dictionary.com) The collection, classification, storage, and analysis of biochemical and biological information using computers especially as applied to molecular genetics and genomics. (Merriam-webster.com) 3 What is bioinformatics There are many different definitions of bioinformatics. Here are some examples: Bioinformatics, as related to genetics and genomics, is a scientific subdiscipline that involves using computer technology to collect, store, analyze and disseminate biological data and information, such as DNA and amino acid sequences or annotations about those sequences. (NIH; https://www.genome.gov/genetics-glossary/Bioinformatics) 4 What is bioinformatics There are many different definitions of bioinformatics. Here are some examples: Bioinformatics is an interdisciplinary field of science that develops methods and software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, chemistry, physics, computer science, computer programming, information engineering, mathematics and statistics to analyze and interpret biological data. (Wikipedia; https://en.wikipedia.org/wiki/Bioinformatics) 5 What is bioinformatics ❖ Biological data and Information (=processed data) ❖ Computer ▪ Genetics, genomics, data storage and analysis, interdisciplinary 6 Biology Bioinformatics Computer Science Statistics Mathematics What is data ▪ Any value or any observation is data (singular datum) ▪ Latin word “datum” means something “given” 8 Types of data ▪ Character ▪ String of characters ▪ Integer ▪ Floating point numbers (or float) ▪ Boolean etc. 9 Data Structure ▪ The logical or mathematical model of a particular organization of data is called a data structure. ▪ Linear – arranged in linear or sequential fashion ▪Linked List, Array, Stack (LIFO), Queue (FIFO) etc. ▪ Non-linear ▪Tree, graph etc. 10 Database ▪ An organized collection of structured data ▪ A set of data tables connected to identifiers ▪ Commonly known as relational database ▪ Managed by a software called Database Management System (DBMS) 11 https://www.codecademy.com/learn/how-do-i-make-and-populate-my-owndatabase/modules/designing-a-database-schema/cheatsheet 12 Biological data ▪ Sequence of DNA, RNA, Protein ▪ Genome sequence ▪ Coordinates of structure ▪ Biological pathway ▪ Gene expression data 13 Biological Data ribosomes and RNAs RNA rRNA, tRNA, mRNA, noncoding RNAs etc. Metabolites Enzymes Pathways Regulatory & Structural Proteins and metabolites => phenotypes Adapted from https://doegenomestolife.org/ 14 Biological databases ▪ NCBI, EBI, DDJB ▪ Gene database: GenBank ▪ Protein database: NCBI Protein, UniProt, PDB ▪ Genome database: NCBI Genome, Ensembl, UCSC ▪ Pathway database: KEGG 15 Bioinformatics tools ❑ Web based ❑ Desktop based 16 Algorithm • Invented by Muḥammad ibn Mūsā alKhwārizmī • An algorithm is a well-defined list of steps for solving a particular problem. • An algorithm usually involves a particular data structure 17 Algorithm • Add numbers from 1 to 100 • Guess the number (range 1 to 100) 18 Flow chart symbols https://www.smartdraw.com/flowchart/flowchart-symbols.htm 19 Algorithm G U ES S T HE N U M BE R ( R AN GE 1 TO 1 0 0 ) https://www.chegg.com/homework-help/questions-and-answers/programming-challenge-random-numberguessing-game-attached-files-random-number-game-flowch-q70899923 20 Bioinformatics Algorithms • Exhaustive search (or brute force) • Greedy algorithms • Dynamic programming algorithms • Divide and Conquer algorithms • Hidden Markov Model (HMM) etc. Jones and Pevzner. 2004. An introduction to bioinformatics algorithms. 21 Bioinformatics Algorithms • Exhaustive search (or brute force) • Greedy algorithms • Dynamic programming algorithms • Divide and Conquer algorithms • Hidden Markov Model (HMM) etc. Jones and Pevzner. 2004. An introduction to bioinformatics algorithms. 22 Bioinformatics Algorithms • Exhaustive search (or brute force) • Greedy algorithms • Dynamic programming algorithms • Divide and Conquer algorithms • Hidden Markov Model (HMM) etc. Jones and Pevzner. 2004. An introduction to bioinformatics algorithms. 23 Hidden Markov Model • Let’s start with Markov chain • Invented by Andrey Markov 24 Markov chain • A Markov chain is a stochastic (random) process with the Markov property • Markov property: “memorylessness” • The probability distribution of the next state depends only on the current state and not on the sequence of events that occurred before that 25 Hidden Markov Model (HMM) • An HMM is a Markov process (more general term for stochastic models with the Markov property, like Markov chains) that has hidden states • What is the meaning of hidden states? 26 Hidden Markov Model (HMM) 0.3 Transition probabilities 0.7 Conditions that may Low affect outcome High 0.2 0.6 Outcomes (visible): 0.4 Rain www.cedar.buffalo.edu/~govind/CS661/Lec12.ppt 0.8 0.4 0.6 Dry 27 Hidden Markov Model (HMM) 0.3 States/conditions that affect outcome Transition probabilities 0.7 Low High 0.2 0.6 Outcomes (visible): 0.4 Rain www.cedar.buffalo.edu/~govind/CS661/Lec12.ppt 0.8 0.4 0.6 Emission probabilities Dry 28 HMM example • One can describe any sequence (including genes) by HMM • Including differentiating between different features within the sequence • Example: Eddy S. R. (2004). What is a hidden Markov model?. Nature biotechnology, 22(10), 1315–1316. https://doi.org/10.1038/nbt1004-1315 29 HMM example Gene – composed of nucleotides (A,C,G,T) – VISIBLE Gene – composed of introns and exons (not VISIBLE) Start and Stop etc. many features are not visible but maybe can be “predicted”. STATE: a nucleotide can be either part of an intron or exon etc. https://www.magazinescience.com/en/biology/eukaryotic-gene-structure/ 30 Simple HMM toy: “looking for features” GT…………introns AT rich………..AG Emission probabilities State to state: Transition probabilities CHAIN- stochastic (based on probabilities) Visible Hidden HMM chain (27 transition P and 26 emission P) Multiply all 53 P (log) = -41.22 (use natural log) Above sequence – scanned for a “G” followed by an AT rich region, i.e. possible 5’ splice site. States: exon (E), splice site (5’), intron (I) (Eddy 2004) 31 Emission probabilities “Training” Sequence is the visible outcome! Each nucleotide can be assigned to a “state” What is the “probability” for a nucleotide to be part of an exon or intron? Moving from nucleotide to the next nucleotide -> transition probabilities Emission is “outcome” – probabilities of nucleotide composition observed in sequence(s) for exons and introns and 5’ (and 3’ splice sites). (Eddy 2004) 32 Homework: GTRAGT --introns AT rich………..AG 3’ 1.0 E END TTAATTAAGCCCGGG State path: logP for your HMM chain? 3’ = 3’ splice site – for this example emission probability For 3’ splice site = 0.95 for G (and 0.05 for C) END = 1.0 (like start = 1.0) Generate a more complete “toy” HMM for the gene above: Define the “states”, state path, emission and transition probabilities. Calculate logP - show all your work! (Eddy 2004) 33 Project idea Think PROJECT (Think – I am Scientist!): Start with an alignment, use sequence logos etc. to find “conserved features”; does NCBI graphical display offer any information? Try HMMER? Use a Phylogeny to learn about the evolution of the gene/protein sequence (evidence of paralogs, horizontal gene transfers etc. ?) ~ gene tree – does it follow the expected “species tree”? Use protein (or RNA) folding tools to learn about the product of the gene Use other tools to explore the gene or protein – evidence of intrinsic disorder, membrane spanning regions, type of domains etc. Project idea Read about your gene/protein – what is known about your gene/protein? Can you apply tools to evaluate “selection”, observe co-evolution of residues within your sequence? How do your various analyses integrate with each other – what questions can you now address? (Take a look at the “inspiration papers” or the RNA lecture and the ITS rDNA example). Think of “in silico” analyses/investigations as performing a series of experiments. Each “experiment” may reveal some interesting findings about the protein / gene you are studying.

Use Quizgecko on...
Browser
Browser