Lecture 1: Introduction to Computational Genomics and Systems Biology PDF

Summary

This document is a lecture for an undergraduate course on computational genomics and systems biology. It covers topics like models and methods for single cell RNA seq analysis, efficient sequence comparison, somatic genomics, and an introduction to AI in biology. The lecture also discusses course policies, logistics, and a grading rubric.

Full Transcript

Introduction to Computational Genomics and Systems Biology Vanessa D Jonsson University of California, Santa Cruz Lecture 1 BME 230A Winter 2025 1 Instructors Vanessa Jonsson Benedict Paten Josh Stuart...

Introduction to Computational Genomics and Systems Biology Vanessa D Jonsson University of California, Santa Cruz Lecture 1 BME 230A Winter 2025 1 Instructors Vanessa Jonsson Benedict Paten Josh Stuart 2 Teaching assistants Lydia Mok Gabriel Penunuri (Jonsson) (Paten/Stuart) 3 Class overview Models and methods for single cell RNA seq analysis (Jonsson/5 weeks) Efficient sequence comparison (Paten/2.5 weeks) Somatic Genomics (Stuart/1 week) Intro to AI in biology (Stuart/1.5 weeks) 4 Course policies and logistics Lectures are Tu/Thu, 3:20 - 4:55pm, PhysSciences 110. Lectures will be posted on Canvas. There will be 3 problem sets, and will be posted on Canvas and submitted on Canvas. Jupyter notebooks will be posted on Canvas and submitted on Canvas. You may collaborate on all of the homework sets except for two (the midterm and final sets). Please collaborate! Remember to record who you worked with, their contributions. Questions and discussion of homework problems will take place on Discord. Some problems will require rudimentary programming. Such problems will be assigned along with Google Colab or Jupyter notebooks you will modify. 5 Grading rubric 15% project proposal presentation (3 slides / 3 mins each: background/problem/proposed work) 25% in class Jupyter notebooks (submitted at end of lecture day) 10% Homework 1 (Jonsson/Mok theoretical + practice, due Jan 21) 10% Homework 2 (Jonsson/Mok theoretical + practice, due Feb 11) 10% Homework 3 (Paten, due Feb 11) 30% Final Project (written report 15%, oral presentation 15%) 6 Models and methods in single cell RNA seq overview Introduction Dimensionality Reduction/Clustering Modeling Counts (statistical distributions) Generalized Linear Models Variance Stabilization Differential Testing Multiple Testing, Part 1 Multiple Testing, Part 2 Biophysical Models, Part 1 (AM 115 SP24) subject to change Biophysical Models, Part 2 (AM 115 SP24) subject to change 7 Acknowledgements Several slides from and inspired by Lior Pachter Bren Professor, Caltech 8 Learning objectives Understand the scope of the class Learn topics that will be covered in first 5 weeks Understand what a gene expression matrix is Learn that a matrix is a linear map Learn about combining data and Simpson's paradox Learn about the purpose of single cell RNA seq Learn about the single cell RNA seq experimental and analytical pipeline 9 What the course is about Computational biology: we will study models and methods that are used in the study of biology and for interpreting biological data. Some warnings: ○ There will be a lot of jargon (underlined so you notice). ○ Outcomes will depend on the degree to which prerequisites have been mastered (we’ll provide reading suggestions to help fill in gaps, and have also organized the course to maximize accessibility in light of variable mastery of prerequisites). ○ Applying computational biology requires domain specific knowledge (we’ll focus the course on one area for pedagogical purposes). 10 Be careful with jargon Computational biology draws on concepts, ideas, methods, and notation from multiple areas, including biology, bioengineering, computer science, electrical engineering, mathematics, and statistics. Words and phrases can vary in meaning within and between these fields. This can be confusing. Example from the titles of two different papers: “Porcine MYF6 Gene: Sequence, Homology Analysis, and Variation in the Promoter Region” Not the same! “Persistent Homology Analysis of Brain Artery Trees” 11 (Ideal) Prerequisites Mathematics: (single variable and multivariable) calculus, linear algebra, measure theory (real analysis), discrete mathematics (combinatorics). Probability and Statistics: probability theory and statistics: probability, applied statistics, theoretical statistics. Computer Science: programming, algorithms and data structures, principles of software engineering. Biology: survey of main areas (molecular and cell biology, immunology, neuroscience, evolution, biophysics, etc.), logic and methodology, domain specific knowledge. Each of these prerequisites may require several years of study. Thus... 12 Focus for the single cell course Rather than try to cover every tool and method in computational biology, we will focus on ideas and concepts that are common to many applications. In order to facilitate a coherent presentation, we will focus on only one application: single-cell RNA-seq. More on this in Lecture 2. The main object of study will be the gene expression matrix. Background reading and materials for further study will be listed in a “Reference” slide at the end of each slide deck for each lecture. 13 Single-cell RNA-seq Single-cell RNA-seq is neither single, cell, nor RNA (Lecture 2). So what is it? Single-cell RNA-seq refers to a group of (constantly improving) technologies and analysis tools that genes ○ start with an INPUT of cells, ○ OUTPUT a (proxy for a) gene expression matrix. cells 14 What is a gene expression matrix Expression: A term referring to the process by which information in a gene is used to generate protein or non-coding RNA product. Gene: A term coined by the Danish botanist Wilhelm Johannsen (1857-1927) in 1909. It has no precise definition, and its meaning has been evolving since it was introduced. Matrix: A rectangular array of numbers used to represent a linear map. 15 The meaning of “gene expression” Figure by Crick from 1956, see also Crick, 1958 16 Watson’s Confusion James Watson Watson, 1965 This is not the central dogma of biological information, it is a reaction network for biopolymers. 17 A reaction network of biopolymers Differential Equations Formerly AM115 Deciphering the “genetic code” 1953: Franklin, Crick and Watson discover the double helix structure for DNA, and Crick and Watson publish it. 1954: George Gamow founds the RNA tie club. The club consists of 20 regular members and 4 honorary members. Their motto: “Do or die; or don’t try” 1955: Francis Crick suggests the “adaptor hypothesis”. Rosalind Franklin 1956 Gamow considers how many nucleotides are necessary to code for one amino acid. George Gamow 19 42 < 20 < 43 “It is assumed in one of the more popular theories of protein synthesis that amino acids are ordered on a nucleic acid strand (see, for example, Dounce ) and that the order of the amino acids is determined by the order of the nucleotides of the nucleic acid. There are some twenty naturally occurring amino acids commonly found in proteins, but (usually) only four different nucleotides. The problem of how a sequence of four things (nucleotides) can determine a sequence of twenty things (amino acids) is known as the ‘coding’ problem.” - Crick, Griffith and Orgel, PNAS 1957. 20 A sense and nonsense proposal There must be a mechanism for determining “frame”: Crick, Griffith and Orgel guessed that this may be accomplished via a “comma free code”. That is, an assignment of a subset of nucleotide triplets to sense codons such that any sequence of successive sense codons only has nonsense codons in the shifted positions. 21 Two simple observations, a question and a theorem Repeats cannot be sense triplets. For example, AAA is nonsense because the frame cannot be determined in the sequence AAAAAA. Shifts of sense triplets must be nonsense. For example, if AGC is a sense triplet, then GCA and CAG cannot be. This is because the frame must be recognizable in the sequence AGCAGC. Question: what is the maximum size of a triplet comma free code on a four letter alphabet? Theorem (Crick, Griffith, Orgel, 1957): The maximum size of a triplet comma-free code on a four letter alphabet is 20. 22 The genetic code Determined in a series of experiments using ideas and techniques developed by Marianne Grunberg-Manago, Robert Holley, Har Khorana, Marshall Nirenberg, and Maxine Singer. Grunberg-Manago Robert Holley Har Khorana Marshall Nirenberg Maxine Singer polynucleotide synthesis of RNA first codon additional codons structure of tRNA phosphorylase repeated developed in vitro synthetic nucleic adapter molecule decoding RNA sequences translation system acids synthesize RNA determined codons codons during without DNA confirmed triplet translation template nature 23 The genetic code 24 There’s plenty of room at the bottom Missed opportunities: The RNA tie club men didn’t end up doing. ○ Understanding of triplet code by Crick ○ Group didn't crack the genetic code 1959 Feynman: there’s plenty of room at the bottom. ○ Lecture at Caltech discussing potential of manipulating matter at atomic and molecular level ○ Vision: Build machines on scale of molecules (synbio,CRISPR) ○ Storage of information in molecular systems (genetics) Let’s head straight to the bottom… Richard Feynman 25 What is a gene? Gene expression in influenced by epigenetic modifications, genes are part of larger regulatory networks. A single gene can produce multiple RNA and protein isoforms through alternative splicing. Non coding genes can produce functional RNA molecules tRNA, rRNA, miRNA Not contiguous stretches of DNA Regulatory regions 26 From M.B. Gerstein et al., What is a gene, post-ENCODE? History and updated definition, 2007. Why a gene expression matrix (and not table) 1 0 A table is an array representing a set of points 0 1 27 A matrix is code for a (linear) function 1 1 0 1 28 How a matrix describes (is code for) a function 1 1 1 1 0 1 0 = 0 29 How a matrix describes (is code for) a function 1 1 0 1 0 1 1 = 1 30 How a matrix describes (is code for) a function 1 1 4 6 0 1 2 = 2 M ( x ) 31 The rank of a matrix is... 1 1 4 6 0 1 2 = 2 M ( x ) … the dimension of the image. 32 Singular Value Decomposition Any matrix can be decomposed into its singular value U, V are orthogonal matrices of decomposition. left/right singular vectors Sigma diagonal matrix of singular Where is SVD is values widely used ? 33 Recall PCA BME 205 Find principal components describing directions of maximal variance, then approximate new data by "best" lower rank matrix Lower rank projected data 2 dimensions 3 dimensions 35 Principal component analysis can be derived by SVD Lower rank approximation of original data matrix 2 dimensions 36 Application to dimension reduction 37 A (statistics) convention for tables The convention for representing data tables in statistics is to use the rows for observations, and the columns for features. Moreover, n is used to represent the number of observations and p the number of features, so that a data table has size n x p. ○ One reason for this convention is the form of regression models, which describe observations as linear combinations of explanatory variables with some added noise using the form: ○ With this matrix notation, X, which is also known as the design matrix, has dimensions n x p. ○ Regression will be discussed in detail in Lecture 3. This is the transpose of the conventional approach in mathematics: 38 Confusion between the mathematics and statistics conventions The two most popular software packages for single-cell RNA-seq analysis Seurat (PI studied biology then did a D.Phil in statistics) Scanpy (PI studied mathematics/physics then did a Ph.D. in physics/computer science) Seurat rows = genes columns = observations Scanpy rows = observations columns = genes 39 Expression: A term referring to Gene: A term coined by the Summary the process by which Danish botanist Wilhelm Johannsen (1857-1927) in information in a gene is used to generate protein or 1909. It has no precise non-coding RNA product. definition, and its meaning has been evolving since it was introduced. Matrix: A rectangular array of A gene expression matrix is numbers that is used to more than a table, does not represent a linear map. quite represent expression, and without context it may not be possible to know exactly what it measures. 40 Why single-cell RNA-seq? A bulk RNA-seq experiment produces a gene expression vector ○ Prior to bulk RNA-seq, bulk measurements of gene expression could be performed using DNA microarrays. (While DNA microarrays are not used much anymore, methods for analysis of DNA microarray data are frequently re-discovered (and republished) as single-cell RNA-seq analysis methods). Lachmann et al., 2018 41 Publicly available RNA-seq samples currently available at GEO/SRA for human and mouse compared to available samples collected with the popular Affymetrix HG U133 Plus 2 platform. Why single-cell RNA-seq? While bulk RNA-seq has been popular, there is a problem with analysis of gene expression in bulk: averaging. Simpson's paradox: Aggregating data can lead to contradictory conclusions. Confounding variables Trapnell et al., 2015 42 Why single-cell RNA-seq? While bulk RNA-seq has been popular, there is a problem with analysis of gene expression in bulk: averaging. Robinson's paradox: Increasing resolution paradoxically introduces uncertainties. Trapnell et al., 2015 43 Getting to the bottom is about getting to the relevant story Lack of resolution may mask the importance and relevance of key confounding variables In terms of gene expression, “getting to the bottom”, (Feynman) means ○ Isoform resolution and cell-type resolution. gene-level single-cell RNA-seq gene-level bulk RNA-seq cell transcript salad isoform-level bulk RNA-seq 47 The purpose of single-cell RNA-seq Decompose tissue or organ expression into its constituent parts Identify the different types of cells that comprise tissues and organs Examine the molecular biology of cells via their expression signatures Determine differentiation trajectories of cells Develop expression biomarkers for disease states 49 Overview of a single-cell RNA-seq experiment and analysis Clustering and the EM algorithm Read alignment Differential analysis and hypothesis testing Variance stabilization and normalization Singular value decomposition Linear and logistic regression Dimensionality reduction 50 Luecken and Theis, 2019 Single-cell RNA-seq as a theme for computational biology Data processing requires efficient data structures and scalable algorithms; many of the tools used are based on ideas and concepts that are applicable to a broad swath of (DNA sequence based) genomics. The statistical challenges that arise in single-cell RNA-seq data analysis are common throughout the biological sciences. Mathematical models for the molecular biology of the cell are key to interpretation of single-cell RNA-seq experiments. Elements of the models that are used are present in many other applications of mathematical biology. The methods that will be taught were chosen for appearing in at least three different (computational) biology applications. Single-cell RNA-seq data is plentiful and for the most part publicly available. Single-cell RNA-seq technologies are becoming omnipresent in biology labs. 51 Single-cell RNA-seq as a theme: limitations There are many interesting areas in computational biology that will not be covered in this part of the course. Single-cell RNA-seq data is almost exclusively count data. There are many other data types that are important in the biological sciences and we will not have the time to discuss the computational biology that pertains to their analysis. This part of the course is a survey of a handful of topics, that while important, are incomplete even for single-cell RNA-seq analysis. 52 Discord Please post questions on Discord. We (the TAs and I) will read the Discord posts regularly and try to respond promptly. Note that Discord will serve as a general discussion board not just between instructors and students but also among students. 53 Additional References Three examples of Simpson’s paradox in computational biology: ○ Singer et al., Controlling for conservation in genome-wide DNA methylation studies, 2015. ○ Franks et al., Post-transcriptional regulation across human tissues, 2017. ○ Freitas, Investigating the role of Simpson’s paradox in the analysis of top-ranked features in high-dimensional bioinformatics datasets, 2020. Visual matrix multiplication. Videos of Gilbert Strang’s course on linear algebra, 2010. 54

Use Quizgecko on...
Browser
Browser