Genomics & Databases: Lecture 2 (Measuring DNA in Modern Biology) PDF

Summary

This lecture covers modern methods for measuring and analyzing DNA, including sequencing, PCR, and restriction enzymes, and their applications in genomics and other fields. The lecture also discusses the history of DNA measurement techniques and the role of genomics databases.

Full Transcript

Genomics & Databases: Lecture 2 Measuring and Analyzing DNA: The Modern Approach https://upload.wikimedia.org/wikipedia/ commons/1/18/DNA_sequence.svg Assoc. Prof. Evan WILLIAMS...

Genomics & Databases: Lecture 2 Measuring and Analyzing DNA: The Modern Approach https://upload.wikimedia.org/wikipedia/ commons/1/18/DNA_sequence.svg Assoc. Prof. Evan WILLIAMS Williams Lab: wwwen.uni.lu/lcsb/research/gene_expression_metabolism ISB101: Genomics & Databases 23 September 2024 Luxembourg Centre for Systems Biomedicine (LCSB) Overview of Lectures, Week 1 Genomics & Databases: Week 1 Lectures 1. History & Basics of DNA & Genomes: A Recap (Hopefully) 2. Measuring & Analyzing DNA in Modern Biology 3. Genomics & Databases: Reference Genomes & Assembly 4. DNA in Practice: Personalized / Precision Medicine 5. Genetic Causality: GWAS, QTLs, and Beyond 6. Understanding Traits: Mendelian, Complex, & Heritability 7. Intro to Commonly-Used Genetics Approaches & Analyses 8. Complementary Applications of Genomics (Epigenomics, Metagenomics, …) 9. Genomics in Common Model Organisms: From E. coli to Yeast to Mice 10. Genetic Modifications: Mutagenesis, Cre-Lox, CRISPR, & More 11. Looking Forward: Beyond Genomics Lecture Overview Measuring & Analyzing DNA in Modern Biology 1. History of the measurement & experimental analysis of DNA 2. Modern methods for sequencing DNA 3. Basic structure & analysis of modern sequence data 4. Intro to genomics databases & linking DNA to trait outcomes History of Measuring DNA in Biology & Medicine 1953: Structure of DNA determined (awarded 1962 Nobel prize) 1965: First nucleotides sequenced (awarded 1968 Nobel prize) 1972: First gene sequenced (bacteriophage MS2 coat protein) 1975: Southern blotting developed (gel electrophoresis for DNA) 1977: Sanger sequencing developed (awarded 1980 Nobel prize) 1981: DNA microarray developed 1995: First bacterial genome sequenced (H. influenzae) 1998: First multicellular organism sequenced (C. elegans) 2001: First “full” human genome published (Nature, PMID 11237011) 2005: Next Generation Sequencing (NGS: modern DNA-seq) 2010s: Much better and exponentially cheaper sequencing 2015: Development of single cell sequencing. Isolating DNA From Cells This is not a biochem course, but if we want to study DNA, how do we isolate it? Step 1: Homogenize Step 2: DNA Isolation phase transfer buffer chloroform vortex vortex isopropanol cell sample + centrifuge glass beads homogenize Lysis & phase separation Step 3: DNA Cleanup ethanol dry vortex resuspend DNA ready to be used & is very stable. centrifuge centrifuge Can get more complicated if you need to use small discard amounts of tissue or degraded material (e.g. from supernatant mammoths or mummies) https://pubmed.ncbi.nlm.nih.gov/25606001/ Measuring & Quantifying DNA: Gel Electrophoresis Separates multiple samples’ DNA, RNA, or protein in lanes according to size; bigger molecules migrate slower & remain at the top (near the loading area) RNA DNA Sample Protein Small molecules migrate further down https://www.labmanager.com/insights/southern- Extraction Electrophoresis vs-northern-vs-western-blotting-techniques-854 https://www.123rf.com/photo_87490334_table-with-equipment- for-gel-electrophoresis-at-biochemical-laboratory.html Gels vs. Membranes You run the gel first, then if you need to label something, you transfer to a membrane. After the membrane transfer, you can “stain” the membrane with particular markers of interest (e.g. antibodies, radio/chemical-labeled probes) 1 - r s r d de a ne dd e La L 9 La CC CT CC TT CC TT CT CT CC Gel Labeled Membrane Which band contains my gene of interest? Detecting Specific DNA Variants: Disease Detection Used for genotyping, e.g. detecting if a variant is in an individual Let’s run a Southern blot comparing a designed DNA molecule of sequence: Reference TACCACGTAGACCGAGGACTCCTCTTCAGA Variant TACCACGTAGACTGAGGACTCCTCTTCAGA e r dd CC TT CC TT CC La CT CC CT CT d ! b an c e cifi p n-s No Gels & membranes are only semi-quantitative Typically only used in DNA analysis today for quick QC/prototyping Measuring & Amplifying DNA by Polymerase Chain Reactions (PCR) DNA measurement technologies typically require more than 1 DNA molecule; it is thus necessary to amplify sequences If you want to amplify a specific signal, then you add a labeled primer (complementary single-stranded DNA). Variant TACCACGTAGACTGAGGACTCCTCTTCAGA PCR uses special “Taq” DNA polymerases from bacteria that are heat-stable and can tolerate up to 97°C! Amplification is extremely important for improving the signal-to-noise ratio (including for single-cell sequencing!) The Cell: A Molecular Approach, 8th Ed, p139 Amplifying DNA: PCR PCR goes through 30+ cycles of heating (denaturation) to split DNA, annealing to bind primers to DNA, then extension to copy sections connected to primers – each time doubling the target DNA sequence. https://www.genome.gov/genetics-glossary/Polymerase-Chain-Reaction Amplifying DNA: Primers A primer is a ~20-30 nucleotide chain that was synthesized to be complementary to a DNA fragment which the scientist wishes to measure. You always have two primers: forward & reverse, for each strand of DNA https://www.khanacademy.org/science/ap-biology/gene-expression-and- regulation/biotechnology/a/polymerase-chain-reaction-pcr The amplicon is the complete section between the two primers (and including them), typically ~100-10000 nucleotides long. Quantitative PCR (qPCR) Very common method for quantifying targeted DNA (or RNA via cDNA) A mature & fairly static technology, little Amplification Curves change since mid 2000s, but still used today! Fluorescence (483 nM) Roche LightCycler 480 Cycles (0 – 40) https://www.gene-quantification.de/lc480-brochure.pdf Applications of qPCR: COVID Testing PCR testing of infectious diseases (e.g. COVID) uses qPCR. COVID GACATATTAGACATAGGATACATACCCTT The “Ct” value (cycle threshold value) calculates your starting DNA amount Amplification Curves Person A Ct = 25 Fluorescence Person B (483 nM) Ct = 32 Person A has 2^7 (128x) more than person B! 1 cycle = doubling of the target; Cycles (0 – 45) so higher cycles = less initially! https://lifescience.roche.com/en_us/articles/lightcycler- 480-system-performance-data.html Person C Ct = ~39-41 Quantification gets less precise after ~30-35 cycles 2^40 is 1 trillion-fold amplification! Depleting DNA: Restriction Enzymes Instead of amplifying DNA, you can also to cleave specific DNA sequences Restriction enzymes are proteins targeting short DNA sequences (~4-40 bp), then induce double strand breaks (DSB), i.e. breaking both complementary DNA strands. Generally, target short (4-8 bp) DNA sequences, & thus cut many parts of a genome This is the bacterial “immune system”, especially against viruses. Most restriction enzymes are naturally-derived (though this is changing now) cas9 (of CRISPR fame) is a natural restriction enzyme Name Source Target Cut https://en.wikipedia.org/wiki/List_of_restriction_enzyme_cutting_sites Sequencing long DNA molecules was difficult; restriction enzymes were necessary! Mapping the Genome: Sanger Sequencing Normal DNA is long; trying to assemble an entire genome linearly is slow & error-prone, ergo: fragmentation (or “shearing”). 1 Many methods to shear: random breaks with sonication, specific breaks with restriction enzymes, or either with chemical mixtures Thousands of short DNA chains, one copy of each 2 Historically, amplify DNA by yeast or bacterial artificial chromosomes (YAC or BAC), 10 to 2000 kbp at a time More recently, amplify DNA using Whole Genome Amplification (WGA) (first in 1992), e.g. with degenerate primers (less-specific sequence amplification). Vogel & Motulsky’s Human Genetics 4th Ed, p143 Thousands of short DNA chains, many copies of each Mapping the Genome: Sanger Sequencing Add primers, DNA polymerase, nucleosides (dNTP), and fluorescent “dideoxy” (ddNTP) nucleostides, which stop copying the unknown template 3 Normal nucleosides (dNTP) (90% to 99.95%) Labeled capping nucleosides (ddNTP) (10% to 0.05%) Vogel & Motulsky’s Human Genetics 4th Ed, p143 GACTAGATACCGTACGCTATGCTCG GACTAGATACCGTACGCTATGCTCG GACTAG GACTAGATACCGTAC GACTAGATACCGTACGCTATGCTCG GACTAGATACCGTACGCTATGCTCG GACTAGATA GACTAGA GACTAGATACCGTACGCTATGCTCG GACTAGATACCGTACGCTATGCTCG GACTAGA GACTAGATACCGTACGCTATGCT Mapping the Genome: Sanger Sequencing GACTAG GACTAGATA GACTAGATACCGTACGCTATGCT GACTAGA GACTAGATACCGTAC GACTAGA The assembled fragments are separated by 4 capillary electrophoresis, so you know length A laser excites the ddNTP fluorophore, so you know the final nucleotide of each fragment. Amplified, unknown GACTAGATACCGTACGCTATGCTCG... sequence ?????GA?A?????C???????T What you learned with the 6 assembled fragments https://www.thermofisher.com/lu/en/home/life-science/ Can measure up to ~800 nucleotides (nt) at a time. sequencing/sequencing-learning-center/capillary- electrophoresis-information/what-is-sanger-sequencing.html Is incredibly slow & wasteful (compared to NGS) Mapping the Genome: Sanger Sequencing First, isolate a certain genome region (e.g. a chromosome) & fragment it Next, amplify the DNA you want to sequence so that you have multiple copies of each fragment (e.g. WGA, BACs) Then, copy that fragment at different lengths with labeled nucleotides, run them through a capillary, measure the last nucleotide & fragment length, & reassemble This takes forever and it is horribly inefficient; that’s why it took decades to assemble the human genome! Also it doesn’t work well in e.g. chromosome regions with many repeating DNA sequences (since max fragment length is ~800 nt), or at chromosome ends. Consequently, no one uses Sanger sequencing anymore for unknown genomic regions, and rarely (if ever) even for known genomic regions (i.e. with both primers) Outdated But Foundational Technologies (Almost) no one uses Sanger sequencing or Southern blots anymore. Why do we learn about it? https://en.wikipedia.org/wiki/ File:Floppy_disk_2009_G1.jpg https://en.wikipedia.org/wiki/Abacus#/media/File:Kugleramme.jpg https://classic-sailing.com/wp-content/uploads/2018/05/sextcaptain1.jpg You never know when an old technology will lead to a new revolution! e.g. restriction enzymes to CRISPR/Cas9! Next Generation Sequencing (NGS) So if we don’t use Sanger sequencing today, how do we sequence today? Fragmentation: still the same, typically sonication 1 Vogel & Motulsky’s Human Genetics 4th Ed, p143 2 Library preparation; add adaptors to the DNA fragments – sequences that let all fragments be amplified (no need for primers for all DNA sequences) Adaptors can also include barcodes (labels) & facilitate the final reading of the DNA sequence. Sample amplification is similar, by modified PCR suitable for small volumes (“ePCR”) Next Generation Sequencing (NGS) There are several different methods for sequencing the amplified products 3a 3b 3c https://www.technologynetworks.com/genomics/articles/an-overview-of-next-generation-sequencing-346532 Pyrosequencing was the first common and modern NGS technique Ligation sequencing uses fluorescence & is most common now, but other methods like proton detection (“ion torrent”) to measure pH, … Next-Generation Sequencing (NGS) Each NGS method is still used and has some specific benefits and drawbacks (e.g. maximum read length, accuracy, throughput, …). Fortunately, the output of all of these machines is more or less equivalent for most end-user biologists: you get a big sequence file. It is a ton of data, e.g. ~200 gigabases (Gb)/day (one human genome is ~3 Gbp). Consider it took ~15 years to assemble the first human genome! You can measure multiple samples simultaneously Reads are 150 bp long, and paired-end (“2x”). Paired end means it reads the fragments in both directions Read 1 fragment https://www.illumina.com/systems/sequencing-platforms.html Read 2 The Near-Future in Sequencing The sequencing field is still rapidly evolving: mostly changes affect price and reliability, but sometimes improvements open up new experimental possibilities Single cell sequencing (scDNA-Seq) Much longer reads (“third generation sequencing”) DNA methylation sequencing (bisulfite sequencing) https://en.wikipedia.org/wiki/Single_cell_sequencing New developments in complementary technologies are necessary too! E.g. multiple displacement amplification (MDA) instead of ePCR for scSeq Sequencing Data Today: Assembling “Shotgun” Reads Amplified DNA fragments are read at random, so you get hundreds of millions of sequences of ~150 nucleotides (150 nts; sometimes called bases). Amplified DNA fragments Measured reads Assembly of reads Campbell Biology 11th Ed, p441 Assembled genome Figuring out which reads overlap, by looking at 150 nt out of 3,000,000,000 bp is a tremendous computational challenge! De Novo Sequence Assembly 21,000,000 reads * 150 nt/read for a human DNA sample, you will measure 3.2 bn bp (length of a full human genome). This will not have the entire genetic code! Some sections will be read multiple times, others not at all. https://mmg-233-2014-genetics-genomics.fandom.com/wiki/Shotgun_Sequencing If we want to discover the full genome sequence we will need to sequence far more basepairs than the pure genome length, since our reads are (semi) random! De Novo Sequence Assembly: How Much To Do? 3,200,000,000 bp (3.2 Gbp) of DNA-seq data in a human would be a “1x” genome, since humans have 3,200,000,000 bp of DNA. A “10x” sequence of a human is ~32,000,000,000 bp measured A “10x” sequence of yeast is ~120,000,000 bp measured Sources sometimes say “20 million reads”; this can be a variable number of nucleotides sequenced, since reads can be different lengths! (e.g. 100x1, 150x2, …) For a human sample, measured on a sequencer using 150x1 reads, these terms are equal: 10x =~ 213 million reads =~ 32 Gbp If measured as 150x2 reads, then 10x =~ 107 million reads =~ 32 Gbp! Rule of thumb #1: ~75x is minimum to assemble a new genome w/o reference (“de novo”) Rule of thumb #2: ~10x is minimum to get any variant data w/ reference Rule of thumb #3: ~30x is more reasonable to get decent variant data w/ reference Why Is Sequencing A Recent Development? Cost of a 10x human whole genome sequence (WGS) 2001 $95,260,000 2011 $7,743 2021 $454 (US$) https://ourworldindata.org/grapher/cost-of-sequencing-a-full-human-genome Reference-Based Sequence Assembly: Databases Most common organisms (e.g. 25,000 eukaryotes!) have been fully sequenced and the reference genome can be found in online databases. UCSC Genome Browser If you want to do genetic analyses on a studied organism, you can usually align to the reference genome ~10x is minimally OK to comprehensively scan a specific subject for sequence variations compared to a reference. Aligning is less computationally intensive than assembly Some challenges remain! If a read does not align to the reference, is it a DNA variant, or a read error? How to sequence samples with DNA from many species together (e.g. intestinal microbiome)? Genome Assembly: De Novo vs. Reference Alignment Genome assembly is solving a 150,000,000 piece puzzle with only four colors. Also you are missing some pieces. And you have duplicates of others. De Novo Assembly Reference Alignment https://www.youtube.com/watch?v=DMA95nAfIHw&ab_channel=WordofAdviceTV First Thing to Do With Measured DNA? More Databases! If we want to understand how variation in DNA leads to variation in traits and diseases, we must first measure & organize the DNA of many people / organisms! 1000 Genomes Project So first, one reference human (in 2001) Next, sequence many more people, align all of these genomes & identify differences. To connect DNA variants to physical variants, we must use relational databases of physical characteristics, diseases, etc! Next Thing to Do With Measured DNA? Linked Databases! Projects eventually have enough data to link specific DNA variations with specific trait outcomes, whether physical, disease, … Human Gene Mutation Database (HGMD) Sickle Cell Anemia (rs334, aka E6A) Cystic Fibrosis (ΔF508) Huntington’s disease We will see how this is done in future lectures Sep 2023: 265k mutations in 11,194 genes https://www.hgmd.cf.ac.uk/ac/stats.php - July 2022 “Small” = ≤ 20 bp Sep 2024: 291k mutations in 11,772 genes Identifying Our DNA Variants: Sequencing or SNP Arrays AATAGATACATACGAGACATAGGATA CCCATACAGATACATACAGACATAAAT AGATACATACGAGACATAGGATACC Isolate DNA CATACAGATACATACAGGACACATAA ATAGATACATACGAGACATAGGATAC ACGCCATAATAAGAGACATAAAGGA TACCGGCCCTCATCGAGACATAGGA TACCCATACAGAGACATACA SNP1: A/A Human Genome SNP2: A/T 1.0 MTA SNP3: A/A SNP4: G/A https://www.cnbc.com/2018/06/15/dante-labs-dna-testing-company-sent- used-kits-to-some-customers.html SNP5: T/T SNP6: C/A SNP7: C/C SNP8: A/A SNP8: G/G SNP8: A/G How to Get Our DNA Variants: Microarrays Microarrays can have millions of pre-designed probes (basically, like primers) that look for specific DNA sequences Probe: GTCTATTAATAGACACATACATAGGGACCT Human Genome This probe was designed for a known SNP at this position 1.0 MTA causing disease; if a sample has a “C” here, DNA binds (“hybridizes”) & produces light. If “no C”, no binding, no light. Difficult to identify novel variants, as probes are pre-designed. But, you can put ~3 million probes on a chip! How to Get Our DNA Variants: Array or Sequence? Genotyping arrays are still cheaper, and their data easier to process – but targets must be determined in advance. Array data benefits less over time as annotation improves Regardless of the method, you will get a list of DNA variants compared to “reference” What can we actually find out, now that we have these variants identified? Analyzing DNA: Heritage 23andme 23andme Analyzing DNA: Heritage Son Mother You get EXACTLY 50% from your mother & your father Father But only *roughly* 25% like each grandparent! And roughly 12.5% like each great-grandparent, etc. 23andme Analyzing DNA: Heritage Grandmother Father Mother Son Son #2 Meiosis and Recombination Recombination: Chromosomal re-arrangement during meiosis One chromosome each from mother & father, with crossover from grandparents! Medical Utility of Knowing Heritage Prevalence of lactose intolerance in Europe Prevalence of sickle cell trait in Africa & ME (darker is more frequent) Vogel & Motulsky’s Human Genetics 4th Ed, p543 https://www.ajpmonline.org/article/S0749-3797(11)00626-X/fulltext Practical Utility of Sequencing DNA Practical Utility of Sequencing DNA: Current Limitations C or re W c t (V ro er n g y) W W ro ro ng ng C or C re or c t re ct P (a ro s a ki b ab ly d an W yw co ro ay ) n rre g c t https://www.collectorsweekly.com/uploads/2014/06/pam-roses-lilies-1024x884.jpg https://www.amazon.com/Mattel-Games- Magic-8-Ball/dp/B00001ZWV7 Practical Utility of Sequencing DNA: Current Limitations Analyzing DNA: Current Limitations Only 80% of people get useful information from their DNA?! How do we improve this to 100%? How do we improve the reliability of the meaning of the genetic variant? How do we improve the increase the number of variants with useful information? Once we can solve the above problems, how do we deal with healthcare 23andme advertisement, July 2022 consequences & ethical issues? Measuring and Sequencing DNA: Summary Sequencing DNA is fundamental to the study of genetics! Consequently, many key developments since the beginning of DNA sequencing in early 1970s: gels, PCR, Sanger sequencing, NGS, and databases. Many once-key technologies have been rendered mostly obsolete in the past ~10 years due to improvements in next-generation sequencing (NGS). However, the final sequence data is equivalent regardless of the measurement technology. The main challenge today with DNA is no longer the sequencing, but databasing and analyzing it to understand what DNA has what function. DNA can predict our health and physical characteristics: but what help is it if the predictions are only educated guesses, and not deterministic?

Use Quizgecko on...
Browser
Browser