Bioinformatics Lecture 2 PDF
Document Details
Uploaded by EfficientHurdyGurdy
McMaster University
Tags
Summary
This document provides a lecture on DNA sequencing technologies. It covers Sanger sequencing, fluorescent dye-termination methods, and capillary electrophoresis.
Full Transcript
BIO4BI3 - Bioinformatics Lecture 2 – DNA Sequencing Technologies and Applications Where are we going? DNA Sequencing DNA DNA Read Sequencing Quality As...
BIO4BI3 - Bioinformatics Lecture 2 – DNA Sequencing Technologies and Applications Where are we going? DNA Sequencing DNA DNA Read Sequencing Quality Assembly Mapping Control Genome Expression Annotation Analysis Marker-Trait Population Polymorphis Genotyping Associations Analysis m Discover Learning Outcomes History of DNA sequencing technology Understand the positives and negatives of current sequencing technology Learn about innovations in sequencing library preparations Use a case study to understand how technology choices influences the outcomes of genome sequencing projects Understand the considerations when choosing sequencing technology Sanger Sequencing Referred to as dideoxy sequencing or chain termination sequencing DNA extension occurs through the polymerization of dATP, dCTP, dTTP, dGTP using an existing DNA strand as a template Radiolabelled primer, dNTPs plus nucleotides unable to continue the polymerization (dideoxy-NTPs) are mixed in four separate sequencing reactions, one for each dNTP The ddNTPs are randomly incorporated into the new strand terminating extension resulting in radioactive fragments of varying length. If enough template is used chain termination will happen at every nucleotide position The fragments are run on a gel to separate them by size and the sequence is read from an autoradiogram Autoradiogram Four separate reactions resolved on an acrylamide gel Smaller fragments run towards the ‘bottom’ Fluorescent Dye-Termination Instead of radioactivity a fluorescent dye is attached to each ddNTP Each nucleotide is a different colour Allows all four reactions to be run in a single lane Capillary Electrophoresis Sanger cont’ The standard against which other types of sequencing are measured First sequenced human genome was generated from Sanger sequences Read lengths of 800-1000 today Still widely used today for sequencing small regions or fragments of DNA Used in combination with next-gen sequencing technology because of its length, high fidelity and well understood error model Quality Scores A measure of confidence in a base-call Originated from the Phred software used to make base-calls from dye-terminator fluorescent sequencing (Ewing and Green, 1998) q=-10log10(p) where p is the probability that the base has been correctly called At q=20 there is a 99% probability that the base has been correctly called. Useful in removing low quality ends of sequence reads Useful in generating consensus sequence from aligned reads Useful in polymorphism discovery Next-gen platforms are trying to develop measures of quality that are compatible with the Phred q-value. Next-gen Sequencing Technology Next-generation sequencing refers to platforms which sequence DNA in a highly parallel manner Sanger -> 1 run = 1 sequence Next-gen -> 1 run = billions of sequences No need for bacterial colonies to enable single clone selection Input library construction is fast and not complex Orders of magnitude faster and cheaper than Sanger sequencing 4 main technologies in the market; Illumina, MGI, Oxford Nanopore and Pacific Biosciences each with different approaches Next-gen Sequencing Technology NGS can be broadly grouped into two types – short read and long read Characteristics of each type to keep in mind short read technologies typically generate much more data per dollar short read technologies use a strategy of signal amplification for detecting base incorporation Short reads are good at SNP detection while long reads excel at identifying structural variations among individuals Long read tech is primarily used in de novo genome sequencing and full-length transcript sequencing, genome structure analysis Short reads are used mainly in re-sequencing, genotyping, genome structure analysis, transcript abundance Illumina Sequencing by synthesis Currently the absolute market leader at > 80% market share Nucleotides are added in a step-wise fashion to the single strand template. Nucleotide determination is based on the wavelength of the fluorescence Polymerization is interrupted after each nucleotide is added. Therefore fluorescence always represents a single base Bridge Amplification http://www.pasteur.fr/ip/easysite/go/03b-00002o-00a/recherche/plates-formes-technologiques/technopole-de-l-institut-pasteur/ genopole/pf/pf1-genomique/sequencage-tres-haut-debit-gaii-illumina-solexa Sequencing by Synthesis http://www.pasteur.fr/ip/easysite/go/03b-00002o-00a/recherche/plates-formes-technologiques/technopole-de-l-institut-pasteur/ genopole/pf/pf1-genomique/sequencage-tres-haut-debit-gaii-illumina-solexa Illumina Sequencing Applications include whole genome re-sequencing, small RNA identification, RNA-seq, digital gene expression, ChIP- Seq, ATAC-Seq, Hi-C, genotyping, Skim-Seq, TILLING Benefits include very large quantities of data generated (latest offering generates multiple TBs of sequence per run), no issues with sequencing homopolymers A major limitation of this technology was the short lengths of the reads it generates. Read lengths of 150 are common but 300 bp are possible. Probability of errors increases at greater read positions It has a non-random error model Overview of library construction. Hirst M , and Marra M A Briefings in Functional Genomics 2010;9:455-465 © The Author 2011. Published by Oxford University Press. All rights reserved. For permissions, please email: [email protected] http://nextgen.mgh.harvard.edu/CustomPrimer.html Paired-end Sequencing https://galaxyproject.github.io/training-material/topics/introduction/tutorials/galaxy-intro-ngs-data-managment/tutorial.html Illumina Sequencing Libraries A sequencing library is a collection of DNA fragments with have been modified to enable sequencing The range of Illumina applications is due to the different types of libraries which are created Third party companies develop novel library preps to extend how Illumina sequence can be applied The innovation is usually associated with processing DNA fragments such that they received a unique combination of barcodes The barcodes allow reads to be associated with individual samples or fragments of DNA 10X Genomics An Illumina library construction technique Allows the assignment of individual sequencing reads to a particular starting template molecule(s) Want to isolate a single cell (or large DNA fragment) with the components for a library preparation Conceptually it is millions of independent Illumina library preparations each with its own unique barcode (Gel Bead in Emulsion – GEM) Benefit is that during sequence assembly you know which reads came from the same starting piece of DNA Has extensive use in single-cell RNA sequencing and single-cell ATAC-Seq No longer offered for whole genome sequencing due to patent disputes. MGI’s single-tube long fragment read (stLFR) is an equivalent offering 10X Genomics – Single Cell Zheng, G., Terry, J., Belgrader, P. et al. Massively parallel digital transcriptional profiling of single cells. Nat Commun 8, 14049 (2017) 10X Genomics Linked Reads https://wheaton5.github.io/projects/tenx Hi-C What is Hi-C Hi-C is an extension of the Chromosome Conformation Capture (3C) method. It captures the three-dimensional organization of genomes by identifying the physical proximity of chromosomal regions in the nucleus. Purpose of Hi-C: To understand how chromosomes fold and interact within the nucleus. To identify topologically associating domains (TADs), enhancer- promoter interactions, and other chromatin interactions. Key Reference: First described in Lieberman-Aiden et al., Science, 2009 Hi-C Workflow Step 1 Step 2 Step 3 Step 4 Crosslinking Ligation DNA Data Analysis and Digestion The sticky ends Purification Mapping of Cells are of the and reads to the crosslinked with fragmented DNA Sequencing reference formaldehyde to are filled in with genome. biotin-labeled The crosslinks Construction of a preserve DNA- nucleotides and are reversed, contact matrix protein ligated. and the ligated that shows interactions. This step joins DNA is purified. interaction The DNA is then DNA fragments High-throughput frequencies digested with a that are sequencing is between restriction physically close performed to different regions. enzyme, cutting in 3D space but identify ligated it at specific may be far apart DNA fragment sites. in linear pairs. genomic distance. Hi-C Overview Lieberman-Aiden et al., Science, 2009 Applications of Hi-C Genome Assembly: Hi-C data helps in scaffolding contigs during de novo genome assembly by providing long-range contact information. Identification of Topologically Associating Domains (TADs): TADs are contiguous regions of the genome that preferentially interact with themselves. These domains are fundamental units of chromosome folding and gene regulation. Enhancer-Promoter Interactions: Hi-C can identify interactions between enhancers and promoters that are critical for understanding gene regulation. Cancer Genomics: Helps in understanding chromosomal rearrangements and aberrations that lead to oncogene activation. Epigenetics and Gene Regulation: Provides insights into how the 3D structure of the genome impacts gene expression and epigenetic regulation. Illumina Complete Long Reads A method from Illumina that allows pseudo-long reads to be created from overlapping short-reads Released in 2023 so it is uncertain whether this technology will have any significant up take The overall goal of this library prep is to identify series of short reads which come from the same DNA template fragment 5 – 7 kb in length These long fragments have superior base call accuracy than true long read sequencing There is criticism that this technology was too long to market and is inferior to long read sequencing technologies whose read accuracy has improved significantly in the last number of years Illumina Complete Long Reads It first randomly inserts primer binding sites into the genome using a transposable element based system they call ‘tagmentation’ These sites will be where PCR primers bind to amplify the genome. The distance between the primer bind sites is the size of the long read fragments. Keep in mind that there are millions of identical copies of the genome being ‘tagged’ so you will end up with PCR fragments that will overlap in an assembly In addition to the primer sites, the genome is enzymatically modified with ‘landmarks’. Illumina hasn’t been clear on the details of this but the purpose of land marks are to allow us to identify Illumina short reads which come from the same fragment Following PCR amplification, the long read fragments with landmarks are prepared as standard Illumina sequencing libraries In addition to the landmarked sequences, a library preparation on is performed on an unmodified genomic DNA sample. These reads are used to removed the landmarks which were introduced into the long reads Illumina Complete Long Reads https://www.illumina.com/science/technology/next-generation-sequencing/long-read- sequencing.html MGI DNA Nanoball Sequencing A newer sequencing platform based on technology from Complete Genomics It has an innovative library preparation that uses rolling circle amplification to create a long oligo with tandem repeats of the fragment to be sequenced (300-500 copies) This single molecule, called a ‘DNA Nanoball’ is loaded onto a flowcell through interactions between the negative phosphate backbone of the DNA and positively charged spots on the flowcell The method of determining the DNA bases is called combinatorian probe anchor synthesis (cPAS) MGI DNA Nanoball Sequencing The details of cPAS haven’t been disclosed but based on a 2018 patent it is similar to sequencing-by-synthesis With cPAS the nucleotide bases are modified with a molecule (sugar?). Each nucleotide has it’s own moiety attached Fluorescently labelled monoclonal antibodies against the moieties are used to detect the base incorporated. Like SBS, the moiety is cleaved from the DNA leaving a 3’OH and sequencing can continue Instruments create 150 GB – 6 TB of sequence per run MGI – DNA Nanoball Sequencing https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864- 019-5569-5 Illumina vs MGI MGI and Illumina technologies were found perform very closely Either is appropriate for whole genome variant analysis PacBio Sequencing by Binding New method of short read sequencing released in late 2022 from a company leading the long- read sequencing market. It is called Sequencing by Binding (SBB) They claim that the accuracy of their technology is 15X greater than Illumina (error rate of 1 in 1000 to 1 in over 10,000) The main innovation seems to be in the way they detect and incorporate bases in the sequencing process. They still have a cluster generation step like other technologies to amplify the base detection step. The break their sequencing process into 4 steps; initiate, interrogate, activate, incorporate Like Illumina, it is adds once DNA base to the sequenced read in each cycle Unlike Illumina, the incorporated bases are blocked at the 3’ end to prevent chain extension Fluorescent labelled nucleotides flood the flow cell and are allowed to base pair with the template strand but not to form a phosphodiester bond. The labeled nucleotides are then washed away, the 3’ hydroxyl group is unblocked and new unlabelled nucleotides are flooded on the flow cell and allowed to be incorporated The incorporated base is block at the 3’ end which allows only one base to be added at a time Incorporating unlabelled bases reduced something they call ‘scarring’, which is the short linker that is left behind during Illumina sequencing. This leads to increased accuracy PacBio - SBB https://www.pacb.com/blog/sbb- Why do companies create new short read sequencing technologies? It is clear that long read sequencing is the future for whole genome assembly Why are companies continuing to invest in new short read sequencing approaches? Short reads are good for re-sequencing, that is comparing short dna sequences against a high-quality reference genome. This is how we find SNPs and short Indels Short reads are good for counting assays like RNA expression analysis or structural analysis like Hi-C DNA extraction for long-read sequencing can be quite difficult (expensive) while short read technologies have a easier time with various types of samples As long as long-read sequencing is more expensive than short read sequencing will remain in demand PacBio – RT-Sequencing PacBio (Pacific Biosciences) sequencing uses Single Molecule Real-Time (SMRT) technology to read long DNA fragments with high accuracy. Key Features: Single-molecule sequencing: Reads individual DNA molecules without amplification. Long reads: Capable of reading tens of thousands of bases, up to 100 kb. High accuracy with HiFi reads: Combines the length of long reads with accuracy comparable to short-read sequencing. Applications: Useful for genome assembly variant detection Epigenomics transcriptomics PacBio – RT Sequencing PacBio – RT Sequencing http://decodingdna.yolasite.com/single-molecule-real-time-sequencing.php PacBio – RT Sequencing http://decodingdna.yolasite.com/single-molecule-real-time-sequencing.php PacBio - HiFi HiFi (High-Fidelity) sequencing combines the benefits of long reads and high accuracy (>99.9%). Circular Consensus Sequencing (CCS): HiFi reads are generated by sequencing the same DNA molecule multiple times to build consensus sequence. Key Features: Long read lengths (10-25 kb) with high accuracy (Q30 or above) Reduced error rates compared to older long-read technologies Applications: de novo genome assembly detecting structural variants haplotype phasing resolving complex genomic regions PacBio – HiFi Sequencing https://www.pacb.com/technology/hifi- sequencing/ Pacific Biosciences – HiFi https://www.pacb.com/technology/hifi-sequencing/how-it-works/ PacBio Error Rate https://www.pacb.com/blog/understanding-accuracy-in-dna- sequencing/ PacBio Unique Characteristics Poisson distribution of read lengths Random error model PacBio – Pros and Cons Advantages: High-accuracy long reads (HiFi) Ability to capture large structural variants and complex regions Comprehensive epigenetic information Limitations: Higher cost per sample compared to some short-read platforms Longer library preparation time and more extensive computational requirements Requires very high-quality DNA to achieve acceptable read lengths Nanopore Sequence Oxford Nanopore Technologies (ONT) devices reads DNA and RNA sequences by measuring changes in electrical current as nucleic acids pass through a biological nanopore. Unique Features: Real-time sequencing: Data is generated in real time, allowing immediate analysis. Long reads: Capable of sequencing very long fragments (up to hundreds of kilobases). Portable devices: Platforms range from handheld (e.g., MinION) to high- throughput (e.g., PromethION). Applications: whole-genome sequencing targeted sequencing Epigenetics Metagenomics transcriptomics Oxford Nanopore Sequencing https://www.nextgenerationsequencing.info/ngs-products/ngs-technologies/ oxford-nanopore-technologies ONT Sequencing https://nanoporetech.com/uploads/Technology_New/Nanopore_Sensing/filemanager-1.jpg?mw=72 ONT – Raw Data ONT Sequencing https://nanoporetech.com/blog/news-blog-kilobases-whales-short-history-ultra-long-reads-and-high-throughput-genome ONT Applications Whole Genome Sequencing De novo assembly of genomes from high quality ONT reads Genome scaffolding using long ONT reads to position contigs Real-Time Pathogen Detection: Used in field diagnostics (e.g., Ebola, COVID-19). Provides rapid sequencing of pathogens to guide public health interventions. Metagenomics: Sequencing of complex microbial communities without the need for culturing. Used in environmental microbiology, human microbiome studies, etc. Structural Variant Detection: Long reads enable the detection of large structural variants, repeat expansions, and phasing. Epigenetics: Direct detection of base modifications such as methylation without additional library preparation. Full-Length RNA Sequencing: Enables sequencing of entire RNA molecules for isoform discovery and gene expression studies. ONT Pros and Cons Advantages: Portable and scalable devices. Real-time data generation. Ability to produce ultra-long reads (>100 kb). Direct detection of nucleotide modifications such as methylation Limitations: Lower raw read accuracy compared to some short-read technologies (but improving with new chemistries and software). High error rate in homopolymeric regions. Higher cost per base for some applications compared to short-read platforms. Case Study - Cherry Tree Assembly Purpose – Solidify IP protection of a commercialized cherry variety. A genome sequence was requested to unambiguously assign chromosome locations to existing gene-based genetic markers Parameters Limited funds Limited DNA Short time-frame Assembly Considerations The goal was to assign chromosome locations to gene-based markers which are unique sequences in the genome Complete genome characterization of repetitive regions not necessary High quality contig assembly needed for unique chromosome regions Had to achieve long scaffolds for anchoring on an existing genetic map of ~2200 markers Technology Choices Assembly Stage Technology Rational Contig Construction 10X 10X Genomics gives high Genomics quality contig sequence in non-repetitive regions PacBio cost was too high Nanopore cost was too high Scaffold Generation 10X 10X Genomics achieves Genomics, excellent initial scaffold Proximo Hi-C lengths Hi-C low cost with in-house library construction Optical mapping was not readily available Pseudochromosome Genetic Map Use an existing genetic map Building to place scaffolds into linkage groups Cherry 10X Genomics Input Region From Average Conc. Region To [bp] Molarity % of Total Color [bp] Size [bp] [ng/µl] Comment [nmol/l] >600 20000 - 13.5 0.322 82.83 00 >600 48500 - 12.1 0.241 74.17 00 Kmer Frequency Analysis Genome Length: 338 Mbp Unique Content: 53.8% Heterozygosity:0.37% 10X Genomics Assembly Summary Number of Reads 150,000,000 Estimated Coverage 57X Median Insert Size 393 bases Repetitive Fraction 28% Assembled GC Content 38% Mean Molecule Size 66,400 bases Mean Distance Between Heterozygous 624 bases SNPs Contig N50 46,660 bases Scaffold N50 > 10,000 bp 2,760,000 Hi-C Scaffolding 10X Assembly Evaluation of the Hi-C Scaffolds Evaluation of the Hi-C Scaffolds Evaluation of the Hi-C Scaffolds Hi-C Scaffolding Results 10X Assembly 10X Assembly Before Hi-C after Hi-C Number of Contigs 16,584 7576 N50 1,894,700 23,810,154 Count >= N50 31 5 Longest Scaffold Length 10,855,963 44,294,128 Total Scaffold Length 292,931,652 288,746,240 Cherry Genetic Map Pseudomolecule Construction Pseudochromosome Linkage Map Scaffolding (bases) chr1 44,294,128 chr2 34,408,754 chr3 26,824,391 chr4 22,649,137 chr5 18,400,590 chr6 27,262,132 chr7 23,810,154 chr8 22,114,774 Total 219,764,060 Assembly Evaluation Number of Percent of BUSCO BUSCO Genes Genes Complete Single- 1337 92.8 Copy Complete 34 2.4 Duplicated Fragmented 32 2.2 Missing 37 2.6 Total 1440 100.0 Assembly Evaluation Choosing a Technology Application dependent Considerations; Total cost Cost per sample Number of reads generated per $ Number of bases generated per $ Length of sequence Quality of the sequence Availability