ZB2101 Introductory Bioinformatics - Genome-Wide Experiments
Document Details
National University of Singapore
2024
Greg Tucker-Kellogg
Tags
Summary
These lecture notes provide an introduction to genome-wide studies, covering DNA microarrays, DNA sequencing technologies, and the FASTQ file format. The presentation details different types of genomic studies and the mapping process. The notes also touch upon quality scores.
Full Transcript
ZB2101 Introductory Bioinformatics Introduction to genome-wide studies Greg Tucker-Kellogg Department of Biological Sciences, National University of Singapore 30 October 2024 Outline Prequel DNA microarray technology...
ZB2101 Introductory Bioinformatics Introduction to genome-wide studies Greg Tucker-Kellogg Department of Biological Sciences, National University of Singapore 30 October 2024 Outline Prequel DNA microarray technology DNA sequencing technologies Some types of genomic studies The FASTQ file type Mapping reads to a genome: FASTQ → SAM 1 Topic Prequel DNA microarray technology DNA sequencing technologies Some types of genomic studies The FASTQ file type Mapping reads to a genome: FASTQ → SAM 2 What do we mean by a genome-wide study? Any study that provides information on many (≫ 1000) genes or genomic positions at once. Several different technology categories, many different applications. Sometimes experimental, sometimes observational. 3 Example: Genome wide association Using single-nucleotide polymorphisms, what genetic location is associated with a disease? 4 Example: transcriptome profiling What genes in the central nervous system undergo mistaken splicing in a mouse model of ALS? 5 Topic Prequel DNA microarray technology DNA sequencing technologies Some types of genomic studies The FASTQ file type Mapping reads to a genome: FASTQ → SAM 6 What is a DNA microarray? A solid surface covered with DNA is attached. Typically the surface is arranged as a grid of spots. Distinct DNA sequences are attached to the surface at different spots. (Each sequence has an address). A typical DNA microarray is between the size of a thumbnail and the size of a microscope slide. A single DNA microarray may represent over a million distinct DNA sequences. Most microarrays are used for two types of assays: ◦ Large scale genotyping ◦ Large scale gene expression measurements DNA methylation assays are a variant of the genotyping assay. 7 Common features of all DNA microarray technologies Generally, 1 microarray is used per sample. The measurement signal is a result of DNA hybridisation. More hybridisation, more signal. The resulting raw data is represented a tall, skinny matrix of numbers. ◦ m columns, each representing a sample (typically ≪ 1000) ◦ n rows (> 100,000), each representing the signal (hybridisation) at a spot. ◦ The amount of signal needs to be converted to a biological measurement estimate, e.g., transcript expression level SNP variant call etc. 8 Common features of expression microarray experiments Samples vary by known, intended ways (e.g., sample treatment). Samples may vary by know, unintended ways (e.g., known batch effects). Samples may vary by uknown, unintended ways. Our scientific goal is to model expression changes based on known (and possibly unknown factors), and to infer changes in expression that are due to intended covariates (e.g., sample treatment). The number of covariates may be larger than the number of samples. Experimental design is often represented as a short, wide table: ◦ m rows, one for each sample ◦ p columns, one for each covariate 9 Target layout of a microarray study 10 Differences between DNA microarray technologies and applications What is measured Transcript (from 3′ end) Transcript (without 3′ end bias) For this module, we will focus on Exons microarrays used to measure RNA Splice junctions expression levels. SNP variation Copy number variation Methylation 11 Affymetrix GeneChips Short probes synthesised on the chip sur- face Light-directed oligonucleotide synthesis. Photolithographic masks used to direct synthesis. Light directed synthesis is slightly less efficient than conventional oligonucleotide synthesis! Oligonucleotide length is practically limited (by efficiency and cost) to 25 nucleotides Millions of distinct short sequences at defined positions 12 Illumina BeadArrays Longer oligos synthesised on beads Conventional solid-phase phosphoramidite DNA synthesis. Much longer sequences possible (typically 50 nucleotides). 3µm beads randomly spread out on a glass slide. ~40,000 - 50,000 distinct sequences at undefined positions. 13 How are Illumina BeadArray probes mapped to bead positions on the slide? Each bead also includes a barcode sequence Barcodes share subsequences Barcodes can be distinguished by repeated hybridisation ◦ Two colours require 2n hybridisations to map n distinct probes 14 Chip technology influences analysis strategy Affymetrix GeneChips Illumina BeadArrays Short oligonucleotide sequences mean fewer Long sequences allow individual oligos to unique sequences in a large genome. identify transcripts. Many arrays have “perfect” and “mismatch” Conventional synthesis is cheaper than pairs to distinguish hybridisation. light-directed synthesis, which permits more Typically a transcript is measured by combining flexible redesign of specialised arrays. data from 11-20 “probe pairs” designed against Individual sequence bias may contribute more different regions of the target to assay measurements. Any sequence hybridisation measurement may For a long while, two colour strategies were be biased; multiple different sequences are used with Illumina BeadArray. intended to compensate for bias. 15 What does the data actually look like? Data are non-negative values from a signal measurement For gene expression microarrays, higher values of signal result in higher estimates of expression. Many readings occur near, but not exactly at, zero (think about why). For genotyping or methylation microarrays, data are often scaled between 0 and 1 to make a categorical estimate (e.g., is this position methylated or not) 16 Topic Prequel DNA microarray technology DNA sequencing technologies Some types of genomic studies The FASTQ file type Mapping reads to a genome: FASTQ → SAM 17 Sanger Sequencing (1 of 2) Invented in 1977, won the second Nobel Prize for Fred Sanger, who shared it with Allan Maxam and Walter Gilbert Relies on a dideoxy nucleotide triphosphate (NTP), where both 2′ and 3′ positions do not have an OH The human genome project was carried out using Sanger sequencing 18 Sanger Sequencing (2 of 2) 19 A modern Sanger sequencer One “sequence read” per sample up to 800 base pairs per read 96 reads per run 20 Next-generation sequencing A general term for sequencing technologies that began to emerge about after 2004 years ago Often shortened to NGS Typically 1 million – 100 million “reads” per sample This technology completely changes what is possible in biology All genomic sequencing projects since 2008 use NGS 21 Comparison of NGS technologies in 2013 A more updated, although less clear, table is available at https://en.wikipedia.org/wiki/DNA_sequencer 22 Illumina flow cells Why has Illumina dominated NGS? Is the most widely used NGS technology (>90% of all DNA sequence data ever generated) Size of a microscope slide hundreds of millions of short reads generated in a single run How does it work? 23 Long reads and single molecule sequencing is next 24 Illumina: from DNA to prepared slides DNA is fragmented and ligated to adapters. This process is called “library preparation” and varies widely depending on the application. The fragments are randomly bound to the surface of a flow cell. PCR amplification occurs in a lawn of adapter sequences The lawn creates “bridges” so a cluster of identical sequences. appear where each molecule of DNA started. The clusters are often called “polonies” (PCR colonies). 25 Illumina: from slide to sequence read Each cluster produces a sequence read from one or both ends of the fragment; Every time a new base is added, it is color coded; A single image of the slide gives one base addition; A single run of a single machine can yield hundreds of millions of sequence reads. Illumina Sequencing by Synthesis 26 The technology provided by NGS is an unprecedented trend Data from US National Human Genome Research Institute 27 Topic Prequel DNA microarray technology DNA sequencing technologies Some types of genomic studies The FASTQ file type Mapping reads to a genome: FASTQ → SAM 28 Some types of genomic studies 29 The barcode of life boldsystems.org 30 SG10K health npm.sg 31 Topic Prequel DNA microarray technology DNA sequencing technologies Some types of genomic studies The FASTQ file type Mapping reads to a genome: FASTQ → SAM 32 FASTQ: FASTA + quality Capillary electrophoresis sequencing provides quality scores (Phred scores) for each base sequenced in every lane; Can a modification of the FASTA file format provide quality information for NGS data? Quality scores are numeric, applied to every base of every read. 33 How can we use quality scores with NGS data? NGS technology yields tens of millions of sequence reads per run Reads are assessed for quality during analysis, not one at a time FASTA is the most common readable format for multi-sequence files 34 The FASTQ data file @FCC2BEJACXX:3:1101:4961:1995#0/1 NGAAGGAACCTTGGCACTAGAATCTCGTATGCCGTCTTCTGCTTGAAAA + BP\ccceegggggiiiiihihiiiiiiiihffhhhhiifffhhib_ceg @FCC2BEJACXX:3:1101:6583:1992#0/1 NGGAGATTGACTTGGCTATGTTCCTCGTATGCCGTCTTCTGCTTGAAAA + BP\cceeefggggiiiihiihiiiiiihhihhhhfhiihhhhiihhiii @FCC2BEJACXX:3:1101:7345:1990#0/1 NGGGAGGATAGTATGTACGCAGACTCGTATGCCGTCTTCTGCTTGAAAA + BS\cccecgggegiihihiiiiiiiihfh_cfgh_fhihiiihihdfhi 35 FASTQ format: four lines per entry Line 1 Like FASTA header line, but begins with "@" instead of ">". The header line includes information about the machine, flow cell, spot position, etc. The details depend on the platform. Line 2 The sequence Line 3 Begins with "+", and then sometimes repeats the information in line 1. This line is generally not used. Line 4 Numeric quality scores for the sequence in Line 2. These are encoded in ASCII, which allows an integer value up to 127 as a single character. 36 The ASCII character table 37 What is Quality? Assume a probability p that the called base is incorrect, and map that probability to a positive integer. The traditional quality mapping is the Phred score, Q Sanger = −10 log10 p Phred scores are encoded in FASTQ files as ASCII = Phred + 33. Illumina used to use a different quality score, based on the odds ratio. p Q old Illumina/Solexa data = −10 log10 1−p They then updated it to use the Phred score, but encoded with an ASCII value of Phred + 64. Even more recently, they seem to have gone back to conventional Phred + 33. In general, the quality encoding can be recognised by the programs that process the data. 38 Topic Prequel DNA microarray technology DNA sequencing technologies Some types of genomic studies The FASTQ file type Mapping reads to a genome: FASTQ → SAM 39 The mapping problem Reads tend to be short, but the genome is long (non-repetitive human genome ~ 2.7Gbp) Sequencing reads are very often paired, reading both ends of an insert. Brute force (compare all k-mers) is impractical. 40 The solution: index the genome Several methods of indexing the genome ◦ Suffix arrays ◦ Search a binary tree Index and compress in clever ways ◦ Burrows Wheeler Tranform ◦ FM Index We don’t need to go into the details of this, but they work 41 Many short read aligners BWA bowtie (or bowtie 1) bowtie2 (longer reads, allows indels) 42 bowtie and bowtie2 A set of C programs for efficient short read alignment Easy to use The workhorse of a larger suite of programs named for men’s formalwear (cufflinks, tophat) Step 1: index the genome Step 2: align the reads Depending on the application, you may want to use the program bowtie or bowtie2 Can output SAM file format 43