Summary

This document is a lecture on RNA-Seq. The lecture details the workflow and methods of RNA sequencing, and goes into detail about some of the bioinformatic methods used in RNA-Seq analysis. This includes several key steps such as sample preparation, library preparation, sequencing, alignment, and quantification.

Full Transcript

Lecture 8 – RNA-Seq BIOTECH 4BI3 - Bioinformatics Where are we going? DNA Sequencing DNA DNA Read Sequencing Quality Assembly Mapping...

Lecture 8 – RNA-Seq BIOTECH 4BI3 - Bioinformatics Where are we going? DNA Sequencing DNA DNA Read Sequencing Quality Assembly Mapping Control Genome Expression Annotation Analysis Marker-Trait Population Polymorphis Genotyping Associations Analysis m Discover What is RNA-Seq RNA-Seq (RNA sequencing) is a high-throughput method used to sequence and quantify RNA in a sample. It enables comprehensive analysis of the transcriptome—the complete set of RNA transcripts produced by the genome under specific circumstances. Why it's important: It helps to quantify gene expression, discover novel transcripts, identify splicing events, and understand regulatory networks. Applications: Used in research for cancer, neuroscience, developmental biology, and plant biology. What is RNA-Seq Wang et al, 2009 – “RNA-Seq: a revolutionary tool for transcriptomics” Why Use RNA-Seq RNA-Seq offers a deeper, more unbiased view of the transcriptome compared to traditional techniques like microarrays. Advantages over Microarrays: Higher sensitivity: Can detect low-abundance transcripts. Not a fixed set: RNA-Seq can discover novel transcripts without needing pre-designed probes. Greater dynamic range: More accurate quantification of highly and lowly expressed genes. Quantitative and Qualitative Benefits: RNA-Seq not only measures transcript levels but also helps discover alternative splicing, non-coding RNAs, and novel isoforms. Applications of RNA-Seq Quantifying gene expression: Allows researchers to measure gene expression levels across different conditions or treatments. Transcript discovery: Identification of novel coding and non-coding RNAs. Alternative splicing: Discovery of isoforms and splice variants in a single experiment. Comparative transcriptomics: Compare expression levels between species, tissues, or conditions. Network analysis: Reveal gene-gene interactions and regulatory pathways through co-expression networks. RNA-Seq Workflow The RNA-Seq process involves the following steps: 1. RNA extraction from the biological sample (e.g., tissue, cells). 2. Library preparation, where RNA is converted into cDNA. 3. Sequencing, typically done using Illumina or PacBio technologies. 4. Read alignment to a reference genome or transcriptome. 5. Quantification of transcript abundance and further downstream analysis (e.g., differential gene expression). Sample Preparation The starting point for any RNA-Seq experiment is obtaining high-quality RNA from biological samples. Types of RNA: mRNA: Typically the focus of RNA-Seq experiments, representing ~1- 2% of total RNA. Other RNAs: Includes ribosomal RNA (rRNA), transfer RNA (tRNA), and non-coding RNAs, which can be filtered out. Challenges in sample prep: RNA is fragile and prone to degradation, making it crucial to handle samples carefully. RNA extraction methods (e.g., TRIzol, column-based kits). Important Note: Sample source matters. Differences in tissue types or conditions (e.g., stress or disease) impact RNA composition. Library Creation Ribosomal RNA (rRNA) Depletion Ribosomal RNA constitutes ~80-90% of total RNA in a cell, and since we’re usually interested in mRNA, it's necessary to deplete rRNA. Methods for rRNA depletion: 1. Poly-A selection: Enriches for mRNA by binding to polyadenylated tails, but misses non-polyadenylated RNAs. 2. Ribodepletion: Removes rRNA without bias, allowing both polyadenylated and non-polyadenylated transcripts to be captured. Choosing the right method depends on the experimental goals (e.g., total RNA vs mRNA). RNA-Seq Experimental Design Properly-designed experiments ensure the generation of reproducible, meaningful data. Poor design leads to biased results, incorrect biological conclusions, and wasted resources. What is the question you are trying to answer? Key goals of experimental design: Maximize biological signal detection Minimize technical noise and bias Biological and Technical Replication Biological Replicates: Biological replication ensures that differences in gene expression are not due to random variation. Usually involves multiple independent samples from each condition being tested. Technical Replicates: Technical replicates address variability introduced by library prep, sequencing, etc. Generally less critical than biological replicates but still useful. Advice on experimental design More biological replicates are better More biological replicates are better The trade off between replicates and sequencing depth should always go for replicates Published data supports 10-20 million reads per transcriptome should capture most of the differentially expressed genes at 2X delta How much sequencing What is sequencing depth? Number of reads generated per sample. Higher depth increases sensitivity to detect low-expressed genes. How much depends on goal: 5 million reads to quantify highly expressed genes. >50 million for lowly expressed genes 80%) of reads should map to the genome or transcriptome. Coverage uniformity: Uniform coverage across the transcriptome is desired, as uneven coverage can lead to biases in downstream analyses. Mapping Reads Alternative Splicing Cartegni L et al. (2002) Nature Reviews Genetics 3:285–298 Transcript Quantification Objective: After reads are mapped, the next step is to quantify the expression levels of genes or transcripts. Quantification Methods: Different pipelines can impact the accuracy and precision of RNA-Seq data. HTSeq: Strong performance in counting reads with the union and intersection methods. RSEM: Effective in transcript quantification by summing transcript-level estimates. Salmon and Kallisto: These pseudoaligners offer speed and good precision but may sacrifice some accuracy in comparison to traditional methods like HTSeq. RNA-Seq Normalization What is normalization: RNA-Seq libraries can vary in total read counts, and this variation must be corrected to compare gene expression between samples. This correction is called normalization and is part of the quantification process Common normalization methods: 1. RPKM (Reads Per Kilobase per Million): Normalizes for gene length and sequencing depth. 2. FPKM (Fragments Per Kilobase per Million): Similar to RPKM, but for paired- end reads. 3. TPM (Transcripts Per Million): Normalizes within each sample and is more consistent for comparing across samples. 4. TMM (Trimmed Mean of M-values):calculates normalization factors by comparing the M-values (log fold changes) between samples, trimming extreme values, and using the mean of the remaining M-values to scale counts RPKM Reads per kilobase of transcript per million reads mapped Proposed to allow the comparison of transcripts within and between samples Tries to account for differences among transcripts and libraries by correcting for the size of the library (number of reads) and the length of the gene RPKM Normalize for read depth per replicate. This gives our scaling factors for each replicate. Divide the read counts in each replicate by this value and multiply by 109. Replicate Replicate Replicate Gene Length 1 2 3 A 2000 25000 27000 75000 B 4000 60000 58000 150000 C 1000 12500 20000 37500 Total Reads 97500 Replicate Replicate 105000 Replicate 262500 Gene Length 1 2 3 25641000 25714300 28571400 A 2000 0 0 0 61538500 55238100 57142900 RPKM Normalize for the length in base pairs of each gene. Replicate Replicate Replicate Gene Length 1 2 3 25641000 25714300 28571400 A 2000 0 0 0 61538500 55238100 57142900 B 4000 0 0 0 12820500 Replicate Replicate 19047600 Replicate 14285700 Gene C Length 1000 1 02 03 0 A 2 kb 128205 128571 142857 B 4 kb 153846 138095 142857 C 1 kb 128205 190476 142857 FPKM Developed to accommodate paired-end read data Paired reads only represent a single DNA fragment which was sequenced It keeps track of the fragments so PE reads aren’t counted twice TPM It was observed that RPKM values vary from sample to sample and aren’t a true measure of ‘concentration’ of an expressed gene TPM (transcripts per million) was put forward as an alternative “For a given RNA sample, if you were to sequence one million full-length transcripts, a TPM value represents the number of transcripts you would have seen for a given gene or isoform.” (NCBI) TPM is proportional to the ‘average concentration’ of a transcript and is preferred for expression studies TPM Divide the number of reads for a transcript in a replicate by the gene length Replicate Replicate Replicate Gene Length 1 2 3 A 2000 25000 27000 75000 B 4000 60000 58000 150000 C 1000 12500 20000 37500 Total Reads 97500 105000 262500 Replicate Replicate Replicate Gene Length 1 2 3 A 2 kb 12500 13500 37500 B 4 kb 15000 14500 37500 C 1 kb 12500 20000 37500 TPM After adjusting for the length of the genes, normalize as if each replicate library had 1,000,000 reads total. Replicate Replicate Replicate Gene Length 1 2 3 A 2 kb 12500 13500 37500 B 4 kb 15000 14500 37500 C 1 kb 12500 20000 37500 Replicate Replicate Replicate Gene Length 1 2 3 A 2 kb 312500 281250 333333 B 4 kb 375000 302083 333333 C 1 kb 312500 416667 333333 Norm. Problems with Library Normalization 1. Differences in library size – Sequenced libraries have different sizes. RPKM/FPKM and TPM account for this through two-step normalization. 2. Difference in library content – Different tissues or conditions can result in very different subsets of genes being expressed. A leaf RNA-seq library will have many genes expressed that don’t show up in a root RNA-seq library. Highly expressed genes in one sample versus another can appear differentially expressed with RPKM/FPKM and TPM Gene Leaf Tissue Root Tissue AtCul1 50 250 Rubisco 400 0 AD 25 125 SFT 25 125 Total Reads 500 500 TMM The group normalization method used by EdgeR Use this on the raw count data not intra-sample normalized Works on the assumption that most genes are not differentially expressed It determines a normalization factor for each gene and applies a scaling factor to create an ‘effective library size’ for the whole library If your samples are similar to each other you may not need to do this Comparing Replicates As a quality control we often want to compare replicates If replicates are less alike than different treatments it could indicate a problem (Spearman coeff

Use Quizgecko on...
Browser
Browser