Week 12 Gene expression, microarrays, RNAseq, and GO-11-8-23.pptx
Document Details
Uploaded by ReliableMookaite1890
Full Transcript
Gene Expression, Microarrays, RNAseq, and GO Dr. Jan E. Janecka [email protected] 236 Mellon Hall OUTLINE FOR TODAY 1.The goals of differential expression 2.Experimental approach 3.Microarrays 4.RNAseq 5.Interpretation of data 6.Gene ontology Lesson 1 • The dynamic nature of gene expression • Th...
Gene Expression, Microarrays, RNAseq, and GO Dr. Jan E. Janecka [email protected] 236 Mellon Hall OUTLINE FOR TODAY 1.The goals of differential expression 2.Experimental approach 3.Microarrays 4.RNAseq 5.Interpretation of data 6.Gene ontology Lesson 1 • The dynamic nature of gene expression • The pitfalls of gene expression analysis • Experimental approaches needed to understand gene expression How can we understand what •• Gene Gene a •determines Genetic variation expression expression • • • Associations (GWAS) Protein phenotype? Protein expression expression …after viral infection …after drug treatment Function al Analysis …relative to a knockout …at a later developmental time …in samples from patients …in a different body region The Central Dogma Differential Expression • Gets you 1 step closer to the phenotype • Focuses on genes that are transcribed and eventually translated in particular cells Begin to understand the link between genotype, environment, and Gene expression is context-dependent, and is regulated in several basic ways • by region (e.g. brain versus kidney) • in development (e.g. fetal versus adult tissue) • in the dynamic response to environmental signals (e.g. immediate-early response genes) • in disease states How Can We Measure Gene Expression? Gene Expression Analysis Samples of interest Condition 1 (normal colon) Condition 2 (colon tumor) Isolate RNAs and convert to cDNA Microarrays RNAseq Differences that affect phenotype Statistical Analysis Gene Expression Analysis General approach to analysis of data 1. Preprocessing normalization, scatter plots 2. Inferential statistics t-test, ANOVA 3. Exploratory (descriptive) statistics distances, clustering Microarrays: Measurement of mRNA from select genes RNAseq: Genome-wide measurement of all RNA transcripts Lesson 2 • How microarrays are designed and analyzed • The 7 main steps to differential gene expression • Microarrays and RNAseq analysis Microarrays: tools for gene expression • Short oligonucleotides (oligos) deposited in a grid-like array • Oligos are “probes” that represent short parts of exons • Extract cDNA from a sample & label with fluorescence • Hybridized the labeled cDNA to microarray • The fluorescence signal is a measure of age 1: Experimental design e 2: RNA extraction and cDNA preparation age 3: Hybridization to DNA array ge 4: Acquire image of fluorescent signal age 5: Microarray data analysis age 6: Biological confirmation ge 7: Deposit in microarray databases Stage 1: Experimental design • Need biological replicates (typically n≥3 per group). Critical there is a balanced, randomized experimental design Stage 2: RNA and cDNA preparation • RNA extraction, conversion to cDNA, label with fluorescent dye, evaluate and avoid systematic artifacts The extraction quality and cDNA efficiency important to consider or you may miss target genes Stage 3: Hybridization to DNA arrays • The cDNA with fluorescent tags is hybridized to microarrays consisting of oligonucleotides probes. There are substantial technical artifacts - temperature, humidity, person doing procedure, etc., all shown to affect results! Stage 4: Image analysis • RNA transcript levels are quantitated based on fluorescence intensity measured with a scanner Signal proportional to relative number of sample sample +2 1 transcripts Exon not expressed in either sample Each spot has a probe for one exon Much greater expression of exon in sample 1 Differential Gene Expression on a cDNA Microarray Control a B Crystallin is over-expressed in Rett Syndrome Rett Pevsner 2009 tage 5: Microarray data analysis Statistical Analysis • Compare results from different microarrays • Determine which RNA transcripts differentially expressed • What are the criteria for statistical significance? Clustering • Meaningful biological patterns in the data Classification • Determine whether expression pattern in RNA transcripts predicts groups • e.g., specific types of lymphoma, biomarkers for disease Stage 6: Biological Confirmation • Microarray experiments are in a way “hypothesis-generating” • The differential up- or down-regulation of specific RNA transcripts needs to be independently confirmed using other methods o Northern blots o Western blots o RT-PCR o RNAseq o in situ hybridization Stage 7: Microarray databases There are two main repositories • Gene Expression Omnibus (GEO) at NCBI • ArrayExpress at the European Bioinformatics Institute (EBI). Minimum Information About a Microarray Experiment (MIAME): ► experimental design ► microarray design ► sample preparation ► hybridization procedures ► image analysis ► controls for normalization NCBI provides access to GEO Datasets and GEO Profiles GEO Datasets: How is a RNA transcript expressed across hundreds of experiments? Search “globin fetal and adult reticulocytes” GEO Datasets: How is a RNA transcript expressed across hundreds of experiments? Search “globin fetal and adult reticulocytes” GEO Datasets: You can perform analysis on experimental data! OTE: These online analysis tools are limited to microarray data GEO Datasets: You can perform analysis on experimental data! Statistical test to identify genes that are differentially expressed GEO Datasets: You can perform analysis on experimental data! You can make a heat map of gene expression GEO Datasets: You can perform analysis on experimental data! Heatmap from fetal versus adult blood experiment GENES SAMPLES you can zoom in on the heat map GEO Datasets: You Can Make Clusters Genes Based on Expression Patterns Clusters Genes Based on Expression Patterns GEO Datasets: You Can Make Clusters Genes Based on Expression Patterns Clusters Genes Based on Expression Patterns GEO Datasets: You Can Make Clusters Genes Based on Expression Patterns Clusters Genes Based on Expression Patterns you can zoom in on the clustes GENES Down regulated in GENES Upregulated in fetal GENES That have no Activity 8 – DE analysis PART A 28 RNA Sequencing Example Samples of interest Isolate RNAs Generate cDNA, fragment, size select, add linkers Condition 1 Condition 2 (normal colon)(colon tumor) Sequence ends Map to genome, transcriptome, and predicted exon junctions 100s of millions of paired reads 10s of billions bases of sequence Differences in gene expression that affect phenotype Downstream analysis RNA Expression Varies in Time and Space RNA studies requires more complex study design than genome sequencing 1. Technical replicates • Repeat experiment, library prep, and sequencing 2. Biological replicates • Multiple RNA isolations from same sample Variation within cells • Multiple individuals of the same stage/condition Variation within individuals Need to take into account variation that could hide experimental response • Environmental factors (temperature, season) • Age (juvenile vs adult) GOAL: High correlation coefficient between replicates Example of an RNAseq Pipeline Sequencing Read alignment Transcript compilation Gene identification Differential expression RNA-seq reads (2 x 100 bp) Bowtie/ TopHat alignment (genome) Cufflinks Cufflinks (cuffmerge) Cuffdiff (A:B comparison) Raw sequence data (.fastq files) Reference genome (.fa file) Gene annotation (.gtf file) Inputs CummRbun d Visualization Align Reads to an Annotated Reference 1. 2. GOAL: Need to determine which reads correspond to which genes • Three general strategies depending on the resources available • If you do not have some assembly need to make one 3. first Comparing Expression of Transcripts • The estimated expression of genes and transcripts needs to be normalized FPKM = Fragments Per Kilobase of Transcript per Million mapped reads Number of reads proportional to cDNA However: • Total number of fragments biased by gene length • Total number biased by library depth Visualization of Spliced Alignment of RNA-seq Data IGV screenshot Normal WGS Acceptor site mutation Tumor WGS Tumor RNA-seq Comparing Differential Expression Between Samples • Model variability in fragment count for each gene across replicates • Estimate fragment count and uncertainty for each isoform • Estimate count variability for each transcript in each library • Test for statistically differences in expression between isoforms by incorporating the Malachi Griffith 2013 bioinformatics.ca Gene Expression Analysis Samples of interest Condition 1 (normal colon) Condition 2 (colon tumor) Isolate RNAs and convert to cDNA Microarrays RNAseq Differences that affect phenotype Statistical Analysis Lesson 3 • Main approaches to data analysis of gene expression • Preprocessing to clean up data • Inferential (t-tests, anova) and descriptive statistics (scatter plots, volcano plot) • Interpreting the results for biological significance • Clustering analysis and heatmaps • Understanding function with Gene Ontology Gene Expression Data Analysis The actual data analysis starts once the raw data is converted to expression values for each gene in a simple table Genes In most studies there are many more genes (> 20,000) than (<100 samples) RNA transcript levels • Signal intensity for microarray • FPKM for RNAseq FPKM = Fragments Per Kilobase of Transcript per Million mapped reads Gene Expression Analysis: Preprocessing • The main goal of data preprocessing is to remove the systematic bias in the data, while preserving the variation in gene expression that occurs because of biologically relevant changes in transcription. Basic assumption is that the gene expression levels for most genes does not change in an experiment Preprocessing: Global Normalization log signal intensity log signal intensity • Histograms of raw intensity values for 14 arrays (plotted in R) before and after normalization was applied. array array Ex high low Expression level of sample 1 re gu la te d n o i ss e pr l e v Le do wn Expression level sample 2 up re gu la te d Descriptive Statistics - Scatter plots Volcano Significant and Displays both p values and fold change Plot large p value (treated versus control) difference downregulation log fold change (treated/untreated) Significant and large difference upregulation Interpreting Note the results can have… • • • • a small p value (<0.05) with a big ratio difference a small p value (<0.05) with a trivial ratio difference a large p value (>0.05) with a big ratio difference a large p value (>0.05) with a trivial ratio difference Which group is worth reporting? Interpreting Analysis of Expression Data • P-value and the problem with multiple testing • For p = 0.05, 5% of ALL TEST will result in false positive (i.e., significant but really no difference) • In a 20,000 gene analysis at p = 0.05, you would expect 20,000 x 0.05 = 1,000 of the ’significant’ observations are false positives • False Discovery Rate (FDR) compensates for multiple testing • For FDR = 0.05, 5% of SIGNIFICANT TESTS result in a false positive • In a 20,000 gene analysis at FDF = 0.05, you would expect that among all significant results, only 5% of those are false positives For DE analysis, FDR is the standard approach. It is less likely to miss true positives than Bonferroni correction Clustering Heat maps Clustering of genes on y-axis and samples (cell lines) on x-axis (Alizadeh et al.2000) Independent Validation Very Important 33 of 192 assays shown. Overall validation rate = 85% Griffith et al. Alternative expression analysis by RNA sequencing. Nature Methods. 2010 Oct;7(10):843-847. Successful Gene Expression Analysis ... • Differences between two groups • List of genes that are differentially expressed Now What??? Understanding Phenotypic Effects [1] Protein families [2] Physical properties [3] Protein localization DNA RNA protein [4] Protein function Gene ontology (GO): --cellular component --biological process --molecular function GO terms are assigned to NCBI Gene entries GO Biological Processes Enriched in Gene Lis Activity 8 – DE analysis PART B 52 GO Applications - Enrichment Analysis One of the main uses of Gene Ontology is Enrichment Analysis • Given a set of genes that are up-regulated under certain conditions, which GO terms are overrepresented (or under-represented)? Provides insight on the pathways/processes that are affecting the phenotype. The Main Repercussions of the Dynamic Nature of Gene Expression …. What you do affects what genes are expressed, which then effects your physiology, metabolism, structure, mood, feelings…. Do things that change your gene expression in a positive way • • • • • Read/Learn new things Exercise Balanced diet Intermittent fasting Win Hof breathing and cold exposure https://www.youtube.com/watch?v=q6XKcsm3dKs The Wim Hof Method has many - Lowers heart rate benefits …. - Improves cardiovascular system - Reduces pain Reduces inflammation Increases immune system Decreases sensitivity to cold Develop mental strength Reduces/eliminates depression Improves mood, outlook, and wellbeing Over the next 2 weeks, I challenge you to do Wim Hof Breathing Exercises (10-15 min each day) and Cold Exposure at least 2-3x/week (at end of a shower turn cold water on, start just for 15 sec, then gradually work your way up to 2 min). I am confident this will have a positive effect on your health and mindset! Guided Wim Hoff Breathing Exercise (11 min) https://www.youtube.com/watch?v=tybOi4hjZFQ&t=3s Wim on Cold Exposure & Breathing https://www.youtube.com/watch?v=nLHdG_zEue0 Review – Gene expression, microarrays, RNAseq, and GO Main Concepts Lesson 1 • The dynamic nature of gene expression • The pitfalls of gene expression analysis • Experimental approaches needed to understand gene expression Lesson 2 • How microarrays are designed and analyzed • The 7 main steps to differential gene expression • Microarray and RNAseq DE analysis Lesson 3 • Main approaches to data analysis of gene expression • Preprocessing to clean up data • Inferential (t-tests, FDF) and descriptive statistics (scatter plots, volcano plot) • Interpreting the results for biological significance • Clustering analysis and heatmaps • Understanding function with Gene Ontology Review – Gene expression, microarrays, RNAseq, and GO Main Terms Lesson 1 • Functional analysis, gene expression differences • Microarrays, RNAseq • Inferential statistics, exploratory statistics Lesson 2 • Oligos, probes, cDNA, hybridization, fluorescent tags • Rett syndrome, a b crystallin • Clustering, classification, northern blots, western blots, RT-PCR, in situ hybridization • Technical replicates, biological replicates, RNAseq pipeline • Gene Expression Omnibus (GEO) databases, metadata, • Annotated reference, FPKM, fragment count, isoforms Lesson 3 • Preprocessing, systematic bias, normalization, FDF, scatter plot, volcano plot, heat map, validation • Gen Ontology, cellular component, biological process, molecular function • Enrichment analysis, pathways