Document Details

GallantSnowflakeObsidian

Uploaded by GallantSnowflakeObsidian

University of Ottawa

Yan Burelle

Tags

pathway analysis bioinformatics microarray analysis gene expression

Summary

This document is about pathway analysis, specifically focusing on techniques like over-representation analysis (ORA) and gene set enrichment analysis (GSEA). It explains how these methods can be used to interpret large gene expression datasets, like those from microarray or RNA sequencing experiments, to understand the functional context of gene expression changes. The document describes the steps in performing these analyses and how to use tools GEO2R and WebGestalt.

Full Transcript

HSS 4324 - Research Approaches in Health Biosciences Yan Burelle, Ph.D. Professor Interdisciplinary School of Health Sciences, Faculty of Health Sciences & Department of Cellular and Molecular Medicine, Faculty of Medicine University Research Chair in Integrative Mitochondrial Biology University of...

HSS 4324 - Research Approaches in Health Biosciences Yan Burelle, Ph.D. Professor Interdisciplinary School of Health Sciences, Faculty of Health Sciences & Department of Cellular and Molecular Medicine, Faculty of Medicine University Research Chair in Integrative Mitochondrial Biology University of Ottawa Pavillon Roger Guindon Room 2117 451 Smyth Road, Ottawa, Ontario K1N 8M5 Lab website : www.burellelab.com Phone (office) : 613-562-5800 ext 8130 Pathway analysis Learning objectives • Understand the purpose of pathway analysis • Get familiar with the various steps involved in a pathway analysis • Understand how Over-representation analysis works • Understand how Gene Set Enrichment analysis works • Use GEO2R and WebGestalt to perform GSEA Pathway analysis You’ve done a microarray/RNA sequencing experiment and have obtained a huge list of genes Questions • Want to put these in a functional context • Want to know what biological processes are perturbed • Want to know what pathways are being dysregulated In experimental vs. Control group What you need to do is • To perform data reduction: tens of thousand genes --> tens of pathways • Identifying modified functions/pathways/biological processes = more explanatory power ex. When injuring muscle cells, how are muscle stem cells remodelling themselves to create reparative tissue You can achieve this through Pathway Analysis • 1st generation: Overrepresentation Analysis ( e.g. Gene Ontology Over representation Analysis) • 2nd generation: Functional Class Scoring (e.g. Gene Set Enrichment Analysis (GSEA)) Over-Representation Analysis Principle: Statistically evaluate the fraction of genes in a particular pathway that shows changes in expression ro To an tal aly n u ze mb d e % of genes 20% big enrichment, overrepresentation of glycolysis pathway genes that are significantly affected compared to what you would predict from random chance es an Si ge gnif ne ica s nt ly ch ro To an tal aly n u ze mb d e Within huge list 1% • fg en ge d es No pathway enrichment fg en ge d Sig ge nif ne ica s nt ly ch looking for if there is an enrichment or not in specific biological pathways % of genes • Pathway enrichment an In groups of genes in the list of genes that are significantly affected w/ significant p-values or significantly unregulated in experimental group, are there genes belonging to a function or subfunction that are excessively represented • chances are more than “random chance” Genes belonging to the same biological pathway (ex: genes for the glycolysis pathway) 1% • poor enrichment and no overrepresentation 1% Over-Representation Analysis Principle: Statistically evaluate the fraction of genes in a particular pathway that shows changes in expression Your gene array Extract transcriptomics dataset on geo database 1 Create input list of genes (e.g. significantly different at p<0.05) between two groups that you are comparing • ex. control vs. Muscle injury Ex. Genes involving glycolysis, genes involving immune response, etc. 2 For each gene set: a) count number of genes affected , b) count number of “background” genes (e.g. all genes detected) Including ones that are not significant 3 Test each pathway for over-representation of input genes • this is how we determine what genes are downregulated/upregulated and what is affected or not Databases classify genes and their associated proteins according to: function, cellular localisation, participation to processes • • Integrate databases with data set to see if there is enrichment compare list of genes with what is in the database to see affected genes Gene Ontology ORA • Ontology = Formal representation of a knowledge domain • Gene Ontology = cell biology • GO is represented by directed acyclic graph (DAG) • Terms are nodes, relationships are edges • Parent terms are more general than their child terms • Unlike simple tree, terms can have multiple parents Catagories on website: 1) biological processes, 2) cellular component, 3) molecular function ex relating to 2): same genes can be found in multiple GO terms - plasma membrane because it is a receptor and it can also be found in macrophage response form together genes that are associated with specific functions - categories are all linked together from more general to more specific More general - one parent can have multiple child terms Parent Child • • first one created most used Gene Ontology ORA Example of GO graph Parent general • GO categories have classification numbers (GO each node has one terms) • Color coding allows to track GO categories that show enrichment in your analysis • • Red = negative enrichment vs control condition (downregulation of genes in this category) Every go term has a list of genes related to the process Upregulated vs. Control - positive fold change • Blue = positive enrichment More specific Child Negatively donwregulated • immunity was down regulated in this ex. • white: not affected in analysis Over-Representation Analysis Limitations • Some categories are so general that they are meaningless (ex: biological process) • • GO terms that are very general ex. If it tells you there are over expression in genes in biological process - very vague and too many genes • ORA uses genes above a cut-off (i.e. p<0.05 for example) and discards everything else (including gene that are borderline significant…) • • biggest limitation it may be significant biologically • ORA uses the number of genes, and ignores their measured changes Algorithm does not consider magnitude - how much/ how significant Functional Class Scoring aims to overcome these limitations Functional Class Scoring fold changes- magnitude of difference in mRNA expression Principle: While large changes in individual genes can have a significant effect on pathways, weaker but coordinated changes in sets of functionally related genes can also have significant effects Your gene array ORA analysis will often not allow this to be seen precisely 1 Compute gene level statistic (e.g. fold change, student’s t test) Statistical value for each gene 2 Aggregate gene level statistics for all genes in pathway into single pathway-level statistic 3 Assess significance with permutation test Gene Set Enrichment Analysis (GSEA) 1 - GSEA calculates an Enrichment Score (ES) For pathways • ex. genes for leukocyte migration, glycolysis, etc • calculates ES for each • shows you how much pathways are affected compared or others Will go down list and mark genes involved in leukocyte migration and generate graph • most leukocyte migration genes found to be unregulated in B vs A Group A) Rank genes by their expression difference Control downregulated Experimental Leukocyte migration genes Up in B vs A each line corresponds to the expression value of a gene Color coded by fold change • rank by fold change upregulated Down in B vs A Generate heat map • visual reprresenation to look at very large datasets and visualize patterns Heatmap is a visual representation of changes in gene expression between groups and samples some have not changed significantly Very few leukocyte migration genes that are down regulated in B vs A Each row represents a gene Each column represents a sample Map shows ranking of gene set in the ranked list of all genes Gene Set Enrichment Analysis (GSEA) 1 - GSEA calculates an Enrichment Score (ES) Genes responsible for enrichment B) For each gene set (ex: leukocyte migration gene set): • Compute cumulative sum over ranked genes • Increase sum when gene is in set, decrease otherwise • Magnitude of increment depends on genephenotype correlation • Record of maximum deviation from zero represent Enrichment score (ES) the bigger the ES, the more genes related to this process are unregulated and vice versa Ranked fold change of individual genes Max deviation from 0 provides enrichment score Fold change vs A • Max cumulative enrichment observed here • genes responsible for enrichment are here + heatmap represented as a list of fold change - Unregulated Not affected down regulated Less genes being encountered As more genes are detected related to leukocyte migration, the ES increases more genes that are in the unregulated category and very few in the down regulated category Gene Set Enrichment Analysis (GSEA) 1 - GSEA calculates an Enrichment Score (ES) Genes contributing to statistical significance of pathway Leading edge • Very statistically enriched and unregulated genes picked up randomly throughout dataset • no enrichment as it doesn’t deviate from zero much Gene Set Enrichment Analysis (GSEA) 3 – Visualize complex changes in functions, processes and pathways • • further organize into visually appealing maps and graphs to provide further reduction big overview of what is going on Blue nodes are unregulated • • • every node is a family of a gene set red: nodes are down regulated in exp. Vs. Control algothrim grouped related genes together Pathway enrichment analysis with GEO2R and WebGestalt Analysis of dataset GSE78929 WORKFLOW GEO database Select gene microarray to analyze https://www.ncbi.nlm.nih.gov/gds/ GEO2R Generate and download list ranked by fold change https://www.ncbi.nlm.nih.gov/geo/geo2r/ Excel Speadsheet Prepare file for GSEA analysis WebGestalt Perform GSEA analysis http://www.webgestalt.org/option.php GEO Datasets Enter dataset accession number: or keyword for search GEO 2R Copy dataset accession number directly here GEO 2R Define name of groups to compare To obtain the ranked gene list GEO 2R P value adjusted for according to the settings selected (see options for default setting) Log fold change vs control To obtain the complete ranked gene list Scroll down arrow to obtain expression value per sample GEO 2R Complete ranked gene list Will save as tab delimited txt file GEO 2R 1. Copy ID and log FC (without column header) and paste in a new file that you will save as tab separated .txt file. ID in column A and log FC in column B 2. In your finder, switch the extension name from .txt to .rnk WebGestalt This info is found on GEO website under the name Platforms Upload you .rnk file Nothing to add here WebGestalt Information on the number of unambiguously mapped genes Parameters of the enrichment analysis. This can be modified in the advance setting Indicates the number of pathways that were significantly enriched in the analysis ether negatively or positively WebGestalt Provides a graph view of the number of genes identified in each of the annotation categories WebGestalt Pathway statistics Pathways that are enriched in the analysis (red = positively regulated vs control; blue = negatively regulated vs control) Detailed information for each pathways Clicking here will provide you a graphical view which will vary according to the method used for analysis WebGestalt Example of graphical view obtained using a GSEA analysis using the Wikipathway database Red genes are the ones identified as changed in the analysis

Use Quizgecko on...
Browser
Browser