Materials e Methods (Part 6) - Computational Analysis Methods Employed [Corrected 2].docx
Document Details
Uploaded by DiligentPolynomial
Open University
Tags
Full Transcript
**MATERIALS AND METHODS** **[6 Computational Analysis Methods Employed]** This chapter will introduce all the computational approaches and methodologies used to study the effectiveness of ATRA treatment and its genomic implications. The first methodology covered is inherent to calculating the ATRA...
**MATERIALS AND METHODS** **[6 Computational Analysis Methods Employed]** This chapter will introduce all the computational approaches and methodologies used to study the effectiveness of ATRA treatment and its genomic implications. The first methodology covered is inherent to calculating the ATRA-Score, derived from experimental data in order to quantify the cellular response to treatment with Retinoic Acid. This calculation is carried out through a series of bioinformatic analyses investigating the gene expression resulting from ATRA treatment. There will be additional paragraphs discussing transcriptomics techniques (RNA-Sequencing), which are widely known for being useful in acquiring the input data required to carry out further analysis and in determining which genes exhibit differential expression after ATRA treatment. Subsequently, the method concerning the in-silico calculation of the ATRA-score fingerprint will be evaluated in depth. The ATRA-Score predicts the response to treatment with Retinoic Acid through computational models. In particular, this is performed by using gene expression patterns indicative of sensitivity to ATRA. Finally, the methods used to evaluate ATRA sensitivity in a set of patients affected by gastric cancer and allowing the validation of signatures/models previously obtained will be discussed. The final aim is to develop predictive models that are representative of the current clinical situation, meeting the needs of personalized medicine. As a final step, the sample clustering methods, which allowed the genomic differences/similarities relating to gastric cancer to be categorized to identify any molecular subgroups that may respond in an alternative or exclusive manner to ATRA treatment, will be explored. **[6.1 Calculation of the experimentally determined ATRA-score:]** In order to determine an experimental parameter that allows measuring the sensitivity of treatment with retinoic acid, 27 gastric carcinoma cell lines were utilized as a model. The characteristics and annotations of such cell lines are listed in **Tab. 1**: These cell lines come from different anatomical origins and represent various types and subtypes of gastric cancer, which allows a comprehensive evaluation of treatment efficacy on different genetic and biological backgrounds. The colourimetric MTS-assay kit (Abcam, ab197010) was used to measure the cell proliferation and viability. The MTS-assay produces a colorimetric result that varies in intensity depending on cell viability. When MTS is reduced by living cells, a purple formazan is formed. The intensity of the purple colour is directly proportional to the metabolic activity of the cells. In fact, an intense purple colour indicates high metabolic activity, i.e. a high number of viable and proliferating cells. On the contrary, a pale or absent purple colour indicates reduced metabolic activity, suggesting a low number of viable cells. This latter case may be due to a high cytotoxicity of the treatment administered to the cells, meaning that the treatment is effective in inhibiting cell proliferation or causing cell death. To proceed with in-silico evaluation, the cell lines were exposed to either vehicle (DMSO) or five increasing concentrations of ATRA (1x10^-9^ M, 1x10^-8^ M, 1x10^-7^ M, 1x10^-6^ M, 1x10^-5^ M) for six days. To generate considerable statistical significance, we exposed 9 experimental replicates to vehicle or each concentration of ATRA. To normalize data, each ATRA value was divided by the control (vehicle) value (**Fig. 8**). ![](media/image2.png) By averaging the different treatment replicates of the 5 distinct ATRA concentrations, the AUC score was calculated for each cell line. Specifically, to measure the sensitivity of each cell-line to ATRA anti-proliferative action, an experimental *ATRA-score* was calculated based on the *Area-Under*-*the*-*Curve* (*AUC*) determined for each cell-line exposed to ATRA for 6 days (**Fig. 9**). The AUC calculation was performed using the trapezoidal method, i.e. by summing the areas of the trapezoids formed by adjacent points on the graph: \ [\$\$AUC = \\sum\_{i = 1}\^{n - 1}{(\\frac{x\_{i + 1} - x\_{i}}{2}\*\\left( {y\_{i} + y}\_{i + 1} \\right))}\$\$]{.math.display}\ Where [*x*~*i*~]{.math.inline} are the log concentrations and [*y*~*i*~]{.math.inline} are the normalized OD values. AUC stands for \"Area Under the Curve\" and measures the entire two-dimensional area under the entire curve^1^. To calculate the AUC, the mathematical formula of numerical integration of data normalized to ATRA concentrations was used. Data analysis was conducted using the R software, generating figures with the *ggplot2*^2^ package. The specific function used to calculate the AUC was \"MESS::auc\". The ATRA-score based on the AUC allows evaluating the antiproliferative response of different cell lines to ATRA treatment. A lower AUC value indicates a greater inhibition of cell proliferation, suggesting a higher sensitivity of the cell line to ATRA\'s antiproliferative action. Conversely, a higher AUC value indicates a lower inhibition of cell proliferation, hence a lower sensitivity to treatment. Finally, the different AUC values were re-scaled between the "1.00" and "0.00" reference-values, in order to obtain a final descending ranking of the various cell lines, where the highest score indicates a better response to treatment (**Tab. 2**). ![](media/image4.png) **[6.2 RNA-Sequencing Settings and Details:]** *RNA-Seq* was carried out using paired-end, 121-base-pair-long reads on an Illumina NextSeq500 instrument. The *FastQC* (v.0.11.9) tool was used to assess the sequencing read quality. Two-pass mode sequence alignment of the total-RNA (stranded) to the reference human genome (GRCh38) was carried out utilizing *STAR* (v.2.7.9a)^3^. Furthermore, gene expression was measured using Gencode\'s extensive annotations (v38 GTF File). Samples were adjusted and normalized for library size using the *cpm* method in the R-statistical-environment. CPM (Counts Per Million) normalization transforms gene counts into values per million reads, allowing a fair comparison between samples with different library sizes. \ [\$\$CPM = \\left( \\frac{\\text{Gene\\ Count}}{\\text{Total\\ Number\\ of\\ Reads\\ Sequenced}} \\right)\*10\^{6}\$\$]{.math.display}\ - **Gene Count**: The number of reads that map to a given gene in a sample. - **Total Number of Reads Sequenced**: The sum of all reads mapped to all genes in the sample. - **Multiplication by 10\^6**: This multiplication factor scales the counts to make them \"per million\", making it easier to compare different samples. Differential analysis was conducted with the *DESeq2* (v1.28.1) pipeline^4^. *GSEA* (*Gene-Set-Enrichment-Analysis*) was performed using the *Limma* (v. 3.52.2) package and *GSEABase* (v 3.19). Gene-set collections were retrieved from the *Molecular-Signature-Database* (*MSigDB*)^5^. The p-values were corrected for multiple testing using the False-Discovery-Rate (FDR) procedure, with the significance threshold set at 0.1. The raw data of the analysis are available in the *EMBL-EBI* *Annotare* database^6^ (https://www.ebi.ac.uk/fg/annotare/) under the accession numbers: **E-MTAB-12387** (Cell-lines) and **E-MTAB-12385** (Patients). *6.2.1 Transcriptomic Clustering:* The processed/normalized RNA-Seq data were used to perform transcriptomic clustering of the cell-lines with the R-programming language. The pairwise distances between samples were calculated using the Euclidean Distance Metric. The Euclidean distance metric is a method for calculating the distance between two points in an n-dimensional space. This metric is often used in mathematics, statistics, and machine learning to measure the similarity or difference between pairs of objects. The Euclidean distance between two points [**p**]{.math.inline} and [**q**]{.math.inline} in n-dimensional space is the length of the line segment connecting them. It is a generalization of the concept of distance between two points in 2-dimensional space (Cartesian plane). For two points [**p** = (*p*~1~,*p*~2~,...,*p*~*n*~)]{.math.inline} and [**q** = (*q*~1~,*q*~2~,...,*q*~*n*~)]{.math.inline}, the formula to calculate the Euclidean distance [**d**]{.math.inline} is: \ [\$\$\\mathbf{d}\\left( \\mathbf{p,q} \\right) = \\sqrt{\\left( q\_{1} - p\_{1} \\right)\^{2} + \\left( q\_{2} - p\_{2} \\right)\^{2} + \\ldots + \\left( q\_{n} - p\_{n} \\right)\^{2}\\ }\$\$]{.math.display}\ The final goal is to perform a hierarchical clustering, which is a clustering technique that builds a hierarchy of clusters. This can be done in two main ways: agglomerative (bottom-up) or divisive (top-down). The agglomerative approach starts by treating each point as a single cluster and then merges the closest clusters until a single cluster containing all the points is obtained. To carry out hierarchical clustering, Euclidean distance was selected as the distance metric to measure the similarity or dissimilarity between points. In particular, the Euclidean distance permits the calculation of the distances between all pairs of points in the data-set, obtaining a distance matrix. The above values were used as inputs for the *hclust* function to measure the distance between clusters with the *Ward\'s minimum variance* method^7^. The *hclust* function in R performs agglomerative hierarchical clustering using a distance matrix as input. This function can use several methods to determine which clusters to merge at each step. One such method is the Ward\'s method. The Ward\'s minimum variance method is a criterion used to merge clusters that minimizes the increase in total variance within the cluster. More precisely, at each step, the Ward\'s method merges the two clusters, leading to the minimum change in the sum of squared distances within the clusters. The logical steps for generating clustering were as follows: 1. **Calculation of Distances**: The Euclidean distances between the samples were calculated using the Euclidean distance formula. 2. **Creation of the Distance Matrix**: These distance values were organized into a distance matrix. 3. **Hierarchical Clustering**: This distance matrix was used as input for the *hclust* function in R. 4. **Ward\'s Method**: The *hclust* function applied the Ward\'s method to merge the clusters in order to minimize the increase in total variance at each step. Data visualization of the hierarchical clustering results was performed using a *dendrogram*, which shows the tree-like structure of the clusters. **[6.3 In-silico Computation of the ATRA-score Fingerprint:]** To obtain a signature predicting the response to ATRA treatment in gastric cancer, the basal gene-expression levels were correlated to the *ATRA-scores* of each cell-line using the *Spearman*\'s approach. The transcriptomics data of cell lines used for the correlation analysis were retrieved from the *CCLE* (*Cancer-Cell-Line-Encyclopedia*) database. The reason for using the basal gene-expression transcriptomic data from the CCLE was the following: despite the treatment of 27 cell lines with retinoic acid for which an experimental sensitivity score was calculated, RNA-Sequencing-based transcriptomic profiling was performed for only 15 of the 27 treated gastric-lines. Thus, it was decided to consider the data from all the cell lines treated with retinoic acid to balance the numbers and to obtain a more substantial amount of supporting data. The correlation analysis resulted in an initial list of 358 genes having a *p-value* \< 0.01 and a *rho correlation coefficient* \> 0.4. In particular, 146 of these genes showed a positive correlation, while the remaining 211 genes showed a negative correlation. To obtain a reduced cluster of genes predictive of the response to retinoic acid treatment, a connectivity analysis was performed in *Cytoscape*^8^. Cytoscape is an open-source software designed for the visualization and analysis of complex molecular networks and pathways integrated with data of any type. Originally developed for systems biology, Cytoscape is widely used to represent protein-protein interaction networks, metabolic networks, gene regulatory networks, and other biological networks. Functional protein interconnections were computed using the *STRING* (*Search-Tool-for-the-Retrieval-of-Interacting-Genes/Proteins*) database^9^. STRING is an online database and bioinformatics resource designed to collect, integrate, and interpret information on all known and predicted functional associations between genes and proteins. In particular, the application of STRING results in a better understanding of the molecular interactions and interaction networks within cells. Lastly, a final signature of 42 genes was obtained by selecting only the genes showing a degree of interconnectedness \> 2 (connection to 2 or more genes in the represented cluster), which eliminates isolated genes or genes with only one interconnection. Specifically, the final genes resulting from the correlation analysis are the following: EGF, HBE1, OR52D1, OR51B5, KCNJ8, ERBB4, WNT2, OR51I2, SPX, FLNC, COL6A1, OR51J1, OR51I1, PITX2, SIRT3, RTP2, EHD2, HBG1, TLL1, LOXL1, PAX9, NUP98, TPM1, MEOX1, CPS1, OR51B4, ALX3, CHRM2, GRK2, OR1N2, SRRM1, CLTB, BHMT, SIX6, UPF3A, ING1, OR4F21, FIP1L1, TLE3, OR52M1, HBM, NRAP. **[6.4 ATRA sensitivity predictions in gastric-cancer patients:]** The predictiveness of the response to retinoic acid treatment, calculated using an ATRA-score, in patient samples from the *TCGA* (*The-Cancer-Genome-Atlas*) and our 13 gastric-cancer cases (*characterized in the next chapter*), was determined using a *Similarity-score* strategy (*GSVA*: *Gene-Set-Variation-Analysis*) that relies on the gene-expression signature previously identified. The *GSVA* algorithm permitted the calculation of a *Similarity-score* value reflecting the enrichment of the gene-set under study, using the *GSVA*-*R*-package^10^. The Gene Set Variation Analysis (GSVA) algorithm is a non-parametric, unsupervised method used to evaluate the variation of gene set enrichment across different samples of an expression dataset. The GSVA technique quantifies the degree of enrichment of gene sets in different samples by assigning a score that reflects the extent to which a gene set is collectively upregulated or downregulated within a sample. This technique enables the comparison of biological pathways and processes across various conditions or phenotypes without requiring pre-established groups. As a result, it represents a precious tool to investigate the fundamental biology of intricate datasets. GSVA is often used in genomics research, particularly in cancer studies, to detect variations in the activation of pathways and biological processes across various samples^10^. The workflow adopted to perform the analysis is as follows: 1. The signature of the 42 identified genes was divided into two groups: *Gene-Set Pos* and *Gene-Set Neg*. The first group includes the genes positively correlated with the response to retinoic acid treatment, while the second group includes the genes negatively correlated with the response to retinoic acid treatment: - **[POSITIVE CLUSTER SIGNATURE:]** - **[NEGATIVE CLUSTER SIGNATURE:]** 2. The GSVA analysis was performed on the 373 patients (TCGA samples) and 13 patients (our lab samples), with the 2 clusters defined. This provided a 2 x 373 (TCGA samples) and 2 x 13 (our lab samples) matrix as output, where the 2 rows correspond to the two *Positive* and *Negative* clusters, while the 373 and 13 columns correspond to the number of patients. 3. Subsequently, the values found for the 2 clusters (*Positive* and *Negative*) were rescaled to homogenize the data and determine their average value to obtain a final \"S-Score\" (Similarity Score): - [GSVA~UP~ = *rescale*(0,1);]{.math.inline} - [GSVA~DOWN~ = *rescale*(1,0);]{.math.inline} - [*S*~Score~ = *mean*(GSVA~UP~,GSVA~DOWN~).]{.math.inline} 4. The final score was represented using two distinct visualizations: a *BoxPlot* and a *ScatterPlot*. **[6.5 Clustering of the samples into the G-DIFF and G-INT sub-groups:]** The *NTP (Nearest Template Prediction)* classification algorithm^11^ was used to cluster the 373 TCGA samples and our patients' collection based on a signature template of 171 genes that classify gastric tumors into the G-DIFF and G-INT sub-groups^12^. The signature of the 171 genes is structured as follows: - **[G-INT CLASS:]** TSPAN8, GPX2, LYZ, PLS1, LGALS4, FUT2, C5orf32, ATAD4, DEGS2, NOSTRIN, MUC13, ALDH3A1, MYO1A, ABCC3, AGR3, VILL, SH3RF1, TRAK1, EGLN3, CDH17, BCL2L14, CEACAM1, LIPH, RSPH1, KALRN, CAPN8, CLCN3, PLEK2, TMC5, CYP3A5, EPS8L3, FA2H, TOX3, BAIAP2L2, PIP5K1B, AGPAT2, BCL2L15, TNFRSF11A, PLCH1, GPR35, ATP10B, TC2N, MMP28, LLGL2, CAPN10, TRNP1, SDCBP2, MYB, ACSM3, REG4, CYP2C18, PRR15, SGK493, HNF4G, TMEM45B, KLF5, UGT8, RNF128, KCNE3, LOC100133019, DNAJC22, ST6GALNAC1, CLRN3, GDF15, RNF43, KIAA0746, USH1C, CLDN2, EHF, FOXA3, POF1B, LOC286208, C9orf152, GMDS, SLC22A18AS, C11orf9, LOC100131701, TMPRSS4, SLC37A1, PTK6, CEACAM5, SULT2B1, LOC120376, MST1R, ELF3, SLC26A9, SLC40A1, PTPRB, AGR2, GALNT12, HEPH - **[G-DIFF CLASS:]** RDX, TBCEL, FERMT2, MYO5A, SOAT1, FADS1, MYH10, FNBP1, ELOVL5, ABL2, PGBD1, SELM, LOXL2, N-PAC, FZD2, KIAA1586, RASSF8, NUAK1, TMEFF1, SCHIP1, TMEM136, ZCCHC11, FAM101B, FAM127A, SIX4, DENND5A, TTC7B, ZNF512B, KIRREL, GNB4, FN1, GJC1, GLIPR2, FJX1, DSE, ENAH, DNAH14, CALD1, GPRASP2, HEG, DLX1, TIMP3, GLT8D4, LPHN2, PTPRS, FRMD6, SNAP47, WHAMML1, GATA2, APH1B, MLLT11, PPM1F, SNX21, ANXA6, PKIG, ANTXR1, ATP8B2, CSRP2, DEGS1, KLHDC8B, DEPDC1, CSE1L, WDR35, SAMD4A, TRIM23, FAM92A1, S1PR3, TUBA1A, LOC644450, PTPN1, HOMER3, IGFBP7, TSR1, AURKB, MSX1, CTSL1, TEAD1, LOC283658, MAP1B The NTP algorithm allows samples to be classified based on their gene expression profiles using a set of predefined gene expression signatures. The pre-processed data were processed using the \"NTP\" R-package. This package calculates the Euclidean distance between the expression profile of each sample and the centroid vector of each template. The sample is then assigned to the template with the shortest distance. The accuracy of the NTP classification method was assessed using a cross-validation procedure, using 100 permutations. This methodology divides the data into a training set and a test set in a random manner. The performance of the model is then evaluated based on the percentage of correctly classified samples in the test set. **[BIBLIOGRAPHY:]** 1\. Flach, P., Hernandez-Orallo, J. & Ferri, C. A Coherent Interpretation of AUC as a Measure of Aggregated Classification Performance. Proceedings of the 28th International Conference on Machine Learning, ICML 2011 664 (2011).2. Wickham, H. Ggplot2: Elegant Graphics for Data Analysis. (Springer, New York, NY, 2009). doi:10.1007/978-0-387-98141-3.3. Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinforma. Oxf. Engl. **29**, 15--21 (2013).4. Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. **15**, 550 (2014).5. Liberzon, A. et al. Molecular signatures database (MSigDB) 3.0. Bioinformatics **27**, 1739--1740 (2011).6. Athar, A. et al. ArrayExpress update -- from bulk to single-cell expression data. Nucleic Acids Res. **47**, D711--D715 (2019).7. Murtagh F & Legendre P. Ward's Hierarchical Agglomerative Clustering Method: Which Algorithms Implement Ward's Criterion? J Classif **31**, 274--275 (2014).8. Shannon, P. et al. Cytoscape: A Software Environment for Integrated Models of Biomolecular Interaction Networks. Genome Res. **13**, 2498--2504 (2003).9. Szklarczyk, D. et al. The STRING database in 2021: customizable protein--protein networks, and functional characterization of user-uploaded gene/measurement sets. Nucleic Acids Res. **49**, D605--D612 (2021).10. Hänzelmann, S., Castelo, R. & Guinney, J. GSVA: gene set variation analysis for microarray and RNA-Seq data. BMC Bioinformatics **14**, 7 (2013).11. Hoshida, Y. Nearest Template Prediction: A Single-Sample-Based Flexible Class Prediction with Confidence Assessment. PLoS ONE **5**, e15543 (2010).12. Tan, I. B. et al. Intrinsic Subtypes of Gastric Cancer, Based on Gene Expression Pattern, Predict Survival and Respond Differently to Chemotherapy. Gastroenterology **141**, 476-485.e11 (2011).