PhD Thesis - Luca Guarrera (Open University - Mario Negri Institute) PDF

Doctor of Philosophy (PhD) Life, Health and Chemical Sciences Faculty of Science, Technology, Engineering and Mathema cs Bioinforma cs for Computa onal Genomics BIOINFORMATICS CHARACTERIZATION AND...

Doctor of Philosophy (PhD) Life, Health and Chemical Sciences Faculty of Science, Technology, Engineering and Mathema cs Bioinforma cs for Computa onal Genomics BIOINFORMATICS CHARACTERIZATION AND PRE-CLINICAL STUDIES ON THE THERAPEUTIC POTENTIAL OF ALL-TRANS-RETINOIC ACID IN THE PERSONALIZED TREATMENT OF GASTRIC CANCER LEAD SUPERVISOR: Enrico Gara ni Head of Biochemistry Department, Biochemistry Department Mario Negri Ins tute for Pharmacological Research (IRCCS) INTERNAL SUPERVISOR: Mario Salmona Head of Biochemistry and Protein Chemistry Laboratory, Biochemistry Department Mario Negri Ins tute for Pharmacological Research (IRCCS) INTERNAL SUPERVISOR: Marco Bolis Head of Computa onal Oncology Unit, Oncology Department Mario Negri Ins tute for Pharmacological Research (IRCCS) EXTERNAL SUPERVISOR: Claudia Manzoni Lecturer in Trasla onal Neuroscience, UCL School of Pharmacy University College London for the Higher Educa on Academy (UK) PHD STUDENT: Luca Guarrera [ID: J8105428] Computa onal Oncology Unit, Oncology Department Mario Negri Ins tute for Pharmacological Research (IRCCS) Academic Year 2023/2024 Do or do not. There is no try. Master Yoda 2|Pag. TABLE OF CONTENTS TABLE OF CONTENTS.......................................................................................................... 3 ABSTRACT........................................................................................................................ 7 INTRODUCTION................................................................................................................. 8 1 Gastric Cancer Background: Classiﬁca on and Clinical Approaches.................................... 8 1.1 Introduc on to Gastric Carcinoma:.......................................................................... 8 1.1.1 Global Incidence and Epidemiology.................................................................... 8 1.1.2 Pathogenesis and Risk Factors........................................................................... 9 1.1.3 Diagnos c and Screening Advances.................................................................... 9 1.2 Molecular and Pathological Classiﬁca on:.............................................................. 10 1.2.1 Lauren’s Classiﬁca on System.......................................................................... 10 1.2.2 WHO Classiﬁca on and Subtypes..................................................................... 11 1.2.3 Molecular Subtypes and Genomic Characteriza on............................................ 11 1.3 Clinical Management and Treatment Strategies:..................................................... 12 1.3.1 Surgical and Non-Surgical Approaches.............................................................. 12 1.3.2 Chemotherapy and Targeted Treatment Op ons................................................ 12 1.3.3 Emerging Therapies and Clinical Trials............................................................... 13 1.4 Prognos c Factors and Biomarkers:....................................................................... 14 1.4.1 Role of Gene c Markers in Prognosis................................................................ 14 1.4.2 Tumor Markers and Predic ve Value................................................................. 14 2 All-Trans Re noic-Acid (ATRA):.................................................................................... 15 2.1 General Overview and Proper es of ATRA:............................................................. 15 2.2 Metabolism of ATRA and Cellular Uptake:.............................................................. 16 2.2.1 Synthesis from Vitamin A................................................................................ 16 2.2.2 Binding to Cellular Re noic Acid-Binding Proteins (CRABPs)................................. 16 2.2.3 Oxida ve Catabolism to Inac ve Metabolites..................................................... 16 2.3 Mechanisms of Ac on in Cellular Processes:........................................................... 17 2.3.1 Interac on with Nuclear Re noid Receptors...................................................... 17 2.3.2 Regula on of Gene Expression via RARs............................................................ 17 2.3.3 Involvement in Non-Genomic Signaling Pathways............................................... 18 3|Pag. 2.4 Role in Cellular Diﬀeren a on and Development:................................................... 18 2.4.1 Therapeu c Applica on in Acute Promyelocy c Leukemia (APL)........................... 18 2.4.2 Impact on Embryonic Development.................................................................. 18 2.4.3 Extending Beyond Hematopoie c Cells............................................................. 19 2.5 ATRA Role in Solid Tumors:................................................................................... 19 2.5.1 Overview of ATRA in Solid Tumor Therapy......................................................... 19 2.5.2 Mechanis c Insights into the Ac on of ATRA in Solid Tumors............................... 19 2.5.3 Clinical Trials and Therapeu c Combina ons...................................................... 20 2.5.4 ATRA and Resistance Mechanisms in Solid Tumors.............................................. 20 2.6 Therapeu c Implica ons and Limita ons of ATRA:.................................................. 20 2.6.1 Clinical Use Limita ons................................................................................... 20 2.6.2 Research Eﬀorts and Development of ATRA Analogues........................................ 21 3 Public Data Retrieval and Databases............................................................................ 22 3.1 TCGA (The Cancer Genome Atlas) Database:........................................................... 23 3.1.1 TCGA's Contribu ons and Resources................................................................. 23 3.1.2 Methodology and Selec on in TCGA................................................................. 24 3.1.3 Gastric Cancer in The Cancer Genome Atlas (TCGA)............................................ 24 3.2 CCLE (Cancer Cell Line Encyclopedia) Database:...................................................... 25 3.2.1 Data Collec on e Development........................................................................ 25 3.2.2 CCLE Database Features and Accessibility.......................................................... 27 MATERIALS & METHODS.................................................................................................. 28 4 Outline of Research Objec ves:.................................................................................. 28 4.1 Speciﬁc Aim 1:.................................................................................................... 28 4.1.1 Experimental Design – Aim 1........................................................................... 28 4.2 Speciﬁc Aim 2:.................................................................................................... 29 4.2.1 Experimental Design – Aim 2........................................................................... 29 4.3 Speciﬁc Aim 3:.................................................................................................... 29 4.3.1 Experimental Design – Aim 3........................................................................... 30 4.4 Expected Outcomes, Risks and Innova on:............................................................. 31 4.4.1 Expected Outcomes....................................................................................... 31 4.4.2 Risk Analysis, possible problems and solu ons................................................... 31 4.4.3 Signiﬁcance and Innova on............................................................................. 31 4|Pag. 5 RNA-Sequencing: Methods & Technique...................................................................... 32 5.1 Sequencing Workﬂow and Pre-Analysis:................................................................ 32 5.1.1 RNA Extrac on and Library Prepara on............................................................ 32 5.1.2 Illumina NGS Process Workﬂow....................................................................... 34 5.2 RNA-Sequencing Pre-Processing phase:................................................................. 37 5.2.1 Mul plexing and Demul plexing...................................................................... 37 5.2.2 FASTQ ﬁle quality control................................................................................ 38 5.2.3 Alignment of the RNA-Seq Data....................................................................... 39 5.3 RNA-Sequencing Post-Processing phase:................................................................ 43 5.3.1 Diﬀeren al Expression Analysis with DESeq2..................................................... 43 5.3.2 Gene Set Enrichment Analysis (GSEA)............................................................... 44 6 Computa onal Analysis Methods Employed................................................................. 48 6.1 Calcula on of the experimentally determined ATRA-score:...................................... 49 6.2 RNA-Sequencing Se ngs and Details:.................................................................... 54 6.2.1 Transcriptomic Clustering:............................................................................... 54 6.3 In-silico Computa on of the ATRA-score Fingerprint:............................................... 56 6.4 ATRA sensi vity predic ons in gastric-cancer pa ents:............................................ 57 6.5 Clustering of the samples into the G-DIFF and G-INT sub-groups:.............................. 58 RESULTS......................................................................................................................... 60 7 Extensive Results and Insights..................................................................................... 60 Summary Overview of the Outcomes:...................................................................... 60 7.1 Sensi vity Proﬁling of Gastric Cancer Cell Lines to ATRA:......................................... 60 7.2 Iden ﬁca on of the RAR/RXR Isoform Media ng ATRA's An -prolifera ve Eﬀects:..... 64 7.2.1 Expression of the RAR/RXR mRNAs in gastric cancer cell lines and stomach tumors. 64 7.2.2 Eﬀects of RAR agonists and antagonists on the growth of selected gastric cancer cell lines.................................................................................................................... 66 7.3 Iden ﬁca on of a Gene Transcriptomic Network Associated with ATRA Sensi vity:.... 68 7.4 Eﬀects of ATRA on gene-expression in gastric-cancer cell-lines:................................. 71 7.4.1 Transcriptomic Characteriza on and Clustering of Gastric-Cancer Cell-Lines:.......... 72 7.4.2 Diﬀeren al Expression Analysis and Iden ﬁca on of the Eﬀect of Re noic Acid Treatment:........................................................................................................... 74 7.5 Deﬁni on of Genes Commonly Modulated by ATRA in G-INT and G-DIFF Cell Lines:.... 78 5|Pag. 7.5.1 G-INT Gastric-Cancer Cell-Lines:....................................................................... 78 7.5.2 G-Diﬀuse Gastric-Cancer Cell-Lines:.................................................................. 83 7.5.3 ATRA-Modulated Gene Networks in Sensi ve G-INT and G-DIFF Gastric-Cancer Cell- Lines:................................................................................................................... 86 7.5.4 Epigene c factors modulated by ATRA in G-INT e G-DIFF gastric-cancer cell-lines:... 87 7.6 Deﬁni on of the Eﬀects of ATRA on An gen-Presenta on Processes in GC Cell Lines:.. 89 7.7 In-Vivo Valida on through G-DIFF and G-INT Gastric Cancer Xenogra Models:.......... 91 7.8 Valida on of the Results using Tissue Cultures of Primary Tissue Specimens:............. 94 7.8.1 Gastric Cancer Pa ents Analysis from Tissue Cultures of Primary Tissue:................ 94 7.8.2 Gastric Cancer Pa ents Analysis from TCGA:..................................................... 100 DISCUSSION................................................................................................................... 102 8 Comprehensive Discussion of the Study Outcomes....................................................... 102 8.1 ATRA-Induced IFN-Dependent Immune Responses and Viral Mimicry in Gastric Cancer............................................................................................................................. 103 8.2 Role of IRF1 and DHRS3 in the growth inhibitory ac on of ATRA:............................. 106 9 Future Perspec ves.................................................................................................. 115 CONCLUSION................................................................................................................. 117 ACKNOWLEDGEMENTS................................................................................................... 118 BIBLIOGRAPHY............................................................................................................... 119 6|Pag. Abstract ABSTRACT Background: Gastric cancer is a heterogeneous type of neoplas c disease, and it lacks appropriate therapeu c op ons. There is an urgent need for the development of innova ve pharmacological strategies, par cularly in considera on of the poten al stra ﬁed/personalized treatment of this tumour. All-Trans Re noic acid (ATRA) is one of the ac ve metabolites of vitamin-A. This natural compound is the ﬁrst example of a clinically approved cyto-diﬀeren a ng agent used to treat acute promyelocy c leukaemia. ATRA may have signiﬁcant therapeu c poten al also in the context of solid tumors, including gastric-cancer. The present study provides pre-clinical evidence suppor ng the use of ATRA in the treatment of gastric-cancer using high-throughput approaches. Methods: The an -prolifera ve ac on of ATRA was evaluated in 27 gastric-cancer cell-lines and ssue-slice cultures from 13 gastric-cancer pa ents. RNA-sequencing studies were conducted on both cell-lines and pa ent samples exposed to ATRA. These and the gastric-cancer RNA-sequencing data of the TCGA/CCLE datasets were used to conduct mul ple computa onal analyses. Results: Proﬁling of the large panel of gastric-cancer cell-lines for their quan ta ve response to the an -prolifera ve eﬀects of ATRA indicates that approximately half of the cell-lines are characterized by sensi vity to the re noid. The cons tu ve transcriptomic proﬁles of these cell-lines permited the construc on of a model consis ng of 42 genes whose expression correlates with ATRA- sensi vity. The model predicts that 45% of the TCGA gastric-cancers are sensi ve to ATRA. RNA- sequencing studies conducted on re noid-treated gastric-cancer cell-lines provide insights into the gene-networks underlying ATRA an -tumor ac vity. Conclusions: ATRA is endowed with signiﬁcant therapeu c poten al in the stra ﬁed/personalized treatment of gastric-cancer. The data represent the founda on for the design of clinical trials focusing on the use of ATRA in the personalized treatment of this heterogeneous tumour. The gene- expression model will enable the development of a predic ve tool for selec ng ATRA-sensi ve gastric-cancer pa ents. As a Bioinforma cs Engineer with an informa cs background, I conducted only the computa onal and bioinforma cs analyses related to the results presented in this thesis. All wet-lab and experimental work, which is minimal and included only to support the bioinforma cs results, were performed by my laboratory colleagues, as indicated in the results and methodologies sec ons. Therefore, this thesis is primarily focused on bioinforma cs-based analyses and ﬁndings. 7|Pag. Introduction INTRODUCTION 1 Gastric Cancer Background: Classiﬁca on and Clinical Approaches The paragraph provides a concise descrip on of the various classiﬁca on modali es of stomach cancer and clinical methods to classify its relevance to the management of diseases and methods of treatment. 1.1 Introduc on to Gastric Carcinoma: 1.1.1 Global Incidence and Epidemiology Gastric cancer (GC) is an important health issue, as it causes approximately 700,000 deaths worldwide every yearSung et al., «Global Cancer Sta s cs 2020».. This high percentage of mortality highlights the need to develop novel and eﬀec ve op ons for the preven on as well as the treatment of this disease. Stomach cancer frequency varies in diﬀerent parts of the world, being most common in Asia, Africa, South America, and Eastern Europe2. These areas of the world are characterized by dis nct epidemiological paterns, which are likely to result from speciﬁc interac ons among gene c, environmental and lifestyle factors. In contrast, Western countries show a persistent decline in GC incidence. It is possible that this decline is the result of beter dietary habits, increased food storage, and an improved awareness of associated risk factors1. In these last countries, it is very interes ng to no ce that there is an increase in tumors aﬀec ng the proximal region of the stomach. The increase in gastric cancers localizing to the proximal part of the organ may be due to a higher prevalence of risk factors including obesity and gastroesophageal reﬂux. From a global, regional and local point of view, the incidence of stomach cancer displayed a remarkable variance in most countries between 1990 and 20173. Nevertheless, the worldwide number of cases and deaths con nues to be on the rise, notwithstanding the declining trends observed in certain areas. This increase in the number of individuals at risk of suﬀering from stomach cancer rise may be ascribed to the popula on growth and aging1. 8|Pag. Introduction 1.1.2 Pathogenesis and Risk Factors The pathogenesis of gastric cancer is complex and mul factorial, as it involves a combina on of factors such as lifestyle, environmental factors and gene c predisposi on. Infec ons involving Helicobacter pylori, a bacterium colonizing the stomach lining and resul ng in chronic inﬂamma on, have been iden ﬁed as one of the major risk factors in gastric cancer. The strong associa on between Helicobacter pylori infec ons and stomach cancer resulted in the classiﬁca on of this bacterium in group I carcinogens by The World Health Organiza on4. Dietary habits also play a fundamental role in the genesis of stomach cancer. High-sodium diets as well as the consump on of smoked and processed meat increase the risk of developing this type of tumor. Indeed, these dietary habits cause chronic irrita on and inﬂamma on of the stomach lining, two processes enhancing the growth of cancerous cells5. Smoking is another stomach cancer predisposing factor. In fact, tobacco contains carcinogenic chemicals, which may impair the stomach lining resul ng in the development of cancer. Hereditariness is another factor controlling the incidence of stomach cancer. Indeed, there are groups of individuals who have higher risk of stomach cancer, as the disease clusters in rela ves, sugges ng a hereditary predisposi on. These subgroups are associated with dis nct risk factors and have diverse clinical outcomes therefore accentua ng the signiﬁcance of gene cs in the pathogenesis of certain subtypes of gastric cancer 5. 1.1.3 Diagnostic and Screening Advances The advances in the diagnosis and screening of stomach cancer have resulted in the early detec on and control of this tumor type, making a substan al diﬀerence in terms of pa ents outcomes. Endoscopic procedures have revolu onized the means of detec ng gastric cancer. Indeed, the current endoscopic methods permit the examina on of stomach coa ng and the detec on of cancerous growth at an early stage. Combined with the development of novel biomarkers, these methods have greatly improved the detec on of malignancies in their early stages, allowing for an earlier start of treatment, thus giving more chances of successful cures6. The detailed characteriza on and molecular proﬁling of stomach tumors have facilitated the advancements in diagnos c procedures aimed at targeted therapy. Molecular proﬁling allows the clinician to iden fy diﬀerent molecular subtypes of gastric cancer favoring the personalized treatment of this tumor. This has been very useful in iden fying individuals likely to respond to 9|Pag. Introduction speciﬁc treatment, avoiding redundant or ineﬀec ve medica on7. The genomic and transcriptomic data provide informa on regarding the altera ons of the genome and transcriptome observed in gastric cancer cells. This provides clues as to the possible causes of the disease and the development of prospec ve therapeu c strategies. 1.2 Molecular and Pathological Classiﬁca on: 1.2.1 Lauren’s Classiﬁcation System In 1960, the classiﬁca on of Lauren was s ll an important technique in dis nguishing or classifying stomach tumors. It had been in use for decades. According to this classiﬁca on, two histological subtypes of gastric cancer are dis nguishable, i.e. “Intestinal” and “Diﬀuse” 8. Intestinal stomach tumors show a well-diﬀeren ated histological phenotype and they are associated with environmental risk factors. The incidence rate of Intestinal gastric cancers type is par cularly high in speciﬁc geographical areas, sugges ng a link with regional risk factors9. In contrast, the Diﬀuse type of gastric cancer is characterized by a poorly-diﬀeren ated histological phenotype is widespread all over the world and it is more likely to be due to gene c causes. Diﬀuse gastric cancer is less inﬂuenced by environmental factors than the Intestinal counterpart in terms of a stronger hereditary predisposi on10. The Lauren’s classiﬁca on has been expanded to include molecular markers, such as HER2, which is a key factor in the personalized treatment of gastric cancer11. Due to its simplicity and long-running history in the medical ﬁeld, the Lauren’s method is the classiﬁca on system, which is used most frequently in the clinics. In spite of signiﬁcant limita ons, such as the inability to classify mixed-type malignancies, the Lauren's classiﬁca on system is s ll broadly applied in the ﬁelds of stomach cancer scien ﬁc research and therapeu c prac ce. Being increasingly hinged on molecular traits12, the Lauren's classiﬁca on system will provide crucial insights into the predic on and treatment mechanisms of gastric cancer, especially in the context of personalized treatment of this tumor. 10 | P a g. Introduction 1.2.2 WHO Classiﬁcation and Subtypes The World Health Organiza on (WHO) classiﬁca on includes most of the known histological subtypes of stomach carcinoma, such as the papillary, tubular, signet ring, and mucinous forms of this tumor. This classiﬁca on is known for its overwhelming deﬁni on of diﬀerent gastric tumor types, regardless of their incidence, since it gives a costly opinion on diﬀerent histological aspects of gastric cancer13. Therefore, the WHO classiﬁca on is an eﬀec ve classiﬁca on in clinical prac ce and research studies, as it permits broad comparisons among diﬀerent studies and it assists in iden fying pa ent subgroups with dis nct clinical characteris cs or outcomes. The WHO Classiﬁca on of gastric tumors according to the histological phenotype permits a beter understanding of the biological behavior and prognosis that accompany each histologic subtype. This accurate classiﬁca on of stomach tumors helps in direc ng therapy choices and predic ng what can be expected from pa ents14. This comprehensive approach helps in gaining a beter apprecia on of stomach cancer for therapeu c as well as diagnos c planning. The WHO classiﬁca on not only represents the framework to classify tumors of the stomach based on histological markers, but it also represents a useful pla orm for the integra on of the molecular and genomic data with the diagnos c and therapeu c procedures. With the trend of customized medicine becoming increasingly popular in cancer care, where treatments are becoming custom-made on the basis of the individual gene c proper es of each tumor14, this integra on is bound to increase its importance. 1.2.3 Molecular Subtypes and Genomic Characterization The recent advances in molecular biology have led to the iden ﬁca on of several malignant subtypes of gastric cancer characterized by peculiar genomic and transcriptome paterns. These subtypes provide more informa on on tumor biology and they are likely to modify the prognosis and the therapeu c strategies of gastric cancer 15. The Cancer Genome Atlas (TCGA) project was decisive in the classiﬁca on of gastric tumors into one of four major molecular subtypes: Epstein-Barr virus (EBV)-posi ve, microsatellite instability (MSI), genomically stable (GS), and chromosomal instability (CIN). EBV-posi ve tumors are characterized by the presence of the Epstein-Barr virus in cancer cells. By converse, the MSI subtype relies on a high degree of gene c altera ons, which is due to a number of abnormali es in the DNA repair pathways. The GS subtype is common in younger pa ents and it exhibits fewer gene c abnormali es 11 | P a g. Introduction rela ve to the other subtypes. Finally, the CIN subtype is characterized by chromosomal abnormali es and it is associated with a poor prognosis16. These molecular subgroups exist because they harbor speciﬁc gene c altera ons and poten ally ac onable targets, which has important implica ons in terms of prognosis and treatment. MSI tumors may be more responsive to immunotherapy, since they display a high muta onal burden. By converse, EBV-posi ve tumors may be more suscep ble to targeted therapies17. 1.3 Clinical Management and Treatment Strategies: 1.3.1 Surgical and Non-Surgical Approaches The personalized treatment of pa ents suﬀering from a heterogeneous type of tumor, like gastric cancer, requires combina ons of surgical and non-surgical procedures. In the early stages of this disease, surgical removal of the tumor is the sole therapeu c op on available. The extent and loca on of the tumor determine the surgical approach to implement. This may involve endoscopic mucosal resec on, distal esophagectomy, par al/total gastrectomy, or combina ons of all these methods18. In the United States19, laparoscopic-assisted gastrectomy, a non-intrusive and fast procedure, is another surgical approach which led to a marked decline of therapeu c problems in old pa ents. Depending on the pa ent's health status, the existence of speciﬁc biomarkers and the gastric cancer stage, non-invasive methods, including radiotherapy, chemotherapy and targeted-therapy, may be used too. If tumor shrinkage is the main therapeu c objec ve or tumor progression prevents surgical removal, these types of treatments acquire further importance20. The cancer staging system released by the eighth version of the American Joint Commitee on Cancer's (AJCC; year 2017) provides a much-needed guidance in the diagnosis and treatment of stomach cancer. Consequently, the selec on of a staging approach enables more accurate and comprehensive assessment of tumor progression, thus facilita ng the implementa on of a precise treatment strategy21. 1.3.2 Chemotherapy and Targeted Treatment Options As further detailed in sec on 1.4.1, TP53, ARID1A and HER2 muta ons are associated with gastric cancer prognosis. Because of their correla on with treatment eﬃcacy, these gene c variants play a crucial role in both tumor behavior and therapy22. The use of targeted pharmacological agents in the treatment of stomach cancer is also on the rise. In this context, HER2 posi vity23 is one of the 12 | P a g. Introduction molecular markers which correlate with increased overall survival in pa ents treated with trastuzumab, ramucirumab, and pembrolizumab. Since targeted therapies are not always successful and they are o en provided in conjunc on with conven onal chemotherapy regimens, stomach cancer treatment requires further and novel therapeu c approaches. In locally advanced disease, the Na onal Comprehensive disease Network (NCCN) guidelines support chemo-radia on or periopera ve chemotherapy before surgery. Pa ents with advanced gastric cancer present with increased survival rates following loco-regional treatments, which reduce tumor size and improve surgical outcomes24. 1.3.3 Emerging Therapies and Clinical Trials With the introduc on of novel treatment strategies, gastric cancer therapy is advancing and focusing on pa ent evalua on. The molecular characteriza on of stomach cancers has uncovered novel markers and poten al therapeu c targets that may ameliorate the prognosis of this deadly illness25. The major goal of these molecular studies is to develop novel and eﬀec ve therapeu c agents, such as immune-therapeu cs, cell-structure remodeling compounds and receptor tyrosine-kinase inhibitors. By focusing on speciﬁc pathways involved in the growth and metasta c behavior of the neoplas c cell, the development of these novel therapeu cs is likely to result in eﬀec ve and tailored treatments of gastric cancer26. The design of speciﬁc clinical trials is required to support the eﬀec veness of these new medica ons in trea ng stomach cancer and overcoming resistance to chemotherapeu cs. In fact, new treatment guidelines and the incorpora on of novel therapeu c approaches in the clinical prac ce cannot rely solely on the results obtained in pre-clinical studies. Indeed, some mes, the results of clinical trials are not in line with what is observed at the pre-clinical level. For instance, both HER2-posi ve and nega ve gastric cancer pa ents have been shown to respond to immune-therapies based on pembrolizumab and nivolumab27. In conclusion, pre-clinical research and clinical trials have played and will con nue to play a key role in increasing our knowledge on stomach cancer and the development of improved approaches to the personalized treatment of this neoplas c disease. Overall, there is a general recogni on that the new therapeu c approaches under development are likely to revolu onize the ﬁeld of stomach cancer treatment. 13 | P a g. Introduction 1.4 Prognos c Factors and Biomarkers: 1.4.1 Role of Genetic Markers in Prognosis The prognosis of gastric cancer is inﬂuenced by the high variability of speciﬁc gene c markers. Some of these markers are associated with a more aggressive disease trajectory, while others correlate with beter therapeu c responses. For instance, muta ons of the TP53, ARID1A and HER2 genes fall within the gene c altera ons which are linked to gastric cancer prognosis. In this context, it is of par cular importance to understand which gene c altera ons aﬀect tumor progression and inﬂuence prognosis as well as sensi vity/resistance to therapeu c agents6. Addi onal predic ve factors include the DNA methyla on patern and the MSI status of cancer cells. The importance of genomic proﬁling in driving treatment decisions is underscored by the fact that pa ents with high MSI are o en characterized by a beter prognosis and react to immunotherapy in a diﬀerent and beter manner28. Furthermore, new studies indicate that the expression proﬁles of microRNAs and long non-coding RNAs represent novel and accurate prognos c indicators. Indeed, these non-coding RNAs are not only markers of tumor development and pa ent outcomes, but they regulate gene expression as well 29. 1.4.2 Tumor Markers and Predictive Value In the clinical prac ce, tumor markers such as carbohydrate an gen 19-9 (CA 19-9) and carcinoembryonic an gen (CEA), are commonly employed to monitor the progression of gastric cancer and the eﬃcacy of an -tumor treatments. In addi on to stomach cancer, these markers are used to monitor pancrea c and colorectal carcinomas. These indicators increase in predic ve power when combined with emerging molecular biomarkers30. Recently, new serum markers for the early detec on and monitoring of stomach cancer have been iden ﬁed, including speciﬁc microRNAs. Due to their capacity to regulate gene expression and to iden fy tumors in their ini al stages, these microRNAs show promising poten al to detect early- stage cancer31. In addi on, HER2 and PD-L1, which have been discovered with genomic proﬁling studies, are two promising indicators of targeted therapy. These indicators provide informa on on speciﬁc treatments, such as trastuzumab for HER2-posi ve malignancies and immunotherapies for PD-L1-posi ve tumors7. 14 | P a g. Introduction 2 All-Trans Re noic-Acid (ATRA): The second chapter of the introduc on will focus on the diverse signiﬁcance of All-Trans Re noic Acid (ATRA) in the ﬁeld of cellular biology and therapeu c applica ons. 2.1 General Overview and Proper es of ATRA: All-trans Re noic Acid (ATRA) is a major metabolite of vitamin A and it plays an essen al role in many biological processes. Indeed, Vitamin A, a fat-soluble vitamin, is important for human health since it supports vision, immune system func on, and cellular homeostasis. Due to its unique chemical composi on, ATRA inherits and improves these func ons by deriving from vitamin A32. The ac vity of ATRA is determined by its molecular structure, which includes a β-ionone ring and a polyunsaturated side chain termina ng in a carboxylic acid group. Due to its structure, ATRA binds speciﬁc nuclear receptors in the organism. These receptors are members of a large family of transcrip on factors that regulate gene expression. When ATRA interacts with these receptors, it regulates the transcrip on of genes involved in various physiological processes33. ATRA is also vital for vision. Indeed, vitamin A deriva ves are essen al for re nal health, especially for the produc on of rhodopsin, a pigment necessary for low-light vision. ATRA par cipates in cellular processes such as metabolism, regula ng lipid metabolism, and energy balance34. Furthermore, ATRA is involved in cell prolifera on and diﬀeren a on. This is especially evident in the context of skin health and epithelial ssue growth. ATRA contributes to the clearance of damaged or undesirable cells via apoptosis or programmed cell death, hence maintaining cell health and homeostasis35. The eﬀects of ATRA are also related to embryonic development, which is cri cal to organ and ssue forma on. Its regulatory func on in gene expression throughout development is s ll being studied36. 15 | P a g. Introduction 2.2 Metabolism of ATRA and Cellular Uptake: 2.2.1 Synthesis from Vitamin A All-trans re noic acid (ATRA) is formed from vitamin A through a sequence of metabolic steps. Re nol dehydrogenase catalyzes the conversion of re nol (vitamin A alcohol) into re nal (vitamin A aldehyde). This conversion to ATRA, also oxidizing the re na, is essen al in conver ng vitamin A into a bioac ve form that the body can use. Vitamin A to ATRA conversion is necessary for the ac on of molecules that regulate gene expression and it inﬂuences a number of physiological processes32. 2.2.2 Binding to Cellular Retinoic Acid-Binding Proteins (CRABPs) A er its genera on, ATRA binds cellular proteins called re noic acid-binding proteins (CRABPs). These proteins modulate the intracellular levels and ac on of ATRA. CRABPs monitor the level of ATRA in cells, thus allowing it to bind to nuclear receptors and hence inhibit the expression of certain genes. The interac on of ATRA with CRABPs governs two aspects controlling the biological ac vity of ATRA and this interac on is of relevance in cell diﬀeren a on and development, as well as other physiological ac vi es37. 2.2.3 Oxidative Catabolism to Inactive Metabolites ATRA metabolism is complicated, involving oxida ve catabolism of inac ve metabolites. This catabolic mechanism is cri cal for maintaining physiologically adequate amounts of ATRA in the body. The inac va on of ATRA ensures that its ac vity is closely controlled, limi ng excessive accumula on and poten al toxicity. The modula on of ATRA levels is cri cal for the proper control of a variety of biological processes, such as cell prolifera on, diﬀeren a on, and apoptosis38. 16 | P a g. Introduction 2.3 Mechanisms of Ac on in Cellular Processes: 2.3.1 Interaction with Nuclear Retinoid Receptors All trans-re noic Acid (ATRA) acts mainly through the interac on with nuclear re noid receptors known as re noic acid receptors (RARs) and re noid X receptors (RXRs). Three kinds of RARs (RARα, RARβ, and RARγ) and RXRs (RXRα, RXRβ, and RXRγ) are known, and they are products of dis nct genes. ATRA requires all these receptors to control gene expression. They play essen al roles in the regula on of genes involving cell development and diﬀeren a on, as well as apoptosis. Following the binding of these two types of receptors to ATRA, some events follow in a sequence that aims to modulate the transcrip onal ac vity of the selected target genes. This modula on controls the ac on of ATRA in the context of cellular growth and in the process of programmed-cell- death/apoptosis. For ATRA to exert its func on in cells, the compound must interact with RARs and RXRs. Addi onally, RAR/RXR binding is crucial for ATRA's cancer therapy and developmental biology therapeu c applica ons33. 2.3.2 Regulation of Gene Expression via RARs The interac on of ATRA with RARs is quite important in controlling gene ac vity. When ac vated by ATRA, these receptors can speciﬁcally bind to some DNA sequences, which are known as re noic acid response elements (RAREs). Such binding ini ates gene transcrip onal changes, a vital ATRA- mediated biological regulatory mechanism. ATRA's ability to mediate gene expression by RARs underscores its fundamental roles in cell diﬀeren a on and death. Such processes are required for normal developmental physiology and exert profound eﬀects upon pathologies such as cancer, where cell prolifera on and death regula on are of vital importance39. 17 | P a g. Introduction 2.3.3 Involvement in Non-Genomic Signaling Pathways ATRA regulates its genomic ac vi es via non-genomic signaling pathways. Many speciﬁc signaling pathways, such as MAPK (Mitogen-Ac vated Protein Kinase) and PKA (Protein Kinase A) pathways have been implicated. This connec on shows the range of ATRA ac vi es, which goes beyond direct genomic pathways and entails a wider spectrum of cellular processes. These are non-genomic pathways important for ATRA's eﬀect on cell signaling involving ac vi es such as cell prolifera on and survival. The involvement of ATRA in these molecular pathways supports the broad func on of this compound in cellular biology, and it provides evidence of its role in several therapeu c contexts40. 2.4 Role in Cellular Diﬀeren a on and Development: 2.4.1 Therapeutic Application in Acute Promyelocytic Leukemia (APL) ATRA has been used in the treatment of a subtype of acute myeloid leukemia known as acute promyelocy c leukemia (APL). In APL, the administra on of ATRA leads to diﬀeren a on induc on in the leukemic cell. Following this diﬀeren a on is disease remission, and hence, diﬀerences in induc on form a landmark for the treatment of disease41. In fact, the therapeu c ac on underlying this diﬀeren a ng ac vity is the ability of ATRA to regulate the expression of genes involved in the development of hematopoie c cells. By aﬀec ng these genes, ATRA induces the diﬀeren a on and matura on of leukemic cells, limi ng neoplas c cell replica on, which results in clinical remission. This discovery has proved to be a signiﬁcant development in cancer treatment, providing the ﬁrst example of targeted therapy in the oncology ﬁeld42. 2.4.2 Impact on Embryonic Development Besides its use in cancer therapy, ATRA plays an important role in the process of embryonic development. With respect to this, ATRA controls and ac vates genes that are responsible for organ forma on and ssue development. Indeed, ATRA-dependent regula on of cell diﬀeren a on contributes to embryonic development under the condi ons of normal organ and ssue growth/matura on. This regulatory func on is of great importance in normal development as well as in preven ng the occurrence of developmental disorders. The later func on is en rely expected given the role that ATRA plays in embryonic development. The point emphasizes the importance of 18 | P a g. Introduction ATRA in developmental biology and the possible contribu on of the re noid in the design of therapeu c approaches to developmental disorders43. 2.4.3 Extending Beyond Hematopoietic Cells It is known that the regula on of ATRA acts pervasively beyond the diﬀeren a on of hematopoie c cells. The regulatory eﬀects exerted by the re noid have been observed in many cell types and ssues, sugges ng a broad ac on in cellular diﬀeren a on and development. ATRA's broad impact on diverse cell types demonstrates adaptability by a regulatory molecule in biological processes. However, given its poten al to target many pathways and diﬀerent cell types, ATRA is an appealing chemical not only in the treatment of leukemia but also in other applica ons such as ssue engineering, regenera ve and developmental medicine44. 2.5 ATRA Role in Solid Tumors: 2.5.1 Overview of ATRA in Solid Tumor Therapy ATRA has been widely studied because of its importance in cell diﬀeren a on and death, as well as its therapeu c poten al in solid tumors. Though ATRA is eﬃcacious in acute promyelocy c leukemia (APL), its ac ons on solid tumors remain quite intricate and unclear. ATRA, either alone or combined with other therapeu c agents, has been studied in the treatment of solid cancers. However, ATRA- based therapies are not necessarily associated with strong an -tumor responses. Hence, a larger number of studies is necessary for a beter applica on of ATRA in the treatment of solid tumors36. 2.5.2 Mechanistic Insights into the Action of ATRA in Solid Tumors ATRA is known to exert tumor-suppressive eﬀects in epithelial tumor cells where regula on of RARβ2 expression through RARα takes place. Regula on of RARβ2 expression by ATRA is of extreme importance to direct malignant cells towards diﬀeren a on or self-destruc on. Nevertheless, loss or repression of RARβ2 is a common event in a wide range of solid tumors, including head and neck, breast, lung, pancrea c, prostate, and cervical malignancies. In these types of cancers, the major mechanism of ATRA resistance is due to RARβ2 inac va on. Increased levels of corepressor and decreased levels of coac vator ac vi es due to inadequate ATRA signaling and epigene c modiﬁca ons in the gene RARβ2 are known to cause resistance to the re noid36. 19 | P a g. Introduction 2.5.3 Clinical Trials and Therapeutic Combinations ATRA clinical trials in solid tumors, i.e., lung, mammary, and cervical cancer, have shown mixed results. Though in vitro and in vivo studies support an an -tumor ac on of ATRA, deﬁnite therapeu c beneﬁts do not emerge from the set of clinical trials performed41. This gap underlines the need for more research aimed at improving ATRA-based therapeu c eﬃcacy through the design of innova ve studies in the ﬁeld of solid tumors. In some of the instances, especially in advanced non-small cell lung cancer, combined treatment of ATRA with other drugs such as paclitaxel and cispla n resulted in beter response rates as well as progression-free survival36. 2.5.4 ATRA and Resistance Mechanisms in Solid Tumors In solid tumors, a major problem is represented by ATRA resistance. In fact, changes that lead to altera on in ligand-induced co-repressor release compared to the changes in co-ac vator recruitment modify ATRA an -tumor eﬃcacy. In addi on, overexpression of some genes, gene c muta ons, as well as epigene c modiﬁca ons, play a role in ATRA resistance. Understanding these pathways is cri cal for crea ng more eﬀec ve ATRA-based treatments of solid tumors36. 2.6 Therapeu c Implica ons and Limita ons of ATRA: 2.6.1 Clinical Use Limitations In terms of the broader therapeu c use of ATRA, one of the major limita ons is represented by the short half-life of the compound in the human body. In fact, the short half-life and the rapid degrada on of ATRA may be the basis for the ineﬃciency of the re noid in the treatment of solid tumors. This is the result of the rapid metabolism and elimina on of ATRA from the body, which makes it necessary to implement frequent dosing or high dosages of the re noid to overcome poten al side eﬀects. Addressing these limita ons would be essen al in broadening the medical applica ons of ATRA 1,33. 20 | P a g. Introduction 2.6.2 Research Eﬀorts and Development of ATRA Analogues The latest research eﬀorts are aimed at increasing the half-life of ATRA and the design of combina ons with other therapeu c drugs to augment the an -cancer eﬃcacy of the compound. A higher eﬃcacy of ATRA in cancer treatment may be obtained with an increase in its bioavailability and an -tumor ac vity through various approaches that include delivery methods such as improved formula ons and combined therapies. In addi on, the use of ATRA analogs is reported to result in posi ve outcomes as it reduces ATRA-associated toxici es. The pharmacokine c advantages of these analogs increase drug eﬃciency and reduce adverse eﬀects, enhancing the therapeu c poten al of re noids in cancer therapy45. 21 | P a g. Introduction 3 Public Data Retrieval and Databases In recent years, the use of accessible databases has become of fundamental importance in the ﬁeld of biological research. One of the concrete examples in this regard is the study conducted on complex pathologies, such as gastric cancer. "The Cancer Cell Line Encyclopedia" (CCLE) and "The Cancer Genome Atlas (TCGA)" are two of the most widely used public databases in the ﬁeld of cancer research. Both pla orms display the same type of data in diﬀerent se ngs; the ﬁrst pla orm gives integrated genomic, transcriptomic, and epigenomic informa on from pa ent samples, while the other one provides the same for cancer cell lines. With the molecular characteriza ons of more than 20,000 primary tumors and matched normal samples spanning 33 dis nct cancer types, the TCGA database stands as a signiﬁcant tool in cancer genomics. Researchers can beneﬁt from the abundance of comparable data oﬀered by this study, looking into the molecular basis of cancer in communi es and ﬁnding novel therapy targets. In a database focused on pa ent samples, the types of gene c altera ons and molecular subtypes found in stomach cancer are described in more detail. Conversely, the CCLE database includes data derived from cancer cell lines. Among these there are pharmacological proﬁles, gene expression proﬁles, and genomic data from hundreds of cancer cell lines. For instance, studies using gastric cancer cell lines from the CCLE provide a more controlled se ng for assessing disease molecular pathways and tes ng new drugs. Combining the informa on from all of these sources allow a beter comprehension of the stomach cancer framework. By approaching the pathology from every point of view, including the pa ent's viewpoint and the cellular and molecular causes, it is possible to develop improved therapies and eﬀec ve treatments. 22 | P a g. Introduction 3.1 TCGA (The Cancer Genome Atlas) Database: Currently, "The Cancer Genome Atlas (TCGA)" is one of the most signiﬁcant advancements in the ﬁeld of cancer genomics46. The Na onal Cancer Ins tute (NCI) and the Na onal Human Genome Research Ins tute (NHGRI) contributed to its comple on back in 2006. This research used over 20,000 original cancer samples, thorough molecular characteriza on of over 33 dis nct cancer types and valida on using normal samples. Therefore, gaining a thorough understanding of cancer physiology has become essen al to TCGA's eﬀorts to improve state detec on, therapy, and preven on. Throughout those years, TCGA collected 2.5 petabytes of genomic, epigenomic, transcriptomic, and proteomic data. This huge data collec on is a turning-point in cancer research since it has advanced our knowledge of the diﬀerent pathologies and it has enabled wide accessibility of data to researchers anywhere in the world. 3.1.1 TCGA's Contributions and Resources The therapeu c treatment of cancer pa ents underwent a signiﬁcant transforma on as a result of TCGA onset, as well as the ﬁeld's current understanding of the biology of cancer. The Pan-Cancer Atlas, a cross-cancer resource published in 2018, is one of its major achievements47. It addresses broader general issues and biological ac vi es in cancer research, including signaling pathways, oncogenic processes, and development-origin paterns. In addi on, TCGA has developed a suite of computa onal tools that manage opera ons related to data processing and visualiza on, enabling scien sts to inves gate various data viewpoints inside extensive and detailed datasets. The Genomic Data Commons Data Portal is an informa ve resource that oﬀers web-based research and visualiza on capabili es in addi on to TCGA data access, improving the dataset's value for a large variety of academic purposes. 23 | P a g. Introduction 3.1.2 Methodology and Selection in TCGA In order to represent the biological heterogeneity of cancer, 33 types of carcinomas were selected to obtain a molecular characteriza on on the TCGA database48. The selec on criteria and procedures used were chosen to emphasize the global and comprehensive approach to Cancer Genomics. In this regard, the data stored in the TCGA website were obtained using mul -omics sequencing pla orms and technologies (Genomics, Transcriptomics, Proteomics, Molecular, etc.), in order to produce a complete and exhaus ve characteriza on, which is documented with the resources and methods used. The TCGA meline and milestones, considering the start of the program and the most promising results, demonstrated how the use of public and accessible data for the study of the molecular biology of cancer has become increasingly dominant. 3.1.3 Gastric Cancer in The Cancer Genome Atlas (TCGA) The Cancer Genome Atlas (TCGA) has played a key role in expanding the general knowledge of gastric cancer, an important and complex form of cancer. Gastric cancer is the subject of extensive research in the TCGA dataset, providing insights into its molecular and gene c origins. In this regard, several mul ple gastric cancer samples have been characterized as a part of the TCGA's comprehensive approach to the disease, providing a precise map of genomic changes and molecular subtypes linked to gastric cancer. One of the crucial advances achieved thanks to the classiﬁca on conducted by TCGA was the discovery of unique and dis nc ve molecular subtypes of gastric cancer16. Due to this classiﬁca on, it was possible to understand the variability of gastric cancer, which also has implica ons for individualized treatment plans. It is possible to dis nguish 4 main subtypes of stomach cancer recognized by the TCGA: 1. Epstein-Barr Virus (EBV)-Posi ve: This tumor subtype was iden ﬁed thanks to the presence of EBV and it is characterized by DNA hypermethyla on and ampliﬁca on of the PD-L1 and PD-L2 genes. 2. Microsatellite Instable (MSI): This tumour subtype presents a high rate of muta ons due to defects in the DNA mismatch repair machinery. Because of the high rate of muta on, this results in the produc on of neo-an gens, which the immune system does not recognize as self, increasing sensi vity and predisposi on to immunotherapy. 24 | P a g. Introduction 3. Genomically Stable (GS): This subtype is diﬀerent from the "Diﬀuse" type of gastric cancer, as it is characterized by a smaller number of muta ons and it presents with speciﬁc altera ons in genes such as RHOA and genes involved in cell adhesion pathways. 4. Chromosomal Instability (CIN): This subtype represents the predominant form of gastric carcinoma and is characterized by a high frequency of chromosomal ampliﬁca ons and dele ons involving a large number of genes linked to cell cycle regula on. The development of targeted therapies and personalized medicine is signiﬁcantly inﬂuenced by the molecular characteriza on of gastric cancer obtained by TCGA. Prognosis and response to treatment can be predicted by understanding the numerous molecular subtypes. For example, pa ents in the EBV-posi ve subtype may beneﬁt from targeted treatments, while those in the MSI subtype may beneﬁt from immunotherapies. Addi onally, the TCGA dataset is a valuable tool for further studies, allowing researchers to beter understand the molecular causes of stomach cancer and to create more powerful treatment plans and strategies. 3.2 CCLE (Cancer Cell Line Encyclopedia) Database: In the ﬁeld of cancer research, the Cancer Cell Line Encyclopedia (CCLE) is a precious resource that provides a large collec on of genomic and pharmacological informa on. This sec on provides a detailed coverage of the organiza on, development and extensive collec on of data available in the CCLE dataset. Along with other collaborators, the Broad Ins tute and the Novar s Ins tutes for Biomedical Research were the primary project managers in the development of the CCLE database. 3.2.1 Data Collection e Development The objec ve of the CCLE project was to perform genomic and molecular characteriza ons of a broad range of cancer cell lines, given the need to develop targeted therapies and understand the molecular heterogeneity of cancer. This extensive database was partly developed through collabora on between Novar s and the Broad Ins tute, and with contribu ons from other smaller partners. With its all-encompassing methodology, the CCLE project seeks to provide a comprehensive gene c and molecular map of a wide range of tumor cell lines, thereby promo ng a beter understanding of cancer biology and suppor ng the crea on of more potent therapeu c approaches49. To achieve this result, it was necessary to combine the eﬀorts of experts in genomics, 25 | P a g. Introduction pharmacology, gene cs and bioinforma cs whose experience and joint work contributed to broadening the area of oncology research. A broad range of biological data is included in the CCLE database, including transcriptomic analyses, pharmacological proﬁles of more than 1,000 tumor cell lines, and genomic data50. The later set of data includes the complete proﬁle of muta ons, copy number varia ons and gene expression. In par cular, the transcriptomic data presented in the CCLE represent the informa ve contribu on that has best managed to improve knowledge of cancer biology through the gene expression models of the diﬀerent cell lines. Furthermore, CCLE has proven essen al in iden fying new oncogenic factors and possible therapeu c targets for dis nct types of carcinomas. Indeed, as part of the genomic characteriza on of CCLE, over 1,650 genes have been sequenced, providing a complete knowledge of the genomic altera ons observed in tumor cell lines. The discovery of new therapies that can overcome cancer disease has beneﬁted greatly from this large data set, which has been essen al in understanding the genomic basis of cancer. A further contribu on, such as the addi on of DNA methyla on data from all CCLE cell lines to the database, has allowed to obtain a more concrete knowledge of the epigene c changes of tumors50. The discovery of biomarkers and the advancement of personalized medicine techniques in the ﬁeld of cancer depend on this level of in-depth analysis. 26 | P a g. Introduction 3.2.2 CCLE Database Features and Accessibility The Broad Ins tute's DepMap portal and CCLE website provide researchers with access to the CCLE database, which has an intui ve user interface that makes naviga on and data retrieval very easy. These pla orms are not only gateways to data, but they also contain advanced tools in terms of visualiza on, analysis and customized download op ons. As such these pla orms meet the diverse needs of the scien ﬁc community. In addi on to serving as data access points, these pla orms also provide sophis cated tools for data analysis and visualiza on, as well as personalized download choices. The power of CCLE lies in its ability to unify and integrate huge volumes of data, oﬀering a single resource that incorporates comprehensive transcriptomic, pharmacological and genomic proﬁles. For scien sts hoping to gain in-depth knowledge of cancer biology, this integra on is cri cal, especially when it comes to therapeu c development and discovery50, and to understand complex genomic landscapes. The CCLE's mul faceted perspec ve in cancer biology, encompassing gene c, molecular and pharmacological components, is further strengthened by its integra on with pla orms such as Expression Atlas, cBioPortal and REACTOME50. For this reason, the CCLE is considered a vital resource for cancer research, which, thanks to its high degree of integra on and access to cu ng-edge analy cal techniques, enables breakthrough discoveries and innova ons in the discipline. 27 | P a g. Materials & Methods MATERIALS & METHODS 4 Outline of Research Objec ves: The primary purpose of the PhD thesis is to provide pre-clinical data on the poten al use of ATRA in the personalized treatment of Gastric Cancer. In addi on, the project aims to deﬁne the molecular mechanisms and gene-networks underlying the expected an -tumor ac vity exerted by ATRA in speciﬁc subgroups of Gastric Cancer using non-oriented gene-expression approaches based on RNA- sequencing. A further goal is to develop a novel diagnos c tool to be used for the selec on of Gastric Cancer pa ents who may beneﬁt from ATRA-based therapies. A ﬁnal and long-term goal of the project is to develop ra onale therapeu c combina ons between ATRA and compounds targe ng speciﬁc components of the gene-networks iden ﬁed in the previous points. 4.1 Speciﬁc Aim 1: ATRA-sensi vity of Gastric Cancer cell lines and the deﬁni on of the associated genomic proﬁles show how Gastric Cancer is a rela vely heterogeneous disease that can be classiﬁed into diﬀerent groups51. Gastric Cancer heterogeneity is par ally recapitulated by immortalized cell lines. The ﬁrst goal of Aim 1 is to carry out RNA-Sequencing analyses to evaluate the transcriptomic proﬁles of our panel of Gastric Cancer cell lines and the eﬀects on gene-transcrip on ac vated by ATRA. A second par al goal is to apply a classiﬁca on based on the molecular proﬁle of our panel of GC cell lines, in order to iden fy diﬀerent sub-groups and individually assess their ATRA sensi vity. 4.1.1 Experimental Design – Aim 1 Using the mRNA extracted from 15 Gastric Cancer cell lines belonging to our laboratory the goal is to perform RNA-Sequencing studies through the use of NextSeq-500 Illumina. The samples are treated with vehicle and ATRA, in order to evaluate the transcrip onal proﬁle resul ng from diﬀeren al analysis following treatment with ATRA. With the aim of carrying out the clustering and iden fying the subgroups according to the molecular proﬁle, the methods described in the review by Wang et al.17 are taken into considera on. Among these, the most suitable is the one available in Tan et al.52, as it uses diﬀerent unsupervised and unbiased clustering techniques. 28 | P a g. Materials & Methods 4.2 Speciﬁc Aim 2: Using bioinforma cs methodologies and our panel of Gastric Cancer cell lines, the goal is to develop a new gene-expression model capable of predic ng ATRA sensi vity in stomach tumors. The long- term goal is to iden fy a minimal gene-expression signature to be used as a diagnos c tool for the selec on of Gastric Cancer pa ents who may beneﬁt from ATRA-based treatments. 4.2.1 Experimental Design – Aim 2 Considering our panel of gastric cell lines (27 Gastric Cancer cell lines), which is larger than the one used for the transcriptomic analyses (15 Gastric Cancer cell lines), each cell line is evaluated for its response to the an -prolifera ve ac on of ATRA, using a score reﬂec ng the in vitro sensi vity to ATRA (AUC/ATRA-score). The new gene-expression model is developed using the methods explained in the next chapter, which are based on bioinforma cs procedures and computa onal techniques. To pursue this goal, the cell lines are proﬁled for their sensi vity to the an -prolifera ve ac on of ATRA, using the transcriptomic expression data of the corresponding 27 Gastric Cancer cell lines retrieved from CCLE (Cancer Cell Line Encyclopaedia). The new model predic ng ATRA-sensi vity in a tumor-independent fashion is validated and op mized in our panel of Gastric Cancer cell lines using the associated basal gene-expression proﬁles and bio-computa onal approaches. The genera on of a predic ve model is necessary to guarantee speciﬁcity and op mal results in Gastric Cancer. 4.3 Speciﬁc Aim 3: In vitro, short-term ssue cultures of immortalized Gastric Cancer cell lines are useful tools to study the an -tumor ac vity of ATRA, as they o en recapitulate the major biological characteris cs of the tumors they derive from. However, the in vitro data obtained in cell lines must be conﬁrmed in other models that more closely reﬂect real life. We developed/implemented a model based on short-term ssue-slice cultures53, which permits the evalua on of the ac vity of pharmacological agents ex-vivo on primary-tumor specimens. This model is used to test the response of primary breast tumors to ATRA. The plan is to apply short-term ssue cultures to establish ATRA an -tumor ac vity on samples derived from Gastric Cancer pa ents in terms of cell growth inhibi on and apoptosis. With this model, it is possible to establish a correla on between the in vitro responses to ATRA an - prolifera ve and apopto c eﬀects determined in ssue-slice cultures. For this purpose, RNA- 29 | P a g. Materials & Methods sequencing experiments in ssue slices exposed to vehicle and ATRA are performed. Hence, this aim includes valida on of the genomic proﬁles associated with ATRA sensi vity and determina on of the genomic eﬀects exerted by the re noid in Gastric Cancer ssue slices. This will be done by applying the novel gene-expression model developed in Aim 2, which validates the gene-networks modulated by ATRA. The data deriving from both Aim 2 and Aim 3 are likely to result in the iden ﬁca on of novel targets of pharmacological interven on in view of the development of ATRA- based therapeu c combina ons. 4.3.1 Experimental Design – Aim 3 The plan is to use freshly isolated surgical samples from several Gastric Cancer pa ents which were obtained over the course of the project. Surgical samples must be processed within 24 hours from the collec on. With a Krumdick ssue-slicer, the dissec on of the core samples in ssue slices is executed (200 µm of thickness). Slices are incubated for 48-72 hours in an op mized culture medium containing the vehicle (DMSO) or ATRA (0.1-1.0 micromolar). At the end of the treatment, slices are ﬁxed in formalin, embedded in paraﬃn and evaluated for: a) Growth inhibi on. To deﬁne this parameter, the quan ta ve expression of Ki67 (percentage of immunohistochemistry posi ve cells) is determined by a pathologist. The goal is to look for signiﬁcant reduc ons in Ki67 levels a er treatment with ATRA. This analysis is blinded as for treatment. In addi on, measurement of a number of RNAs coding for prolifera on associated genes using PCR technologies is performed. b) Apoptosis. The poten al pro-apopto c ac on of ATRA is evaluated as described under point a) for Ki67, using an an body targe ng ac vated caspase-3, a biomarker associated with the early phases of apoptosis. A er this step, slices are used for RNA extrac on. The goal is to perform oriented RNA-sequencing experiments in these ssue slices exposed to vehicle and ATRA. A major goal is to conﬁrm the predic ve power of the novel gene-expression model (Aim 2) in primary Gastric Cancer samples. In par cular, the priority is to establish correla ons between the computa onal model and the in vitro sensi vity of each tumor to the an -prolifera ve and/or apopto c ac ons of ATRA. As for the second goal of Aim 3, the focus is to determine ATRA-dependent perturba ons on the gene expression proﬁles of Gastric Cancer and to iden fy genes diﬀeren ally regulated by the re noid, using standard computa onal analyses of the RNA-sequencing data. 30 | P a g. Materials & Methods 4.4 Expected Outcomes, Risks and Innova on: 4.4.1 Expected Outcomes The study will provide insights into the therapeu c poten al of ATRA in speciﬁc groups of Gastric Cancers. In addi on, a diagnos c tool to be used in the clinics for the selec on of pa ents who may beneﬁt from ATRA-based therapeu c strategies, will be developed and validated. Finally, we will iden fy therapeu c targets for the design of innova ve ATRA-based drug combina ons. 4.4.2 Risk Analysis, possible problems and solutions The proposed study's aims are feasible, and no technical problems are foreseen regarding the high- throughput genomic studies required. However, no systema c data in the literature on the an - tumor ac on of ATRA in Gastric Cancer are available. The data obtained on the cell lines may support the idea that ATRA favours rather than blocks the prolifera on and survival of Gastric Cancer cells. If this is the case, the observa on may redirect the research project on inverse-agonists of the re noid receptors. 4.4.3 Signiﬁcance and Innovation ATRA is a non-conven onal an cancer agent diﬀering from classic chemotherapeu cs. Its an -tumor ac vity has been established in pre-clinical models of diﬀerent tumor types, although there are very few studies evalua ng ATRA therapeu c poten al in Gastric Cancer. This project explores the sensi vity of Gastric Cancer to ATRA with a systema c approach. The study is expected to provide insights into the possible clinical development of a novel type of an -cancer agent that could act in synergy with other therapies. The emerging theme of personalized medicine calls for the development of accurate diagnos c tools capable of predic ng the sensi vity of individual pa ents to a given therapeu c agent. It is foreseen that comple on of the project will provide the ra onale for the design of phase I/II trials based on ATRA or derived re noids in gastric cancer. As a Bioinforma cs Engineer with an informa cs background, I conducted only the computa onal and bioinforma cs analyses related to the results presented in this thesis. All wet-lab and experimental work, which is minimal and included only to support the bioinforma cs results, were performed by my laboratory colleagues, as indicated in the results and methodologies sec ons. Therefore, this thesis is primarily focused on bioinforma cs-based analyses and ﬁndings. 31 | P a g. Materials & Methods 5 RNA-Sequencing: Methods & Technique RNA-sequencing (RNA-Seq) is a remarkable technique researchers use in genomics and molecular biology to obtain a complete proﬁling of the en re transcriptome. The methodology applied is based on the iden ﬁca on and quan ﬁca on of the various types of exis ng RNAs, such as mRNA, non- coding RNA and microRNA, as well as the detec on of the mechanism of ac va on and inhibi on of genes and how their expression is regulated. In addi on, the methodology allows for diﬀeren al analyses between the diﬀerent condi ons, comparing the transcripts in order to iden fy the resul ng pathways and biological processes. 5.1 Sequencing Workﬂow and Pre-Analysis: The ﬁrst step of the RNA-Sequencing workﬂow consists of extrac ng the RNA from samples and conver ng it into a library of cDNA fragments. Subsequently, the fragments are sequenced to generate millions of short reads, iden fying only the determined RNA sequences. This ﬁrst step prepares the data for subsequent analyses, such as the measurement of the gene expression levels and the iden ﬁca on of transcripts. 5.1.1 RNA Extraction and Library Preparation It is possible to dis nguish diﬀerent phases in the RNA extrac on and library prepara on procedure. The protocol used in our laboratory analysis is the "TruSeq Stranded Total RNA (Low Sample)"54 (Fig.1): 1) RNA Extrac on: This phase consists of extrac ng total RNA from the biological samples. Depending on the objec ve of the analysis, it is possible to iden fy messenger RNAs, non-coding RNAs and other types of RNAs. A er the extrac on process, the quality of the RNA is evaluated using a Bioanalyzer with a speciﬁc RNA chip. 2) Selec on and Enrichment of RNA: Depending on the research ac vity, it is necessary to select speciﬁc subpopula ons of RNA, such as mRNA, which cons tutes the perfect example of study for protein-coding genes. The mRNA selec on process requires the use of magne c beads coated with poly-T oligo-nucleo des, which are nothing more than short sequences of T-thymines. These sequences are complementary to the adenine tail (polyA) present at the 3' end of the 32 | P a g. Materials & Methods mRNA, to which they are then mixed to allow binding by complementarity. The separa on occurs consecu vely thanks to the applica on of a magne c ﬁeld, which atracts the magne c beads linked to the mRNA to be extracted and subsequently through an elu on process involving the applica on of a solu on that allows the puriﬁed mRNA to be recovered. This step is necessary as it removes the cytoplasmic ribosomal RNA. 3) Conversion to cDNA: In general, the transcriptome (RNA) is rela vely unstable and it cannot be sequenced directly. Therefore, it must ﬁrst be converted into complementary DNA (cDNA) using various enzymes. In par cular, the ﬁrst enzyme is Reverse Transcriptase, which converts messenger RNA (mRNA) into a single strand of complementary DNA (cDNA). This enzyme uses random primers (short sequences of random nucleo des) to ini ate DNA synthesis. Random primers perform their func on by binding to random sites along the RNA strand, indica ng a star ng point for the ac on of reverse transcriptase and ensuring broad coverage of the RNA regions to increase the representa veness of the cDNA produced. The second step involves the Ribonuclease H enzyme, which degrades the original RNA strand, as it is necessary to clean and ﬁlter the residual RNA from the remaining hybrids, leaving only the cDNA strand. The third and ﬁnal step involves DNA polymerase, which is necessary to synthesize the second and complementary cDNA strand. 4) Prepara on of the Library: The library is prepared by processing the cDNA, which is fragmented into smaller parts. Sequence-speciﬁc adapters (adenine bases) are added to the ends of each fragment. This is necessary to trigger the liga on of adapters, i.e., binding of the cDNA to the sequencing pla orm and ampliﬁca on. 5) Puriﬁca on Quality Assessment: PCR puriﬁes and ampliﬁes the fragments to create the ﬁnal cDNA library. PCR ampliﬁes only fragments with adapters on both ends, using "cocktail primers" that bind exactly to these ends to increase the amount of gene c material. To obtain high-quality data, an excellent cluster density is necessary, which involves an accurate quan ﬁca on of the DNA library template. 6) Sequencing: The ﬁnal step is loading of the library onto the sequencing pla orm, which, in our case, is the “Illumina Next-Seq 500” system. The cDNA fragments are sequenced to produce reads that represent the transcripts present in the original sample. 33 | P a g. Materials & Methods Figure 1. Overview of Sample Preparation for RNA Sequencing2. 5.1.2 Illumina NGS Process Workﬂow The main steps of Illumina NGS sequencing are the same for both DNA and RNA, and are shown below55: 1) Library Preparation 2) Cluster Generation 3) Sequencing 4) Data Analysis 34 | P a g. Materials & Methods A. Library Prepara on. The library is prepared by random fragmenta on of the DNA or cDNA sample, followed by adapter liga on at the fragments' 5' and 3' ends (Fig. 2A), as described in the previous paragraph. B. Cluster Genera on. The cluster genera on phase takes place inside the Flow-Cell, a small cartridge containing several channels called "lanes" through which the DNA ﬂows. The presence of the "lanes" allows the loading of several samples to be sequenced simultaneously, guaranteeing the paralleliza on of the sequencing process. A er loading the DNA library inside the Flow-Cell, the DNA fragments bind to the complementary oligonucleo des ﬁxed on the cell surface, corresponding to the fragments' adapters. Bridge Ampliﬁca on follows this ini al liga on process and consists of a par cular ampliﬁca on that takes place in situ, similar to PCR (Polymerase Chain Reac on), directly on the surface of the ﬂow cell. In this phase, each DNA fragment folds, crea ng a "bridge" that can bind to a further complementary oligo-nucleo de. This process is repeated mul ple mes in order to create several thousand copies of each fragment for each single cluster, which are physically separated from each other. A high number of iden cal copies of the original fragment in each cluster is essen al to obtain a sequencing signal that is precise and clear, especially in the read phase (Fig. 2B). C. Sequencing. Sequencing occurs inside the ﬂow-cell through the Sequencing by Synthesis (SBS) process speciﬁc to Illumina technology. In par cular, each DNA cluster is sequenced by adding speciﬁc nucleo des that are chemically modiﬁed and labelled with a diﬀerent ﬂuorochrome. The iden ﬁca on of each added base occurs thanks to the light associated with a speciﬁc wavelength emited by each ﬂuorochrome. A "terminator" is added to each nucleo de in succession to temporarily prevent the addi on of further nucleo des and to allow sequen al base-by-base reading. This terminator is then eliminated in each incorpora on cycle to allow DNA synthesis in such a way as to allow the capture of the emited ﬂuorescence signal, which is detected and recorded by an imaging system. These signals are then translated into base sequences (A, T, C, G) representa ve of the sample's genomic sequence (Fig. 2C). D. Data Analysis. For the purposes of data analysis, the reads of the iden ﬁed sequences are aligned to a reference genome (Fig. 2D). A er alignment, the data are imported into tools or so ware in order to implement a pipeline for analysis. 35 | P a g. Materials & Methods Figure 2. Illumina NGS Workflows3. 36 | P a g. Materials & Methods 5.2 RNA-Sequencing Pre-Processing phase: The pre-analysis steps involve quality control checks to ensure the integrity and usability of the data, including the removal of low-quality reads and the alignment of sequences to a reference genome. 5.2.1 Multiplexing and Demultiplexing Over the years, the amount of data has increased, par cularly regarding NGS experiments. This increase requires sequencing of a larger number of samples and a larger number of libraries in the shortest possible me. In par cular, Mul plexing is a technique that allows grouping of diﬀerent DNA or RNA libraries to be sequenced in a single run (Fig. 3). With mul plexed libraries, unique index sequences are added to each DNA fragment during library prepara on to iden fy and sort each read before ﬁnal data analysis. The principal advantage is a considerable reduc on in analysis mes. On the other hand, this me gain results in an added level of complexity to the sequenced reads. Consequently, this involves the need to iden fy and order the sequenced reads computa onally through the Demul plexing process before proceeding with the ﬁnal data analysis55. Figure 31. (A) During the preparation of the libraries, unique index sequences are added to two different libraries. (B) Libraries are grouped together and loaded in the same lane as the flow cell. (C) Libraries are sequenced together in a single run. All sequences are exported to a single output file. (D) A demultiplexing algorithm sorts the reads into several files based on their indexes. (E) Each set of reads is aligned to the appropriate reference sequence3. 37 | P a g. Materials & Methods Demul plexing is a crucial step in Next-Genera on Sequencing (NGS) analysis, inherent to the data pre-processing phase. In fact, the grouping and parallel processing of mul ple samples in a single run (Mul plexing) involves sor ng the sequencing reads in the respec ve samples. This process is possible thanks to the unique index sequences added to the samples during library prepara on. The high number of samples and sequences present characterizes the complexity of Demul plexing, as it requires the use of complex computa onal techniques and methods to precisely iden fy and assign millions of reads to the correct samples. To this purpose, the "bcl2fastq" tool, developed by Illumina, is o en used. Bcl2fastq transforms Binary Base Call (BCL) ﬁles, which are the raw output of Illumina sequencing devices, to the more accessible FASTQ format56. FASTQ ﬁles, which include both nucleo de sequences and quality ra ngs, permit downstream bioinforma cs analyses. During the conversion, "bcl2fastq" also demul plexes, using the index informa on to assign each read to the correct sample. The so ware "bcl2fastq" commonly works by se ng the rela ve commands into the bash terminal57. The program needs input folders holding the BCL ﬁles as well as the sample sheet, which contains informa on on the index sequences used in each sample. In addi on, users provide an output path for the demul plexed FASTQ ﬁles. The applica on oﬀers numerous customiza on op ons for the conversion process, such as adjus ng the permissible number of mismatches in index sequences or handling compressed BCL ﬁles, making it a versa le tool in the NGS data processing pipeline. 5.2.2 FASTQ ﬁle quality control A er the Demul plexing phase, evalua ng the quality of the generated FASTQ Files is an essen al and advisable step. This is possible through FastQC, an informa c tool widely used in the ﬁeld of genomics and bioinforma cs, which allows the evalua on of the quality of NGS experiments58. Many formats are accepted as input, including BAM, SAM, and FASTQ Files, which are also obtained from diﬀerent experiments and sequencing pla orms. The advantage of this program lies in the possibility of accurately and immediately iden fying troubles present in the sequence data and implemen ng a preliminary evalua on before applying a more detailed analysis. A further posi ve aspect is the modular structure, where diﬀerent outputs from parameter analyses structured in modules are collected. These include, for example, evalua ons of sequence quality scores, assessment of GC content, evalua on of sequence duplica on levels and overrepresented sequences. FastQC illustrates the results with the use of graphs and summary tables, which oﬀer a generic and concise 38 | P a g. Materials & Methods overview of the data and allow the user to easily iden fy and access ﬁles and sec ons characterized by poor quality. In addi on, these results are provided by reports in HTML format. Unlike similar tools used in the ﬁeld, FastQC is ﬂexible and permits quality analyses interac vely or oﬄine. This tool allows automa on of the processing procedures. The implementa on of FastQC into Java is an aspect that should not to be underes mated, as it gives rise to vast compa bility among the various opera ng systems. 5.2.3 Alignment of the RNA-Seq Data During this stage, we matched the paired-end reads obtained from the RNA-Seq experiment with the reference genome using alignment. We u lized the reference genome HG38 (GRCh38.p12), which is a comprehensive digital repository of nucleic acid sequences. HG38 was methodically compiled by researchers and scien sts to serve as a representa ve model of the whole set of genes found in the human species (Homo sapiens). It is possible to access the data using the dedicated servers of UCSC Genome Browser and Ensembl. Aligning high-throughput sequencing (HTS) generated datasets of big reads to a reference genome is a crucial step in the processing of RNA-Seq data59. The sequenced reads consist of microscopic fragments of 150 base pairs, which is far less than the typical size of human genes (24 kilobase pairs). Due to factors, such as the poten al existence of dele ons, inser ons, mismatches, and sequencing mistakes, the alignment of these reads with various genomic regions might be misleading. Therefore, we performed the alignment using STAR (Spliced Transcript Alignment to a Reference), a specialized sequence aligner that is tailored to align non-con guous sequences directly to the reference genome59. STAR outperforms other aligners in terms of mapping speed, sensi vity and alignment correctness. 39 | P a g. Materials & Methods It is possible to dis nguish two types of approaches in the Illumina Next-Genera on Sequencing (Fig.4): - Single-Read Sequencing, also known as single-end sequencing, is a method permi ng sequencing of the DNA from just one end of each DNA fragment. - Paired-End Sequencing (PE) allows sequencing of both ends of the DNA fragment. Typically, PE sequencing results in a greater quan ty of SNV calls a er read-pair alignment. Although some techniques, including short RNA sequencing, are more suitable for single-read sequencing, currently, the majority of researchers chooses the paired-end strategy. Alignment algorithms may eﬃciently map readings across repeated por ons using the known distance between each paired read. This results in beter alignment of reads, especially in repe ve, diﬃcult-to-sequence regions of the genome. Figure 42. Diagram illustrating sequencing both ends of the DNA fragment for alignment to the reference genome3. The STAR method comprises two dis nct phases: the seed search phase and the clustering/s tching/scoring phase. 1. Seed search: The primary concept behind the STAR seed discovery phase is the systema c search for a Maximal Mappable Preﬁx (𝑀𝑀𝑀𝑀𝑀𝑀). The 𝑀𝑀𝑀𝑀𝑀𝑀 of a read sequence 𝑅𝑅 at posi on 𝑖𝑖, with respect to a reference genome sequence 𝐺𝐺, is deﬁned as the longest substring (𝑅𝑅𝑖𝑖 , 𝑅𝑅𝑖𝑖+1 , … , 𝑅𝑅𝑖𝑖+𝑀𝑀𝑀𝑀𝑀𝑀−1 ) that matches one or more substrings of 𝐺𝐺, where 𝑀𝑀𝑀𝑀𝑀𝑀 is the maximum length that may be mapped. Ini ally, the method iden ﬁes 𝑀𝑀𝑀𝑀𝑀𝑀 beginning from the ﬁrst base of the read. When a splice junc on is present, the read cannot be mapped to the genome con nuously. As a result, the ﬁrst seed is mapped to a donor splice site. Subsequently, the 𝑀𝑀𝑀𝑀𝑀𝑀 search is reiterated for the unaligned segment of the sequence, which, in this par cular scenario, will be aligned with an acceptor splice site. Splice junc ons are iden ﬁed in a single alignment process without any prior knowledge of their loca ons or characteris cs, and without the 40 | P a g. Materials & Methods necessity for a preliminary alignment pass required by junc on database methods. The 𝑀𝑀𝑀𝑀𝑀𝑀 search is conducted in both the forward and backward direc ons of the read sequence59 (Fig. 5). Figure 5. Schematic representation of the Maximum Mappable Prefix search in the STAR algorithm for detecting (a) splice junctions, (b) mismatches and (c) tails6. 2. Clustering, S tching and Scoring: During the second step of the algorithm, STAR constructs alignments of the complete read sequence by connec ng all the seeds aligned to the genome in the ini al phase. Ini ally, the seeds are grouped together based on their closeness to a certain set of “anchor” seeds. The size of genomic windows dictates the upper limit for the intron size in the spliced alignments. This technique allows for unlimited number of mismatches, but only permits one inser on or dele on (gap). It is important to highlight that the seeds from the mates of paired-end RNA-seq reads are clustered and s tched together at the same me. Each paired- end read is treated as a single sequence, which allows for the possibility of a chromosomal gap or overlap between the inner ends of the mates. The primary method for u lizing the paired- end informa on is by acknowledging its ability to represent the nature of the paired-end accurate reads, namely the fact that the mates are fragments (ends) of the same sequence. By employing this strategy, the algorithm sensi vity is enhanced, as a single accurate anchor from either mate is suﬃcient to precisely align the whole read59. In the s tching phase, a local alignment scoring method is used to guide the process. This scheme incorporates user-deﬁned scores, or penal es, for matches, mismatches, inser ons, dele ons, and splice junc on gaps. This permits a quan ta ve evalua on of the alignment quality and rankings. 41 | P a g. Materials & Methods The s tched combina on with the greatest score is selected as the op mal alignment of a read. For mul -mapping readings, alignments having scores within a speciﬁc range chosen by the user and lower than the highest score are provided59. The STAR command aligns sequencing reads to a reference genome in the context of RNA-Seq data processing. Usually, in a Bash environment, the input ﬁles holding the RNA-Seq reads, the reference genome directory and the path to the STAR program must be provided. A fundamental STAR command may read like this: 𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺 − −𝒈𝒈𝒈𝒈𝒈𝒈𝒈𝒈𝒈𝒈𝒈𝒈𝒈𝒈𝒈𝒈𝒈𝒈 /𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑/𝒕𝒕𝒕𝒕/𝒈𝒈𝒈𝒈𝒈𝒈𝒈𝒈𝒈𝒈𝒈𝒈𝒈𝒈𝒈𝒈𝒈𝒈 − −𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓 /𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑/𝒕𝒕𝒕𝒕/𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝟏𝟏. 𝒇𝒇𝒇𝒇𝒇𝒇𝒇𝒇𝒇𝒇 /𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑/𝒕𝒕𝒕𝒕/𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓. 𝒇𝒇𝒇𝒇𝒇𝒇𝒇𝒇𝒇𝒇 − −𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓 𝑵𝑵𝑵𝑵𝑵𝑵𝑵𝑵𝑵𝑵𝑵𝑵𝑵𝑵𝑵𝑵𝑵𝑵𝑵𝑵𝑵𝑵𝑵𝑵𝑵𝑵𝑵𝑵𝑵𝑵 The indexed reference genome directory is speciﬁed by --genomeDir; the paths to the paired-end read ﬁles (in the case of single-end reads, only one ﬁle is speciﬁed) are followed by --readFilesIn; and the alignment process can be accelerated by se ng the number of threads for parallel compu ng with --runThreadN. Coun ng the reads mapping the exon of each iden ﬁed gene, the --quantMod GeneCounts op on in the STAR program yields gene-level expression quan ﬁca on (Fig. 6). Figure 6. Overview of RNA-Seq data analysis7. 42 | P a g. Materials & Methods Although a STAR run for RNA-Seq data produces several ﬁles, the aligned read sequences are included in the ﬁles Aligned.out.sam, Aligned.out.bam, or Aligned.out.tab. These ﬁles depend on later inves ga ons, including the determina on of splice variants or the measurement of the gene expression levels. The Log.ﬁnal.out ﬁle is especially signiﬁcant, as it sheds light on the alignment and sequencing data quality and it oﬀers summary sta s cs of the alignment process, including the propor on of uniquely mapped reads. 5.3 RNA-Sequencing Post-Processing phase: Following ini al processing, the emphasis switches to employing "Diﬀeren al Expression Analysis" using DESeq2 and "Gene Set Enrichment Analysis" (GSEA) to explore pathways of biological signiﬁcance. 5.3.1 Diﬀerential Expression Analysis with DESeq2 In bioinforma cs, diﬀeren al expression analysis using DESeq2 is the predominant algorithm for detec ng varia ons in gene expression levels across samples under diﬀerent condi ons. This includes analyses speciﬁcally intended to elaborate data obtained from sequencing studies, including RNA-seq, as indicated in the Bioconductor so ware package DESeq2. DESeq2 works by sta s cally comparing gene expression varia ons between experimental groups. Raw counts are the ﬁrst type of data implemented in the process, and they indicate the number of reads mapped to every gene in every sample. These data must be speciﬁcally prepared to guarantee the correct representa on of the gene expression levels. The DESeq2 algorithm takes into account the heterogeneity of gene expression data. This involves the applica on of a nega ve binomial distribu on to model the read-counts before carrying out subsequent opera ons, such as enrichment analyses or diﬀeren al expression. This method is very advantageous and eﬀec ve as the nature of the data used as inputs is typically discrete, with a strong correla on between variance and mean, another innate characteris c of this type of data. Diﬀeren al analysis is carried out using the variables representa ve of the experimental condi ons examined, adjus ng any varia ons in the size of the library between the samples to provide unbiased comparisons60. 43 | P a g. Materials & Methods One peculiar aspect of the analysis using DESeq2 concerns the es ma on of dimensional factors, which, as men oned above, permits to compensate for the varia ons caused by the sequencing depth or the size of the library between samples. Indeed, the main aim is to guarantee that varia ons in gene expression are due exclusively (where possible) to biological diﬀerences rather than to batch- eﬀects caused by technological or other discrepancies. In addi on, DESeq2 allows se ng of the tes ng hypotheses, based on changes in gene expression. In par cular, the most commonly used tests, especially for most queries and experimental designs, are the Wald test and the Likelihood Ra o Test (LRT). Usually, pairwise comparisons are performed with the Wald test, while complex experimental designs with many components may beneﬁt from LRT. The DESeq2 output provide p-values for every gene a er sta s cal tes ng, which indicates the probability that the observed expression diﬀerence is the result of chance. False posi ves are less likely when these p-values are adjusted for mul ple tes ng, o en using the Benjamini-Hochberg method. Lists of diﬀeren ally expressed genes are part of the DESeq2 outcomes, using the log2(Fold- Changes), which quan ﬁes the size and impact of changes in gene expression. Visualiza on techniques like MA plots, volcano plots, and heatmaps may be used to get views on the overall structure of the data and speciﬁc paterns of gene expression61. 5.3.2 Gene Set Enrichment Analysis (GSEA) Measuring the amounts of DNA, RNA, and proteins in biological samples has become a standard procedure. This results in a substan al volume of data allowing researchers to inves gate new biological func ons, correla ons between genotypes and phenotypes, and processes of diseases62,63. The current diﬃculty is interpre ng the outcomes in order to acquire knowledge of biological systems. To address these analy cal diﬃcul es, we may use the route enrichment analysis, which typically consists of three main phases64. Ini ally, a selec on of genes that are of par cular interest is established u lizing omics data. For example, RNA-seq data might provide a list of genes that are expressed diﬀerently in diﬀerent condi ons65. Furthermore, a sta s cal methodology permits the discovery of pathways that have a higher level of enrichment in the gene list, rela ve to what is caused by random chance. The gene list is evaluated for enrichment in all pathways included in a certain database. There are several pathway enrichment analysis methods that may be used, and the selec on relies on the nature

PhD Thesis - Luca Guarrera (Open University - Mario Negri Institute) PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue