Final Exam Bioinformatics part 2 for quiz.docx
Document Details
Uploaded by ReliableMookaite1890
Full Transcript
What actually happens to get a phenotype for a particular cell? Genetic variation Usually, there are certain alleles in individuals or cells that are associated with some kind of phenotype morphology, physiological, traits, characters, and so forth. These types of studies allow us to hone in on w...
What actually happens to get a phenotype for a particular cell? Genetic variation Usually, there are certain alleles in individuals or cells that are associated with some kind of phenotype morphology, physiological, traits, characters, and so forth. These types of studies allow us to hone in on what genes are potentially influencing those characteristics. We are not certain whether that gene is associated with that phenotype or if it is actually driving that phenotype. For us to actually understand, we have to know when and where that gene is expressed, what it’s doing, and what pathways are affecting to see whether or not if we alter some sort of pharmaceutical drug to decrease the expression of that particular gene, is it actually going to stop tumor growth or cause it to shrink? Or is it just a gene that the tumor is expressing on a high level but its not what is driving the particular development of that tumor. Genetic variation, GWAS + gene and protein expression = functional analysis In functional analysis, we are trying to see what is the different functional roles that these different gen products or these genetic factors are actually doing. Differential expression Important to look at coding genes. Protein coding genes are very well described. Gets you one step closer to the phenotype. If we look specifically at which genes are getting transcribed and then eventually translated in different cells, we can start to better understand the link between that genotype, the environment, how its influencing it. We can understand different developmental stages along with different biological factors, and the phenotype of that cell. Pretty much every single cell in your body has the exact same genomic content, but there are a few cells that have slightly different genomes. For example, red blood cells do not have a nucleus. RBCs eventually started as immature RBCs and had a nucleus at one point. More or less, every single cell starts off with the same genome. The only difference between these cells is what genes are actually expressed and what pathways were expressed that cause that cell to differentiate into a different type. That particular expression patterns and those pathways are what create that phenotype. What genes are being expressed at what point? Because it is so context-dependent, there is all these different levels that it is regulated: By region (brain vs. kidney) In development (fetal vs. adult tissue) In the dynamic response to environmental signals (immediate-early response genes) In disease states Gene expression analysis For your samples of interest, you want samples where most of their gene expression is similar, and there are not a lot of differences. Then, the differences that you do observe are potentially the ones that are driving the phenotype. You want them to be similar enough so that the differences you see are going to be informative: An example would be comparing the gene expression between a neuron and a muscle cell, there would be so many differences that there would be no way we could make sense of what is actually causing the difference in those 2 phenotypes. Much more similar expression profiles, and the differences we are going to see between then are going to be the ones causing the tumor phenotype or a result of the tumor phenotype: Then, extract RNA from samples, which are going to be all the transcribed genes that are actually being used in the cell. You are going to need to convert that RNA to cDNA because RNA by itself is very unstable and there is a lot of RNAs that are constantly chewing it apart. So if you convert to cDNA, then it will be double-stranded. cDNA then gets processed and prepared so it can be analyzed on microarrays. These probes are spotted, and they represent different exons of different genes. Then, you fluorescently label the DNA, and it matches a particular gene. Then when that particular spot where the probe is for that gene will light up. From the intensity of the different colors that you see, you can determine which transcripts from which genes are expressed. With RNAseq analysis, it is very similar as far as selecting the samples and then converting it to DNA. After you extract RNA, instead of actually probing to see what genes are present by hybridizing it to the oligonucleotide probes, you simply sequence the whole thing on an Illumina flow cell. Then, you map those reads back to an annotated genome to identify where those sequences are originating from, and when it maps somewhere it means that the cell has transcribed that section of it into the RNA. Then, you convert the DNA sequence there. So that means there is a gene that is actually transcribed from that particular location. Then, you perform a statistical analysis to determine which of those differences are actually biologically and statistically significant. If you can identify these things, then you can start to understand which genes potentially are functioning to produce those phenotypes that you’re interested in and how those networks are being activated or how they are being used to modify that particular process in the cell. General approach to analysis of data Reprocessing Normalization, scatter plots Inferential statistics T -test, ANOVA Exploratory statistics (descriptive) Distances, clustering This whole process leads us to a better understanding of what is functioning and the functional aspects of the particular cell. Microarray: Measurement of mRNA from select genes Microarrays, you are only measuring typically mRNA, because most of the microarrays are focused on protein-coding genes, you are measuring the mRNA transcripts relative to those 2 different samples. It has to be from a select set of genes because those genes have to have been described enough to design the probes that are basically going to be on the microarray slide. RNAseq: Genome-wide measurement of all RNA transcripts. Gives you a sequence of all transcripts that are generated in the cell. Depending on how you make that library, it can include basically any RNA transcript, it does not have to be coding it does not have to be a previously described gene, which is why it has mainly taken microarray studies. Much more comprehensive. Microarrays: tools for gene expression It has been used for a long time in different applications. Sometimes they can be used to assay DNA and the genome, other times if you put probes from different exons of genes, they are used to measure the expression of genes. Short oligonucleotides are deposited in a grid-like array. Is a sequence that is complementary to a known sequence within an exon or a UTR A limitation is that it has to be an organism where the genome is annotated really well, and the sequences are really well, and you have the resources to actually design and manufacture the micro slide. When designing a DNA microarray study, you have to have a specific sample from 2 different groups that are similar but vary by some type of phenotype. Experimental design Need biological replicates, which would be multiple individuals in the groups you are analyzing. Usually, you want 3 or more per group. Critical there is a balanced, randomized experimental design. You must make sure that your samples are balanced with respect to other things that may be influencing your results, for example there is differences in expression in males and females. Expression of cells in some sort of disorder, you want to make sure you are matching at male vs male or female vs female or an equal number of sexes in each group. Age differences you want to make sure that they are in the same category developmentally because that can influence expression. RNA extraction and cDNA preparation Critical to have a fresh sample, RNA has to be sampled within minutes otherwise you will get degradation. RNA extraction, conversion to cDNA, label w/fluorescent dyes, evaluate and avoid systematic artifacts Hybridization to DNA array The cDNA w/fluorescent tags is hybridized to microarray consisting of oligonucleotide probes. There are substantial technical artifacts; temp, humidity, person doing the procedure. This can affect results! Acquire image of fluorescent signal Relative expression against a control RNA transcript levels are quantitated based on fluorescence intensity measured with a scanner Signal proportional to relative number of transcripts. Microarray data analysis Structural Analysis Compare results from different microarrays. Determine which RNA transcripts are differentially expressed Clustering Meaningful biological patterns in the data Classification Determine whether expression pattern in RNA transcripts predicts groups Biological confirmation Microarray experiments are in a way hypothesis-generating, it is not the definitive study that is telling you the specific phenotype. It is getting you closer, and it generates specific hypothesis on how the pathways are working to create additional experiments. The differential up or down-regulation of specific RNA transcripts needs to be independently confirmed using other methods: Northern blots Western blots RT-PCR RNAseq In situ hybridization Microarray databases There are 2 main repositories. Gene expression omnibus at NCBI (GEO) Array express at the European bioinformatics institute. Develop criteria that are much more stringent for researchers to deposit their expression data and within this data. Minimum information about a microarray experiment (MIAME): Experimental design Microarray design Sample preparation Hybridization procedures Image analysis Controls for normalization RNA Sequencing Platform Similar to DNA microarray slides. Still have 2 samples that are very similar Isolate RNA Still convert to cDNA Add adapters, linkers, barcodes to be sequenced in Illumina library on a flow cells. You will get so many reads, then you will map those reads back to some annotated genome. They are going to align where transcript came from, and you can identify genes that were expressed in that sample. End result will nearly be identical whether you have RNA seq data or DNA data. You will have a table that have a list of different expression levels. RNA expression varies in time and space RNA studies require more complex study design than genome sequencing Technical replicates Repeat experiments, library prep, and sequencing Biological Replicates Multiple RNA isolates from the same samples Multiple individuals of the same stage/condition You need to consider variation that could hid experimental response Environmental factors (temp, season) Age (juvenile vs. Adult) Growth conditions (diet, resource) You get more fine resolution data You want high correlation to ensure there is not some sort of technical artifact Align reads to an annotated reference Goal is to determine which reads correspond to which genes 3 general strategies depend on the resources available: De novo assembly Generate own genome assembly, align it to already assembled reference genome, then assemble transcripts into a transcriptome assembly (contigs of all transcripts), then map back to transcriptome assembly. Align to reference genome Going to get much more info on the genes themselves Preferred option Align to transcriptome Comparing expression of transcripts With RNAseq, the most common criteria to compare expression is Fragments Per Kilobase of Transcript per million mapped reads (FPKM) Number of reads mapping to particular gene, is data to look at levels of expression Long genes will have more reads The more sequence depth, more reads, you will have more reads mapping FPKM is a way to account how long the gene is and how much sequencing you did so you can better compare amongst genes Gene expression Data Analysis and Preprocessing The actual data analysis starts once the raw data is converted to expression values for each gene in a simple table. The genes are in one column and then in another column there is the RNA transcript levels which will include the signal intensity for microarray and FPKM for RNAseq Gene Preprocessing: The main goal is to remove the systematic bias in the data, while preserving the variation in gene expression that occurs because of biologically relevant changes in transcription. The basic assumption is that the gene expression levels for most genes DOES NOT CHANGE in an experiment Global Normalization: Histograms of raw intensity values before and after normalization. Allows us to differentiate between different genes more accurately. Descriptive statistics: scatter plots Expression levels of A sample (x) B sample (Y) You do not know any statistical significant here Volcano Plot Displays both P values and fold change Forms a volcano, log fold change (x) and p value (y) Pick apart and identify genes that have both a high significant value and also have a substantial magnitude of difference Interpreting gene expression analysis There is not a set of regulations for how to do this A small p-value (<0.05) with a big ratio difference The magnitude of expression difference is really high. A small p-value (<0.05) with a trivial ratio difference A large p-value (>0.05) with a big ratio difference A large p-value (>0.05) with a trivial ratio difference Large value with a ratio difference is not significant. May be a big difference but significantly, there is a lot of variation so it may not be correlated. P-value and the problem with multiple testing For P=0.05%, 5% of ALL TEST will result in a false positive (I.e., significant but really no difference) they will be sig different by chance In a 20,000 gene analysis at P=0.05, you would expect 20,000 x 0.05=1,000 of the “significant” observations are false positives. False discovery Rate (FDR) comprehensive for multiple testing in gene expression Used instead of p-value Giving you the probability that among your significant difference, what is your chance that it is a false positive? For FDR=0.05, that means that 5% of your significant test are false positives In a 20,000 gene analysis with a FDR=0.05, you would expect that among the significant results, only 5% of those are false positives. More stringent criteria, eliminate the majority false positives. Less likely to miss true false positives. In Bonferroni, you often miss a lot of true positives because of how stringent. Clustering Heat Maps Colors values in the different cells are based on the magnitude, and groups genes on expression profile, then groups the genes based on the similarity amongst samples, and then it groups the samples based on their genes. Sort of like grouping all the genes in a similar that have the same profile across samples together on the usually on vertical tree and then it's grouping the different cells and samples that you analyzed on the horizontal tree at the top right and it again it allows you to visualize the ones that see a similar pattern that might be involved in let's say the same regulatory network or pathway right and it provides a nice analysis for an overview of the overall patterns that are controlling or that are influencing the main differences between the control groups or between different groups and disease group. Successful gene expression analysis Difference between 2 groups List of genes that are differentially expressed. Understanding phenotypic effects Protein families Physical properties Protein localization Protein function Gene ontology: (basically a dictionary of the following below to categorize the different genes) Cellular component Biological process Molecular function The main repercussion of the dynamic nature of gene expression is that what you do affects what genes are expressed which then affects your Physiology, metabolism, structure, mood, feelings, etc. Do things that change your gene expression in a positive way: Read/learn new things Exercise Balanced diet Intermittent fasting Wim Hof breathing and cold exposure The Wim Hof Method has many benefits… Lowers heart rate, improves the cardiovascular system, reduces pain, reduces inflammation, increases the immune system, decreases sensitivity to cold, develops mental strength, reduces and or eliminates depression, and improves mood and outlook. Gene ontology Framework for modeling of biological systems and pathways Defines concepts/classes used to describe gene function and the relationship between these concepts. Functions are classified along 3 aspects of gene products: Molecular function Activities they perform Cellular component Where they are active Biological process Pathways/processes are made up of activities of multiple gene products. Enrichment Analysis Provides insight into the pathways/processes that are affecting the phenotype. One of the main uses of gene ontology is enrichment analysis. Once a specific gene is identified need to study its product to understand how it causes disease protein structure and human disease In some cases, there may be a single amino acid substitution which can induce A dramatic change in the protein structure. For example, the change in amino acid F 508 mutation of CFTR can alter the alpha-helical content of the protein and disrupt intracellular trafficking which is known as cystic fibrosis. On the other hand, some changes can be subtle and induce only a small change to a part of the protein. For example, the E6V mutation in hemoglobin beta introduces A hydrophobic patch on the protein surface leading to the clumping of hemoglobin molecules which is what we call sickle cell disease. The side chain effect the way that different proteins are structured Many websites are available for the analysis of individual proteins. The two listed below are excellent resources: ExPASy ISREC Protein secondary structure: Determined by the amino acid side chain Secondary structure prediction Chou and Fasman developed an algorithm based on the frequencies of amino aids found in alpha helices, beta sheets, and turns. Tertiary protein structure: Protein folding Main approaches Detrimental determination (x-ray, crystallography, NMR) Prediction Comparative modeling (based on homology) Threading AB initio (de novo) prediction The protein data bank Principal repository for protein structures The CATH hierarchy of structure Vector alignment search tool (VAST) Compare different proteins based on their structure. Offers a variety of data on protein structures, including: PDB identifiers Root-mean-square deviation (RMSD) values to describe structural similarities. NRES: the number of equivalent pairs of alpha carbon atoms superimposed. Percent identity Spi1transcription factor interacts with another transcription factor called C/EBP. These two factors interact together and bind together which then allows for them to recognize enhancers IL-1B which is responsible for inflammatory response, differentiation of macrophages, a cytokine, critical for immune system. The way these transcription factors interact is critical for understanding the way these cytokines are recruited and upregulated. IL-1B which is the important cytokine that initiates an inflammatory response, there are several transcription factors that re responsible for the upregulation of ir. You have to have Spi-1 recognize the promoter and has to have CEBP bind to the enhancer that is far upstream, it ends up folding over, then contacting spi-1 at the promoter, then loops, and the loop is responsible for recruiting whole complex and regulating IL-1B. they found that the whole interaction between spi1-CEBP is controlled through some extent a single amino acid in the complex. These types of interactions are difficult to model and are important to understand how everything fits together. Steps in genetic approach to drug discovery You have to find genomic regions with variants associated with a disease phenotype identify genes and functional elements in that region. determine causal mutation determine the effect this has on structure and function of proteins, functional RNA's, and or complexes identify other genes slash RNA molecules that are affected find drugs that will target the functional site or mediate the downstream effect of the disease factor. A single amino acid is critical for SPI-1 interaction with C/EBP PCSK9 is a protein that binds to LDL cholesterol receptors (LDLR). When PCSK9 is blocked, more LDLR is presented thus removing more cholesterol. Alirocumab is a monoclonal antibody that blocks that protein and now that PCSK9 cannot bind to those targets and it helps reduce that cholesterol Bioinformatics software can model the interactions between molecules to predict those that may bind active sites There are >30 genes that increase the risk of coronary disease. IL1B and Spi-1 initiates fever, helps bone resorption and cartilage breakdown, acute-phase reactants in the liver. IL1B is a cytokine that causes inflammation. Spi-1 is a part of the transcription factor complex necessary for upregulation of IL1B IL1B involved in numerous auto-inflammatory diseases and inflammatory associated disorders (arthritis and septic shock) There are 3 drugs that will target 1L1B: anakinra, rilonacept, and canakinumab) L-Arginine inhibits Both transcription of the IL1B gene and recruitment of C/EBP to SPi1 at the IL1B promoter in stimulated macrophages.