Bioinformatics - CAP5 PDF
Document Details
Uploaded by UpbeatQuantum
University of Coimbra
Joel P. Arrais
Tags
Summary
This document discusses various types of biological networks, including metabolic networks, transcriptional regulation networks, and protein-protein interaction networks. It highlights the importance of understanding these networks in computational biology and systems biology, and mentions methods for their analysis.
Full Transcript
Bioinformatics Mestrado em Biologia Computacional Mestrado em Engenharia Biomédica Mestrado em Engenharia e Ciência de Dados Joel P. Arrais – [email protected] Associate Professor DEI - University of Coimbra 1 Goals for this Module 5. Systems Bi...
Bioinformatics Mestrado em Biologia Computacional Mestrado em Engenharia Biomédica Mestrado em Engenharia e Ciência de Dados Joel P. Arrais – [email protected] Associate Professor DEI - University of Coimbra 1 Goals for this Module 5. Systems Biology a. Theoretical properties of Biological Networks b. Discovery of patterns and signatures c. Forecasting and simulation of biological networks d. Time series reconstruction 2 Challenges in Computational Biology 4 Genome Assembly 5 Regulatory motif discovery 1 Gene Finding DNA 2 Sequence alignment 6 Comparative Genomics TCATGCTAT TCGTGATAA 3 Database lookup 7 Evolutionary Theory TGAGGATAT TTATCATAT TTATGATTT 8 Gene expression analysis RNA transcript 9 Cluster discovery 10 Protein 11 Protein network analysis structure prediction 12 Regulatory network inference 13 Emerging network properties 3 Biological Networks Central dogma (revisited) The standard pathway of information flow: DNA->RNA->Protein. Two kinds of proteins (enzymatic and regulatory proteins) are shown as well as two types of gene regulation (via regulatory and external signal). 5 Systems Biology Systems-level understanding of biological systems Analyze not only individual components, but their interactions as well as emergent behavior In the rest of the module: Learn new biology from the topology/wiring/structure of such interaction networks 6 Networks model many real-world phenomena E.g., WWW 8 Networks model many real-world phenomena E.g., Internet 9 Networks model many real-world phenomena E.g., Airline routes 10 11 Biological networks Intra-cellular networks a) Transcriptional regulation networks b) Protein structure networks c) Metabolic networks d) Protein-protein interaction (PPI) networks e) Cell signalling networks Other biological networks Neuronal synaptic connection networks Brain functional networks Ecological food webs Phylogenetic networks Correlation networks (e.g., gene co-expression) Disease – “disease gene” association networks Drug – “drug target” networks 12 Biological Networks. Abstraction. 13 Metabolic networks Used for studying and modeling metabolism Biochemical reactions in cells that allow an organism to: Respond to the environment Grow Reproduce Maintain its structure … i.e., the main biochemical reactions needed to keep an organism in homeostasis An internal regulation that maintains a stable, constant condition of a living system 14 Metabolic networks Metabolites Small molecules such as glucose and amino acids Also, macromolecules such as polysaccharides and glycans (carbohydrates) Metabolic pathways Series of successive biochemical reactions for a specific metabolic function, e.g., glycolysis, or penicillin synthesis, that convert one metabolite into another Enzymes: proteins that catalyze (accelerate) chem. reactions Bipartite graph Thus, in a metabolic pathway: Nodes correspond to metabolites and enzymes In an alternate order à bipartite graphs Directed edges correspond to metabolic reactions Simpler approaches: nodes are metabolites, directed edges are reactions that convert one metabolite into the other; or nodes are enzymes and metabolites as edges 15 Metabolic networks Example: part of glycolysis pathway Metabolite-centric Reactions + metabolites representation 16 Metabolic networks All metabolic pathways of a cell form a metabolic network Complete view of cellular metabolism and material/mass flow through the cell Cell relies on this network to digest substrates from the environment, generate energy, and synthesize components needed for its growth and survival Insights from analyzing them used to, for example: Cure human metabolic diseases through better understanding of the metabolic mechanisms Control infections of pathogens by understanding the metabolic differences between human and pathogens 17 Metabolic networks Constructed: Partially experimental Partially from genetic sequence (homology) Available for many organisms, from bacteria to human Available on-line: KEGG (Kyoto Encyclopedia of Genes and Genomes) Info on genes, proteins, reactions, pathways Both for eukaryotes and prokaryotes GeneDB–contains similar info BioCyc, EcoCyc, MetaCyc More specialized info on particular species WIT, renamed to ERGO 18 19 20 From: https://www.genome.jp/pathway/map00052 21 From: http://www.kegg.jp/kegg/atlas/?01100 22 Transcriptional regulation networks Model regulation of gene expression Recall: gene à mRNA à protein Gene regulation Gives a cell control over its structure and function, e.g.: Cellular differentiation – a process by which a cell turns into a more specialized cell type Morphogenesis (a process by which an organism develops its shape)... 23 Transcriptional regulation networks Nodes correspond to genes DNA sequences which are transcribed into mRNAs that translate into proteins Directed edges correspond to interactions through which the products of one gene affect those of another Protein-protein, protein-DNA and protein-mRNA interactions Transcription factor X (protein product of gene X) binds regulatory DNA regions of gene Y to regulate the production rate (i.e., stimulate or repress transcription) of protein Y Note: proteins are products of gene expression that play a key role in regulation of gene expression 24 26 Cell signaling networks Cell signaling Complex communication system that governs basic cellular activities, e.g., development, repair, immunity Errors in signaling cause diseases E.g., cancer, autoimmune diseases, diabetes… E.g.: Transforming growth factor beta (TGF-β) is a protein that controls proliferation, cellular differentiation, and other functions in most cells. 27 Cell signaling networks Signaling pathways Ordered sequences of signal transduction reactions in a cell, as shown in the previous figure Cascade of reversible chemical modifications of proteins E.g., phosphorylation catalyzed by protein kinases: enzymes that modify other proteins by adding phosphate groups to them (process called phosphorylation) Signaling pathways in the cell form the cell signaling network Nodes are proteins and edges are directed 28 Protein-Protein Interaction (PPI) Networks 30 Protein-Protein Interaction (PPI) Networks A protein-protein interaction (PPI) usually refers to a physical interaction, i.e., binding between proteins PPIs are very important for structure and function of a cell: Participate in signal transduction (transient interactions) Play a role in many diseases (e.g., cancer) Can be stable interactions forming a protein complex (a form of a quaternary protein structure, set of proteins which bind to do a particular function, e.g., ribosome, hemoglobin – illustrated below) 31 Protein-Protein Interaction (PPI) Networks PPIs are very important for structure and function of a cell: Can be transient interactions Brief interactions that modify a protein that can further change PPIs e.g., protein kineases (add a phosphate group to a target protein) A protein can carry another protein, e.g., nuclear pore importins (proteins that carry other proteins from cytoplasm to nucleus and vice versa) Transient interactions form the dynamic part of PPI networks Some estimates state that about 70% of interactions are stable and 30% are dynamic (transient) PPIs are essential to almost every process in a cell Thus, understanding PPIs is crucial for understanding life, disease, development of new drugs (most drugs affect PPIs) 32 Protein-Protein Interaction (PPI) Networks Methods to detect PPIs: Biological and computational approaches None are perfect High rates of false positives Interactions present in the data sets that are not present in reality High rates of false negatives Missing true interactions 33 Protein-Protein Interaction (PPI) Networks Methods to detect PPIs: PPIs initially studied individually by small-scale biochemical techniques (SS) However, large-scale (high-throughput) interaction detection methods (HT) are needed for high discovery rates of new protein interactions SS of better “quality,” i.e., less noisy than HT However, HT are more standardized, while SS are performed differently each time SS are biased – the focus is on the subsets of proteins interesting to particular researchers HT – view of the entire proteome 34 Protein-Protein Interaction (PPI) Networks Methods to detect PPIs: Physical binding Yeast 2-hybrid (Y2H) screening Mass spectrometry of purified complexes Functional associations Correlated mRNA expression profiles Genetic interactions In silico (computational) methods In many cases, functional associations do take the form of physical binding 35 Protein-Protein Interaction (PPI) Networks Biases within PPI networks: The following is lost: Spatial information Temporal information Information about experimental conditions Strength of interactions Number of experiments confirming interactions PPI network: proteome + interactome Proteome: a set of all unique proteins in an organism; How does protein concentration affect the topology: More instances of a protein in the cell à more interacting partners in the network? 36 Protein-Protein Interaction (PPI) Networks Quality and completeness of PPI data Data sets produced by different methods are often complementary Even data sets obtained by the same technique complement each other to some (large) extent Completeness of data sets: Yeast: ~50% (~6K proteins, ~30K-60K interactions) Human: ~10% (~25K proteins, ~260K interactions; ~300 million pairs to test) Fly Worm Recently, herpes viruses (genome-wide coverage) 37 Different Network Types: Summary m2 Proteins C Metabolites Metabolism A B Gene regulation Cell signaling m1 m3 D PPIs E F 38 Network Properties Definition: Graph G is an ordered triple G:=(V, E, f) V is a set of nodes, points, or vertices. E is a set, whose elements are known as edges or lines. f is a function maps each element of E to an unordered pair of vertices in V. Vertex V:={1,2,3,4,5,6} Basic element E:={{1,2},{1,5},{2,3},{2,5},{3,4},{4,5},{4,6}} Drawn as a node or a dot. Vertex set of G is usually denoted by V(G), or V Edge A set of two elements Drawn as a line connecting two vertices, called end vertices, or endpoints. The edge set of G is usually denoted by E(G), or E. 39 Network Comparisons: Properties of Large Networks Large network comparison is computationally hard due to NP- completeness Thus, network comparisons rely on easily computable heuristics (approximate solutions), called “network properties” Network properties can be divided in two categories: Global network properties: give an overall view of the network, but might not be detailed enough to capture complex topological characteristics of large networks. Local network properties: more detailed network descriptors which usually encompass larger number of constraints, thus reducing “degrees of freedom” in which the networks being compared can vary. 41 41 Network Comparisons: Properties of Large Networks 1. Global Network Properties: a) Degree distribution b) Average clustering coefficient c) Clustering spectrum d) Average diameter e) Spectrum of shortest path lengths f) Centralities Readings: Chapter 3 of “Analysis of biological networks” by Junker and Björn 42 a) Degree Distribution Definitions: degree of a node is the number of edges incident to the node. Average degree of a network: average of the degrees over all nodes in the network. However, average degree might not be representative, since the distribution of degrees might be skewed. x deg(x)=5 43 a) Degree Distribution Let P(k) be the percentage of nodes of degree k in the network. The degree distribution is the distribution of P(k) over all k. P(k) can be understood as the probability that a node has degree k. (log-log plot) Here P(k) ~ k-γ , where often 2 ≤ γ < 3. This is a power-law, heavy-tailed distribution. Networks with power-law degree distributions are called scale-free networks. In them, most of the nodes are of low degree, but there is a small number of highly-linked nodes (nodes of high degree) called “hubs.” 44 a) Degree Distribution Example where average degree Example where average degree is meaningful is NOT meaningful However: degree distribution (and global properties in general) are weak predictors of network structure. Illustration: P(k) is a Poisson distribution. G and H are of the same size (i.e.,|G|=|H| -- they have the same number of nodes and edges) and they have the same degree distribution, but G and H have very different topologies (i.e., graph structure). 45 Research Debates… Assortative (favored connections between similar nodes) vs. disassortative (between dissimilar nodes) mixing of degrees: Do high-degree nodes interact with high-degree nodes? Done by: Pearson corr. coefficient between degrees of adjacent vertices Average neighbor degree; then average over all nodes of degree k Structural robustness and attack tolerance: “Robust, yet fragile” Scale-free degree distribution: “Party” vs. “date” hubs J.D. Han et al., Nature, 430:88-93, 2004 Bias in the data collection – sampling? M. Stumpf et al., PNAS, 102:4221-4224, 2005 J. Han et al., Nature Biotechnology, 23:839-844, 2005 High degree nodes: Essential genes H. Jeong at al., Nature 411, 2001. Disease/cancer genes Jonsson and Bates, Bioinformatics, 22(18), 2006 Goh et al., PNAS, 104(21), 2007 47 47 b) Average Clustering Coefficient Definition: clustering coefficient Cv of a node v: Cv = |E(N(v))|/(max possible number of edges in N(v)) where N(v) is the neighborhood of v, i.e., all nodes adjacent to v Cv can be viewed as the probability that two neighbors of v are connected. Thus 0 ≤ Cv ≤ 1. Definition: For vertex v of degree 0 or 1, by definition Cv=0. Definition: average clustering coefficient, C, of a network is the average Cv over all nodes v∈ V. 48 b) Average Clustering Coefficient Example: |N(v)|= 4, since there are 4 nodes in N(v), i.e., N(v)= {1, 2, 3, 4} |E(N(v))|= 3, since there are 3 edges between nodes in N(v) Max possible number of edges between nodes in N(v) is: choose(4,2) = 6. Therefore Cv= 3/6 = 1/2 49 c) Clustering Spectrum Definition: clustering spectrum, C(k), is the distribution of average clustering coefficients of all nodes of degree k in the network, over all k. Example: 50 d) Average Diameter Definition: the distance between two nodes is the smallest number of links that have to be traversed to get from one node to the other. Definition: the shortest path is the path that achieves that distance. Definition: the average network diameter is the average of shortest path lengths over all pairs of nodes in a network. 52 e) Spectrum of Shortest Path Lengths Definition: Let S(d) be the percentage of node pairs that are at distance d. The spectrum of shortest path lengths is the distribution of S(d) over d. Example: 53 d) and e) Average Diameter and Spectrum of Shortest Path Lengths u Distance between a pair of nodes u and v: G Du,v = min {length of all paths between u and v} = min {3,4,3,2} = 2 = dist(u,v) v Average diameter of the whole network: D = avg {Du,v for all pairs of nodes {u,v} in G} Spectrum of the shortest path lengths E.g. (not for G) 54 Node Centralities Rank nodes according to their “topological importance” Definition: Centrality quantifies the topological importance of a node (edge) in a network. There are many different types of centralities. Different types of centralities: Degree centrality Closeness centrality Eccentricity centrality Betweenness centrality Subgraph centrality Eigenvector centrality Software tools: Visone (social nets) and CentiBiN (biological nets) 55 Node Centralities 1. Degree centrality, Cd(v): nodes with a large number of neighbors (i.e., edges) have high centrality. Therefore, we have Cd(v)=deg(v). Example of a use of degree centrality: In PPI networks, nodes with high degree centrality are considered to be “biologically important.” (E.g., essential/lethal, mutated in cancer…) 2. Closeness centrality, Cc(v): nodes with short paths to all other nodes in the network have high closeness centrality 1 Cc(v)= å dist(u,v) uÎV 56 Node Centralities 3. Betweenness centrality, Cb(v): Nodes (or edges) which occur in many of the shortest paths have high betweenness centrality. sst(v) å sst Cb(v)= s¹t s¹v v ¹t Above: The above summation means that there is a sum on the top and on the bottom of the fraction. σst(v) = the number of shortest paths from s to t that pass through v σst = the number of all shortest paths from s to t (they may or not pass through node v) 57 Node Centralities 4. Eccentricity centrality, Ce(u): nodes with short paths to any other node have high eccentricity centrality Eccentricity of a node u is defined as ecc(u) = max dist(u,v) vÎV So it is the maximum shortest path length from node u to all other nodes v in V. Eccentricity centrality of a node u: Thus, central nodes have higher Ce since they have lower ecc. 58 g) Node Centralities Example: 2 5 4 4 3 2 1 4 5 H-2 j -4 2 f-3 Degree Closeness Betweenness From highest F,G A,B,D to H A,B C,E,I C,E lowest J I J 59 g) Node Centralities Example: Degree Closeness Betweenness From highest F, G F, G H A, B, D D, H F, G to H A, B I C, E, I C, E D lowest I A, B J J C, E, J 60 Reconstruction of Time Series Motivation Problem: Most methods for reconstructing signaling and regulatory response networks from high throughput data generate static models which cannot distinguish between early and late stages in the response Hypothesis: Given time series gene expression data and host proteins that a virus or drug interacts with, it is possible to infer the signaling pathways that are responsible for the gene expression change over time. A cell is in a stationary state A perturbation is induced (external condition; Virus; bacteria) What are the response mechanisms to this event? BETTER PUT: What are the chain of response pathways to this event? 10 2 Temporal constraint Hypothesis — differential expression at later time points is affected by differential expression at earlier time points. Implementation in model — if a signaling pathway ends at a differentially expressed gene at time point T, then there must be a differentially expressed gene from a previous time point. 10 4 State of the art: autocorrelation Autocorrelation between successive points. Can identify complete set of acting genes. Allows to infer causality. Pearson//other similarity measures Problems: Few samples; lot of variables Very high noisy dependent 10 5 State of the art: graphical models Efficient way to represent and reason about joint distributions Graphs in which nodes represent random variables and edges correspond to dependency relationships Two major types: Directed and undirected 10 6 Markov random fields – Physical networks 10 7 Maximum Edge Orientation (Gitter2011) 10 8 Input – Output Hidden Markov Model (Bengio1995) 10 9 Major limitations Both previous solution were already explored DREM uses Input-Output Hidden Markov Models (previous CMU model) One major limitation of the previous methods is the lack of scalability ß difficulty to explain the whole model Size of the problem: 20k human proteins The total number of possible paths with length between 3 and 10 is ~1043 11 0 Proposed solution Use heuristics and domain knowledge to filter the search space Apply Integer Programing (IP) to search for the best paths Select the paths that better explain the observed expression Select the the best early stage proteins that affect the most the infection mechanism. 11 1 Material and Methods Material: Expression profile for the HIV infection mRNA expression profiling for 12 time points over 24 hours Seed nodes for the HIV mechanism Transcription Factors; Protein-DNA interactions Human Protein-Protein interaction network Methods: Path Filtering Problem formulation and IP model implementation Objective function Maximize the number of targets explained Maximize the weight of the selected paths Minimize the number of nodes 11 9 Finding Candidate Pathways 1. We divide the time series into k phases each consisting of T/k time points where T is the total number of points. We use k = 3 for this paper. 2. We extract the top N1 DE genes for each phase (we use N1 = 200). 3. We then search for the highest scoring N2 acyclic paths from the source proteins (host proteins interacting with the virus of drug) to the targets (DE genes) for each phase (we use N2 = 10 million here). We use the edge weights to compute a score for each path. We also guarantee the following constraints: a. The last edge in the path has to be a protein-dna interaction (i.e. we need a TF to activate / repress the gene). b. A path to a phase i>1 target has to contain a node that is a target for phase i−1. 12 0 Integer Programming 101 Mathematical optimization or feasibility program in which all variables are restricted to integers Feasible integer points are shown in red Red dashed lines indicate their convex hull The blue lines together with the coordinate axes define the polyhedron of the LP relaxation The optimal solutions of the integer problem are the points (1,2) and (2,2) 12 1 IP formulation For scalability Proteins (and not paths) are considered as variables. Variables are Boolean. Global objectives: Maximize path weight, λ1 Maximize the number of targets explained, λ2 Minimize the number of intermediate nodes, λ3 12 2 Search for optimal solutions Greedy algorithm: 1. At time point 0 all proteins all active 2. For each protein calculate the Δ for the objective function 3. Select the protein with higher Δ and change their state 4. IF Δ>0 goto step 2 5. Return optimal solution MetaHeuristic used: Local Search with random initial states TABU search: keeps a buffer with the previous paths that led to local maxima Simulated annealing: with a given probability accepts changing to lower energy states in order to avoid local maxima 12 3 Nuts and bolts Independent Dependent variables: variables: Search buffer #Paths Output buffer #Proteins Phase size #Sources Max path length #Targets Filter base network Screen hits (top 100; Depth-first search top 1000) Breath-first search Early targets Separate Targets Reactome Pathways Time restrictions 12 4 In silico validation siRNA screenhits siRNA binds to mRNA disabling its translation to proteins (gene silencing) A library with siRNAs for all human genes is inserted in the cell (~20k) For the cells that survived screen the active siRNAs. A meta-analysis of the previous three available studies determined that only common 3 proteins. Studies from (Hela/TZM-bl and 293T). We have used Sup-T1 cells. A total of ~300 sequences were selected Overlap with Reactome HIV pathways Reactome is a database of metabolic pathways It has 21 HIV infection related pathways The overlap will give a good measure of confidence 12 5 Results – Path filter #Sources and #Proteins are ~constant #ReactomePaths and #Targets are dependent of Phase size Screen hits peek at Phase size 200 #Top100 screen #Reactome #Phase size #Paths #Proteins #Sources #Targets hits Paths 100 173483 3125 163 153 21 238 200 218959 3270 163 278 25 403 400 238876 3240 163 455 23 823 800 290152 3529 163 712 21 1392 1600 358411 3785 163 978 23 2525 3000 407948 3842 163 1134 19 4288 12 6 Results – IP #Top100 screen #Reactome λ2 λ3 #Paths #Proteins #Sources #Targets hits Paths 0 0 218959 3270 163 278 25 403 1 1 213583 2280 162 278 25 403 5 1 213583 2280 162 278 25 403 9 1 213583 2280 162 278 25 403 13 1 213583 2280 162 278 25 403 17 1 213583 2280 162 278 25 403 1 81 124002 470 138 121 25 377 5 81 200081 1451 162 250 25 399 9 81 202840 1572 162 257 25 400 13 81 202840 1572 162 257 25 400 17 81 209608 1869 162 271 25 402 1 161 28794 134 57 32 24 184 5 161 200081 1451 162 250 25 399 9 161 199282 1447 162 249 25 397 13 161 199527 1451 162 253 26 400 17 161 208646 1836 162 269 25 402 1 241 0 0 0 0 0 0 5 241 191695 1189 159 216 25 394 9 241 199282 1447 162 249 25 397 13 241 198982 1447 162 250 25 397 17 241 208646 1836 162 269 25 402 1 321 0 0 0 0 0 0 5 321 191695 1189 159 216 25 394 9 321 199282 1447 162 249 25 397 13 321 198982 1447 162 250 25 397 12 17 321 208130 1832 162 266 25 402 7 #Proteins Results - IP 4000 3000 2000 1000 13 0 5 0 1 0 λ2 81 161 241 321 λ3 #Reactome Paths 600 #Pathways 400 300000 200 13 200000 0 5 0 λ2 100000 13 1 81 0 161 0 5 241 λ3 321 0 1 81 0 λ2 161 241 λ3 321 #Top100 screen #Reactome λ2 λ3 #Paths #Proteins #Sources #Targets hits Paths 0 0 218959 3270 163 278 25 403 17 81 209608 1869 162 271 25 402 1 81 124002 470 138 121 25 377 1 161 28794 134 57 32 24 184 12 8 Comparison Overlap between RNAi screen hits and top 100 genes for the different dynamic network reconstruction methods Edge list from Reactome and the edges extracted by the different methods. Comparison with a baseline ranking of the differentially expression (DE) genes is also presented. 12 9 TimePath (http://sb.cs.cmu.edu/timepath) 3 time phases x 3 iterations 13 0"8$hours 8"16$hours 16"24$hours 0 Visual representation 13 1 Visual representation https://www.youtube.com/watch?v=Y6RfKjU1kM0 13 2 Experimental validation: Results 13 3 Experimental validation: Results *p < 0.05; N=3 160 2hrs prior to Infection 4hrs post Infection 14hrs post Infection 120 Infection (%) 80 * * 40 * * * * *** * 0 * * ** * * REGORAFENIB VELIPARIB OLAPARIB DASATINIB 5AZA SAHA WP1066 SNS-032 CARFILZOMIB AZD 6244 SP600125 AZT (+) DINACICLIB IκK2-V MEDIA Vehicle 2 2 2 3 1 1 1 1 1 1 1 1 1 1 1 3 3 13 4 Final remarks TimePath is fully based on the molecular data, thus could be applied to much shorter time scale. Therefore it enables to obtain a more fine resolution of the disease stage. Blocking proteins, at any phase, leads to reduction in viral load Blocking specific proteins has a positive impact only if they are targeted at the phases predicted by TimePath Reinforces the importance of temporal models for preventing and treating infections. In silico results are consistently good both for HIV and flu H1N1 Lab validation was conducted (Pittsburgh University collaboration) 13 5 Acknowledgments Slides adapted from: Computational Biology: Genomes, Networks, Evolution, MIT Course Number 6.047 / 6.878 Computational Genomics, CMU Course GHC 8006 Methods for Computational Gene Prediction, by W.H. Majoros Analysis of Gene Expression Data, JSM CE Course Protein Threading, BMI/CS 776, by Colin Dewey Introduction to Bioinformatics, Spring 2015, by Natasa Przulj 13 6