Part 1. System Modeling in Cellular Biology PDF
Document Details
Uploaded by WellConnectedAquamarine
2006
Douglas B. Kell and Joshua D. Knowles
Tags
Summary
This document is part 1 of a book titled "System Modeling in Cellular Biology" from concepts to nuts and bolts, from 2006. The text introduces the concepts and philosophy of modeling in systems biology.
Full Transcript
I GENERAL CONCEPTS 1 The Role of Modeling in Systems Biology Douglas B. Kell and Joshua D. Knowles The use of models in biology is at once both familiar and arcane. It is familiar because, as we shall argue, biologists presently and regularly use models as abstractions of reality: diagrams, l...
I GENERAL CONCEPTS 1 The Role of Modeling in Systems Biology Douglas B. Kell and Joshua D. Knowles The use of models in biology is at once both familiar and arcane. It is familiar because, as we shall argue, biologists presently and regularly use models as abstractions of reality: diagrams, laws, graphs, plots, relationships, chemical formulae and so on are all essentially models of some external reality that we are trying to describe and understand (fig. 1.1). In the same way we use and speak of “model organisms” such as baker’s yeast or Arabidopsis thaliana, whose role lies in being similar to many organisms without being the same as any other one. Indeed, our theories and hypotheses about biological objects and systems are in one sense also just models (Vayttaden et al., 2004). Yet the use of models is for most biologists arcane because familiarity with a subset of model types, especially quantitative mathematical models, has lain outside the mainstream during the last 50 years of the purposely reductionist and qualitative era of molecular biology. It is largely these types of model that are an integral part of the “new” (and not-so-new) systems biology and on which much of the rest of this book concentrates. Since all such models are developed for some kind of a purpose, our role in part is to explain why this type of mathematical model is both useful and important, and will likely become part of the standard armory of successful biologists. 1.1 Philosophical Overview When one admits that nothing is certain one must, I think, also admit that some things are much more nearly certain than others. Bertrand Russell, Am I an Atheist or an Agnostic? It is conventional to discriminate (as in fig. 1.2) (a) the world of ideas, thoughts, or other mental constructs and (b) the world of observations or data, and most scientists would recognize that they are linked in an iterative cycle, as drawn: we improve our mental picture of the world by carrying out experiments that produce data, and such data are used to inform the cogitations that feed into the next part 4 The Role of Modeling in Systems Biology Figure 1.1 Models in biology. Although we shall be concentrating here on a subset of mathematical models, we would stress that the use of all sorts of models is entirely commonplace in biology—examples include (a) diagrams (here a sequence of DNA bases and the “central dogma”), (b) laws (the flux-control summation theorem of metabolic control analysis), (c) graphs—in the mathematical sense of elements with nodes and edges (a biochemical pathway), (d) plots (covariation of 2 metabolites in a series of experiments), (e) relationships (a rule describing the use of the concentration of a metabolite in disease diagnosis), (f) chemical formulae (tryptophan), and (g) images (of mammalian cells). of the right-hand arc, that designs and performs the next set of experiments as part of an experimental program. Such a cycle may be seen as a “chicken and egg” cycle, but for any individual turn of the cycle there is a clear distinction between the two essential starting points (ideas or data). This also occurs in scientific funding circles—is the activity in question ideas- (that is, hypothesis-)driven or is it data-driven? (Until recently, the latter, hypothesis-generating approach was usually treated rather scornfully.) From a philosophical point of view, then, the hypothetico-deductive analysis, in which an idea is the starting point (however muddled or wrongheaded that idea may be), has been seen as much more secure, since deductive reasoning is sound in the sense that if an axiom is true (as it is supposed to be by definition) and the observation is true, we can conclude that the facts are at least consistent with the idea. If the hypothesis is “all swans are white” then the prediction is that a measurement of the whiteness of known swans will give a positive response. By contrast, it has been known since the time of Hume that inductive reasoning, by which we seek to generalize from examples (“swan A is white, swan B is white, swan C is white . . . so I predict that all swans are white”) is insecure—and a 1.1 Philosophical Overview 5 Figure 1.2 The iterative relationship between the world of ideas/hypotheses/thoughts and the world of data/observations. Note that these are linked in a cycle, in which one arc is not simply the reverse of the other (Kell, 2002, 2005; Kell and Welch, 1991). single black swan shows it. Nothing will ever change that, and the “problem of induction” probably lies at the heart of Popper’s insistence (see Popper (1992) and more readable commentators such as Medawar (1982)) that theories can only be disproved. Note of course that it is equally true for the hypothetico-deductive mode of reasoning that a single black swan will disprove the hypothesis. This said, the ability of scientists to ignore any number of ugly facts that would otherwise slay a beautiful hypothesis is well known (Gilbert and Mulkay, 1984), and in this sense—given that there are no genuinely secure axioms (Hofstadter, 1979; Nagel and Newman, 2002)—the deductive mode of reasoning is not truly much more secure than is induction. Happily, there is emerging a more balanced view of the world. This recognizes that for working scientists the reductionist and ostensibly solely hypothesis-driven agenda has not been as fruitful as had been expected. In large measure in biology this realization has been driven by the recognition, following the systematic genome sequencing programs, that the existence, let alone the function, of many or most genes—even in well-worked model organisms—had not been recorded. This could be seen in part as a failure of the reductionist agenda. In addition there are many areas of scientific activity that have nothing to do with testing hypotheses but which are exceptionally important (Kell and Oliver, 2004); perhaps chief among these is the development of novel methods. In particular there are fields—functional genomics not least among them (Kell and King, 2000), although this is very true for many areas of medicine as well—that are data-rich but hypothesis-poor, and are best attacked using methods that are data-driven and thus essentially inductive (Kell and King, 2000). A second feature that has emerged from a Popperian view of the world (or at least from his attempt to find a means that would allow one to discriminate “science‘” from “pseudo-science” (Medawar, 1982; Popper, 1992)) is the intellectual significance of prediction: if your hypothesis makes an experimentally testable (and 6 The Role of Modeling in Systems Biology thus falsifiable) prediction it counts as “science,” and if the experimental prediction is consistent with the prediction then (confidence in) the “correctness” of your hypothesis or worldview is bolstered (see also Lipton (2005)). 1.2 Historical Context The history of science demonstrates that both inductive and deductive reasoning occur at different stages in the development of ideas. In some cases, such as in the history of chemistry, a period of almost purely inductive reasoning (stamp-collecting and classification) is followed by the development of more powerful theories that seek to explain and predict many phenomena from more general principles. Often these theories are reductionist, that is to say, complicated phenomena that seem to elude coherent explanation are understood by some form of breaking down into constituent parts, the consideration of which yields the required explanation of the more complicated system. A prime example of the reductionist mode is the explanation of the macroscopic properties of solids, liquids, and gases—such as their temperature, pressure, and heat— by considering the average effect of a large number of microscopic interactions between particles, governed by Newtonian mechanics. For the first time, accurate, quantitative predictions with accompanying, plausible explanations were possible, and unified much of our basic understanding of the physical properties of matter. The success of early reductionist models in physics, and later those in chemistry, led in 1847 to a program to analyze (biological) processes, such as urine secretion or nerve conduction, in physico-chemical terms proposed by Ludwig, Helmholtz, Brucke, and du Bois-Reymond (Bynum et al., 1981). However, although reductionism has been successful in large part in the development of physics and chemistry, and to a great extent in acquiring the parts list for modern biology—consider the gene—the properties of many systems resist a reductionist explanation (Solé and Goodwin, 2000). This ultimate failure of reductionism in biology, as in other disciplines, is due to a number of factors, principal among them being the fact that biological systems are inherently complex. Although complexity is a phenomenon about which little agreement has been reached, and certainly for which no all-encompassing measure has been established, the concept is understood to pertain to systems of interacting parts. Having many parts is not necessary: it is sufficient that they are coupled in some way, so that the state of one of them affects the state of one or more others. Often the interactions are nonlinear so, unlike systems which can be modeled by considering averaged effects, it is not possible to reduce the system’s behavior to the sum of its parts (Davey and Kell, 1996). Common interactions in these systems are feedback loops, in which, as the name suggests, information from the output of a system transformation is sent back to the input of the system. If the new input facilitates and accelerates the transformation in the same direction as the preceding output, they are positive feedback —their effects are cumulative. If the new data produce an output in the 1.2 Historical Context 7 opposite direction to previous outputs, they are negative feedback—their effects stabilize the system. In the first case there is exponential growth or decline; in the second there is maintenance of the equilibrium. These loops have been studied in a variety of fields, including control engineering, cybernetics, and economics. An understanding of them and their effects is central to building and understanding models of complex systems (Kell, 2004, 2005; Milo et al., 2002). Negative feedback loops are typically responsible for regulation, and they are obviously central to homeostasis in biological systems. In control engineering, such systems are conveniently described using Laplace transforms—a means of simplifying the combination and manipulation of ordinary differential equations (ODEs), and closely related to the Fourier transform (Ogata, 2001); Laplace transforms for a large variety of different standard feedback loops are known and well-understood, though analysis and understanding of non-linear feedback remains difficult (see chapter 12 for details). Classical negative feedback loops are considered to provide stability (as indeed they do when in simple systems in which the feedback is fast and effective), though we note that negative feedback systems incorporating delays can generate oscillations (for example (Nelson et al., 2004)). Positive feedback is a rather less appreciated concept for most people and, until recently, it could be all but passed over in even a control engineer’s education. This is perhaps because it is often equated with undesired instability in a system, so it is just seen as a nuisance; something which should be reduced as much as possible. However, positive feedback should not really be viewed in this way, particularly from a modeling perspective, because it is an important factor in the dynamics of many complex systems and does lead to very familiar behavior. One very simple model system of positive feedback is the Polya urn (Arthur, 1963; Barabási and Albert, 1999; Johnson and Kotz, 1977). In this, one begins with a large urn containing two balls, one red and one black. One of these is removed. It is then replaced in the urn, together with another ball of the same color. This process is repeated until the urn is filled up. The system exhibits a number of important characteristics with respect to the distribution of the two colors of balls in the full urn: early, essentially random events can have a very large effect on the outcome; there is a lock-in effect where later in the process, it becomes increasingly unlikely that the path of choices will shift from one to another (notice that this is in contrast to the “positive feedback causes instability” view); and accidental events early on do not cancel each other out. The Polya urn is a model for such things as genetic drift in evolution, preferential attachment in explaining the growth of scale-free networks (Barabási and Albert, 1999), and the phenomenon whereby one of a variety of competing technologies (all but) takes over in a market where there is a tendency for purchasers to prefer the leading technology, despite equal, or even inferior, quality compared with the others (for example QWERTY keyboards and Betamax versus VHS video). (See also Goldberg (2002) and Kauffman et al. (2000) for the adoption of technologies as an evolutionary process.) Positive feedback in a resource-limited environment also leads to familiar behavior. The fluctuations seen in stock prices, the variety of sizes of sandpiles, and 8 The Role of Modeling in Systems Biology cycles of population growth and collapse in food-chains all result from this kind of feedback. There is a tendency to reinforce the growth of a variable until it reaches a value that cannot be sustained. This leads to a crash which “corrects” the value again, making way for another rise. Such cyclic behavior can be predictably periodic but in many cases the period of the cycle is chaotic—that is, deterministic but essentially unpredictable. All chaotic systems involve nonlinearity, and this is most frequently the result of some form of positive feedback, usually mixed with negative feedback (Glendinning, 1994; Tufillaro et al., 1992; Strogatz, 2000). Behavior involving oscillatory patterns may also be important in biological signaling (Lahav et al., 2004; Nelson et al., 2004), where the downstream detection may be in the frequency rather than the amplitude (that is, simply concentration) domain (Kell, 2005). All of this said, despite encouraging progress (for example (Tyson et al., 2003; Wolf and Arkin, 2003; Yeger-Lotem et al., 2004)), we are far from having a full understanding of the behavior of concatenations of these simple motifs and loops. Thus, the Elowitz and Leibler oscillator (Elowitz and Leibler, 2000) is based solely on negative feedback loops but is unstable. However, this system could be made comparatively stable and robust by incorporating positive feedback loops, which led to some interesting work by Ferrell on the cell cycle (Angeli et al., 2004; Pomerening et al., 2003). It is now believed that most systems involving interacting elements have both chaotic and stable regions or phases, with islands of chaos existing within stable regions, and vice versa (for a biological example, see (Davey and Kell, 1996)). Chaotic behavior has now been observed even in the archetypal, clockwork system of planetary motion, whereas the eye at the heart of a storm is an example of stability occurring within a wildly unpredictable whole. Closely related to the vocabulary of complexity and of chaos theory is the slippery new (or not so new?) concept of emergence (Davies, 2004; Holland, 1998; Johnson, 2001; Kauffman, 2000; Morowitz, 2002). Emergence is generally taken to mean simply that the whole is more than (and maybe qualitatively different from) the sum of its parts, or that system-level characteristics are not easily derivable from the “local” properties of their constituents. The label of emergent phenomenon is being applied more and more in biological processes at many different levels, from how proteins can fold to how whole ecosystems evolve over time. A central question that the use of the term emergence forces us to consider is whether it is only a convenient way of saying that the behavior of the whole system is difficult to understand in terms of basic laws and the initial conditions of the system elements (weak emergence), or whether, in contrast, the whole cannot be understood by the analysis of the parts, and current laws of physics, even in principle (strong emergence). The latter view would imply that high level phenomena are not reducible to physical laws (but may be consistent with them) (Davies, 2004). If this were true, then the modeling of (at least) some biological processes should not follow solely a bottomup approach, hoping to go from simple laws to the desired phenomenon, but might eventually need us to posit high-level organizing principles and even downward 1.3 The Purposes and Implications of Modeling 9 causality. Such a worldview is completely antithetical to materialism and remains as yet on the fringes of scientific thought. In summary, reductionism has been highly successful in explaining some macroscopic phenomena, purely in terms of the behavior of constituent parts. However, this was predicated (implicitly) on the assumption that there were few parts (for example, the planets) and that their interactions were simple, or that there were many parts but their interactions could be neglected (for example, molecules in a gas). However, the scope of a reductionist approach is limited because these assumptions are not true in many systems of interest (Kell and Welch, 1991; Solé and Goodwin, 2000). The advent of computers and computer simulations led to the insight that even relatively small systems of interacting parts (such as the Lorenz model) could exhibit very complex (even chaotic) behavior. Although the behavior may be deterministic, complex systems are hard to analyze using traditional mathematical and analytical methods. Prediction, control, and understanding arise mainly from modeling these systems using iterated computer simulations. Biological systems, which are inherently complex, must be modeled and studied in this way if we are to continue to make strides in our understanding of these phenomena. 1.3 The Purposes and Implications of Modeling We take it as essentially axiomatic that the purposes of academic biological research are to allow us to understand more than we presently do about the behavior and workings of biological systems (see also Klipp et al. (2005)) (and in due time to exploit that knowledge for agricultural, medical, commercial, or other purposes). We consider that there are several main reasons why one would wish to make models of biological systems and processes, and we consider each in turn. In summary, they can all be characterized as variations of simulation and prediction. By simulation we mean the production of a mathematical or computational model of a system or subsystem that seeks to represent or reproduce some properties that that system displays. Although often portrayed as substantially different (though we consider that it is not), prediction involves the production of a similar type of mathematical model that simulates (and then predicts) the behavior of a system related to the starting system described above. Clearly simulation and prediction are thus related to each other, and the important concept of generalization describes the ability of a model derived for one purpose to predict the properties of a related system under a separate set of conditions. Thus some of the broad reasons—indeed probably the main reasons—why one would wish to model a (biological) system include: Testing whether the model is accurate, in the sense that it reflects—or can be made to reflect—known experimental facts Analyzing the model to understand which parts of the system contribute most to some desired properties of interest 10 The Role of Modeling in Systems Biology Hypothesis generation and testing, allowing one rapidly to analyze the effects of manipulating experimental conditions in the model without having to perform complex and costly experiments (or to restrict the number that are performed) Testing what changes in the model would improve the consistency of its behavior with experimental observations Our view of the basic bottom-up systems biology agenda is given in fig. 1.3. 1.3.1 Testing Whether the Model Is Accurate A significant milestone in a modeling program is the successful representation of the behavior of the “real” system by a model. This does not, of course, mean that the model is accurate, but it does mean that it might be. Thus the dynamical behavior of variables such as concentrations and fluxes is governed by the parameters of the systems such as the equations describing the local properties and the parameters of those equations. This of itself is not sufficient, since generalized equations (for example, power laws, polynomials, perceptrons with nonlinear properties) with no mechanistic or biological meaning can sometimes reproduce well the kinetic behavior of complex systems without giving the desired insight into the true constitution of the system. Such models may also be used when one has no experimental data, with a view to establishing whether a particular design is sensible or whether a particular experiment is worth doing. In the former case, of engineering design, it is nowadays commonplace to design complex structures such as electronic circuits and chips, buildings, cars, or aeroplanes entirely inside a computer before committing them to reality. Famously, the Boeing 777 was designed entirely in silico before being tested first in a wind tunnel and then with a human pilot. It is especially this kind of attitude and experience in the various fields of engineering that differs from the current status of work in biology that is leading many to wish to bring numerical modeling into the biological mainstream. Another example is the development of “virtual” screening, in which the ability of drugs to bind to proteins is tested in silico using structural models and appropriate force fields to calculate the free energy of binding to the target protein of interest of ligands in different conformations (Böhm and Schneider, 2000; Klebe, 2000; Langer and Hoffmann, 2001; Shen et al., 2003; Zanders et al., 2002), the most promising of which may then be synthesized and tested. The attraction, of course, is the enormous speed and favorable economics (and scalability) of the virtual over the actual “wet” screen. 1.3.2 Analyzing Subsystem Contributions Having a model allows one to analyze it in a variety of ways, but a chief one is to establish those parts of the model that are most important for determining the behavior in which one is particularly interested. This is because simple inspection of models with complex (or even simple) feedback loops just does not allow one 1.3 The Purposes and Implications of Modeling 11 (a) (b) (c) Figure 1.3 The role of modeling in the basic systems biology agenda, (a) stressing the bottom-up element while showing the iterative and complementary top-down analyses. (b) The development of a model from qualitative (structural) to quantitative, and (c) its integration with (“wet”) experimentation. to understand them (Westerhoff and Kell, 1987). Techniques such as sensitivity analysis (see below) are designed for this, and thus indicate to the experimenter which parameters must be known with the highest precision and should be the focus of experimental endeavor. This is often the focus of so-called top-down analyses in which we seek to analyze systems in comparatively general or high-level terms, lumping together subsystems in order to make the systems easier to understand. The 12 The Role of Modeling in Systems Biology equivalent in pharmacophore screening is the QSAR (quantitative structure-activity relationship) type of analysis, from which one seeks to analyze those features of a candidate binding molecule that best account for successful binding, with a view to developing yet more selective binding agents. 1.3.3 Hypothesis Generation and Testing Related to the above is the ability to vary, for example, parameters of the model, and thereby establish combinations or areas of the model’s space that show particular properties in which one might be interested (Pritchard and Kell, 2002), and then to perform that small subset of possible experiments that it is predicted will show such interesting behavior. An example here might be the analysis of which multiple modulations of enzymatic properties are best performed for the purposes of metabolic engineering (Cascante et al., 2002; Cornish-Bowden, 1999; Fell, 1998). We note also that when modeling can be applied effectively it is far cheaper than wet biology and, as well as its use in metabolic engineering, can reduce the reliance on in vivo animal/human experimentation (a factor of significant importance in the pharmaceutical industry). 1.3.4 Improving Model Consistency In a similar vein, we may have existing experimental data with which the model is inconsistent, and it is desirable to explore different models to see which changes to them might best reproduce the experimental data. In biology this might, for example, allow the experimenter to test for the presence of an interaction or kinetic property that might be proposed. In a more general or high-level sense, we may use such models to seek evidence that existing hypotheses are wrong, that the model is inadequate, that hidden variables need to be invoked (as in the Higgs Boson in particle physics, or the invocation of the existence of Pluto following the registration of anomalies in the orbit of Neptune), that existing data are inadequate, or that new theories are needed (such as the invention of the quantum theory to explain or at least get round the so-called “ultraviolet catastrophe”). In kinetic modeling this is often the case with “inverse problems” in which one is seeking to find a (“forward”) model that best explains a time series of experimental data (see below). 1.4 Different Kinds of Models Most of the kinds of systems that are likely to be of interest to readers of this book involve entities (metabolites, signaling molecules, etc.) that can be cast as “nodes” interacting with each other via “edges” representing reactions that may be catalyzed via other substances such as enzymes. These will also typically involve feedback loops in which some of the nodes interact directly with the edges. We refer to the basic constitution of this kind of representation as a structural model (not, 1.4 Different Kinds of Models 13 of course, to be confused with a similar term used in the bioinformatic modeling of protein molecular structures). A typical example of a structural model is shown in fig. 1.4. Figure 1.4 A structural model of a simple network involving nine enzymes (E1 to E9), four external metabolites (A,J,K,L—whose concentration must be assumed to be fixed if a steady state is to be attained), and eight internal metabolites (B,C,D,E,F,G,H,I). D and E are effectively cofactors and are part of a ‘moiety-conserved cycle’ (Hofmeyr et al., 1986) in that their sum is fixed and they cannot vary their concentrations independently of each other. The classical modeling strategy in biology (and in engineering), the ordinary differential equation (ODE) approach (discussed in chapter 6) contains three initial phases, and starts with this kind of structural model, in which the reactions and effectors are known. The next level refers to the kinetic rate equations describing the “local” properties of each edge (enzyme), for instance that relate the rate of the reaction catalyzed by, say, E1 to the concentrations of its substrates; a typical such equation (which assumes that the reaction is irreversible) is the Henri-Michaelis-Menten equation v = Vmax .[S]/([S] + Km ). The third level involves the parameterization of the model, in terms of providing values for the parameters (in this case Vmax and Km . Armed with such knowledge, any number of software packages can predict the time evolution of the variables (the concentrations and fluxes of the metabolites) until they may reach a steady state. This is done (internally) by recasting the system as a series of coupled ordinary differential equations which are then solved numerically. We refer to this type of operation as forward modeling, and provided that the structural model, equations, and values 14 The Role of Modeling in Systems Biology of the parameters are known, it is comparatively easy to produce such models and compare them with an experimental reality. We have been involved with the simulator Gepasi, written by Pedro Mendes (Mendes, 1997; Mendes and Kell, 1998, 2001), which allows one to do all of the above, and that in addition permits automated variation of the parameters with which to satisfy an objective function such as the attainment of a particular flux in the steady state (Mendes and Kell, 1998). In such cases, however, the experimental data that are most readily available do not include the parameters at all, and are simply measurements of the (timedependent) variables, of which fluxes and concentrations are the most common (see chapter 10). Comparison of the data with the forward model is much more difficult, as we have to solve an inverse modeling, reverse engineering or system identification (Ljung, 1999b) problem (discussed in chapter 11). Direct solution of such problems is essentially impossible, as they are normally hugely underdetermined and do not have an analytical solution. The normal approach is thus an iterative one in which a candidate set of parameters is proposed, the system run in the forward direction, and on the basis of some metric of closeness to the desired output a new set of parameters is tested. Eventually (assuming that the structural model and the equations are adequate), a satisfactory set of parameters, and hence solutions, will be found (see table 1.1). These methods are much more computer-intensive than those required for simple forward modeling, as potentially many thousands or even millions of candidate models must be tested. Modern approaches to inverse modeling use approaches from heuristic optimization (Corne et al., 1999) to search the model space efficiently. Recent advances in multiobjective optimization (Fonseca and Fleming, 1996) are particularly promising in this regard, since the quality of a model can usually be evaluated only by considering several, often conflicting criteria. Evolutionary computation approaches (Deb, 2001) allow exploration of the Pareto front, that is the different trade-offs (for example, between model simplicity and accuracy) that can be achieved, enabling the modeler to make more informed choices about preferred solutions. We note, however, that there are a number of other modeling strategies and issues that may lead one to wish to choose different types of model from that described. First, the ODE model assumes that compartments are well stirred and that the concentrations of the participants are sufficiently great as to permit fluctuations to be ignored. If this is not the case then stochastic simulations (SS) are required (Andrews and Bray, 2004) (which are topics of chapter 8 and chapter 16). If flow of substances between many contiguous compartments is involved, and knowledge of the spatial dynamics is required (as is common in computational fluid dynamics), partial differential equations (PDEs) are necessary. SS and PDE models are again much more computationally intensive, although in the latter case the designation of a smaller subset of representative compartments may be effective (Mendes and Kell, 2001). If the equations and parameters are absent, it may prove fruitful to use qualitative models (Hunt et al., 1993), in which only the direction of change (and maybe rate 1.4 Different Kinds of Models Table 1.1 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 15 10 Steps in (Inverse) Modeling. Get acquainted with the target system to be modeled Identify important variable(s) that changes over time Identify other key variables and their interconnections Decide what to measure and collect data Decide on the form of model and its architecture Construct a model by specifying all parameters. Run the model forward and measure behavior. Compare model with measurements. If model is improving return to 6. If model is not improving and not satisfactory, return to 3, 4, and 5. Perform sensitivity analysis. Return to 6 and 7 if necessary. Test the impact of control policies, initial conditions, etc. Use multicriteria decision-making (MCDM) to analyze policy tradeoffs. of change) is recorded, in an attempt to constrain the otherwise huge search space of possible structural models (see chapter 7). Similarly, models may invoke discrete or continuous time, they may be macro or micro, and they may be at a single level (such as metabolism, signaling) or at multiple levels (in which the concentrations of metabolites affect gene expression and vice versa (ter Kuile and Westerhoff, 2001). Models may be top-down (involving large “blocks”) or bottom-up (based on elementary reactions), and analyses beneficially use both strategies (fig. 1.3). Thus a “middle-out” strategy is preferred by some authors (Noble, 2003a) (see chapter 14). Table 1.2 sets out some of the issues in terms of choices which the modeler may face in deciding which type of model may be best for particular purposes and on the basis of the available amount of knowledge of the system. Table 1.2: Different types of model, presented as choices facing the experimenter when deciding which strategy or strategies may be most appropriate for a given problem. Dimension or Feature Stochastic or deterministic Possible choices Comments Stochastic: Monte Carlo methods or statistical distributions Deterministic: equations such as ODEs Discrete versus continuous (in time) Discrete: Discrete event simulation, for example, Markov chains, cellular automata, Boolean networks. Continuous: Rate equations. Phenomena are not of themselves either stochastic or deterministic; large-scale, linear systems can be modeled deterministically, while a stochastic model is often more appropriate when nonlinearity is present. Discrete time is favored when variables only change when specific events occur (modeling queues). Continuous time is favored when variables are in constant flux. 16 The Role of Modeling in Systems Biology Table 1.2: Different types of model, presented as choices facing the experimenter when deciding which strategy or strategies may be most appropriate for a given problem. Dimension or Feature Macroscopic versus microscopic Hierarchical versus multi-level Possible choices Comments Microscopic: Model individual particles in a system and compute averaged effects as necessary. Macroscopic: Model averaged effects themselves, for example, concentrations, temperatures, etc. Hierarchical: Fully modular networks. Multi-level: Loosely connected components. Are the individual particles or subsystems important to the evolution of the system, or is it enough to approximate them by statistical moments or ensemble averages? Fully quantitative versus partially quantitative versus qualitative Qualitative: Direction of change modeled only, or on/off states (Boolean network). Partially quantitative: Fuzzy models. Fully quantitative: ODEs, PDEs, microscopic particle models. Predictive Predictive: Specify every variable versus that could affect outcome. exploratory/ex- Exploratory: Only consider some planatory variables of interest. 1.5 Estimating rare events versus typical behavior Rare events: Use importance sampling. Typical behavior: Importance sampling not needed. Lumped or spatially segregated Lumped: Treat cells or other components/compartments as spatially homogeneous. Spatially segregated: Treat the components as differentiated or spatially heterogeneous. Can some processes/variables in the system be hidden inside modules or objects that interact with other modules, or do all the variables interact, potentially? This relates to reductionism versus holism. Reducing the quantitative accuracy of the model can reduce complexity greatly and many phenomena may still be modeled adequately. If a model is being used for precise prediction or forecasting of a future event, all variables need to be considered. The exploratory approach can be less precise but should be more flexible, for example, allowing different control policies to be tested. Estimation of rare events, such as apoptosis times in cells is time-consuming if standard Monte Carlo simulation is used. Importance sampling can be used to speed up the simulation. If heterogeneous it may be necessary to use the computationally intensive partial differential equation, though other solutions are possible (Mendes and Kell, 2001) Sensitivity Analysis -Sensitivity analysis for modelers? -Would you go to an orthopaedist who didn’t use X-ray? Jean-Marie Furbringer Sensitivity analysis (Saltelli et al., 2000) represents a cornerstone in our analysis of complex systems. It asks the generalized question “what is the effect of changing 1.5 Sensitivity Analysis 17 something (a parameter P) in the model on the behavior of some variable element M of the model?” To avoid the magnitude of the answer depending on the units used we use fractional changes ΔP and observe their effects via fractional changes (ΔM) in M. Thus the generalized sensitivity is (ΔM/M)/(ΔP/P) and in the limit of small changes (where the sensitivity is then independent of the size of ΔP) the sensitivity is (dM/M)/(dP/P) = d(lnM)/d(lnP). The sensitivities are thus conceptually and numerically the same as the control coefficients of metabolic control analysis (MCA) (see Fell (1996); Heinrich and Schuster (1996); and Kell and Westerhoff (1986)). Reasons for doing sensitivity analysis include the ability to determine: 1. If a model resembles the system or process under study 2. Factors that may contribute to output variability and so need the most consideration 3. The model parameters that can be eliminated if one wishes to simplify the model without altering its behavior grossly 4. The region in the space of input variables for which model variation is maximum 5. The optimal region for use in a calibration study 6. If and which groups of factors interact with each other. A basic prescription for performing sensitivity analysis (adapted from (Saltelli et al., 2000)) is: 1. Identify the purpose of the model and determine which variables should concern the analysis. 2. Assign ranges of variation to each input variable. 3. Generate an input vector matrix through an appropriate design (DoE). 4. Evaluate the model, thus creating an output distribution or response. 5. Assess the influence of each variable or group of variables using correlation/regression, Bayesian inference (chapter 4), machine learning, or other methods. Two examples from our recent work illustrate some of these issues. In the first, (Nelson et al., 2004; Ihekwaba et al., 2004), we studied a refined version of a model (Hoffmann et al., 2002) of the NF-κB pathway. This contained 64 reactions with their attendant parameters, but sensitivity analysis showed that only 8–9 of them exerted significant influence on the dynamics of the nuclear concentration of NFκB in this system, and that each of these reactions involved free IκBα and free IKK. An entirely different study (White and Kell, 2004) asked whether comparative genomics and experimental data could be used to rank candidate gene products in terms of their utility as antimicrobial drug targets. The contribution of each of the submetrics (such as essentiality, or existence only in pathogens and not hosts or commensals) to the overall metric was analyzed by sensitivity analysis using 3 different weighting functions, with the top 3 targets— which were quite different from those of traditional antibiotics—being similar in all cases. This gave much confidence in the robustness of the conclusions drawn. 18 1.6 The Role of Modeling in Systems Biology Concluding Remarks The purpose of this chapter was to give an overview of some of the reasons for seeking to model complex cellular biological systems, and this we trust that we have done. We have also given a very brief overview of some of the methods, but we have not dwelt in detail on: their differences, the question of which modeling strategies to exploit in particular cases, the problems of overdetermination (where many models can fit the same data) and of model choice (which model one might then prefer and why), nor on available models (for example, at http://www.biomodels.net/) and model exchange using, for example, the systems biology markup language (SBML) (http://www.sbml.org) (Finney and Hucka, 2003; Hucka et al., 2003; Shapiro et al., 2004) or others (Lloyd et al., 2004). These issues are all covered well in the other chapters of this book. Finally, we note here that despite the many positive advantages of the modeling approach, biologists are generally less comfortable with, and confident in, models (and even theories) than are practitioners in some other fields where this is more of a core activity, such as physics or engineering. Indeed, when Einstein was once informed that an experimental result disagreed with his theory of relativity, he famously and correctly remarked “Well, then, the experiment is wrong!” It is our hope that trust will grow, not only from a growing number of successful modeling endeavors, but also from a greater and clearer communication of models enabled by new technologies such as Web services and the SBML. Acknowledgments We thank the BBSRC and EPSRC for financial support, and Dr. Neil Benson, Professor Igor Goryanin, Dr. Edda Klipp and Dr. Jörg Stelling for useful discussions. 2 Complexity and Robustness of Cellular Systems Jörg Stelling, Uwe Sauer, Francis J. Doyle III, and John Doyle The daunting complexity of cellular systems appears as a major hurdle for largescale modeling efforts. This complexity resides not only in the sheer number of components and interactions, but also in the operations on multiple levels and time-scales. Guidelines for meaningful modeling such as underlying organizational and design principles are thus required. A key to derive guidelines could be the high internal organization and the selection for function that distinguish cellular systems from complex physical systems; both factors considerably shrink the space of possible designs. One prominent aspect of cellular functions is their robustness, that is, their insensitivity to a wide range of perturbations. Here, we focus on connections between cellular complexity and robustness—with robustness requirements being the driving forces for complexity. Since only a rather limited set of mechanisms establishes robustness in biological circuits, understanding robustness can provide a key for understanding cellular organization. Practical implications for the modeling task are, for instance, the emphasis on network structures over exact values of kinetic parameters. Thus, we advocate that qualitative or structural modeling approaches may already yield deep insights by identifying important versus less important parts of a system for the purpose of more detailed modeling. 2.1 Introduction Complexity is a hallmark of cellular systems, with great challenges for the development and analysis of cellular networks at the system level. Without appropriate conceptional frameworks for dealing with that complexity, the vision of ultimately going from the description of entire cells to organs and organisms will not be achievable. Hence, it is important to think about rather high-level abstractions of cellular properties that could help in system modeling and analysis. In general, complex sys- 20 Complexity and Robustness of Cellular Systems tems may either show a behavior or a design that is difficult to understand (Weng et al., 1999). While the behavior of biological systems is, in most cases, relatively simple, the numbers of metabolic and regulatory genes shows that complexity in biology arises mainly from abundant control circuits, that is, from the system’s design. For maintaining simple behavior under real-life conditions, biological systems have to cope with a constantly varying environment, be it changing physicochemical conditions or noisy external signals that have to be processed. Moreover, their internal properties are also subject to uncertainty, since they can, for instance, be changed by mutations, and because stochastic noise is an important source of cellular variability. Therefore, evolution must have strongly favored robustness, that is, a system’s ability to maintain (key) functional characteristics despite potentially harmful external or internal perturbations. A now widely accepted notion is that many (or most) cellular sub-systems are robust (Kitano, 2002a; Stelling et al., 2004b; Kitano, 2004b). Examples for this capacity can already be found in simple organisms such as the bacterium Escherichia coli, which displays robust perfect adaptation in its search for nutrients (see chapter 12) and also a high resistance to gene deletions (see section 2.4). Robustness has long been recognized as an important property of biological systems, for instance described as “canalization” (towards a specific outcome despite uncertain starting conditions) in developmental biology. However, the understanding of how robustness is accomplished at the cellular or molecular level is still limited (Hartman et al., 2001), mainly because robustness is intimately linked to the apparent complexity of cellular systems. For instance, the main purpose of cellular control systems seems to be to guarantee reliable performance of vital functions under conditions of uncertainty (Lauffenburger, 2000; Csete and Doyle, 2002). Hence, elucidating high-level cellular design principles that could be exploited in systems modeling will require the simultaneous consideration of complexity and robustness in cellular networks—which is the topic of the present chapter. We will start with describing the sources and types of cellular complexity in more biological detail, before attempting to distinguish the type of complexity that is present in biological and physical systems by focusing on functional and organizational principles that underly this complexity at a more abstract level (section 2.2). Robustness as a concept for understanding biological function and behavior will require a more in-depth exposition of the theoretical concept (section 2.3), before we discuss two biological example systems, namely central metabolism and circadian clocks (section 2.4). These examples are intended to explain how and why robustness can help in modeling cellular complexity (section 2.5). 2.2 2.2 Complexity of Cellular Networks 21 Complexity of Cellular Networks 2.2.1 Sources of Complexity Biological complexity arises at several levels. At the molecular level, heterogeneous regulation networks control individual cell responses to environmental changes. The basic biological information flow from DNA to biochemical activities—with interconnected control mechanisms—is illustrated here for metabolic networks (figs. 2.1 and 2.2). About a quarter of the around 4,000 genes in a typical microbe encode the enzymes that catalyze approximately 1,000 biochemical reactions. While all cells share essentially the same DNA, the rate of transcription (synthesis of mRNA from DNA) varies greatly for each gene. Dynamically controlled by overlapping networks of repressors and activators, transcription is further affected by the hard-wired location of the gene in an operon (or on the genome), promoter or initiation site quality, or more general mechanisms like DNA topology and epigenesis. Typically, regulatory proteins themselves are subject to negative and positive feedback regulation through interaction with other proteins or metabolites. Next, mRNA is translated into protein, which again is regulated at multiple levels by different mechanisms that include mRNA stability, active degradation, attenuation (premature termination as a function of the initial rate of translation), rare tRNAs, anti-sense RNA, quality of the ribosome binding site, etc. Figure 2.1 Complexity in cellular networks. Flow of information (left) and example interaction network (right). Cellular components are, for example, regulatory proteins (ellipses, R), enzymes (ellipses, E), and metabolites (capital letters). Bold arrows indicate regulatory influences (activation or inhibition), while normal arrows denote chemical reactions. Essentially each step of protein synthesis is affected by multiple and overlapping regulation loops that operate both at the global cellular and a pathway/reaction 22 Complexity and Robustness of Cellular Systems Figure 2.2 Complexity in cellular networks for a typical microbe such as E. coli. Regulatory interactions are indicated by dashed lines. Transcript interactions are based on operon structures and ribosomal RNA interactions. Proteome interactions include an average of 6–7 protein-protein interactions as well as protein-DNA, protein-RNA, and protein-membrane interactions (see chapter 10 for details). Metabolic interactions include biochemical transformations and regulatory interactions between metabolites, RNA, and protein. Protein numbers encompass differences in folding, size, and covalent modifications. Note that not all proteins are necessarily present at the same time. specific level. Activity and stability of the synthesized proteins may then be modulated by posttranslational modification (for example, phosphorylation), aggregation to multimers, or complex formation with other proteins. Beyond such geneticically determined regulation, enzyme activity is often regulated by feedback inhibition. This is a common regulatory principle in biosynthetic pathways, where endproducts inhibit the first enzyme in the pathway. In the multipurpose central metabolic pathways, several key enzymes are subject to feedback and feedforward inhibition and activation through multiple metabolites. Temporal coordination of control is achieved by combining rapid and sensitive regulation through feedback loops (seconds) with somewhat slower protein modification (seconds to minutes) and transcriptional/translational regulation (minutes). Almost no individual mechanism achieves on/off effects but rather modulates processing rates in a 2–20 fold range. Thus, much of the complexity is based on multi-level combination of heterogeneous control systems that tune strength and speed of cellular responses to stimuli. Unlike most technical systems, individual biological processes are extremely sensitive to the exact physico-chemical conditions because slight changes in, for example, temperature, pH, or the concentration and nature of the surrounding protein/membrane matrix influence the availability of substrates, products, and the kinetic properties of the enzymes themselves. Rarely are all physico-chemical parameters identical in independent experiments, but enzymes are also exposed to different micro-environments within a single cell that cannot be determined 2.2 Complexity of Cellular Networks 23 exactly. An extreme, but not exclusive case is spatial separation into several distinct intracellular compartments—a distinguishing feature between eukaryotes and simpler prokaryotes. An additional level of complexity is the organization of different cell types into tissues and organs and finally of multiple tissues and organs into higher organisms (for instance, humans, plants). Not even in steady state cultures of single-celled microbes, however, are all cells necessarily in identical states. Driven by a not overly stringent control design, often subpopulations enter a resting state or simply exhibit different phenotypes, which increases chances to propagate the genetic offspring in an ever-changing environment. On longer timescales (days to years), the enormous potential of biological systems for evolutionary adaptation adds yet a different level of complexity. Random imprecisions in copying the genetic source code during cell duplication continuously increase the genetic diversity within a population. While the overall precision of the duplication process is extraordinary high—about 0.003 point mutations occur per microbial genome (2–8 million base pairs) and round of replication—short generation times (minutes to hours) rapidly lead to recognizable genetic differences (Sauer, 2001). While most random differences have no apparent effect or are harmful, some variants bear the potential for improved survival upon drastic environmental changes. In contrast to most technical systems, biological systems thus continuously adapt by “redesigning” their makeup through the evolutionary process of mutation and selection. 2.2.2 “Organized” versus “Emergent” Complexity The staggering complexity of cellular networks makes appropriate abstractions mandatory for meaningful mathematical modeling. An obvious pragmatic approach consists of decomposing the networks into smaller units that allow for the development of models of limited complexity. Likewise, models for cellular networks are not built at atomic resolution of individual biochemical species. More generally, however, with an ultimate goal of modeling entire cells and organs, we will need a deeper understanding of the specific type of complexity prevalent in biology to develop rigorous analysis methods. Here, we aim at outlining such a characterization by contrasting biological (and engineered) systems with complex physical systems. Complexity has become a field of intensive research in physics through the notion that systems with many components and interactions can show complicated collective (“emergent”) behavior. For instance, when adding sand to apparently stable sand piles, we cannot predict at which point the system reaches its “margin of stability” and avalanches are generated. This does not mean that the behavior is not deterministic; we simply do not have complete knowledge of the initial conditions when starting such an experiment. As the system is extremely sensitive to changes in those conditions, the apparent behavior is chaotic. Similarly, simple sets of interacting particles can generate complicated spatial structures. Rationalizing these emergent properties often abstracts from the real systems by assuming homogeneous components that interact randomly; analysis methods for characterizing the 24 Complexity and Robustness of Cellular Systems collective behavior are often rooted in statistics (Goldenfeld and Kadanoff, 1999). Such approaches were, for instance, used in revealing rich and complex dynamic behaviors that could be generated by simplified models of cellular signaling networks (Amaral et al., 2004). A different issue is whether this type of abstraction is useful for a deeper understanding of biological complexity. At the first glance, biological systems differ in several aspects from the type of physical systems mentioned above. One hallmark is their heterogeneity of components and interactions. They are highly structured, which encompasses, among other things, sophisticated spatial organization and layering of different types of control mechanisms. Finally, their complexity resides in these two features as well as in the sheer numbers of components and interactions. From a dynamic point of view, real biological systems are rather boring in that homeostasis and simple switching of states prevail, while complex behavior such as chaos mainly occurs under conditions when the systems are not working properly. Hence, today’s biological systems could perhaps best be understood as rare, extremely improbable outcomes of emergent processes leading to primitive forms of life, and their subsequent shaping through evolution. Functional requirements constitute the main differences between complex physics and biology/engineering. In physics, they do not exist. Biological and engineered systems, in contrast, are evolved or designed to fulfill functions, and are constantly evaluated with respect to how well they perform. In both cases, insufficient performance will lead to extinction of a specific species, irrespective of whether this occurs through evolutionary or human design processes. The immediate consequence of a purpose is a considerably smaller design space, in which network structures that could be effective and reliable implementations are likely to be rare. Hence, we will face a more structured (instead of randomly connected) system. A hope for understanding complexity in biology then is to uncover operational principles through a “calculus of purpose” (Lander, 2004)—by asking teleological questions such as why cellular networks are organized as observed, given their known or assumed function. 2.2.3 Function and Organization Principles The purpose or function as one hallmark of cellular networks itself is a rather complicated concept. Attributing a particular function to a subnetwork may not be easy because it is in many cases context dependent. For instance, a particular signaling pathway may have roles in counter-acting biological processes such as the regulation of cell proliferation and apoptosis. Owing to the multiscale organization in biology, we need a precise notion of function at the different scales. For the example above, at the organismic level the pathway may coherently serve to achieve homeostasis of cells in an organ. Hence, we will need a hierarchical description of functions and corresponding organization principles at different levels, from coarsegrained overall architectures to detailed insight into individual network motifs (Shen-Orr et al., 2002). This corresponds to the modularization of explanations as a final aim of dealing with biological complexity. Here, we consider the global 2.2 Complexity of Cellular Networks 25 architecture of metabolism as an example. In metabolism, analyses based on the networks’ stoichiometry alone (neglecting unknown kinetics and regulation) have revealed a close relation between network structure, function, and regulation at least for bacteria (see chapter 5 for details), which makes it suitable for high-level abstractions of organization principles. One possible principle has been proposed recently, focusing on “bow tie” structures as shown in fig. 2.3 (Csete and Doyle, 2004). Figure 2.3 Bow tie abstraction of cellular organization. Open arrows denote cellular regulation and control. Involvement of carriers such as ATP and NAD(P)H in individual processes is indicated by •. In the bow tie view, the basic network organization is a combination of fans of possible inputs (such as nutrients that can be processed) and possible outputs (for example, the variety of biomass components) that are linked through the core of central metabolism. Fans and core have rather different structural properties: while the former show many specialized, mostly linear pathways for catabolism and anabolism, the highly interconnected network of central metabolism generates and distributes