Gene regulatory network inference using out of equilibrium statistical ...

2 downloads 13045 Views 277KB Size Report
Jul 23, 2008 - applicability of his out of equilibrium statistical mechanics approach to the gene network ... regulatory programs therefore belong to the “hard” problems that very ..... Both the successful inference and thus more efficient car-.
C O M M E N TA RY

HFSP Journal

Gene regulatory network inference using out of equilibrium statistical mechanics Arndt Benecke1,2 1

Institut des Hautes Études Scientifiques, Bures-sur-Yvette, France Institut de Recherche Interdisciplinaire–CNRS USR3078–Université Lille I, France

2

共Received 24 June 2008; published online 23 July 2008)

Spatiotemporal control of gene expression is fundamental to multicellular life. Despite prodigious efforts, the encoding of gene expression regulation in eukaryotes is not understood. Gene expression analyses nourish the hope to reverse engineer effector-target gene networks using inference techniques. Inference from noisy and circumstantial data relies on using robust models with few parameters for the underlying mechanisms. However, a systematic path to gene regulatory network reverse engineering from functional genomics data is still impeded by fundamental problems. Recently, Johannes Berg from the Theoretical Physics Institute of Cologne University has made two remarkable contributions that significantly advance the gene regulatory network inference problem. Berg, who uses gene expression data from yeast, has demonstrated a nonequilibrium regime for mRNA concentration dynamics and was able to map the gene regulatory process upon simple stochastic systems driven out of equilibrium. The impact of his demonstration is twofold, affecting both the understanding of the operational constraints under which transcription occurs and the capacity to extract relevant information from highly time-resolved expression data. Berg has used his observation to predict target genes of selected transcription factors, and thereby, in principle, demonstrated applicability of his out of equilibrium statistical mechanics approach to the gene network inference problem. [DOI: 10.2976/1.2957743]

CORRESPONDENCE Arndt Benecke: [email protected]

The identification of gene regulatory networks is paramount to the understanding of regulatory coding as much as to targeted interference in physiopathology. Given the combinatorial nature of possible gene networks a purely experimental cataloging effort will require large amounts of resources and is uncertain to succeed in view of the number of biological conditions that would have to be explored (Lesne and Benecke, 2008). Recent advances in microarray and high-throughput sequencing for transcriptomics (Cloonan et al., 2008) as well as proteomics will definitively help to increase the number of observations on the different states of activity of a given genome in many biological conditions, but still, the state space will be very hard to sample sufficiently in order to obtain a representative coverage all the more as it is itself dynamic in time. This is especially true for the larger and in

HFSP Journal © HFSP Publishing $25.00 Vol. 2, No. 4, August 2008, 183–188 http://hfspj.aip.org

terms of gene regulatory degrees of freedom more complex eukaryotic genomes. Notably, the sequencing of genomes was accompanied by the idea that using statistical methods on the genome sequence itself would help to unravel the functional gene regulatory elements and thus allow inferring gene regulatory networks from the sequence alone. Again especially in eukaryotes, different features of the gene regulatory code (as opposed to the genetic code) prevent statistical inference. Functional regulatory sequence elements are: (i) not necessarily located in the close vicinity of transcription start sites, (ii) often combinations of shorter sequence words with variable spatial organization, and (iii) bifold degenerate in that not only transcription factors interact with sets of related sequence elements, but also very different transcription factors can bind to identical elements (Benecke, 2003). The fact that tran183

HFSP Journal scription cofactors, which themselves do not bind DNA, add another combinatorial component to the overall transcription activity of a target gene, as well as the structural dynamics of chromatin on different time and space scales, further complicate regulatory inference. The deciphering of the gene regulatory code of eukaryotic cells and the inference of gene regulatory programs therefore belong to the “hard” problems that very likely are unsolvable without using a combination of very large collections of experimental genome activity recordings under many different biological conditions in conjunction with sophisticated genome sequence statistic analyses (Benecke, 2006). The main question today is precisely how to achieve the most coherent and functional coupling between the experimental and statistical methodologies. Three fundamentally different concepts can be distinguished. One approach consists in analyzing the genomic sequence of an organism for statistical sequence irregularities such as overrepresentation of sequence words (Siggia, 2005). This analysis is often restricted to sequences surrounding the transcription start sites in one dimension (“promoter regions”—a rather ill-defined subset of the genome sequence) and then correlated with functional genomics data such as transcriptome measurements. Apart from the difficulties associated with the restriction to promoter regions, functional sequence elements are not necessarily statistically over- or underrepresented. Furthermore, many, not yet well understood, sequence constraints (Audit et al., 2007) as well as repetitive elements blur such predictions. Finally, the small size of functional gene regulatory words (typically six to eight nucleotides) and their degeneracy lead to large numbers of false positive results (Maston et al., 2006). The inverse approach is also being followed. The promoter regions of genes showing statistically significant changes in their expression following a given signal are analyzed for the presence of known sequence elements or statistically overrepresented sequence motifs that could explain the signal-induced changes in gene expression (Rajewsky et al., 2002). This approach suffers from similar limitations as above. Recently, more integrated techniques are being thought of, and the two articles published by Berg are prominent examples thereof (Berg, 2008a,b). Essentially, information pertaining to the dynamics of gene expression in time-course measurements can be exploited to unravel interdependencies between different genes such as between the effector (transcription factor) and its targets. There are several reasons besides the fact that time-resolved gene expression measurements are hard to come by—given experimental and financial challenges, why such methodology is still in its infancy. First and foremost, since gene products have highly nontrivial autoregulatory, combinatorial, and timedelayed interdependencies, an effective model for the process of gene transcription and translation is a central requirement. Pure statistical correlation analysis in the absence of a gene expression model is ineffective because correlations are 184

intransitive. Probabilistic, Bayesian inference, or linear regression are prohibitive for larger networks as the computational cost scales exponentially with the number of genes considered. Second, the time resolution of the experimental data needs to echo the time scales at which regulatory events take place, which is typically seconds to minutes. However, those rapid regulatory events might occur to be considerably delayed with respect to the arrival of the inducing signal requiring either prior insight into the molecular signaling for an educated choice of the timing of the experiment or very long series of highly resolved data [Fig. 1(A)]. Finally, in eukaryotes, the chromatin structure over and around sequence regulatory elements and transcription initiators introduces further heterogeneities in the response times (Lesne, 2006; Barrera and Ren, 2006). These heterogeneities can lead to the fact that a direct target gene of the transcription signal under study changes expression only after changes for an indirect target gene can already be observed [Fig. 1(B)]. In all cases, as gene expression, due to technical limits, is mostly studied on populations, those populations need to be synchronized. Constructing effective models for gene expression is challenging. Many different models have so far been developed, and rightly so, as different models will capture different aspects of the process to a different extent. Simply speaking, three classes of models can be distinguished. Logic models simply establish operator relationships between objects (for example, IF gene product ␣ AND gene product ␤ THEN gene product ␥ is induced). Neither time nor concentrations and their dynamics or mechanistic information is taken into account, unless many instances of the system are simulated in parallel, in which case the modeling approach turns into a stochastic approach (see below). Molecular models aim to describe, in more detail, empirical knowledge about the objects and their functional relationships. Often reaction-diffusion equations are combined to more or less complex systems with many parameters, which trade precision for generality. Such models can hardly be investigated analytically and are thus simulated numerically. As a variant, so-called stochastic simulation can be used, reducing the number of kinetic parameters, which have to be measured beforehand. Finally, minimal effective models, based on defined, well-understood physical systems, are used to describe and analyze general statistical properties of the observables. Such models generally are much reduced and can often be analyzed analytically. They can be regarded as “simple” if referring to the number of variables and the structure of the equations, neither of which precludes complex dynamics of the modeled system. A system of coupled oscillators, for instance, can show chaotic behavior however is described by two simple time differential equations. Minimal models for obvious reasons are in general not well suited to describe either molecular details of a process or subtle variations of a process with many parallel executed slightly different inGene regulatory network inference using . . . | Arndt Benecke

C O M M E N TA RY

Figure 1. Time delays and resulting Inference limits. 共A兲 Many different effective delay times might be observed in the coupling of cellular processes leading to transcription and mRNA stability regulation. Signal diffusion, transcription activation of a target gene in euchromatin or heterochromatin, mRNA elongation, maturing, export, specific and generic degradation, translation, and feedback all have different, often sequence or molecular species specific time constants. While most of these processes seem to occur on sufficient close time scales in order to permit modeling using minimal models without having to resort to multiscale approaches 共Castiglione et al., 2008兲, documented cases such the one schematized in 共B兲 pose a limit on inference in the absence of any additional information.

stances. Furthermore, and contrary to their origin—simple, idealized physical processes—universality classes of behavior cannot be established in the biological system in a straightforward manner. Concretely, modeling gene expression dynamics over time for a large set of genes using a description for the motion of an overdamped particle subject to a dynamic external force, as Berg recently did, a priori does not seem sufficiently detailed in order to allow for the identification of gene regulatory sequences within the promoter regions of the studied genes. It is exactly in the seemingly small but very significant detail of how Berg used this model where the beauty of his works unfolds (Berg, 2008a,b). By being able to establish a direct link between an internal, central measure on the mesoscopic scale of the model and the associated free-energy changes, thus on the thermodynamical scale, with a simple time-dependent DNA occupancy model for the transcription factors, Berg is able to efficiently extract direct target genes of selected transcription factors from the pool of analyzed dynamic expression data. Consider the evolution over time of the quantity of a given messenger RNA as the sum of three distinct entities: (i) production with some average rate, (ii) degradation with some mRNA-specific constant, and (iii) some stochastic term that captures any true or perceived random variation (Mc Adams and Arkin, 1997; Kaern et al., 2005) and all HFSP Journal Vol. 2, August 2008

other processes, which one is, by default of knowledge or for reasons of generalization, unable to detail. The resulting partial time-dependent stochastic differential equation is, due to the last term, a complete description of any change over time in the quantity of the mRNA. If the first two terms capture only negligible (in terms of amplitude compared to the contribution of the third term) effects, the evolution of the mRNA quantity over time appears to be random, and hence the information gain becomes minimal. In other words, our ignorance of the processes affecting mRNA quantity is much larger than the effects of the two processes we describe explicitly. By estimating the free parameters of the model (production rate, degradation constant, and a scaling factor for the stochastic term) gene-by-gene on experimental timekinetic microarray data, one notably obtains an estimate of how significant the contributions by the two first terms are relative to the third term. By doing so, Berg demonstrates an expected, but beforehand, in terms of experimental analysis not truly generalized, phenomenon. Indeed, correlations between the mRNA quantities of known target genes of a transcription factor and the concentration (as approximated through its mRNA quantity) of the transcription factor itself are observed, illustrating that the time scales at which transcription factor concentrations change are the same as for their targets. This observation is the necessary condition for 185

HFSP Journal the possibility of inference using the time-kinetic measurements and hence allows/warrants reducing our ignorance of the system in that one should consider transcription factor (TF) concentration dependency of the production rate of the mRNAs. The production rate of a transcript hence becomes a function of the TF concentration, which is itself a function of time f共x共t兲兲, which leads to a better appreciation of the observed empirical statistics of the gene expression measurements by the model. It should be envisioned that mRNA concentrations are de facto kept out of equilibrium. In a certain sense the relative contribution of the stochastic term of the equation was reduced by adding more detail and thus more constraint to the first term. In Berg’s own words, the equation, which can be generalized to also account for the effects of several transcription factors with additive or synergistic activity “serves as a first starting point toward an increasingly deterministic description of mRNA dynamics.” Interestingly, similar correlations between specific mRNA degrading factors such as microRNAs and mRNA levels might exist (in the experiments that Berg analyzed, miRNAs were not yet quantified) and might lead to similar extensions of the model for the degradation term. Importantly, the equation used by Berg and sketched here, the so-called driven Langevin equation, describes the motion of an overdamped harmonic oscillator subject to a time-dependent external force. In the case of the gene expression application, the time-dependent external force is provided by the transcription factor-regulated production rate. The importance of this fact becomes more obvious when considering the following. By modeling the mRNA quantity as an oscillator, which is kept away from equilibrium through an external transcription regulation force that counteracts a default decay force in a dynamic manner, the movement of the oscillator in its force field reflects work performed by the transcription factor on the system. The total work performed by a transcription factor can be quantified using the sum of the individual motions induced—a quantity that can be indirectly estimated from the experimental data and sizes the coupling of changes in the respective concentrations of the effector and its target. When analyzing the distribution of work performed by a single transcription factor—Swi4—on its known target genes, Berg realizes that the distribution is compatible with the so-called detailed fluctuation theorem (Kurchan, 1998). A special instance of this theorem, the so-called Jarzynski equality, thereby directly relates the average work performed over many instances of the system (here the ensemble of target mRNA changes of a transcription factor) with the concomitant change in free energy. Again there are several implications of this observation of which an interesting one in the context of gene regulation is that a direct link to the free energy of transcription factor-DNA interactions can be established. Using both observations together with the experimental data and a minimal Michaelis–Menten model for the rela186

tionship between the TF concentration and the free energy, Berg then demonstrates that inversely target genes for TFs can be inferred at the example of several Swi TF family members. The latter family of factors was probably chosen as many confirmed target genes, and the binding sites were already experimentally established. Using the DNA binding model, Berg can also infer binding parameters. Moreover, he demonstrates recovery of the correlation between the numbers of sequence elements in a regulatory region and the strength of the transcription response, which generally holds. In his own words Berg has thus “. . . used simple stochastic models of mRNA dynamics based on the universal properties to infer some specific properties of gene expression. . . .” Another of the properties he recovers pertains directly to the lower-bound limits of time resolution in kinetic experiments. The noise term used in the driven Langevin equation employed is modeled as uncorrelated. However, when datasets with higher resolution were to become available, Berg predicts that correlations between the noise terms for successive intervals will become apparent, which will lead to possible modifications to the model he has described and applied (see also, Jaynes, 1984). The elegant works by Berg establish a proof of principle. Using highly resolved transcriptome measurements and a stochastic model for mRNA production and decay, it is possible to reverse engineer some cause-and-effect relationships between transcription factors and their target genes. Two obvious questions arise. Is the derived inference method more effective than other methods, or does it at least have more potential effectiveness? Does the method itself provide additional insights into the studied process of gene expression regulation? The positive answer to the latter question is defended more easily. The model provided by Berg notably allows assessing quantitatively relative overall contributions of stochasticity (without distinguishing between different intrinsic and external sources, see also, Mc Adams and Arkin, 1997; Kaern et al., 2005) in the process of gene expression regulation. The fact that the detailed fluctuation theorem holds under the assumptions made on concrete data thereby serves as an indicator for the (approximate) correctness of the assumptions. A key assumption thereby is the dependency of the mRNA production rate on the transcription factor dynamics. This dependency is at the heart of the inference strategy proposed and hence of importance. Moreover, as shown in the two works, the model can effectively be used to determine a number of relevant parameters on a gene-bygene basis that directly pertain to TF function and mRNA stability. As mentioned above, and relating in equal part to the positive answer to the first question, it is reasonable to assume that time-resolved coexpression data for miRNAs and mRNAs would allow, once available, further refining the decay term of the equation and in ultimo allowing reverse engineering of such data as well. However, the inference problem in terms of production rate control is significantly Gene regulatory network inference using . . . | Arndt Benecke

C O M M E N TA RY

Figure 2. Time-scale heterogeneity in chromatin dynamics. The author is convinced that due to a computer glitch, the “ultimate answer” was only mistakenly indicated as “42” and rather should have read “chromatin” 共Adams, 1979兲. Chromatin structural and functional dynamics are known to happen at very different time and space scales. Chromatin “breathing,” the rapid local oscillations between a primed and an active state 共Metivier, 2008兲 is likely as fast as transcription initiation “green arrows” and hence is probably sufficiently well captured by the noise term in the driven Langevin equation, whereas the establishment of, for instance, gene differentiation programs involves time scales at or in excess of cellular generation times “red arrows” and hence should have a significant contribution to the long-term behavior of the system. This aspect does not yet seem to be treated appropriately in the different modeling approaches, and it can be expected that faithful gene regulatory network inference will profit significantly from including chromatin dynamics in the models.

advanced in that it is more inclusive than other methodology. The strategy described by Berg makes explicit use of the stochasticity as well as the out-of-equilibrium nature of the gene expression process at the relevant time scales. Furthermore, the model implicitly uses the previously established Jarzynski equality for the link between the microscopic and the thermodynamic (hence averaged over large populations, see also, Ritort, 2004) behavior, which then is effectively used in the inference process. These examples alone suffice as testimony to the potential effectiveness of the methodology. The actual performance of the current method is quite hard to benchmark as only limited data in terms of effector-target mappings are presented. The choice of Swi family members, as major cell-cycle regulators, for demonstrating the inference of target genes form a cell cycle; synchronized cell population seems like a safe choice, and it would have been informative to also provide information on less obvious systems for comparison. On the other hand, these two studies are, to the authors best knowledge, the first time that target genes have successfully been inferred from very short interval time-kinetic data using a systematic first-principle derived method without using any prior information such as promoter sequences, etc. The perspectives for modification and sophistication of the described methodology clearly make this line of research very promising, not only with respect to the gene regulatory inference problem, but also as an HFSP Journal Vol. 2, August 2008

additional angle to derive a first-principle-based description of the main molecular mechanisms of gene expression control. In this respect, it seems important to also mention that from a theoretical physics perspective, the works discussed here might serve as a starting point for further investigation on other possible applications of the detailed fluctuation theorem in similar contexts, and investigations on modified Langevin equations with other than white noise might be very rewarding. Note that an active theoretical physics and computer science community experiments with colored noise, it will be hence of utmost interest to see whether “biocolored” noise terms also emerge and find their application in such problems as discussed here. Moreover, and importantly, we note that the current model is unlikely to capture chromatin structural and functional dynamics, which at least in part occur on different time and space scales (Fig. 2). Without the inclusion of a certain amount of these dynamics, it seems likely that gene regulatory network inference will remain restricted to single time-scale systems (Benecke, 2006). The formalism derived by Berg seems amendable to such an inclusion. Both the successful inference and thus more efficient cartography of the state space of the gene expression machinery conjunct with a first-principle-derived description of the molecular processes underlying gene expression will greatly improve the perspectives for biomedical target definition and 187

HFSP Journal in ultimo be important cornerstones of a formal understanding of the gene regulatory code. Statistical mechanics inspired approaches have previously proven effectiveness for problems of similar complexity (see, for instance, Braunstein and Zecchina, 2006). ACKNOWLEDGMENTS

I thank Annick Lesne, Brendan Bell, and Marc Hütt for critical comments on the manuscript and the members of my group for exciting discussions around the gene network inference problem. REFERENCES Adams, D (1979). The Hitchhiker’s guide to the galaxy, Pan Books, London. Audit, B, Nicolay, S, Huvet, M, Touchon, M, D’aubenton-carafa, Y, Thermes, C, and Arneodo, A (2007). “DNA replication timing data corroborate in silico human replication origin predictions.” Phys. Rev. Lett. 99, 248102. Barrera, LO, and Ren, B, (2006). “The transcriptional regulatory code of eukaryotic cells—insights from genome-wide analysis of chromatin organization and transcription factor binding.” Curr. Opin. Cell Biol. 18, 291–2980. Benecke, A (2003). “Genomic plasticity and information processing by transcription coregulators.” ComPlexUs 1, 65–76. Benecke, A (2006). “Chromatin code, local non-equilibrium dynamics, and the emergence of transcription regulatory programs.” Eur. Phys. J. E 19, 379–384. Berg, J (2008a). “Dynamics of gene expression and the regulatory inference problem.” Europhys. Lett. 82, 28010. Berg, J (2008b). “Out-of-equilibrium dynamics of gene expression and the Jarzynski equality.” Phys. Rev. Lett. 100, 188101.

188

Braunstein, A, and Zecchina, R (2006). “Learning by message passing in networks of discrete synapses.” Phys. Rev. Lett. 96, 030201. Castiglione, P, Falcioni, M, Lesne, A, and Vulpiani, A (2008). Chaos and coarse graining in statistical mechanics. Cambridge, UK, Cambridge University Press. Cloonan, N et al. (2008). “Stem cell transcriptome profiling via massivescale mRNA sequencing.” Nat. Methods 5, 613–619. Jaynes, ET (1984). “Prior information and ambiguity in inverse problems.” SIAM-AMS Proc. 14, 151–166. Kaern, M, Elston, TC, Blake, WJ, and Collins, JJ (2005). “Stochasticity in gene expression: from theories to phenotypes.” Nat. Rev. Genet. 6, 451–464. Kurchan, J (1998). “Fluctuation theorem for stochastic dynamics.” J. Phys. A 31, 3719–3729. Lesne, A (2006). “The chromatin regulatory code: beyond an histone code.” Eur. Phys. J. E 19, 375–377. Lesne, A, and Benecke, A (2008). “Probability landscapes for integrative genomics.” Theor. Biol. Med. Mod. 5, 9. Maston, GA, Evans, SK, and Green, MR (2006). “Transcriptional regulatory elements in the human genome.” Annu. Rev Genom. Hum. Genet. 7, 29–59. Mc Adams, HH, and Arkin, A (1997). “Stochastic mechanisms in gene expression.” Proc. Natl. Acad. Sci. U.S.A. 94, 814–819. Metivier, R et al. (2008). “Cyclic DNA methylation of a transcriptionally active promoter.” Nature (London) 452, 45–53. Rajewsky, N, Vergassola, M, Gaul, U, and Siggia, ED (2002). “Computational detection of genomic cis-regulatory modules applied to body patterning in the early Drosophila embryo.” BMC Bioinf. 3, 30. Ritort, F (2004). “Work fluctutions and transient violoations of the second law: perspectives in theory and experiment.” Poincaré seminars. Dalibard, J, Duplantier, B, and Rivasseau, V (eds.) pp. 195–229, Birkhäuser, Basel, Switzerland. Siggia, ED (2005). “Computational methods for transcriptional regulation.” Curr. Opin. Genet. Dev. 15, 214–221.

Gene regulatory network inference using . . . | Arndt Benecke