Ninth International Workshop on Computational Systems Biology ...

2 downloads 0 Views 16MB Size Report
in 2011 the workshop was co-organized by ETH Zürich, Switzerland, Tampere University ... L. Astola, V. Gomez Roldan, J. Molenaar, Wageningen University and Research Centre, The. Netherlands. 7. A Novel ...... [4] D. Campbell and R.J. Steele, “Smooth functional ...... [3] T. Hardiman, H. Meinhold, J. Hofmann, J. Ewald,.
Antti  Larjo,  Steffen  Schober,  Muhammad  Farhan,  Martin  Bossert  &  Olli  Yli-Harja  (Eds.)

Ninth International Workshop on Computational Systems Biology, WCSB 2012, June 4-6, 2012, Ulm, Germany

Tampere  2012

Tampere International Center for Signal Processing. TICSP series # 61

Antti Larjo, Steffen Schober, Muhammad Farhan, Martin Bossert & Olli Yli-Harja (eds.)

Ninth International Workshop on Computational Systems Biology, WCSB 2012, June 4-6, Ulm, Germany

Tampere International Center for Signal Processing Tampere 2012

ISBN 978-952-15-2853-8 ISSN 1456-2774

PREFACE

The Workshop on Computational Systems Biology (WCSB) has been organized successfully during the last years by the Computational Systems Biology Research Group in the Department of Signal Processing at Tampere University of Technology (TUT). The history of the workshop can be traced back to 2003, when it was organized for the first time as an internal meeting with some invited international collaborators. Since then the meeting has expanded each year reflecting the rapid development in experimental biosciences and growth in the research of computational methods for systems biology. The first four events were organized and hosted by TUT but in 2008 the program committee set the target to render the event more international. Therefore, the workshop was organized in the Institute for Medical Informatics, Statistics and Epidemiology (IMISE), Leipzig, Germany, together with collaborators from the University of Leipzig and Technische Universität (TU) Dortmund. Following the successful event in 2008 the WCSB 2009 was organized in conjunction with the Bioinformatics Research Centre (BiRC) at Aarhus University (AU), Denmark. In 2010 the workshop was hosted by Luxembourg Centre for Systems Biomedicine (LCSB) at University of Luxembourg, and in 2011 the workshop was co-organized by ETH Zürich, Switzerland, Tampere University of Technology, Finland, and the Institute for Systems Biology, USA. This year the workshop is organized together by Ulm University, Germany and TUT. The scientific program includes five invited talks from internationally acknowledged experts of their respective fields in systems biology research. This volume collects together the research papers and short abstracts submitted to WCSB 2012. We would like to thank the authors and the reviewers for their contributions to the workshop and the proceedings. We are also grateful for the contribution of organizers in Finland and Germany for their efforts. We would also like to thank the Academy of Finland, Tampere Graduate School in Information Science and Engineering (TISE), Tampere International Center for Signal Processing (TICSP), and the Institute of Communications Engineering at Ulm University, for their support. On behalf of the WCSB 2012 Scientific Committee, Olli Yli-Harja and Martin Bossert

WCSB 2012 ORGANIZATION

Scientific Committee Miika Ahdesmäki (Almac Diagnostics) Tommi Aho (Tampere University of Technology) Martin Bossert (Ulm University, workshop co-chair) Antti Honkela (Aalto University) Juha Kesseli (Tampere University of Technology) Heinz Koeppl (ETH Zürich) Antti Larjo (Tampere University of Technology) Harri Lähdesmäki (Aalto University) Matti Nykter (Tampere University of Technology) Andre S. Ribeiro (Tampere University of Technology) Pekka Ruusuvuori (Tampere University of Technology) Steffen Schober (Ulm University) Ilya Shmulevich (Institute for Systems Biology) Korbinian Strimmer (University of Leipzig) Ralf Takors (University of Stuttgart) Carsten Wiuf (Aarhus University) Olli Yli-Harja (Tampere University of Technology, workshop co-chair)

Workshop Organization Tommi Aho (Tampere University of Technology) Martin Bossert (Ulm University) Johannes Georg Klotz (Ulm University) Antti Larjo (Tampere University of Technology) Steffen Schober (Ulm University)

Acknowledgements We greatly appreciate help from Ms. Marjo Elojoki and Ms. Virve Larmila, from Department of Signal Processing, Tampere University of Technology, for their devoted work in organization of the workshop. Special thanks are also due to Ulrike Stier, Heike Schewe and Gero Viertel for their diligent work in organization of the workshop. In addition we thank NT group members Katharina Mir and David Kracht from Ulm University, and CSB group members from Department of Signal Processing, Tampere University of Technology, for their valuable help for the organization of the workshop.

TABLE OF CONTENTS

Invited talks

Systems Biology of the Circadian Clock Hanspeter Herzel, Charite and Humboldt University Berlin, Germany

2

Scientific Knowledge is Possible with Small-Sample Biomarker Classification Lori A. Dalton and Edward R. Dougherty, Texas A&M University, USA

3

New Insights Into Regulatory Transcriptomics From Next-Generation Sequencing Christina Leslie, Sloan-Kettering Institute, USA

4

Bayesian Modelling in Systems Biology Dirk Husmeier, University of Glasgow, UK

5

Full papers

Inferring the Genes Underlying Flavonoid Production in Tomato L. Astola, V. Gomez Roldan, J. Molenaar, Wageningen University and Research Centre, The Netherlands

7

A Novel Censored Sampling Paradigm for Genomic Data Classification L. Dalton, E. Dougherty, Texas A&M University, USA

11

Parameter Inference in Mechanistic Models of Cellular Regulation and Signalling Pathways Using Gradient Matching F. Dondelinger, S. Rogers, M. Filipppone, R. Cretella, T. Palmer, University of Glasgow, UK, R. Smith, A. Millar, University of Edinburgh, UK, D. Husmeier, University of Glasgow, UK

15

Dose-dependent Drug Synergism in Flux Balance Analysis G. Facchetti , C. Altafini, SISSA, Italy

19

Bayesian Inference for the Chemical Master Equation Using Approximate Models C. Gillespie, A. Golightly, Newcastle University, UK

23

Modelling Regulatory Processes During Morphogenesis in Drosophila Melanogaster with an Improved Version of the BGMD Dynamic Bayesian Network Model M. Grzegorczyk, Dortmund University, Germany

27

Determinative Power and Tolerance to Perturbations in Boolean Networks R. Heckel, ETH Zürich, Switzerland, S. Schober, M. Bossert, Ulm University, Germany

31

Algorithm for in Silico Optimization of Production Strains E. Heikkinen, A. Larjo, V. Santala, O. Yli-Harja, T. Aho, Tampere University of Technology, Finland

35

New Methods for Finding Associations in Large Data Sets: Generalizing the Maximal Information Coefficient (MIC) T. Ignac, University of Luxembourg, N. Sakhanenko, A. Skupin, D. Galas, Institute for Systems Biology, USA

39

Blind Source Separation Using Latent Gaussian Graphical Models K. Illner, C. Fuchs, F. Theis, Technische Universität München

43

Warped Gaussian Process Modelling of Transcriptional Regulation R. Ji, D. Husmeier, University of Glasgow, UK

47

Spatial Stochastic Simulation of Transcription Factor Binding Reveals Mechanisms to Control Gene Activation M. Klann, H. Koeppl, ETH Zürich, Switzerland

51

Modeling and Analysis of Division-, Age-, and Label-Structured Cell Populations P. Metzger, J. Hasenauer, F. Allgower, University of Stuttgart, Germany

55

Identification of Feedback Circuits That are Connected to Multiple Fixed Points in Biological Networks N. Radde, University of Stuttgart, Germany

59

In Vivo Kinetics of Asymmetric Disposal of Individual Protein Aggregates in E. Coli, One Aggregate At a Time A. Ribeiro, J. Lloyd-Price, A. Häkkinen, M. Kandhavelu, I. Marques, S. Chowdhury, E. Lihavainen, O. Yli-Harja, Tampere University of Technology, Finland

63

Image Analysis of Nuclear Envelope Breakdown Events Using KNIME T. Rieß, University of Konstanz, Germany, J. Marino, C. Wandke, ETH Zürich, Switzerland, D. Merhof, O. Deussen, G., University of Konstanz, Germany, Csucs, U. Kutay, P. Horvath, ETH Zürich, Switzerland

67

Mixture Model Clustering for Peak Filtering in Metabolomics S. Rogers, R. Daly, R. Breitling, University of Glasgow, UK

71

Quantifying the Relationship Between the Structure and the Dynamics of Random Boolean Networks From Time Series Data S. Sarbu, J. Kesseli, M. Nytker, Tampere University of Technology, Finland

75

A Model for Proliferating Cell Populations That Accounts for Cell Types D. Schittler, J. Hasenauer, F. Allgower, University of Stuttgart, Germany

79

Pareto-optimal RNA Sequence-Structure Alignments T. Schnattinger, U. Schöning, H. Kestler, Ulm University, Germany

83

An Advanced Image Processing Approach Based on Parallel Growth and Overlap Handling to Quantify Neurite Growth F. Schönenberger, A. Krug, M. Leist, E. Ferrando-May, D. Merhof, University of Konstanz, Germany

87

Set Based Uncertainty Analysis and Parameter Estimation of Biological Networks with the BioSDP Toolbox S. Waldherr, J. Hasenauer, F. Allgower, University of Stuttgart, Germany

91

Abstracts

Comparative Study of microRNA Pathway-Analysis Methodologies in Colorectal Cancer T. Creanza, V. Liuzzi, R. Maglietta, R. Anglani, P. Stifanelli, University of Bari, Italy, A. Piepoli, IRCCS, Italy, S. Mukherjee, Duke University, USA, F. Schena, N. Ancona, University of Bari, Italy

96

Selecting Differentially Expressed Genes with Hidden Subclasses C. Fortes, Higher School of Health Technology of Lisbon, Portugal, A. Turkman, L. Sousa, University of Lisbon, Portugal

97

Attractor Robustness Does Not Restrict Information Propagation in Critical Random Boolean Networks A. Gupta, J. Lloyd-Price, O. Yli-Harja, A. Ribeiro, Tampere University of Technology, Finland

98

Derrida Values for Networks Based on Nested Canalizing Functions C. Kadelka, D. Murrugarra, R. Laubenbacher, Virginia Tech, USA

99

Prediction of Regulatory Impact Under Defined Environmental Conditions Based on a Combination of Metabolic and Transcriptional Networks J. Klotz, G. Viertel, Ulm University, Germany, R. Feuer, K. Gottlieb, O. Sawodny, G. Sprenger, University of Stuttgart, Germany, M. Bossert, Ulm University, Germany, M. Ederer, University of Stuttgart, Germany, S. Schober, Ulm University, Germany

100

Using Physiology-based Pharmacokinetic Modeling for Pharmaceutical Research and Development L. Kuepfer, Bayer Technology Services GmbH, Germany

101

Modeling Metabolism of an Artificially Constructed System of Two Bacterial Strains A. Larjo, S. Santala, M. Karp, V. Santala, Tampere Univesity of Technology, Finland

102

Using Kernel Density Estimators to Detect Bistable States in Stochastic Simulations of Genetic Circuits E. Monzon, A. Tejeda and C. Winstead, Utah State University, USA, C.J. Myers, C. Madsen, University of Utah, USA

103

Single-Molecule Dynamics of the Bidirectional Arabinose Promoter S. Oliveira, J. Mäkelä, M. Kandhavelu, O. Yli-Harja, A. Ribeiro, Tampere Univesity of Technology, Finland

104

Towards Confidence Intervals for the Mutual Information Between Two Binary Random Variables A. Stefani, J. Huber, C. Jardin, H. Sticht, FAU Erlangen-Nuremberg, Germany

105

Adaptive Metrics for RNA-Protein Interaction Studies M. Strickert, M. Mernberger, E. Hüllermeier, University of Marburg, Germany

106

Effects of Rate Limiting Steps in Transcription Initiation on the Behavior of Small Genetic Motifs H. Tran, A. Häkkinen, O. Yli-Harja, A. Ribeiro, Tampere University of Technology, Finland

107

Modeling the Cis- and Trans-Effect of DNA Copy Number Aberrations on Gene Expression Levels in a Pathway W. van Wieringen and M. Van de Wiel, VUmc University Medical Center, Netherland

108

INVITED TALKS

SYSTEMS BIOLOGY OF THE CIRCADIAN CLOCK Hanspeter Herzel Institute for Theoretical Biology, Charite and Humboldt University Berlin 10115 Berlin, Germany [email protected]

The circadian clock allows organisms to adapt to 24 h rhythm of the solar day. In most organisms an endogeneous clock rhythm generator exists based on gene regulatory feedback loops. This circadian clock with periods ranging from 23 to 25 h is typically entrained by external light-dark cycles and temperature rhythms. The circadian clock can be regarded as a network of coupled oscillators. Synchronization of single cell rhythms leads to a remarkable accuracy of the circadian clock even without external time-cues. We present a data-driven gene regulatory network model based on delaydifferential equations. Moreover, we describe single cell data with stochastic differential equations and study the synchronization of coupled neurons. Finally, we discuss the determination of the entrainment phase using discrete and continuous models. This concept explains the wide range of observed chronotypes (morning larks and night owls).

2

SCIENTIFIC KNOWLEDGE IS POSSIBLE WITH SMALL-SAMPLE BIOMARKER CLASSIFICATION Lori A. Dalton and Edward R. Dougherty Department of Electrical and Computer Engineering, Texas A&M University, USA [email protected], [email protected]

The “truth” of a scientific theory depends on its agreement with empirical predictions. The implications for translational science are existential because the worth of any diagnostic, prognostic, or therapeutic decision depends on the probability of a correct decision. For the development of biomarkers, this means that the scientific validity of any classifier depends on the properties of the error estimation procedure. Absent a characterization of accuracy in relation to issues such as distributional complexity, sample size, and classification rule, a computed error estimate is virtually worthless. The essentially complete lack of such characterization in the bioinformatics literature could easily leave one with the impression that worthwhile small-sample classification is impossible. Such a conclusion would be wrong. This talk addresses the fundamental role of the joint distribution between the true and estimated errors in characterizing estimation accuracy, the need (as with any scientific theory) for assumptions, and optimal classifier-error estimation in the presence of distributional assumptions.

3

NEW INSIGHTS INTO REGULATORY TRANSCRIPTOMICS FROM NEXTGENERATION SEQUENCING Christina Leslie Sloan-Kettering Institute, Computational Biology Program, New York 10056, USA [email protected]

Many large-scale genomics projects, including the ENCODE projects, are using DNA-sequencing technologies to comprehensively map the epigenome. These technologies – including ChIP-seq for profiling transcription factor binding locations and histone modifications and DNase-seq for identifying chromatin-accessible regulatory regions – tell us mainly about transcriptional regulation and not about subsequent regulatory steps. Now, emerging technologies based on RNA sequencing are beginning to shed light on co-transcriptional and post-transcriptional regulatory processes. We will describe our recent work in this new area of "regulatory transcriptomics" based on two new sequencing technologies: (1) CLIP-seq (cross-linking immunoprecipitation followed by next-generation sequencing) for mapping the transcriptome-wide binding sites of RNA-binding proteins; and (2) 3'-seq for mapping the alternative 3'-ends of all polyadenylated transcripts. We will describe how combining Argonaute CLIP-seq with a genetics approach ("differential CLIP-seq") defines, for the first time, the full set of targets of a single microRNA in a physiological context. This analysis reveals widespread non-canonical targeting, defining a wider range of sequence specificity patterns than previously known. We will also describe a new and quantitative 3'-seq method that we use to decipher tissue- and stimulusspecific alternative cleavage and polyadenylation in human cells.

4

BAYESIAN MODELLING IN SYSTEMS BIOLOGY Dirk Husmeier School of Mathematics and Statistics, University of Glasgow Glasgow G12 8QW, UK [email protected]

There are two fundamental approaches to computational systems biology. The first approach is inductive and based on empirical modelling, aiming to infer molecular interaction networks from postgenomic data by the application of abstract models and statistical inference techniques. The second approach is deductive, based on an explicit representation of molecular interaction processes, formulated mathematically via systems of coupled differential equations. In my talk, I describe how recent work on Bayesian modelling can make substantial contributions to both approaches. For the first paradigm, I discuss how the performance of relatively simple models, based on dynamic Bayesian networks, can be substantially improved by the integration of multiple changepoint processes and the application of principles from hierarchical Bayesian modelling. For the second paradigm, I discuss the need and hurdles for statistical inference, and outline prospective solutions based on gradient matching. These discussions are complemented by a contributed companion talk on Gaussian process applications to this effect. The methods described in my talk are the result of joint work with Marco Grzegorczyk and Frank Dondelinger.

5

FULL PAPERS

6

INFERRING THE GENES UNDERLYING FLAVONOID PRODUCTION IN TOMATO Laura Astola1,2, Victoria Gomez-Roldan2,3 and Jaap Molenaar1,2 1

Biometris, Wageningen University and Research Centre, P.O. Box 100, 6700 AC Wageningen, The Netherlands 2 Netherlands Consortium for Systems Biology, Amsterdam, The Netherlands 3 Bioscience, Plant Research International, Wageningen University and Research Centre, The Netherlands [email protected], [email protected], [email protected] ABSTRACT

tem [7]. When (as in our case) the kinetic parameters are not known, one may find suitable models in general biochemical systems theory [8]. Typically the identifiability of the parameters is not guaranteed [9]. One approach towards improving the identifiability of the parameters is the so-called dynamic flux estimation (DFE) [10]. The initial set up of our approach is similar to DFE in that the slopes/derivatives of the measured metabolites are estimated directly from the data and also in that the kinetic rates are being solved at each time point. In this paper we discuss first how to estimate the time dependent kinetic rates from a time series of metabolite concentration data, and then how to extract the corresponding potential gene candidates from the time series microarray data. For clarity, we begin by briefly sketching the inference procedure for constant kinetic rates and then generalize this to the time dependent case.

Flavonoids are plant secondary metabolites that are extensively studied for their proposed positive effects on human health. They are the end products of a cascade of enzymatic reactions that convert initially toxic substances to glycosylated forms. To determine which enzymes are precisely responsible for which conversions is by far not trivial, since hundreds of candidate genes are in principle capable of performing the transformation of interest. In this paper we propose a method to solve this problem for the glycosylation of flavonoids by coupling gene expression data to the metabolic pathway underlying glycosylation. The core of the method is to estimate time dependent coefficients in a highly efficient way. To show how this approach performs, we apply this method to study the flavonoid glycosylation pathway in tomato (Solanum lycopersicum) seedlings.

CONSTANT PARAMETER ESTIMATION

INTRODUCTION In tomato seedlings, over 200 putative glycosyl transferases [1] constitute the set of potential enzymes that catalyze the reactions of interest. The experimental validation of each glycosylation process using purified target proteins is costly and time consuming. Therefore, we want to limit the number of enzyme candidates by mathematical modeling, using both, the metabolite concentration and the gene expression data. In order to simulate and analyze the glycosylation processes, we first need to have a sufficiently descriptive model system. Whereas gene and signaling networks require the inference of the network architecture as well as the estimation of the network parameters, in metabolic networks one typically has some a priori information on the possible network configurations. This shifts the emphasis from structural inference methods as Boolean networks [2], Bayesian and statistical inference [3,4] towards kinetic parameter estimation methods [5,6]. Since we have time series data for metabolites and gene expressions (measured from same sample material), a reasonable choice is to use ordinary differential equations (ODEs) as a model sys-

We recall that any network can be represented as a graph, where nodes are connected by directed or undirected edges when there is some interaction between these nodes. In a metabolic network a node represents a substrate or a product, and a directed edge from node i to node j means that i can be converted to j by enzymatic activity. To an edge from node i to j, we assign a weight, i.e., the kinetic rate kij ≥ 0. This indicates the rate of product formation. In network reconstruction one may find as a result of an estimation procedure that kij = 0. Then we may conclude that there is no edge connecting nodes i and j. A general time-invariant linear ODE model with constant coefficients and nonhomogeneous source terms, satisfying the mass conservation law, can be written as ! ! X˙ i (t) = − kij Xi (t) + kji Xj (t) + bi , (1) j!=i

j!=i

for i = 1, . . . , n. The first summation stands for the edges leaving Xi , the second for the incoming edges, while bi

7

represents a possible constant in or outflow. To simplify the notation, we introduce a matrix A with components given by ! Aij = kji , i != j " (2) Aii = − j!=i kij , n #

Aij Xj (t) + bi , i = 1, . . . , n .

k25

X5

k36

X6

k37

X7

k12

X0

(3)

X1 k01

j=1

k13

To reconstruct the network from time-series measurements, we have to estimate the reaction rates kij , i.e., the weights of the edges in the network. Due to (2), it is sufficient to estimate the matrix A. In [11], we experimented with a fast parameter estimation method, where the efficiency was based on the fact that we avoided iterative solving of ODEs by directly substituting the measurements into the ODEs and by approximating the derivatives with finite differences. An alternative and often better approach to obtain approximations for the time derivatives X˙ i (tj ), is to fit splines to the time series data Xi (tj ). For each metabolite, we have 9 replicates of averaged metabolite concentrations measured per given time point. To obtain curves that represent the data faithfully, we require that the distance between the curves and the measurements are minimal and that at the same time the curves are smooth. To achieve this we fit P-splines, which are B-splines with a penalization for nonsmoothness [12]. The coefficient λ of the penalty term can be chosen, e.g., using leave-one-out cross validation. From these splines, we evaluate the derivative estimates at time points tj . These estimates are then used ˙ In this formulation, the probas entries in the matrix X. lem of network inference comes down to solving the set of equations given by X˙ = AX .

X4

X2

Then, (1) becomes X˙ i (t) =

k24

X3

Figure 1. A putative graph for quercetin glycosylation pathway, used as the minimum spanning tree for the networks in the simulations. This is an example of a graph with rooted tree structure. This alternative formulation allows inclusion of expert knowledge in a simple way. We put Aij = 0, when an edge from node i to node j cannot exist.

TIME DEPENDENT PARAMETER ESTIMATION A shortcoming of the model in the previous section is that it cannot capture the trends in enzyme concentrations which are naturally time varying. To take the enzyme dynamics into account we extend the previous in a straightforward fashion as follows: • Scheme 1: Fit first natural- or B-splines to data and evaluate estimates for derivatives X˙ i . Substitute these estimates and the measurement data Xi (tk ) into the ODE-system obtaining a set of algebraic equations at each separate time point tk . Solve first, the constant parameters kij (tk ), obtaining a set of estimates. Fit a function of choice to these sets over time range.

(4)

Solving the parameters directly would be fast since it involves only matrix manipulations. However, it often results in over-fitting, since all possible edges are included in the modeled network. Another serious weakness of such a matrix (pseudo-) inversion approach is the fact that we cannot control the positivity of the reaction rates. Although in [13], positive(negative) coefficients were interpreted as activation(inhibition) of the compounds, in many biological pathways, negative coefficients are not allowed. Thus we take a more general approach that allows sparse networks, where one can exclude all irrelevant edges that are not contained in any biologically feasible model, and in which one can constrain the reaction rates to be positive, without substantially compromising computation time. To this end, we reformulate the equation as a minimization problem: $ % arg min ||X˙ − AX|| . (5)

For example, a second order polynomial may describe the trend of the enzyme activity sufficiently for a relatively short time. We remark that in case the metabolic network in question has a structure of a rooted tree graph, the estimated parameters at each time point are unique. This is an advantage in terms of identifiability of the parameters. The glycosylation pathways for flavonoids such as quercetin and kaempferol are in fact expected to be of this type. We compare scheme 1 to an alternative standard method: • Scheme 2: Iteratively solve the ODEs with varying kinetic rates kij (t) = αij t2 + βij t + γij (or another suitable function of choice), until the solutions

A

8

log inference errors

log inference errors 30

40

25 30

metabolites yi!tk "

20 15

20

mRNA data xi!tk "

10 10 5

!2

0

2

4

Scheme1

6

Scheme2

8

10

2

4

6

Scheme1

fit splines

8

network model

Scheme2 GT 21

Figure 2. We have compared two different reconstruction schemes versus their errors in 100 simulations. By errors we mean here the sum of squared differences between the original kinetic rates used in the simulations and the reconstructed kinetic rates. On the left: The (logarithmic) errors in the inferred networks using scheme 1 (proposed method) and scheme 2 (iterative method). Right: as in the left hand side but with 10% uniformly sampled noise added to sample data.

50

40

30

20

6

7

10

0

evaluate yi and yi'

8

9

unknown parameters kij

35 30 25 20 15 10 5 6

ODE model

7

8

9

correlate

least squares ENZYME IN REACTION 7

" k ij!tk "

Xi (t) are sufficiently close to the measurements at points tk . Typically ! this means that an objective function such as || k Xi (tk ) − Xi (tk )||, where Xi are the measurement vectors, is minimized.

sample

0.12 0.10 0.08 0.06 0.04 0.02 6

7

8

9

Figure 3. Schematic view of the gene inference procedure.

We have compared these two parameter inference schemes using simulated data. In the simulation, to generate artificial data, we assigned pseudo random values to αij , βij and γij in a range, such that the resulting ODE solutions have approximately same range as the biological data used in the application. The networks used in the simulations were random modifications of the graph in Fig. 1, but with at least one cycle, to make the inference a bit more challenging. In the first set of experiments, we used noiseless data sampled from the simulation results. In the second set, we added ±10% uniformly distributed noise to these same samples. As can be seen from Fig. 2 the proposed method gives on average the best results, although scheme 1 occasionally succeeds in finding the most accurate estimates, when the data is noiseless. In computation time scheme 1 was on an average 700 times faster than the iterative scheme 2. The comparisons in Fig. 2 were done in a setting where no initial values nor parameter constraints (except for the positivity) were given to the solvers and the parameters were estimated using global search.

chronic diseases in humans [14]. In the experiments daily samples were extracted from the seedlings. The same time series sample material from seedlings were analyzed using liquid chromatography mass spectrometer for metabolite concentrations and on a mRNA microarray for the expression levels of glucosyl transferases (GTs). We use the heuristics that in a mRNA microarray obtained from time series samples, the expression levels sufficiently correlate with the actual protein concentration. Correlation of sample vectors captures the similarity of the (finite) derivatives and curvatures, while ignoring the average values. This is indeed what we want to measure, since the mean values of the expression levels and enzyme concentrations are not likely to be similar, because the units are not physically related. The proposed work flow for the GT inference is briefly as follows: 1. Given the time series metabolite concentration data, estimate the time dependent parameters using all biologically relevant networks. Select the network that gives the best fit to measurements with respect to residual or goodness of fit etc. Save the kinetic rates estimated on the best networks.

APPLICATION IN THE INFERENCE OF ACTIVE GENES

2. Compute correlations between the time series of mean expression levels of each GT and the kinetic rates.

As an application, we consider the inference of the genes behind the enzymatic reactions in metabolic pathways. As an example we take the quercetin glycosylation pathway occurring during the development of a tomato seedling. Quercetin glycosides are a subset of flavonoids, which are plant secondary metabolites naturally produced by plants. Flavonoids are actively studied besides for their important role in protecting the growing plants from external stress, also for their proposed beneficial effects on prevention of

3. Select those GTs whose dynamics correlate most with kinetic dynamics. For convenience, we have summarized this as a schematic diagram in Fig. 3. As an example, in Fig. 4, we see the expression levels of the three GTs that correlate most with the estimated

9

Gene 1

Gene 2

[3] D. Husmeier, R. Dybowski, and S. Roberts, Probabilistic modeling in bioinformatics and medical informatics, Springer, 2005.

Gene 3

100

150

50 80 40 100

60 30 40

20

50 20

6

7

8

9

time

10

6

7

8

9

time

6

7

8

9

time

[4] N. Price and I. Shmulevich, “Biochemical and statistical network models for systems biology,” Curr Opin Biotechnol, vol. 18, no. 4, pp. 365–370, 2007.

Figure 4. The expression levels (dotted line) of three different glucosyl transferase genes and the estimated kinetic rate (continuous line) for a reaction that converts quercetin to quercetin-3-O-glucoside. The mean expression levels of gene 1 (leftmost frame) correlate best with the predicted enzymatic activity. kinetic rates for a reaction that glycosylates quercetin to quercetin-3-O-glucoside. Although the units for the predicted enzyme activities and gene expression levels differ, we observe an almost identical shape in the leftmost frame of Fig. 4. To experimentally test whether the inferred genes are actually transcribing the enzymes that glycosylate the flavonols, a set of selected genes are currently being cloned. As a computational validation, we tested whether substituting the data of the selected genes into the model will result in better likelihood (of observing the measurements) than when we substitute the other less correlated genes. In the simulations we ran Markov Chain Monte Carlo-algorithm [15] to ensure a rich set of gene combinations and scalings of expression levels. We ordered the genes into a sequence according to their correlation with the predicted enzyme concentration levels. We took two sets of genes according to their order number in the sequence: 1, 2, . . . , 10 and 11, 12, . . . , 20. We tested whether the residuals corresponding to the simulations using the data of these two sets have equal means and variances. For the mean test we obtained a P-value less than 0.00001 and for the variance test a P-value of less than 0.006. We may conclude that in the context of a dynamic kinetic reaction model, the gene set with high correlation is significantly more likely to have caused the observations.

ACKNOWLEDGEMENTS This work results from a collaboration between plant biologists, statisticians and mathematicians, initiated by the Netherlands Consortium for Systems Biology (NCSB) and Centre for Biosystems Genomics (CBSG). Both the NCSB and CBSG are Centres of Excellence under the auspices of the Netherlands Genomics Initiative. 1. REFERENCES [1] J. Wang, “Glycosyltransferases: key players involved in the modification of plant secondary metabolites,” Front. Biol. China, vol. 4, no. 1, pp. 39–46, 2009. [2] T. Akutsu, S. Miyano, and S. Kuhara, “Inferring qualitative relations in genetic networks and metabolic pathways,” Bioinformatics, vol. 16, no. 8, pp. 727–734, 2000.

[5] B. Palsson, Systems Biology: Simulation of Dynamic Network States, Cambridge University Press, 2011. [6] E. Conrad and J. Tyson, “6. modeling molecular interaction networks with nonlinear ordinary differential equations,” in System Modeling in Cellular Biology, from concepts to nuts and bolts, Z. Szallasi, J. Stelling, and V. Periwal, Eds. 2010, pp. 97–123, The MIT Press. [7] W. Chen, M. Niepel, and P. Sorger, “Classic and contemporary approaches to modeling biochemical reactions,” Genes & development, vol. 24, no. 17, pp. 1861–1875, 2010. [8] E. Voit, S. Marino, and R. Lall, “Challenges for the identification of biological systems from in vivo time series data,” In Silico Biol., vol. 5, pp. 83–92, 2005. [9] G. Craciun and C. Pantea, “Identifiability of chemical reaction networks,” J Math Chem, vol. 44, pp. 244–259, 2008. [10] G. Goel, A Novel Framework for Metabolic Pathway Analysis, Ph.d thesis, Wallace H. Coulter Dept. of Biomedical Engineering, Georgia Institute of Technology, Atlanta, December 2009. [11] L. Astola, M. Groenenboom, V. Gomes Roldan, F. Eeuwijk, R. Hall, A. Bovy, and J. Molenaar, “Metabolic pathway inference from time series data: a non iterative approach,” in Lecture Notes in Bioinformatics. 2011, vol. 7036, pp. 97–108, Springer. [12] P. Eilers and B. Marx, “Flexible smoothing with bsplines and penalties,” Statistical Science, vol. 11, no. 2, pp. 89–121, 1996. [13] H. Schmidt, K.-H. Cho, and E. Jacobsen, “Identification of small scale biochemical networks based on general type system perturbations,” The FEBS Journal, vol. 272, pp. 2141–2151, 2005. [14] A. Bovy, E. Schijlen, and R. Hall, “Metabolic engineering of flavonoids in tomato (Solanum lycopersicum): the potential for metabolomics.,” Metabolomics, vol. 3, no. 3, pp. 399–412, 2007. [15] D. Calvetti and E. Somersalo, Introduction to Bayesian Scientific Computing: Ten Lectures on Subjective Computing, vol. 2, Springer, 2007.

10

A NOVEL CENSORED SAMPLING PARADIGM FOR GENOMIC DATA CLASSIFICATION Lori A. Dalton1 and Edward R. Dougherty1,2,3 Dept. of Electrical and Computer Engineering, Texas A&M University Computational Biology Division, Translational Genomics Research Institute 3 Dept. of Bioinf. and Computational Biology, University of Texas M. D. Anderson Cancer Center [email protected], [email protected] 1

2

ABSTRACT

and mixed second-order moments for the true error and the resubstitution and leave-one-out estimators are available [1]. For linear discriminant analysis (LDA), exact joint distributions have been found for both resubstitution and leave-one-out in the univariate Gaussian model, and approximate joint distributions are available in the multivariate model with a common known covariance [2]. An approximate RMS has also been found via asymptotically exact analytic expressions [3]. A weakness of the classical approach is that the true underlying distribution is unknown in practice. We may hope to avoid assumptions completely and search for “distribution free” bounds on performance, but in the few cases where such bounds are known they are too loose to be useful for the kind of small-samples seen in biology. The following is an example of a distribution free RMS bound for the leave-one-out error estimator with the discrete histogram rule and tie-breaking in the direction of class 0 [4]: " 1 + 6/e 6 RMS(! εloo |F ) ≤ +# , (1) n π (n − 1)

In recent years, biomedicine has faced a flood of difficult and expensive small-sample phenotype discrimination problems. A host of classifiers have been proposed to discriminate between types of pathology, stages of disease and other diagnoses, which are typically designed from heuristic algorithms on small samples with very little known about performance. To give a concrete mathematical structure to the problem, recent work utilizes a Bayesian modeling framework to both optimize and analyze error estimator performance, facilitating the development of a new sample-conditioned mean-square-error (MSE) performance measure for error estimation, where uncertainty is relative to a family of feature-label distributions conditioned on the observed sample. Herein we discuss an application in censored sampling, where sample points are collected one at a time until the conditional MSE reaches a stopping criterion. Classifiers designed from each censored sample are guaranteed to meet the desired performance requirement within the assumed mathematical framework, making censored sampling an attractive and economical method for genomic applications.

where F is the feature-label distribution and n is the sample size. It is useless for small samples: for n = 200 this bound is 0.506. Accurate distribution-free small-sample error estimation is an illusion [5]. We are forced to proceed with analysis under distributional assumptions. Consider the RMS graphs in Figure 1 for a discrete model with b = 8 bins [6]. Performance is shown for a class of Zipf bin probabilities, which essentially follow a power law, each with different Bayes error. A Bayesian error estimator with flat priors [6], resubstitution and leave-one-out are shown, and in particular leave-one-out performs well below the bound offered by (1); even with n = 20 the worst case performance for the Zipf model is about 0.17. While the computation of an error estimator may be independent of the distribution, its performance is certainly not. For instance, leave-one-out in Figure 1 operates best with low Bayes errors, which is quite typical, so that its use with small sample sizes implicitly assumes low Bayes error, at least if one is assuming some degree of accuracy. Given the necessity of assumptions, why not state them outright and fully integrate them into the analysis? This

1. INTRODUCTION In applications where sample data are abundant and cheap to acquire, we can partition points into training and testing subsets without degrading the quality of classification significantly. Biomedicine is a different story. The advent of microarrays and other genomic and proteomic technologies, coupled with the inherent difficultly in collecting biological sample points, presents a unique high-throughput small-sample setting where classifier error estimation accuracy becomes a critical issue because it is the primary measure of the scientific validity of a classifier model, and accurate estimation is no longer guaranteed. As a measure of validity, we focus on the MSE or its square root, the root-mean-square (RMS). The classical approach to error estimator analysis conditions on a fixed distribution and averages over the corresponding sampling distribution. Recent work has aimed at finding joint second order moments or even the joint density between the true and estimated errors. For multinomial discrimination, exact representations of both marginal

11

The Bayesian model admits that we do not know the underlying feature-label distribution perfectly, and instead quantifies our initial uncertainty in the true distribution through a prior. As we observe sample points, this uncertainty should converge to a certainty on the true distribution. More precisely, it has been proven in [7] that under mild regularity conditions, the posteriors converge to delta functions on the true parameters for two important models: discrete distributions with Dirichlet priors and arbitrary classifiers (henceforth referred to as the discrete model) and Gaussian distributions with arbitrary covariances, normal-inverse-Wishart priors and linear classifiers (the Gaussian model). More informative priors may help the posteriors converge faster, but as long as the prior does not exclude the true distribution as impossible, convergence is assured. Bayesian error estimation is defined to be the minimum mean-square error (MMSE) estimate of the true error, or equivalently the first moment of the true error conditioned on the observed sample [6]. It is a training databased estimator, with no data held out for testing. Representation has been solved in both the discrete and Gaussian models. Classical frequentist consistency also holds for Bayesian error estimators on fixed distributions in the parameterized family owing to the convergence of the posteriors in both the discrete and Gaussian models [7]. Practical considerations for Bayesian error estimation in microarray data analysis have been addressed in [8]. Priors may be calibrated with a method-of-moments approach using features from the microarray dataset that are discarded by feature selection, and Monte Carlo approximation code is also provided for Gaussian models with non-linear classifiers. Performance is often superior to classical error estimation schemes on real biological data. The sample-conditioned MSE for a Bayesian error estimator is equivalent to the variance of the true error conditioned on the observed sample [7]. Again thanks to the convergence of the posteriors, it has been shown that the sample-conditioned MSE converges to zero almost surely in both the discrete and Gaussian models, where closed form expressions for the MSE are also available. Further, the exact MSE for arbitrary error estimators falls out naturally in the Bayesian model. That is, if ε!BEE is a Bayesian error estimate evaluated from the observed sample, and ε!• is a constant representing another error estimate, then the sample-conditioned MSE of ε!• can be evaluated directly from that of the Bayesian error estimator:

RMS deviation from true error

0.25 0.2 0.15 0.1 resub/plugin loo Bayes

0.05 0 0

0.1

0.2 0.3 Bayes error

0.4

0.5

Figure 1. Classical RMS with respect to Bayes error (discrete model, b = 8, Zipf distributions, c = 0.5, n = 20). is precisely what is done in a recent Bayesian framework for classification [6]. Rather than conditioning on an unknown distribution, this mathematical framework is used to condition on the actual observed sample with uncertainty now relative to the unknown feature-label distribution. It facilitates both the design of estimators with desirable properties or optimal performance and the analysis of any error estimator. This work focuses on analysis through the recently introduced sample-conditioned RMS for arbitrary error estimators, which is a new and practical measure of error estimation performance [7]. For the first time, it is possible to report performance precisely for the actual observed data, trained classifier and computed error estimate in hand, not just loose bounds or average performance over random samples for a classification and error estimation rule pair. After a brief review of the Bayesian framework, we will cover a few examples of how the conditional MSE may be used in practice. A salient application is a new censored sampling strategy, where sample points are acquired one at a time until the estimated error of the classifier reaches a desired RMS. 2. REVIEW OF THE BAYESIAN FRAMEWORK Bayesian frameworks parameterize the class-conditional distributions with parameters θ0 for class 0 and θ1 for class 1. Using either a non-informative approach or expert information, we assign “prior” distributions to the a priori probability of class 0, c, and the parameters of the classconditional distributions, which are all assumed to be independent. An observed labeled sample, e.g. microarray data consisting of normalized log-ratios, is then used to update the priors to “posterior” distributions, which in turn imply a sample-conditioned distribution on the true error of the fixed classifier trained from the same given sample. Priors quantify the information we have about the distribution before observing the data. We have the option of using non-informative or flat priors, as long as the posterior is a valid density function. Alternatively, informative priors can supplement the classification problem with additional information. This may be done to make the problem tractable or to improve performance with distributional information when the sample size is small.

MSE(! ε• |Sn ) = MSE(! εBEE |Sn ) + (! ε − ε!• )2 .

(2)

Note the MSE optimality of Bayesian error estimation. 3. THE SAMPLE CONDITIONS UNCERTAINTY Consider a typical scenario, where we are given a sample to train a classifier, and the same data is used to estimate the error of this classifier. A natural question arises: How close is the estimate to the actual true error? Whereas in a classical approach nothing can be said for the particular problem at hand because nothing is known given a single

12

100

RMS = 0.1103

Bayes:

RMS = 0.0518

0.03 pmf

loo:

E[n] = 60.3273

0.04

Devroye: RMS r2 or ε! > e then we append a new point to the current sample without replacement. We then design a new classifier and check the conditional MSE and expected true error again. This is repeated until both MSE(! ε|Sn ) ≤ r2 and ε! ≤ e, or until n = 100. The consistency of Bayesian error estimation guarantees that MSE(! ε|Sn ) will eventually reach the stopping criterion (assuming the true distributions are truly Gaussian), and if ε! does not fall below e after a large amount of sampling, then we assume that we cannot achieve an acceptable classification error for the problem at hand.

4. CENSORED SAMPLING We finally propose an important application in censored sampling, where sample points are collected one at a time until the conditional MSE reaches an acceptable level. Since the conditional MSE converges (a.s.) to zero with increasing sample size, censored sampling may be applied with any desired threshold. Figure 3 shows a probability density of the censored sample size for the Bayesian error estimator with correct priors in a Gaussian model with 2 features and LDA classification [7]. The mean of the distribution is indicated with a vertical dotted line. The simulation generates 1000 distributions drawn from “low-information” priors and 1000 censored samples from each distribution. Each censored sample is initialized with 6 points in each class, and points are added until the conditional RMS is at most about 0.0358, corresponding to the RMS in a parallel experiment with sample size fixed at n = 60.

13

Table 1. Censored sampling example for real breast cancer data experiments with flat priors. Sample Sample Sample Sample Label ε! RMS(! ε|Sn ) Label ε! RMS(! ε|Sn ) size point size point Initial sample: Appended sample points (cont’d): 1 class 0 0.309 10 class 1 -0.718 0.190709 0.120224 2 class 0 -0.127 11 class 0 0.391 0.162752 0.108650 3 class 0 0.153 12 class 1 -0.909 0.141007 0.095877 4 class 0 0.473 13 class 0 0.355 0.120434 0.086282 5 class 1 -0.485 14 class 0 0.385 0.104277 0.078103 6 class 1 0.160 0.360119 0.243913 15 class 1 -0.273 0.114879 0.076103 .. .. .. .. .. Appended sample points: . . . . . 7 8 9

class 0 class 0 class 1

-0.114 0.357 -0.399

0.304501 0.253987 0.293964

0.218271 0.196361 0.219222

36 class 1 -0.250 0.152377 37 class 0 0.038 0.149821 Approximate true error: 0.197674

Table 1 provides a detailed example of the procedure from a single iteration [7]. We list the actual initial sample, along with the initial Bayesian error estimate and conditional MSE. These are followed by the sample points added in each repetition of the procedure, along with the current Bayesian error estimate and conditional MSE computed as each point is added. We observe that the expected true error of the classifier tends to decrease, while the conditional MSE (variance of the true error) decreases almost monotonically. In this example, the stopping criteria are satisfied at a sample size of 37. The approximate true error of the classifier using (holdout) points remaining in the data set after censored sampling is also shown.

0.050352 0.049262

tion and leave-one-out error estimators for linear classifiers,” IEEE Trans. Inf. Theory, vol. 56, no. 2, pp. 784–804, 2010. [3] ——, “Analytic study of performance of error estimators for linear discriminant analysis,” IEEE Trans. Signal Process., vol. 59, no. 9, pp. 4238–4255, Sep. 2011. [4] L. Devroye, L. Gyorfi, and G. Lugosi, A Probabilistic Theory of Pattern Recognition. New York: SpringerVerlag, 1996. [5] E. R. Dougherty, A. Zollanvari, and U. M. BragaNeto, “The illusion of distribution-free small-sample classification in genomics,” Current Genomics, vol. 12, no. 5, pp. 333–341, August 2011.

6. CONCLUSION A major advantage of Bayesian error estimation over classical classifier error estimation schemes is that it articulates a mathematical model to generate an estimator that is theoretically optimal in the mean-square sense. Moreover, other benefits emerge: the Bayesian MMSE error estimator is theoretically unbiased and the priors can be tailored to target certain properties, for example, to obtain best performance in moderately difficult classification problems with Bayes errors in the mid range [6]. In prior work, the RMS of a non-hold-out error estimator has always been considered by averaging over the sampling distribution and nothing could be said about performance for a particular sample. Hence, Bayesian modeling boasts another practical advantage in that it naturally gives rise to a new and practical measure of performance, the RMS of an error estimate conditioned on the actual observed sample, with important applications in censored sampling.

[6] L. A. Dalton and E. R. Dougherty, “Bayesian minimum mean-square error estimation for classification error–Part I: Definition and the Bayesian MMSE error estimator for discrete classification,” IEEE Trans. Signal Process., vol. 59, no. 1, pp. 115–129, Jan. 2011. [7] ——, “Exact sample conditioned MSE performance of the Bayesian MMSE estimator for classification error–Part II: Consistency and performance analysis,” IEEE Trans. Signal Process., vol. 60, no. 5, pp. 2588– 2603, May 2012. [8] ——, “Application of the Bayesian MMSE estimator for classification error to gene expression microarray data,” Bioinf., vol. 27, no. 13, pp. 1822–1831, 2011. [9] M. J. van de Vijver, Y. D. He, L. J. van ’t Veer, H. Dai, A. A. M. Hart, D. W. Voskuil, G. J. Schreiber, J. L. Peterse, C. Roberts, M. J. Marton, M. Parrish, D. Atsma, A. Witteveen, A. Glas, L. Delahaye, T. van der Velde, H. Bartelink, S. Rodenhuis, E. T. Rutgers, S. H. Friend, and R. Bernards, “A gene-expression signature as a predictor of survival in breast cancer,” New Engl. J. Med., vol. 347, no. 25, pp. 1999–2009, Dec. 2002.

7. REFERENCES [1] U. Braga-Neto and E. R. Dougherty, “Exact correlation between actual and estimated errors in discrete classification,” Pattern Recogn. Letters, vol. 31, no. 5, pp. 407–412, April 2010. [2] A. Zollanvari, U. M. Braga-Neto, and E. R. Dougherty, “On the joint sampling distribution between the actual classification error and the resubstitu-

14

PARAMETER INFERENCE IN MECHANISTIC MODELS OF CELLULAR REGULATION AND SIGNALLING PATHWAYS USING GRADIENT MATCHING Frank Dondelinger1, Simon Rogers2 , Maurizio Filippone2, Roberta Cretella3 , Tim M. Palmer3 , Robert W. Smith4 , Andrew J. Millar4 and Dirk Husmeier1 1

School of Mathematics and Statistics, University of Glasgow 2 School of Computing Science, University of Glasgow 3 Institute of Cardiovascular and Medical Sciences, University of Glasgow University of Glasgow, Glasgow G12 8QQ, Scotland 4 SynthSys, University of Edinburgh, Edinburgh EH9 3JD Corresponding author: [email protected] ABSTRACT A challenging problem in systems biology is parameter inference in mechanistic models of signalling pathways. In the present article, we investigate an approach based on gradient matching and nonparametric Bayesian modelling with Gaussian processes. We evaluate the method on two biological systems, related to the regulation of PIF4/5 in Arabidopsis thaliana, and the JAK/STAT signal transduction pathway. 1. INTRODUCTION A central problem in computational systems biology is the formulation of a consistent mechanistic model description of signalling pathways and molecular processes of cellular regulation. While there have been several attempts at describing these processes at a qualitative level, proper statistical inference is a much more challenging problem. The approach is based on minimising the discrepancy between measured data, e.g. related to the abundance profiles of some molecular components, and their simulated values. This discrepancy is related to some metric, which can be shown to be defined by the assumed noise model in terms of a maximum likelihood approach to inference. For instance, minimising the root mean square deviation between measured and simulated data is equivalent to maximizing the likelihood under the assumption of white additive Gaussian noise. For further details, see [1]. The practical difficulties with this approach are twofold. First, the likelihood landscape is typically rugged and multimodal, which calls for some form of annealing scheme. Second, each parameter adaptation requires a numerical solution of the differential equations (ODEs), which is computationally expensive and hence limits the number of maximum likelihood (ML) optimization steps or Markov chain Monte Carlo (MCMC) sampling steps that can be carried out at reasonable computational costs. A potential solution to this problem is the approach of gradient matching. The idea is that rather than aiming to explicitly solve the ODEs, we seek to minimise the discrepancy between the gradients inferred from the slope

of the interpolant and those predicted from the coupled system of ODEs. The former is defined by the regression model and depends on some smoothness or regularization parameters. The latter is defined by the system of coupled ODEs and depends on its parameters, whose determination is the ultimate objective of inference. Earlier approaches pursued a two-step approach, in which first the interpolant was inferred, and then the ODE parameters were inferred by minimizing the discrepancy between the time derivatives predicted from the ODEs and those predicted from the slope of the interpolant [2]. The disadvantage of this approach is that the result of parameter inference critically hinges on the quality of the interpolation scheme, which once completed is kept fixed. A better approach, first suggested in [3], is to allow for some feedback mechanism by which the system of ODEs can act back on the interpolation scheme. For instance, the system of ODEs might only match the slopes of the interpolant for implausible or a priori unlikely parameter configurations, in which case the interpolant should be adjusted. Hence, in order to be viable, the mismatch between the time derivative predicted from the slope of the interpolant and the one obtained from the ODEs should be systematically reduced in an iterative loop, whereby both the ODEs and the smoothness parameters of the regression are adapted simultaneously. A variation of this approach was presented in [4], where the authors employed parallel tempering using the data-smoothing hyperparameter, which improves sampling efficiency. 2. METHOD Our approach is based on non-parametric Bayesian regression with Gaussian processes, following Calderhead et al. in [5]. Their approach draws on the fact that the derivative of a Gaussian process is a Gaussian process again. This renders Gaussian process regression a natural tool for the double objective of nonlinear data interpolation and gradient matching along the lines outlined in the previous section. The result is a hierarchical Bayesian model which allows for parameter inference. We will have a system

15

 

 

 

 

Figure 1. Outline of the approach of gradient matching. Left panel: the slope of the interpolant informs the inference of the ODE parameters θ. Centre panel: the derivative dx dt predicted based on the updated parameters from the ODEs f (x(t), θ) is fed back to the interpolation scheme. Right panel: by iterating the above two steps, the mismatch between the time derivative predicted from the slope of the interpolant and the one obtained from the ODEs is systematically reduced, thereby adapting both the ODE parameters and the smoothness parameters of the regression simultaneously.

10

15

20

0

5

10

15

0.8 PIF4/5

20

10

15

Timepoints

Noise 0 Gap 1

Noise 0.1 Gap 1

Noise 0.2 Gap 1

15

20

PIF4/5

0.8

True Simulated Sampled

0.0

0.0 10

20

0.4

True Simulated Sampled

0.8

0.8 0.4 0.0

5

5

Timepoints

True Simulated Sampled

0

0

Timepoints

PIF4/5

5

True Simulated Sampled

0.0

PIF4/5

0.4

0.8

True Simulated Sampled

0.4

0

PIF4/5

Noise 0.2 Gap 1

0.0

0.4

0.8

True Simulated Sampled

0.0

PIF4/5

Noise 0.1 Gap 1

0.4

Noise 0 Gap 1

0

5

Timepoints

10

15

20

0

5

Timepoints

10

15

20

Timepoints

Figure 2. PIF4/5 expression sampled every hour, with varying observational noise. We show the true (noiseless) expression values, versus expression values simulated from the ODE system using mean sampled parameters. We also show the values of PIF4/5 sampled from the GP model. Top: Calderhead et al. model [5]. Bottom: Improved model. Note that the dashed line in the top panel is out of scale. of coupled ODEs, which predict the time derivative; the ODE parameters are adapted so as to minimize the deviation from the time derivatives predicted with the Gaussian process. A novel aspect of our approach, which constitutes an important improvement on the method proposed in [5], is the fact that the smoothness hyper-parameters of the Gaussian process are adapted simultaneously along with the parameters of the ODEs. A mathematical description of this approach is beyond the scope and page limit of this paper and will be presented elsewhere. A schematic representation is depicted in Figure 1.

the circadian clock gene regulatory network of Arabidopsis thaliana. The overall network is represented by the Locke 2-loop model [6], with fixed parameters set following [7]. Only the parameters involved in regulation of PIF4 and PIF5 are inferred. As the expression profiles are very similar, we simplify the model to represent genes PIF4 and PIF5 as a combined gene PIF4/5. We are interested in the promoter strength s, the rate constant Kd and Hill coefficient h of the regulation by T OC1, and the degradation rate d of the P IF 4/5 mRNA. The regulation is represented by the following ODE:

3. DATA We test our approach on two biological systems; gene regulation in the circadian clock of Arabidopsis thaliana, and receptor signal transduction in the JAK/STAT pathway.

Kdh d[P IF 4/5] =s· h − d · [P IF 4/5] (1) dt Kd + [T OC1]

3.1. The PIF4/5 model We apply our GP parameter inference method to a model for gene regulation of genes PIF4 and PIF5 by TOC1 in

where [P IF 4/5] and [T OC1] represent the concentration of PIF4/5 and TOC1, respectively, and t represents time. 3.2. The JAK/STAT pathway We analyse a model for interleukin-6 signalling (IL-6) in vascular endothelial cells. IL-6 binds to a receptor on the

16

Parameter: k1f Species: R

Parameter: k14 Species: R

200

300

400

500

600

4 0.05

0.10

0.15

0.20

0.0

300

400

500

8 0.00

0.05

0.10

0.15

0.20

0.00

0.10

0.15

0.20

Parameter Value

Parameter Value

Species: 2−STAT3*_N

Parameter: k6f Species: 2−STAT3*_N

Parameter: k8 Species: 2−STAT3*_N

200

300

400

500

30 20

Density

0

0 600

5 10

80 60 20

40

Density

25

0.000

0.005

0.010

0.015

0.020

0.025

0.030

0.00

0.02

0.04

0.06

0.08

0.10

Parameter Value

Parameter Value

Species: STAT3.R*

Parameter: k3 Species: STAT3.R*

Parameter: k13 Species: STAT3.R*

200

300 Time

400

500

15 5 0

0 600

0.12

10

Density

5

Density

3.0

15

Time

10

Concentration

0 5

0.05

Time

2.0

Concentration

0.0

6 2 0

0 600

4

Density

6 4 2

200

1.0

100

0.8

Parameter: k10b Species: SOCS3.R*

Param. Error: 0.005 0.001 0.138 0.001 0

0.6

Parameter: k11 Species: SOCS3.R*

15

100

0.4

Species: SOCS3.R*

Param. Error: 0 0.002 0 0 0.001 0 0

0.2

Parameter Value

35

100

3

0.25

Parameter Value

Param. Error: 0.007 0.062 0.004 0

2

Density

1 0 0.00

Time

8

100

Density

0.0 0.5 1.0 1.5 2.0

Concentration

0

0 2 4 6 8

Density

3 2 1

Param. Error: 0.001 0.001 0 0.022

0

Concentration

12

4

5

Species: R

0.0

0.2

0.4

0.6

0.0

Parameter Value

0.2

0.4

0.6

Parameter Value

Figure 3. JAK/STAT pathway results. Row 1: R. Row 2: SOCS3.R∗ . Row 3: 2ST AT 3∗ N . Row 4: ST AT 3.R∗. See (2) for the equations. True (solid line) and inferred (points with error bars) concentrations for each species. The inferred values were obtained by drawing parameter samples from the posterior and forward-simulating the ODE system with these parameters. The error bars show one standard deviation. Histograms show the distribution of sampled parameters, the dashed line represents the gamma prior on the parameters in the hierarchical Bayesian model, and the horizontal line indicates the true value of the parameter. d[R] dt

plasma membrane, activating the JAK/STAT pathway [8]. The receptor is phosphorylated, creating docking sites for signalling molecules like STAT3. STAT3 binds to the phosphorylated receptor and is phosphorylated itself. Phosphorylated STAT3 molecules are released from the receptor, dimerize and then migrate to the nucleus to trigger mRNA transcription of target proteins like SOCS3. SOCS3 acts as a feedback mechanism for the signalling pathway: it binds to active receptors to prevent STAT3 activation and to provide a signal termination. The model we consider is a complex system comprising 13 species and 19 parameters. The dynamics of the system are described by mass-action kinetics, with non-linear interactions among species. Under the assumption of full observation of all species, we can decompose the system into 13 subsystems, one per species. This simplifies inference, and allows us to investigate the local identifiability of parameters in this model. For space reasons, we only reproduce the subset of equations consisting of the species in Figure 3 below; the full system will be presented in a future paper.

=

−k1f [R] + k1b [R∗ ] + k11 [SOCS3.R∗ ] + k14 [SOCS3.ST AT 3.R∗ ]

d[SOCS3.R∗ ] dt

=

f −k11 [SOCS3.R∗ ] + k10 [SOCS3][R∗ ] − b k10 [SOCS3.R∗ ]

d[2ST AT 3∗

N]

dt

=

−k6f [2ST AT 3∗ N ][P 300] + k6b [2ST AT 3∗ N.P 300] + k5 [2ST AT 3∗ C] − k7 [2ST AT 3∗ N ] + k8 [2ST AT 3∗ N.P 300]

d[ST AT 3.R∗ ] dt

=

k2f [ST AT 3][R∗ ] − k2b [ST AT 3.R∗ ] − k13 [SOCS3][ST AT 3.R∗ ] − k3 [ST AT 3.R∗ ]

(2)

Here the square brackets, [.], indicate concentrations,

17

and the ki are kinetic parameters. 4. RESULTS We simulated data from the PIF4/5 model using the parameters (s = 1, Kd = 0.46, h = 2, d = 1), adding observation noise with standard deviation in {0, 0.1, 0.2}. The parameter inference was done by sampling from the posterior using MCMC with the model from [5], as well as using MCMC with the improved model from Section 2. The time period of the measurements is 24 hours, and the interval (gap) between observed points is 1 hour, resulting in 25 datapoints. Figure 2 shows the noiseless concentrations sampled from the GP model, and simulated PIF4/5 concentrations from the true parameters and from the sampled parameters, for MCMC with the Calderhead et al. model in [5], and MCMC with the improved model. For the JAK/STAT pathway, we generated simulation data for 600 timepoints with parameter values that gave realistic behaviour for the different species. The data was sampled at intervals of 60 seconds from time zero, making 11 timepoints in total. Due to space restrictions, we only present results for a subset of species. Figure 3 shows the results for the inactive receptor R, for the SOCS3/R* complex (where R* is the activated receptor), for the activated 2-STAT3 complex in the nucleus and for the STAT3/R* complex. 5. DISCUSSION Our results demonstrate that gradient matching is a promising approach for parameter estimation in ODE systems. We have demonstrated that our improvement on the method described in [5] produces better results in the PIF4/5 system (Figure 2) in the presence of observation noise. Our application to the JAK/STAT signalling pathway shows that the approach we have taken is promising, but that some challenges remain. It also allows us to draw some inferences about the properties of this system. For some species, such as R and SOCS3.R* in rows 1 and 2 of Figure 3, we obtain very good predictions for the concentrations of the species, as well as giving a good estimate for some of the parameters, such as k1f and k11. However, we can see that there is a lot of uncertainty about the inferred parameters, even though the species concentrations are predicted quite well. This points to a problem with lack of identifiability in parameter space, related to ridges in the likelihood. For example, if the rate limiting chemical kinetics depend on the ratio of two kinetic constants, then the confidence or credible intervals of the individual parameters may be large without that being reflected by the prediction uncertainty, as long as the posterior distribution of the ratio is peaked. We notice that for species 2-STAT3* N (row 3) in Figure 3, the posterior probability mass for the parameters is in the tail of the prior distribution. This implies two things. First, the data are informative with respect to the inference of some parameters. Second, the prior has not been chosen very well for this example, and should be chosen more appropriately. Finally, the bottom row of Figure 3 shows a mismatch between the true and predicted signal of STAT3.R*. This points to an intrinsic difficulty with

the gradient matching approach. The mismatch is mainly due to a short transient region around time point 100. In the following time segment the gradient is well matched, whereas the signal itself shows a strong deviation. This indicates that in scenarios of this form, the likelihood landscape for the explicit solution of the ODEs differs systematically from the one obtained with gradient matching. In conclusion, our gradient matching approach demonstrates good predictions for realistic biological systems, and promises to become a useful tool for parameter estimation in system biology. However, some challenges related to unidentifiable parameters and non-stationary signals still remain. 6. ACKNOWLEDGEMENTS This work was funded by Bridging-the-Gap EPSRC grant 59229/1. R.W.S. was supported by BBSRC (BB/F59011/1, BB/F005237/1). SynthSys is a Centre for Integrative and Systems Biology partly supported by BBSRC and EPSRC (BB/G019621). 7. REFERENCES [1] N. D. Lawrence, M. Girolami, M. Rattray, and G. Sanguinetti, Eds., Learning and Inference in Computational Systems Biology, MIT Press, 2010. [2] A.A. Poyton, M.S. Varziri, K.B. McAuley, P.J. McLellan, and J.O. Ramsay, “Parameter estimation in continuous-time dynamic models using principal differential analysis,” Computers & chemical engineering, vol. 30, no. 4, pp. 698–708, 2006. [3] J.O. Ramsay, G. Hooker, D. Campbell, and J. Cao, “Parameter estimation for differential equations: a generalized smoothing approach,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 69, no. 5, pp. 741–796, 2007. [4] D. Campbell and R.J. Steele, “Smooth functional tempering for nonlinear differential equation models,” Statistics and Computing, pp. 1–15, 2012. [5] B. Calderhead, M. Girolami, and N.D. Lawrence, “Accelerating Bayesian inference over nonlinear differential equations with Gaussian processes,” Neural Information Processing Systems (NIPS), vol. 22, 2008. [6] J.C.W. Locke, M.M. Southern, L. Kozma-Bognar, V. Hibberd, P.E. Brown, M.S. Turner, and A.J. Millar, “Extension of a genetic network model by iterative experimentation and mathematical analysis,” Molecular Systems Biology, vol. 1, pp. (online), 2005. [7] A. Pokhilko, S.K. Hodge, K. Stratford, K. Knox, K.D. Edwards, A.W. Thomson, T. Mizuno, and A.J. Millar, “Data assimilation constrains new connections and components in a complex, eukaryotic circadian clock model,” Molecular systems biology, vol. 6, no. 1, 2010. [8] P.C. Heinrich, I. Behrmann, G. M¨uller-Newen, F. Schaper, and L. Graeve, “Interleukin-6-type cytokine signalling through the gp130/Jak/STAT pathway.,” Biochemical Journal, vol. 334, no. Pt 2, pp. 297, 1998.

18

DOSE-DEPENDENT DRUG SYNERGISM IN FLUX BALANCE ANALYSIS Giuseppe Facchetti1 , Claudio Altafini1 1

SISSA - Via Bonomea 265 - 34136, Trieste - Italy [email protected], [email protected]

ABSTRACT The investigation of synergistic effects which come out from the combination of drugs is one of the new frontiers in the discovery of new therapies. However, the intrinsic combinatorial complexity of the problem has so far hampered systematic theoretical and experimental studies. For these reasons, we have developed a novel in silico method based on Flux Balance Analysis and metabolic networks that finds the synergistic effects among drugs which guarantees the inhibition of a given reaction (efficacy) and has the minimal impact on the whole metabolism (selectivity). For a more realistic description, also partial inhibition (i.e. drug dosage) has been included in the model. 1. INTRODUCTION A possible alternative to the challenging discovery of new drugs is to make use of the unexploited properties of already available drugs, since for them a wide knowledge about both their therapeutic and toxicity effects has already been acquired during the study for their approval. In this perspective, a natural approach is to try to combine them in multiple drug therapies [1, 2]. However, so far the experimental investigations of multicomponent drugs have been quite limited: major obstacles to this approach are the high number of possible combinations but also our limited understanding of the complex mode of action of a multicomponent treatment. The aim of this paper is to describe a novel algorithm which, through the widely used Flux Balance Analysis formalism , takes advantage of the knowledge provided by genome-scale metabolic networks and exploits the possible synergisms among the already available drugs. Namely we consider the search for the optimal combination of drugs capable, through a synergistic effect due to the network topology, to inhibit an given reaction (i.e. a putative target for a disease) while inducing the minimal perturbation on the rest of the network. Indeed, the selectivity of the therapy is one of the most important aspects of any drug discovery project. In the literature, drugs are often treated as knockout of the gene which codes for the enzyme they target [3]: indeed, this

19

ON/OFF simplification (i.e. the use of Boolean quantities) allows the application of the duality theory inside the FBA framework [4]. However, it is more plausible to assume that a drug acting on an enzyme leads to a partial loss of functionality of the latter, and hence to a partial inhibition of the corresponding reaction. In this perspective, a synergistic effect is often characterized by the so called drug-drug interaction surface [5] where the inhibitory effect is drawn as function of different drug dosages. For these reasons, our method provides a more realistic description of the partial inhibition induced by the drugs while still remaining within the framework of FBA. 2. METHODS 2.1. Flux Balance Analysis FBA is a linear constraint-based framework for stoichiometric models of metabolic networks; such a network is described by the stoichiometric matrix S = (si,j ), where si,j represents the stoichiometric coefficient of the i-th metabolite in the j-th reaction (with i = 1, . . . , Nm , j = 1, . . . , Nr ), and by the reaction fluxes denoted by the vector v 2 RNr . Because of the much faster dynamics with respect to gene regulation, metabolic processes are assumed at steady state, which corresponds to assuming Sv = 0.

(1)

Thermodynamical constraints and availability of nutrients impose further finite upper-bounds on the fluxes: 0  vi  Ui 8i = 1, . . . , Nr .

(2)

The constraints (1) and (2) generate a convex and bounded set W . The vector v 2 W of the metabolic fluxes is obtained through the optimization of a certain cost functional (v). For unperturbed networks, the production of the macromolecular building blocks for the biomass (i.e. (v) is the growth rate) is often maximized [6]: we denote by vut (ut=“untreated”) the reaction fluxes obtained after this optimization.

2.2. The optimal drug combination problem

2.3. The strong duality theorem

The aim of the study is the search of the most selective combination of drugs: in particular, we suppose to have a metabolic network and a set of Nd drugs which inhibit some reactions of this network (the set of targets of the k-th drug are indicated by Tk ). We want to modulated the flux of a certain reaction (denoted by vmod ) through a combination of these drugs, inducing the minimal effect on the rest of the network. We assume that a drug induces the same inhibition of its targets, according to the dosage of use. Therefore, the amount of inhibition by the k-th drug on its target reaction vi , i 2 Tk , can be modelled by the linear constraint

The bilevel optimization (4) is a min-min linear program. The inner problem adjusts the fluxes so as to achieve the minimal metabolic adjustment, subject to the drug inhibitions imposed by the outer problem. The outer problem selects the combination of drugs which minimizes the side effect and guarantees a flux lower than the fixed threshold for the reaction which has to be modulated. This bilevel optimization is commonly solved by applying the strong duality theorem of LP [4] which consists in appending a list of constraints corresponding to the dual of the inner problem and setting the primal objective function equal to the dual [8]. This leads to the following single minimization:

vi  Ui hk ,

where, in order to model a partial inhibition, we have hk 2 [0, 1]. These inhibitions reduce the set W to a subset W (h). The determination of the reaction fluxes vtr (h) (tr=“treated”) for the drug-treated network is obtained through the MOMA problem (Minimization Of the Metabolic Adjustment [7]) which has been shown to generate reasonable and realistic results for a perturbed metabolism. In order to apply the theory of linear programming, we use the definition of MOMA in terms of norm L1 . Then tr

v (h) = arg min kv v2W (h)

ut

v k1

(3)

In the following the side effect of a drug treatment is estimated in terms of the distance kvtr (h) vut k1 used in (3): the greater is the distance, the bigger is the impact of the drugs on the whole network. The problem we want to solve can be stated as follows: PROBLEM STATEMENT: Given: • a metabolic network, which means a stoichiometric matrix S 2 RNm ⇥Nr and a vector of fluxes v with upper-bounds U, both in RNr ; • the unperturbed fluxes vut ; • Nd drugs together with their inhibition targets Tk , k = 1, . . . , Nd , • a reaction to be modulated (vmod ) and the relative threshold ⌧ < 1; we want to find the inhibitions h 2 [0, 1]Nd such that tr ut vmod (h)  ⌧ vmod causing the minimal side effect. According to (3), for a given set of drugs (i.e., an inhibitions vector h), we can calculate both vtr (h) (and then tr ut check whether vmod (h)  ⌧ vmod ) and the value of the side effect. Then, the final formulation of the problem is the following: 2

min

vtr (h) = arg min ||w w 2 W (h) h :4 tr ut vmod (h)  ⌧ vmod

vut ||1

tr 3 ||v (h) 5

vut ||1

(4)

20

Nr X

Minimize

ai

b

i=1

Nr X

8i = 1, . . . , Nm ;

vi  Ui

8i = 1, . . . , Nr ;

ai  +viut ai 

viut

8i = 1, . . . , Nr ;

j

0

8j = 1, . . . , Nr ;

j

1

8j = 1, . . . , Nr ;

vi vi

Nt,j j

+

i=1

X

i

8k = 1, . . . , Nd , i 2 Tk ;

+ ↵j

i=1

Nr X i=1

ai =

Nr X i=1

↵j + 0 @

i

such that

Si,j vj = 0

vi  Ui hk

Si,j µi +

hk

k=1

j=1

N m X

Nd X

+ Ui

i

X

k2Ti

8i = 1, . . . , Nr ;

ut vmod  ⌧ vmod ; 1

hk + (↵i

ut i )vi A .

where ai = |vi viut | and Greek letters refer to dual variables. However, the last equation, which comes from the application of the duality theorem equality, is no longer a linear constraint since it contains the product between the outer problem variable hk and the dual variable i ; hence, the problem can no longer be solved by a linear optimization. It is common to overcome this problem by restricting the hk variables to Boolean values and then excluding the possibility to induce partial inhibition. It is worth noting the presence of the parameter b ⌧ 1 in the cost function of the outer problem. The use of the norm L1 does not guarantee the uniqueness of the solution. Moreover, it is also possible to have different solutions that induce equivalent inhibitions. Therefore, we introduce the parameter b in order to exclude the combinations containing redundant inhibitions and avoid an “over-selection” of drugs, reducing the number of cases of degeneracy. However, in those situations the algorithm supplies one of them without any specific preference.

2.4. Partial inhibition In this Section we propose a solution which can still use the duality theorem for solving the bilevel optimization but does not ignore completely the possibility of inducing a partial inhibition of the reactions targeted by the drugs (i.e. the use of different drug dosages). This requires to create a discretization of the interval [0, 1] and to replace the binary action of each drug with P Boolean variables describing this discretization. For the k-th drug (k = 1, . . . , Nd ) we introduce the set of Boolean variables {dk,n }n=0,...,P and define an inhibition coefficient hk given by the following convex combination: P

hk :=

dk,0 X dk,n + . 2P 2n n=1

(5)

In (5) the integer P is related to the approximation we introduce through this [0, 1] discretization. Indeed, the factor hk assumes values between 0 and 1 with precision 2 P . Notice that for P = 0 we have the ON/OFF model of the previous section. When the strong duality theorem is applied, the nonlinear terms are the i hk products. Expanding the product according to the definition in Eq. (5): i hk

=

i dk,0 2P

+

P X

n=1

i dk,n , 2n

the nonlinearity is spread over the products i dk,n with dk,n a binary variable. In this case, the nonlinear terms k,n hk can be exactly linearized as follows:

i

where

max i

zikn := i dk,n ; 0  zikn  imax dk,n ; max (1 dk,n )  zikn  i

i,

is the upper bound for the dual variable

i.

3. RESULTS In order to evaluate its performances, the algorithm has been tested on the core metabolism of E.coli [9], which has 95 reactions and 72 metabolites. Eight drugs acting on this network have been found in the DrugBank database (www.drugbank.ca). 3.1. Computational performaces. First, the impact of the parameter P on the computational cost is evaluated: we choose ribose-5-phosphate isomerase as reaction to be modulated and run the algorithm with different values of P from 0 to 5, recording the computational time required to find the solution. In addition, we estimate the time needed to perform an exhaustive search over all possible drug combinations (and

21

Figure 1. Algorithm performances. Computational time of the algorithm compared with an exhaustive search for different values of P . dosages). The comparison of the performaces of the algorithm with this estimation is plotted in Figure 1 and shows the good performaces of our method. 3.2. Screening of the reactions to be modulated. A set of tests have been carried out combining different values of ⌧ and P , in particular: ⌧ 2 {0.0, 0.1, 0.5}; P 2 {0, 1, 2}. For each pair, we perform a screening that considers each metabolic reaction as process to be modulated and finds the most selective drug combination. The cumulative count of the cardinality of the solutions (number of components of h which are smaller than 1) is reported in Figure 2. One can see that, when a complete stop of the reaction is required (⌧ = 0) there is no significant advantage in increasing the precision P . However, when it is necessary to induce a more accurate modulation of the flux, higher values of P allow the algorithm to find a larger number of solutions. Through the partial inhibition, indeed, we can find solutions which are closer to the desired threshold whereas the simple ON/OFF model is more suitable for bringing the reaction to a complete stop. 3.3. Drug-drug interaction surface: a case study. For a case inside the set of solutions that have been found through this procedure, we detail now the drug interactions exploited by the algorithm. In particular, we considered the synergisms in the inhibition of transketolase. The solution contains a pair of drugs (Fomepizole and Halofantrine, which target alcohol/acetaldehyde dehydrogenase and ATP synthase respectively). We explore the drug interaction surface changing the amount of inhibition induced by each compound, as could correspond in experiments to using different drug dosages (the interval [0, 1] has been discretized through (5) with

Figure 2. Results from the reaction screening. In correspondence of 3 different ⌧ , the three histograms report the cumulative number of all solutions we have found for different choices of P .

combination of several Boolean variables) while still saving the linear nature of the problem and the consequent efficiency. Indeed, we have shown that with the available optimization softwares, the performance of the algorithm is still good (see Figure 1). The method has been applied to the central carbon metabolism of E.coli; as expected, we have found that, increasing the number of Boolean variables used in the convex combination, the modulation of the flux for a given reaction becomes more accurate and an high number of solutions have been found (see Figure 2). Moreover, Figure 3 reveals that nonlinear effects, not explained by superposition of the effects, take place and can be captured by the method we proposed. 5. REFERENCES [1] J. Lehar, A. Krueger, G. Zimmermann, and A. Borisy, “High-order combination effect and biological robustness,” Mol Syst Biol, vol. 4, pp. doi:10.1038, 2008. [2] S. Klamt and E. Gilles, “Minimal cut sets in biochemical reaction networks,” Bioinformatics, vol. 20, no. 2, pp. 226–234, 2004.

Figure 3. Surfaces of drug-drug interaction. Axes x and y report the inhibition coefficients h for the two drugs (0 means a complete stop of the reaction, i.e. high drug dosage, and 1 no inhibition i.e. no drug usage); zaxis reports the percentage of the flux through the modulated reaction after drug treatment with respect to the untreated value. P = 4). The 2D surface is reported in Figure 3. A complete stop of the reaction is achieved only with high dosage of both drugs, whereas a partial inhibition is clearly achieved already at small/intermediate dosages. Moreover, the plot clearly shows the deviation of the synergistic effect from the superposition of the single drug inhibitions. 4. CONCLUSIONS The field of drug combinatorics is largely unexplored experimentally and the potential of combined drug therapies is difficult to assess, mostly for lack of suitable systematic methodologies. To try to fill this gap, we have developed an algorithm which is capable of exploring efficiently the optimal synergisms among all possible drug combinations and of characterizing them in terms of side effect and selectivity. Moreover, the method provides a more realistic description of the inhibition of the target induced by a drug (through a convex

22

[3] D. Segr´e, A. De Luna, G. Church, and R. Kishony, “Modular epistasis in yeast metabolism,” Nat Genet, vol. 37 (1), pp. 77–83, 2005. [4] A. Burgard, P. Pharkya, and C. Maranas, “OptKnock: A bilevel programming framework for identifying gene knockout strategies for microbial strain optimization,” Biotechnol Bioeng, vol. 84 (6), pp. 647–657, 2003. [5] P. Yeh and R. Kishony, “Networks from drug-drug surface,” Mol Syst Biol, vol. 3, pp. 85–87, 2007. [6] B. Palsson and A. Varma, “Metabolic capabilities of Escherichia coli II: optimal growth pattern,” J Theor Biol, vol. 165, pp. 503–522, 1993. [7] D. Segr´e, D. Vitkup, and G. Church, “Analysis of optimality in natural and perturbed metabolic networks,” Proc Natl Acad Sci USA, vol. 99 (23), pp. 15112–15117, 2002. [8] J. Matouˇsek and B. G¨artner, Understanding and Using Linear Programming, Springer, Berlin, 2000. [9] J. Orth, R. Fleming, and B. Palsson, “Reconstruction and use of microbial metabolic networks: the core Escherichia coli metabolic model as an educational guide,” EcoSal, vol. doi: 10.1128/ecosal.10.2.1, 2010.

BAYESIAN INFERENCE FOR THE CHEMICAL MASTER EQUATION USING APPROXIMATE MODELS Colin S. Gillespie and Andrew Golightly School of Mathematics & Statistics Newcastle University NE1 7RU, UK [email protected], [email protected] ABSTRACT

Although inference is complicated by analytically intractability of the observed data likelihood, we proceed by using the fact that we can forward simulate from the model - known as likelihood free techniques. We use the forward simulator to make independent proposals inside a Markov chain Monte Carlo (MCMC) step thereby sidestepping the need to evaluate the observed data likelihood. Moreover, to avoid low acceptance rates, we perform the algorithm sequentially, making use of particle filtering methods [4, 5, 6].

We consider the problem of efficiently inferring the parameters in gene regulatory networks. Whilst it is possible to work with a discrete stochastic model for inference, computational cost can be prohibitive for networks of realistic size and complexity. By treating the numbers of molecules of biochemical species as continuous, it is natural to work with a diffusion approximation, or if intrinsic stochasticity is ignored, an ordinary differential equation (ODE) description of the system dynamics. In this paper, we examine the effect of approximating the true stochastic model for parameter inference. We show that even when the model contains very low copy numbers, inference from the SDE model is still reasonable, but an ODE approach cannot be relied upon for making accurate inferences on stochastic rate constants.

2. INFERENCE Treating molecule numbers as continuous and performing exact parameter inference for the resulting continuous state Markov process appears to be a promising approach. Inferring the kinetic parameters governing the diffusion approximation has been the focus [7, 8]. However, when molecule numbers are particularly small, working with the diffusion or ODE approximation may lead to a detrimental effect on the quality of inference that can be made about the rate constants c. We investigate this empirically in §3.

1. INTRODUCTION A growing realisation of the importance of stochasticity in cell and molecular processes has stimulated the need for efficient methods of inferring rate constants in stochastic kinetic models associated with gene regulatory networks [1]. Such inferences are typically required to allow predictive in silico experiments. Inference methods for the associated discrete stochastic kinetic model have been proposed (see [2, 3]). Whilst such techniques have been shown to perform well for a simple Lotka-Volterra model with two species involved in three reactions, the associated computational cost precludes analysis of networks with many reactions and species. There is therefore a real need for algorithms which provide accurate inferences and will not suffer computationally as the complexity of the system of interest increases. The aim of this paper is to exploit the computational efficiency of methods such as the diffusion (SDE) and ODE approximation. Rather than directly infer parameters for the chemical master equation, we will perform inference for an SDE and ODE approximation. Using a model that combines strong stochastic features with low copy numbers, we show that the SDE approximation performs remarkably well, whilst assuming an ODE model underestimates the parameter uncertainty, but captures the mean value.

2.1. Likelihood free inference Consider the problem of inferring the rate constants c using time course biochemical data observed at discrete time points. Suppose that the process Y (t) is not observed exactly, rather, we have (without loss of generality) noisy measurements X0:T = {X(t) : t = 0, 1, . . . , T } observed on a regular grid. We assume that the true underlying process Y (t) is linked to X(t) via the density ⇡(X(t)|Y (t)). Note that this setup is flexible and does not limit us to additive error structures. Rather than perform inference for the exact Markov jump process, we work with the approximate SDE and ODE models and respective kinetic rate constants c that govern the approximate model. Let Y(0,T ] = {Y (t) : t 2 (0, T ]} denote the complete process path on (0, T ] and denote the marginal density of Y(0,T ] , under the structure of the approximate model, by ⇡a (Y(0,T ] |Y (0), c), since it depends on the starting value Y (0) and the rate constants c. Let ⇡(Y (0)) and ⇡(c) denote the respective prior densities for Y (0) and c.

23

To avoid low acceptance rates associated with sampling the full latent process and parameters (rate constants) jointly in one MCMC move, we update Y (t) and c for each observation in turn. Suppose that observation X(i) becomes available and our goal is to generate a sample from the posterior distribution of c and the latent process at the observation time, Y (i), given all information up to (and including) time i. Denote this target density by ⇡(c, Y (i)|X0:i ). We have, Z

⇡ (c, Y (i ⇥ ⇡a Y(i

1)|X0:i 1,i] |Y

Population Level

⇡ (c, Y (i)|X0:i ) /

120

1)

(i

⇥ ⇡ (Y (i)|X(i)) dY[i

1,i)

(1)

0 0

25

50

75

Time

100

Figure 1. Synthetic data set generated via the Gillespie algorithm with the simple auto-regulatory network. The original realisation is shown as a solid red line (Y1 ) and dashed blue line (Y2 ). The solid points are the twenty-one sampled noised points used in the case study. density in (1) by ⇡ ˆ (c, Y (i)|X0:i ) /

Z

⇡ ˆ (c, Y (i ⇥ ⇡a Y(i

1)|X0:i 1,i] |Y

1)

(i

⇥ ⇡ (Y (i)|X(i)) dY[i

1), c 1,i)

(3)

and we sample the density in (3) via the following MCMC scheme:

2.2. A particle approach Particle filtering for dynamic states has been considered by many authors [9, 10, 4]. Suppose that we have a sample of size N from ⇡(c, Y (i 1)|X0:i 1 ) and denote this sample by {[c(j) , Y (j) (i 1)], j = 1, . . . , N }. Note that the kernel density estimate of ⇡(c, Y (i 1)|X0:i 1 ) is

j=1

40

1), c

where Y(i 1,i] = {Y (t) : t 2 (i 1, i]} is the latent path in (i 1, i]. Intuitively, the posterior based on all data up to time i 1 becomes the prior at time i. This is then combined with the likelihood of the latent process in (i 1, i] to give the posterior density (up to proportionality) of c and the latent process in [i 1, i]. Integrating over the latent process in [i 1, i) gives the desired density. If ⇡ (c, Y (i 1)|X0:i 1 ) can be sampled then the integration in (1) can be performed via Monte Carlo. However, this density typically cannot be obtained analytically. If we assume that we have a sample from ⇡ (c, Y (i 1)|X0:i 1 ) then a particle approximation can be used – the discrete support generated by the sample of points (or “particles”) can be used as the actual support. Such an approach is referred to in the literature as a particle filter and is the subject of the next section.

⇡ ˆ (c, Y (i 1)|X0:i 1 ) = N n X [c, Y (i 1)]0 ; [c(j) , Y (j) (i

80

1)]0 , ! 2 V

o

1. draw [c⇤ , Y ⇤ (i

1)]0 ⇠ ⇡ ˆ (· |X0:i

2. draw Y(i⇤ 1,i] ⇠ ⇡a (· |Y ⇤ (i or ODE model.

1 ),

1), c⇤ ) using the SDE

3. if the current state of the chain is [c, Y (i)]0 then accept and store a move to [c⇤ , Y ⇤ (i)]0 with probability ⇢ ⇡(X(i)|Y ⇤ (i)) min 1, , ⇡(X(i)|Y (i))

(2)

otherwise store the current value of the chain. Return to step 1.

where (·; µ, ⌃) denotes the Gaussian density with mean µ and variance matrix ⌃, and V is the Monte Carlo variance of the sample. Standard rules of thumb can be used to choose the smoothing parameter ! [11]. We can sample (2) by picking a particle [c(j) , Y (j) (i 1)] at random from the particle set and adding zero mean Gaussian noise with variance ! 2 V . In our test case, we set ! 2 = 0.016 and perturbed the parameter values on the log scale. This process is known in the context of the particle filter as jittering and is used in favour of simply sampling values of c and Y (i 1) from the particle set to avoid sample impoverishment [12]. Hence the particle filter approximates the target

Hence, the particle filter is implemented by initialising with a sample from the prior distribution of c and Y (0) and performing steps 1–3 above for i = 1, 2, . . . , T . Note that the Markov chain generated this way has ⇡ ˆ (c, Y (i)|X0:i ) as its invariant distribution. Furthermore, by using the approximate simulator as a proposal mechanism, evaluation of the associated likelihood is not required in the acceptance probability and the only term that needs to be evaluated is the tractable density associated with the measurement error. This setup is flexible and can be used with any forwards

24

0.3

Density

Density

0.4

0.2 0.1 0.0

30

25

25

20

20

15

5

10

15

c1

20

25

10

5

5 0

30

0.0

400

6

300

Density

8

4

2

15

10

0 0

Density

30

Density

0.5

0.1

0.2

c2

0.3

0.4

0.0

0.1

0.2

c3

0.3

0.4

200

100

0

0 0.0

0.5

1.0

c4

1.5

2.0

0.000

0.015

c5

0.030

0.045

Figure 2. Parameter posterior distributions from the output of the particle filter from the Gillespie (solid histogram), SDE (blue dashed line) and ODE (red solid line) simulators. simulator as a proposal process. For example, if computational expense is not an issue then inference for the exact underlying discrete stochastic model can be performed by using the Gillespie algorithm inside the particle filter.

jump process (under an assumption of mass action kinetics) cannot be found in closed form. We take initial values of Y1 (0) = Y2 (0) = 10 and values of the rate constants of c1 = 10, c2 = 0.1, c3 = 0.1, c4 = 0.7 and c5 = 0.008. A typical stochastic simulation is shown in figure 1. The data, Xi (t), used when inferring the parameters was obtained by corrupting Yi (t), in the following manner ( Poisson (Yi (t)) if Yi (t) > 0, Xi (t)|Yi (t) ⇠ Bernoulli(0.1) if Yi (t) = 0

3. EXAMPLE 3.1. Augoregulatory network As an example, we consider a simple autoregulatory network with two species, Y1 and Y2 whose time course behaviour evolves according to the following set of coupled pseudo-reactions, R1 : ; R3 : Y 1

c1

! Y1

R2 : ;

!;

R4 : Y 2

c3

R5 : Y 1 + Y 2

c5

for each component i = 1, 2, so that Y (t) is not observed anywhere. In the inference procedure described in the following section, any error model can be used.

c2

! Y2

c4

!;

3.2. Inference

! 2Y2 .

We run the particle filter and make proposals using either the diffusion approximation or the ODE model. To allow comparison with inferences made using the true discrete stochastic model, we also run the particle filter with the Gillespie algorithm as a proposal mechanism. In all cases six million iterations were performed with N = 30, 000 particles stored at each time point. Independent proper

Essentially, the system is a coupled immigration-death process, with Y1 and Y2 interacting through reaction R5 . Reactions R1 and R2 represent immigration, reactions R3 and R4 represent death and finally R5 can be thought of as interaction. Note that even for this simple system, the transition density associated with the resulting Markov

25

Uniform U ( 6, 3) priors were taken for each log(ci ) and discrete Uniform distributions on {0, 1, . . . 20} were taken as priors for Y1 (0) and Y2 (0). Figure 2 provides kernel density estimates of the marginal parameter posteriors for the SDE and ODE systems. The output from the Gillespie simulator is summarised using a histogram. A quick inspection reveals that, not surprisingly, all schemes produce parameter values that are consistent with the values of c that produced the data. Rather than compare the output of each scheme with the true values however, we view the inferences made under the discrete stochastic model as the gold standard, and compare these with the output under the approximate models. Even though the species Y2 , has very low copy numbers for much of the simulation, the SDE approximation works surprisingly well. Working with a diffusion approximation and treating all species numbers as continuous provides accurate inferences for c1 , c3 and c5 . However, the SDE approximation fails to accurately estimate c2 and c4 . These rate constants are linked with the Y2 process and correspond to the birth and death rates of Y2 . Hence, treating numbers of Y2 as continuous when figure 2 clearly shows that the data are inherently discrete, impacts on the quality of the inferences made on the related rate constants. The ODE approximation in general significantly underestimates the uncertainty of the stochastic rate constant. Overall, the mean estimates are reasonable (when compared to the output obtained from the Gillespie simulator). The one exception is the parameter c2 , where the mean value is underestimated. This is not surprisingly, since this parameter is linked to the species with low copy numbers.

[3] D. J. Wilkinson, Stochastic Modelling for Systems Biology, Chapman and Hall/CRC Press, London, 2006. [4] A. Doucet, S. Godsill, and C. Andrieu, “On sequential Monte Carlo sampling methods for Bayesian filtering,” Statistics and Computing, vol. 10, pp. 197– 208, 2000. [5] P. Del Moral, J. Jacod, and P. Protter, “The Monte Carlo method for filtering with discrete-time observations,” Probability Theory and Related Fields, vol. 120, pp. 346–368, 2002. [6] T. Toni, D. Welch, N. Strelkowa, A. Ipsen, and M. P. H. Stumpf, “Approximate Bayesian computation scheme for parameter inference and model selection in dynamical systems,” Journal of the Royal Society Interface, vol. 6, pp. 187–202, 2009. [7] E. A. Heron, B. Finkenstadt, and D. A. Rand, “Bayesian inference for dynamic transcriptional regulation; the Hes1 system as a case study,” Bioinformatics, vol. 23, pp. 2596–2603, 2007. [8] A. Golightly and D. J. Wilkinson, “Bayesian inference for nonlinear multivariate diffusion models observed with error,” Computational Statistics & Data Analysis, vol. 52, pp. 1674–1693, 2008. [9] M. K. Pitt and N. Shephard, “Filtering via simulation: Auxiliary particle filters,” Journal of the American Statistical Association, vol. 446, no. 94, pp. 590–599, 1999. [10] J. Carpenter, P. Clifford, and P. Fearnhead, “An improved particle filter for nonlinear problems,” IEE Procedings - Radar, Sonar and Navigation, vol. 146, pp. 2–7, 1999.

4. DISCUSSION There has been considerable work in recent years in developing approximate simulators [13, 3, 14]. The need for such an approach is clear – working with a discrete stochastic inferential model is computationally expensive and this cost increases as the number of reactions and species increase. Whilst the approach of entirely ignoring discreteness and working with a diffusion approximation can work well in some scenarios [15], it is found here that ignoring inherent discreteness associated with low copy number species can have a negative impact on the quality of inferences made. Whilst ignoring both discreteness and stochasticity by working with a system of ODEs can give completely misleading inferences about the stochastic rate constants.

[11] B. W. Silverman, Density Estimation for Statistics and Data Analysis, Chapman and Hall, London, 1986. [12] J. Liu and M. West, “Combined parameter and state estimation in simulation-based filtering,” in Sequential Monte Carlo Methods in Practice, A. Doucet, N. de Freitas, and N. Gordon, Eds. Springer-Verlag, New York, 2001. [13] D. T. Gillespie, “Approximate accelerated stochastic simulation of chemically reacting systems,” Journal of Chemical Physics, vol. 115, no. 4, pp. 1716–1732, 2001.

5. REFERENCES

[14] C. S. Gillespie, “Moment closure approximations for mass-action models,” IET Systems Biology, vol. 3, pp. 52–58, 2009.

[1] H. Kitano, Foundations of Systems Biology, The MIT Press, London, London, 2001.

[15] A. Golightly and D. J. Wilkinson, “Bayesian inference for stochastic kinetic models using a diffusion approximation,” Biometrics, vol. 61, no. 3, pp. 781– 788, 2005.

[2] R. J. Boys, D. J. Wilkinson, and T. B. L. Kirkwood, “Bayesian inference for a discretely observed stochastic-kinetic model,” Statistics and Computing, vol. 18, pp. 125–135, 2008.

26

MODELLING REGULATORY PROCESSES DURING MORPHOGENESIS IN DROSOPHILA MELANOGASTER WITH AN IMPROVED VERSION OF THE BGMD DYNAMIC BAYESIAN NETWORK MODEL Marco Grzegorczyk Department of Statistics, TU Dortmund University, 44221, Dortmund, Germany [email protected] ABSTRACT

growth in D. melanogaster, presented here, the cpBGe model is over-flexible, since a priori one would expect all genes to be affected by the same morphogenetic transitions (e.g. from larva to pupa). As the cpBGe model would have to infer the same transitions for each gene independently, it is inappropriate for this application. With the DP scheme [3] the changepoint configurations can be sampled from the proper conditional distribution within a Gibbs sampling scheme. Mixing and convergence of the two competing methods, the RJMCMC sampler from [1] and the Gibbs sampler, proposed here, are compared on synthetic network data. Afterwards, we use the BGMD model to infer the morphogenetic stages of muscle development and growth in D. melanogaster.

At the WCSB workshop in 2009 a novel dynamic Bayesian network model for non-homogeneous regulatory processes was presented. The dynamic Bayesian Gaussian mixture Bayesian network (BGMD ) model divides the data into disjunct time segments by a changepoint process, and each individual segment is modelled separately and independently. To improve the performance and efficiency of the BGMD model, we propose a novel Markov chain Monte Carlo (MCMC) scheme. The earlier work pursued inference with reversible jump MCMC (RJMCMC) simulations. We explore the application of a dynamic programming scheme, with which changepoint configurations can be sampled from the proper conditional distribution within a Gibbs sampling scheme.

2. METHODOLOGY

1. INTRODUCTION

2.1. The homogeneous dynamic Bayesian network

The standard assumption underlying dynamic Bayesian networks (DBNs) is that time-series have been generated from a homogeneous Markov process. However, regulatory interactions in the cell usually change in response to external stimuli. The assumption of homogeneity is therefore too restrictive in many circumstances. At the sixth WCSB workshop in 2009, a novel DBN model for nonhomogeneous regulatory processes has been proposed [1]. The Bayesian Gaussian mixture (BGMD ) DBN model employs a multiple changepoint process to divide a time series into disjunct segments, and it infers a graph topology, which is common to all segments, while parameters are allowed to vary between segments. The earlier work [1] pursued inference with reversible jump Markov chain Monte Carlo (RJMCMC), based on birth and death moves for individual changepoints. Here, we follow Grzegorczyk and Husmeier [2] and employ a dynamic programming (DP) scheme of Fearnhead [3] for improving convergence and mixing of the MCMC simulations. In [2] it was found that the DP scheme improves convergence for the cpBGe model. The cpBGe model [2] is similar to the BGMD model, but assumes that the patterns of nonhomogenity are gene-specific so that individual genes are affected differently by changing processes. For the application to gene expression from muscle development and

Dynamic Bayesian networks (DBNs) are graphical models, which describe the probabilistic relationships among variables, X1 , . . . , XN , that have been measured at time points, t = 1, . . . , m. Each variable is represented as a node in a directed graph, G, and an edge pointing from Xi to Xj , symbolically G(i, j) = 1, means that the realisation of Xj at time point t, Xj (t), depends on the value of Xi at time point t − 1, Xi (t − 1). πn = πn (G) denotes the parent set of Xn , i.e., the set of nodes from which an edge points to Xn in G. There is a one-to-one mapping between G and the parent sets, πn : G(j, n) = 1 ⇔ Xj ∈ πn , and / πn . Each node can be its own parent G(j, n) = 0 ⇔ Xj ∈ node, but it depends on the application if self-loops, such as Xn (t − 1) → Xn (t), are meaningful.1 Given a data set, D, where Dn,t and Dπn ,t are the t-th realisations, Xn (t) and πn (t), of Xn and πn , respectively, DBNs are based on the following homogeneous Markov chain expansion:

M.G. is financially supported by the German Research Foundation (DFG), research grant: 3853/1-1.

1 In the studies presented here, we rule out self-loops for the RAF pathway data, while we allow for them in the D. melanogaster network.

P (D|G, θ) =

N ! m !

P (Xn (t) = Dn,t |πn (t−1) = Dπn ,t−1 , θ n )

n=1 t=2

where θ is the total parameter vector, composed of subvectors θ n . The BGe model [4] specifies the distributional form of P (D|G, θ) as a multivariate Gaussian distribution, and assumes a normal-Wishart distribution as prior,

27

P (θ|G), for the expectation vector and the precision matrix. Under fairly weak conditions, the parameters, θ, can be integrated out and the marginal likelihood can also be factorized: ! N " Ψ(Dπnn ) P (D|G) = P (D|G, θ)P (θ|G)dθ =

p38

pip3

plcg

pkc

erk

pka

pip2

n=1

akt

jnk

Figure 1. The RAF pathway, as reported in [6].

where each Ψ(Dπnn ) can be computed in closed-form [4]. 2.2. The non-homogeneous BGMD DBN model

sampled from P (G|D, V, K) by independently sampling for each Xn a new parent set πn" from:

The BGMD model expands the conventional DBN model by introducing a latent allocation vector, V, which assigns the data points to K mixture components, where K is inferred from the data. Conditional on V a separate score can be computed for each segment. The allocation vector describes the allocation of the data points t = 2, . . . , m to the time segments. V(t) = k denotes that data point t is allocated to the k-th compartment (1 ≤ k ≤ K), and D(V,k) denotes the set of all data points that are allocated to compartment k. The posterior probability is proportional to the joint distribution: P (G, V, K|D) =

mek raf

P (πn" |D, V, K) = $

#K

k=1

πn :|πn |≤F

! (V,k),πn

Ψ(Dn #K

)

(V,k),πn ) k=1 Ψ(Dn

where the Ψ(.) terms can be computed in closed form. Note that P (πn" |D, V, K) depends on V and K and, hence, has to be re-computed after each sampling step for the changepoint configurations. 2.4. Sampling changepoint configurations

P (G, V, K, D) ∝ P (G, V, K, D) P (D)

In [1] the sample from P (G, V, K|D) is obtained with a RJMCMC sampling scheme, which is based on additions and deletions of edges, and birth, death, and reallocation moves of changepoints. Here, we follow [2] and propose a sampling scheme based on dynamic programming (DP). We adapt the DP scheme proposed by Fearnhead for Bayesian mixture models [3] to the BGMD model. Fearnhead shows how to sample from P (V, K|G, D) directly, with a DP scheme, when the marginal likelihood, P (D|G, V, K), can be computed in closed-form, so that no RJMCMC sampling scheme with potential convergence problems is required. Combining the DP scheme [3] with the graph sampling procedure from Subsection 2.3 gives a Gibbs sampler for sampling from P (G, K, V|D) by iteratively sampling from P (G|K, V, D) and P (K, V|G, D).

and with independent segment-specific total parameter priors, P (θ (k) |G) (k = 1, . . . , K), the joint distribution can be factorised: P (G, V, K, D) = P (K)P (V|K)P (G)P (D|G, V, K) #K #N (V,k),πn where P (D|G, V, K) = k=1 n=1 Ψ(Dn ), and (V,k),πn := {(Dn,t , Dπn ,t−1 )|V(t) = k} is the set Dn of realisations of Xn and πn for those time points that have been allocated to component k. The allocation vector V acts as a filter which divides the data, D, into K segments, D(V,k) , for which separate independent BGe (V,k),πn ), can be computed in closed-form. scores, Ψ(Dn For P (G) we take a uniform distribution over all graphs subject to a fan-in restriction of |πn | ≤ F = 3, and for P (K) we take a truncated Poisson distribution restricted K to 1 ≤ K ≤ K " as prior, P (K) ∝ λK! e−λ (K = 1, ...K " ), with λ = 1. Unlike in the earlier work [1], we employ the discrete counterpart of the prior of Green [5] and identify K components with K − 1 changepoints, (b1 , . . . , bK−1 ), on the discrete set {2, . . . , m − 1}. With this modification it is possible to employ a DP scheme for sampling changepoints from the posterior distribution, and we have: V(t) = k ⇔ bk−1 < t ≤ bk , where bk is the k-th changepoint implied by V with b0 := 1 and bK := m. We assume that the changepoints are distributed as the evennumbered order statistics of L := 2(K − 1) + 1 points u1 , . . . , uL uniformly and independently distributed on the set {2, . . . , m − 1}.

3. DATA 3.1. Synthetic RAF pathway data The RAF protein pathway has been widely studied in the literature [6], and we employ its topology, shown in Fig. 1, to generate synthetic network data: Node pip3 has no parent nodes, and its values are sampled from iid N (0, 0.1) Gaussian distributions for all t. For each other node its t-th realisation is a linear combination of the values of its parent nodes at time point t − 1 plus iid distributed Gaussian noise φ(t), where the regression coefficients β vary in time. E.g. for pip2 we have for t = 2, . . . , m: V(t)

V(t)

pip2(t) = βpip3 ·pip3(t−1)+βplcg ·plcg(t−1)+φpip2 (t) where V(t) is the segment, such that βpip3 and βplcg are piecewise constant functions, with discontinuities at the changepoints, and pip3(t − 1) and plcg(t − 1) are the values of pip3 and plcg at time point t-1. The noise variables φpip2 (2), φpip2 (3), . . . are iid N (0, 0.01) Gaussian V(t)

2.3. Sampling graph structures The graph G is defined by the parent sets {πn }1≤n≤N . Thus, conditional on K and V, a new graph G " can be

28

V(t)

distributed. The length of the time series is fixed to m = 61, and non-stationarity is obtained by temporal changepoints at which the regression coefficients change. We consider K equally-spaced segments. For K = 1, . . . , 5 each compartment is of length (m − 1)/K and the coefficients β change from compartment to compartment. For each compartment we sample the β-coefficients independently from iid uniform distributions on the interval [0.5, 2] with random signs. For K = 1 there is no changepoint and all coefficients stay constant across time. For K = 2 there are two compartments of length mi = 30 (i = 1, 2) and all coefficients are re-sampled after time point t = 31. For each K ten independent data sets were generated along these lines.

RJMCMC

K=1

K=2

1 0

0

1

0

0

GIBBS AUROC

0

0

K=2

1

0

1

0

1

0

K=3

1

0

1

0

1

0

1 K=5

1

0

0

K=4

1

0

K=5

1

1

0

1

0

0

1

1

0.5

3.2. Drosophila melanogaster

1

1

0

K=4

1

K=1 1 0

K=3

1

K=1

K=2

K=3

K=4

K=5

Figure 2. Upper panels: Scatter plots of the marginal edge posterior probabilities for the RAF pathway data with K true segments. We compare the RJMCMC sampler and the Gibbs sampler. For each single data set the marginal edge posterior probabilities have been estimated from two independent simulations with different initializations, and these probabilities have been plotted against each other. In each panel the scatter plots of 10 independent data instantiations have been superimposed. The coordinates of all points were slightly randomly perturbed to visualize clusters of points. Lower panel: Network reconstruction accuracy, in terms of AUROC values. All bars are averages over 10 data instantiations and the standard deviations are indicated by vertical lines. The grey (black) bars refer to the RJMCMC (Gibbs) sampler.

The gene expressions in D. melanogaster were sampled at m = 67 time steps during four morphological stages of life: the embryonic, larval, pupal, and adult stage. Since these phases cover time periods of different lengths, gene expression profiles were collected at non-equidistant timepoints. The three transitions occur at t = 31 (embryonic to larval), t = 41 (larval to pupal), and t = 59 (pupal to adult) [7]. Like [8] we focus on genes involved in muscle development and growth: eve, gfl, twi, mlc1, sls, mhc, prm, actn, up, myp61f, and msp300. 4. SIMULATIONS AND EVALUATION The hyperparameters of the normal-Wishart prior are chosen maximally uninformative subject to the regularity conditions discussed in [4]. For the RJMCMC sampler we set the burn-in and sampling phase to 250k (with k = 1, 000) iterations each. We sample networks and changepoints every 5k iterations during the sampling-phase. With the novel Gibbs sampler we perform 20 Gibbs steps for each data set. All runs are initialized with random graphs and K = 1. For the D. melanogaster time series we standardize each gene to zero mean and marginal variance of 1, and we set the maximum number of time segments, K ! , to 8, i.e., twice the number of true segments. Since 5 independent Gibbs MCMC simulations starting from different initializations with 100 Gibbs steps each converged sufficiently, we only report the results of the run which has been initialized by a graph without edges and K = 1.

test if the improved convergence also yields a higher network reconstruction accuracy, AUROC scores can be computed from the marginal edge posterior probabilities. The mean AUROC values for the reconstruction of the RAF pathway are shown in the bottom row of Fig. 2. For each K the Gibbs sampler yields significantly better results than the RJMCMC sampler. Fig. 3 shows the inferred changepoint location posterior probabilities for the D. melanogaster data, and the BGMD model detects 6 changepoints having a high posterior probability around 1, and a less pronounced seventh changepoint in the embryonic phase. Among the six changepoints that have been clearly detected the three main morphogenetic transitions can be found: (i) embryonic to larval, (ii) larval to pupal, and (iii) pupal to adult. The additional transitions are located in the larval and pupal phase. Since there are various sub-transitions (embryonic phase → 1st instar larval phase → 2nd instar larval phase → 3rd instar larval phase → prepupal phase → pupal phase), which do not coincide with the main transitions (reported in the data), the additional transitions might refer to biologically plausible sub-transitions. Our findings are comparable to those reported in [9] for the HetDBN-SI model. The same time series has also been analysed with discrete non-homogeneous DBNs, but these models fail to detect

5. RESULTS The MCMC run lengths ensure that the novel Gibbs sampler requires less computational time than the RJMCMC sampler, and we compare the results in terms of convergence and mixing. Fig. 2 shows superimposed scatter plots of the marginal edge posterior probabilities estimated from two independent MCMC simulations with different initializations on the same synthetic RAF pathway data sets. The RJMCMC sampler does not converge properly; for each K there are edges whose probabilities differ. For the Gibbs sampler all points are located around the diagonal indicating a substantially stronger convergence. To

29

transition probabilities

1 0.8 0.6 0.4 0.2 0

1

31 41 time points t

msp300

mhc sls prm

gfl

59

actn

67

mlc1

7. REFERENCES

eve up

twi

processes [1]. In [1] a RJMCMC sampling scheme for sampling from the posterior distribution was presented. We have proposed a new Gibbs sampling scheme for the BGMD DBN model, and we have demonstrated that the proposed scheme achieves a substantial improvement in terms of convergence and mixing. The application to gene expression time series from D. melanogaster has led to a plausible data segmentation, and the reconstructed network shows features that are consistent with the literature. [1] M. Grzegorczyk and D. Husmeier, “Modelling nonstationary gene regulatory processes with a nonhomogeneous dynamic Bayesian network and the change point process,” WCSB, vol. 6, 2009.

myo61f

Figure 3. Top: Posterior probabilities of the changepoint locations inferred with the Gibbs sampler for the D. melanogaster data. The transition probabilities (vertical axis) are plotted against the time axis (horizontal axis). The main transitions: embryonic to larval, larval to pupal, and pupal to adult are indicated by vertical dashed lines. Bottom: Predicted gene network in D. melanogaster.

[2] M. Grzegorczyk and D. Husmeier, “Nonhomogeneous dynamic Bayesian networks for continuous data,” Machine Learning, vol. 83, pp. 355– 419, 2011. [3] P. Fearnhead, “Exact and efficient Bayesian inference for multiple changepoint problems,” Statistics and Computing, vol. 16, pp. 203–213, 2006.

the last changepoint (pupal to adult) [8]. In [9] it is argued that a failure to detect known transitions is clearly a shortcoming while the occurrence of additional transitions is not implausible, as some pathways might have to undergo changes earlier in preparation for the morphogenetic transition. The bottom row of Fig. 3 shows the predicted network in D. melanogaster, where only edges whose marginal posterior probabilities exceed the threshold of 0.75 are shown, and self-loops are suppressed.2 As there is no gold standard network for the D. melanogaster data, we have to evaluate our prediction with respect to the biological plausibility of the inferred interactions. The predicted network possesses 9 edges, and mlc1 (’myosin alkali light chain1’) is the only gene that regulates 3 other genes. The gene mlc1 belongs to the myosin family and the interaction mlc1→twi has already been found in [10]. Moreover, Guo et al. [11] inferred an edge between mlc1 and myo61f for the embryonic phase. Another interaction that has been extracted with the BGMD model is eve→twi, and this interaction has also been reported in other studies [10, 11, 9]. Other interactions have been reported for the majority of developmental phases by other researchers. This is not surprising, as we would expect that a model which keeps the network invariant among segments infers those edges that are present for the majority of phases, e.g., the Het-DBN-SI model [9] extracted the interaction twi→myo61f for all except the larval phase and the interaction actn→gfl for all except the embryonic phase.

[4] D. Geiger and D. Heckerman, “Learning Gaussian networks,” UAI, vol. 10, pp. 235–243, 1994. [5] P. Green, “Reversible jump Markov chain Monte Carlo computation and Bayesian model determination,” Biometrika, vol. 82, pp. 711–732, 1995. [6] K. Sachs, O. Perez, D. A. Pe‘er, D. A. Lauffenburger, and G. P. Nolan, “Protein-signaling networks derived from multiparameter single-cell data,” Science, vol. 308, pp. 523–529, 2005. [7] M. Arbeitman, E. Furlong, F. Imam, E. Johnson, B. Null, B. Baker, M. Krasnow, M. Scott, R. Davis, and K. White, “Gene expression during the life cycle of Drosophila melanogaster,” Science, vol. 5590, pp. 2270–2275, 2002. [8] J. Robinson and A. Hartemink, “Non-stationary dynamic Bayesian networks,” NIPS, vol. 21, pp. 1369– 1376, 2009. [9] F. Dondelinger, S. L`ebre, and D. Husmeier, “Heterogeneous continuous dynamic Bayesian networks with flexible structure and inter-time segment information sharing,” ICML, vol. 27, 2010. [10] W. Zhao, E. Serpedin, and E. R. Dougherty, “Inferring gene regulatory networks from time series data using the minimum description length principle,” Bioinformatics, vol. 22, no. 17, pp. 2129–2135, 2006.

6. CONCLUSIONS This work expands and improves an earlier paper on a the BGMD DBN model for non-homogeneous regulatory

[11] F. Guo, S. Hanneke, W. Fu, and E. P. Xing, “Recovering temporally rewiring networks: a model-based approach,” ICML, vol. 24, pp. 321–328, 2007.

2 Six self-loops were inferred: eve→eve, sls→sls, mhc→mhc, prm→prm, actn→actn, and msp→msp.

30

DETERMINATIVE POWER AND TOLERANCE TO PERTURBATIONS IN BOOLEAN NETWORKS Reinhard Heckel1 , Steffen Schober2 and Martin Bossert2 Department of Information Technology and Electrical Engineering, ETH Zurich, Institute of Telecommunications and Applied Information Theory, University of Ulm [email protected], [email protected], [email protected] 1

2

ABSTRACT Consider a large Boolean network with a feed forward structure. Given a probability distribution on the inputs, can one find—possibly small—collections of input nodes that determine the states of most other nodes in the network? To answer this question, a notion that quantifies the determinative power of an input over the states in the network is needed. We argue that the mutual information (MI) between a given subset of the inputs X = {X1 , ..., Xn } of some node i and its associated function fi (X) quantifies the determinative power of this subset of inputs over node i. We compare the determinative power of a set of input nodes to the sensitivity to perturbations to this input nodes, and find that, maybe surprisingly, an input that has a large sensitivity to perturbations does not necessarily have large determinative power. However, for unate functions, which play an important role in genetic regulatory networks, we find a direct relation between MI and sensitivity to perturbations. As an application of our methods, we analyze the large-scale regulatory network of E. coli numerically: We identify the most determinative nodes and show that a small set of those reduces the overall uncertainty of network states significantly. 1. INTRODUCTION A Boolean network (BN) is a discrete dynamical system, which is often used to study and model a variety of biochemical networks. BNs have been introduced by Kauffman [1] as models of gene regulatory networks. Amongst others, they are used to model large-scale networks such as the Escherichia coli regulatory network [2] which is analyzed in Sec. 4. In the analysis of BNs, it is common to consider measures that quantify the effect of perturbations, whereas determinative power has not received much attention, even though there are several settings where such a notion is of interest, e.g., the following. Given a feedforward network where the states of the nodes are controlled by the states in the input layer, we might ask whether a possibly small set of inputs suffices to determine most states, i.e., reduces the uncertainty about the network’s states significantly. This can be addressed by quantifying the determinative power of the input nodes. For example, in the E. coli regulatory network it turns out that a small set of metabolites and other inputs determine most genes that account for E. coli’s metabolism (see Sec. 4).

31

In this paper, we view the state of each node in the network as an independent random variable. This modeling assumption applies e.g. for networks with a treelike topology, and is standard when studying the effect of perturbations. For this setting, determinative power of nodes and measures of perturbations are properties of single functions, hence the analysis of the BN reduces to the analysis of single functions. As the main tool for the analysis, we use Fourier analysis of Boolean functions. Fourier analytic techniques were first applied in the context of Boolean networks by Kesseli et al. [3], and also in [4]. However, the setting and problems addressed in [3, 4] are different to the problem considered here. The contributions of this paper are as follows. We argue that the MI between a set of nodes and the state of a node is a measure of the determinative power of this set of inputs, as MI is a measure that quantifies the mutual dependence of random variables. If a set of inputs to a node and the state of this node are strongly mutually dependent, then this set can be viewed as having large determinative power over this node. To understand determinative power and mutual dependencies in Boolean networks better, we systematically study the MI of a sets of inputs and the state of a node. We relate the mutual information to measures of perturbations, and prove that—maybe surprisingly—a set of inputs that is highly sensitive to perturbations, might not necessarily have determinative power. Conversely, an input that has determinative power, must be sensitive to perturbations to some extent. These results are proven, using Fourier analytic techniques. Moreover we show, that for the class of unate functions, which model functional dependencies in gene regulatory networks well, any input and the function’s output are statistically dependent. For unate functions we also prove a direct relation between the mutual information and the influence of a variable. As an application of the theoretical results in this paper, we show that mutual information can be used to identify the determinative nodes in the large-scale model of the control network of E. coli’s metabolism [2]. Due to limited space, proofs and a more detailed exposition are omitted but can be found in the preprint [5]. 2. PRELIMINARIES We start by shortly stating some standard facts about BNs and Fourier analysis of Boolean functions and introduce

notation. A Boolean network (BN) can be viewed as a collection of n nodes with memory. The state of a node i is described by a binary state xi (t) 2 { 1, +1} at discrete time t 2 N. Choosing the alphabet to be { 1, +1} rather then {0, 1} as more common in the literature on BNs, will turn out to be advantageous later. However, both choices are equivalent. In most of the Boolean network models used in biology, fi (x) does not depend on all arguments x1 , ..., xn , but on a small subset only. Obviously, to study determinative power and tolerance to perturbations, a probabilistic setup is needed. In our analysis, we assume that each state is an independent random variable Xi which follows the distribution Pr [Xi = xi ] , xi 2 { 1, +1}. The assumption of independence holds for networks with tree-like topology, but is not feasible for networks with strong local dependencies. In many relevant settings a BN has a tree-like topology, for instance the E. coli network which is analyzed in Sec. 4.

changes the function’s output. Hence influence captures the effect of a single perturbation of input i. In Boolean networks it is common to study the sum of all influences, i.e., the average sensitivity of function f . The average sensitivity of f to the variables in set A is defined as X IA (f ) , Ii (f ) i2A

and captures whether flipping an input, chosen uniformly at random from the set A affects the function’s output. Most commonly all inputs are taken into account, i.e., the average sensitivity of f , as(f ) , I{1,...,n} (f ) is studied. The average sensitivity with respect to A (and hence the influence, by setting A = {i}) can be expressed in terms of Fourier coefficients as X X 1 IA (f ) = fˆ(S)2 (1) 2. S✓[n]

Notation. We use [n] for the set {1, 2, ..., n}, and P all sets occurring in this paper are subsets of [n]. With S✓A we mean the sum over all sets S that are subsets of A. Throughout this paper, we use capital letters for random variables, e.g., X, and lower case letters for their realizations, e.g., x. Boldface letters denote vectors, e.g., X is a random vector, and x its realization. For a vector x and a set A ✓ [n], xA denotes the subvector of x corresponding to the entries indexed by A. Fourier Analysis of Boolean Functions . Let X = (X1 , ..., Xn ) be a binary, product distributed random vector, i.e., the entries of X are independent random variables Xi , i 2 [n] with distribution Pr [Xi = xi ] , xi 2 { 1, +1}. Throughout this paper, probabilities Pr [·] and expectations E[·] are with respect to a product distributed X. We denote p pi , Pr [Xi = 1], Var (Xi ) as the variance of Xi , , Var (Xi ) as its standard deviation and finally, i µi , E[Xi ] as its mean. The inner product of the Boolean functions f, g : { 1, +1}n ! { 1, +1} with respect to the distribution of X is defined as p hf, gi , E[f (X)g(X)] which induces the norm kf k = hf, f i. An orthonormal basis is given by the functions [6] Y xi µi , S ✓ [n] \ ; S (x) = i2S

i

and S (x) = 1, S = ;. Thus, each function f can P ˆ be uniquely expressed as f (x) = S✓[n] f (S) S (x), where fˆ(S) , hf, S i are the Fourier coefficients. Note that this is a representation of the function f as a multilinear polynomial, and the Fourier coefficients are the coefficients of that polynomial. Influence and Average Sensitivity. Next, we discuss measures of perturbations and their relation to the Fourier spectrum. We start with the influence of variable i, which is defined as [6] Ii (f ) , Pr [f (X) 6= f (X ei )] , where x ei is the vector obtained from x by flipping its ith entry. By definition, the influence of variable i is the probability that a perturbation of input i, i.e., flipping input i,

32

i2S\A

i

From (1) (by setting A = {1, .., n}) it becomes apparent that the average sensitivity as(f ) is large if the sum over the squared Fourier coefficients fˆ(S)2 of high deP gree d = |S|, is large. As S✓[n] fˆ(S)2 = 1, the terms fˆ(S)2 for which the degree d = |S| is small must then be small. Hence for f to be tolerant to single perturbations, i.e., to have a small average sensitivity, the Fourier coefficients must be concentrated on coefficients with low degree. Let’s see an example: Suppose p1 = p2 = p3 = 1/2 and consider the AND3 function, i.e., fAN D3 (x1 , x2 , x3 ) = 1 if and only if x1 = x2 = x3 = 1. The average sensitivity of the AND3 function is as(fAN D3 ) = 0.75. Hence, fAN D3 is tolerant to perturbations. The spectrum of fAN D3 is concentrated on the coefficients of low degree. In contrast, consider the parity of three variables: fP ARIT Y 3 (x1 , x2 , x3 ) = x1 x2 x3 , for which as(fP ARIT Y 3 ) = 3. Hence, PARITY3 is maximal sensitive to perturbations. The spectrum of the PARITY3 function is maximal concentrated on the coefficient of highest degree as fˆ({1, 2, 3}) = 1. 3. MAIN RESULTS In this section, we study the mutual information MI(f (X); XA ) between f (X) and XA , where XA consists of the entries of X corresponding to the indices in the set A ✓ [n]. Proofs are omitted due to space limitations; those and further details can be found in [5]. We start by defining the mutual information. Mutual information is the reduction of uncertainty of a random variable Y due to the knowledge of X, hence we define a measure of uncertainty first, which is entropy. As a reference for the following definitions see [7]. The entropy H(X) of a discrete randomP variable X with alphabet X is defined as H(X) , x2X Pr [X = x] log2 Pr [X = x] . The conditional entropy H(Y |X) of a pair of discrete and jointly distributed P random variables (Y, X) is defined as H(Y |X) , x2X Pr [X = x] H(Y |X = x). Finally, the mutual information MI(Y ; X) between Y and X is defined as MI(Y ; X) , H(Y ) H(Y |X). For a binary random variable X with alphabet X = {x1 , x2 } and

p , Pr [X = x1 ], we have H(X) = h(p), where h(p) is the binary entropy function, defined as h(p) ,

p log2 (p)

(1

p) log2 (1

p).

1.0

i=1

MI(f (X); Xi )  MI(f (X); X)  1,

0.0

!0.5

(2)

!1.0 0.8

Mutual information is a measure of determinative power because of the following reasons. Consider a single variable Xi of the argument X: If knowledge of Xi reduces the uncertainty of f (X), then Xi determines the state of f (X) to some extent, because then knowledge about the state of Xi helps in predicting f (X). Furthermore, we require from a measure of determinative power, that not all variables can have large determinative power simultaneously. This is guaranteed for mutual information, as n X

fˆ({i}) 0.5

0.6

MI(f (X); Xi ) 0.4 0.2 0.0 1.0 0.5 0.0 !0.5

fˆ(;)

!1.0

Figure 1. MI(f (X); Xi ) as a function of fˆ({i}) and fˆ(;) for pi = 0.3.

(3)

which follows from the chain rule [7] of mutual information and independence of the variables Xi , i 2 [n]. Hence, if MI(f (X); Xi ) is large, i.e., close to 1, we can be sure that Xi has some determinative power over f (X), since (3) implies that MI(f (X); Xj ) must be small for j 6= i. Influence lacks this property: Each input can have large influence. An example is the parity function, where each input has influence 1. If variable i has large influence, this just implies that input i has power to change the output, but not to determine it. Our results are based on the following novel characterization of the mutual information in terms of Fourier coefficients: Let X be product distributed and let XA = {Xi : i 2 A} be a fixed set of arguments, where A ✓ [n]. Then ⇣ ⌘ MI(f (X); XA ) = h 1/2(1 + fˆ(;) 2 0 0 113 X 1 E 4h @ @1 + fˆ(S) S (XA )AA5 (4) 2

Theorem 1. For any Boolean function f , for any product distributed X, ✓ ◆ 1 IA (f ) min (MI(f (X); XA ) (Var (f (X)))) 2 i2A

with

i

(x) , (x)1/ ln(4)

(5)

x.

The term (Var (f (X))) should be understood as an error term which satisfies 0  (Var (f (X))) < 0.12 and which is close to zero for situations (i.e., functions and distributions of X) of interest. Theorem 1 shows that a large value of MI(f (X); XA ) implies that f must be sensitive to perturbations of the entries of XA . Moreover, if IA (f ) is small, i.e., if f is tolerant to perturbations of the entries of XA , then MI(f (X); XA ) must be small, i.e., the entries of XA do not have large determinative power. For the case that A = [n], Theorem 1 states that as(f ) is lower-bounded by MI(f (X); X) minus some small term. Again, we discuss the special case that A = {i}. Theorem 1 evaluated for the case that A = {i} yields that

S✓A

where h(·) is the binary entropy function as defined in (2). Let us start with discussing MI(f (X); Xi ), based on (4). As seen by (4), MI(f (X); Xi ) just depends on fˆ({i}), fˆ(;) and pi . In Figure 1 we depict MI(f (X); Xi ) for pi = 0.3 as a function of fˆ({i}) and fˆ(;). It can be seen that MI(f (X); Xi ) = 0, i.e., f (X) and Xi are statistically independent, if and only if fˆ({i}) = 0. Furthermore it is seen that MI(f (X); Xi ) is increasing in |fˆ({i})|. Both observations can be proven rigorously. Hence Xi has large determinative power, i.e., MI(f (X); Xi ) is large, if and only if |fˆ({i})| is large (i.e., close to one). Next, let us consider the (trivial) case where A = [n] and hence XA = X. Then MI(f (X); X) = h(1/2(1 + fˆ(;)). It follows that MI(f (X); X) is maximized for fˆ(;) = 0, i.e, Pr [f (X) = 1] = 1/2, i.e., if the variance of f (X) is 1. In general, the closer to zero fˆ(;) is, the larger the mutual information between a function’s output and all its inputs. We continue with studying the relation of mutual information and average sensitivity.

Ii (f )

1/

2 i (MI(f (X); Xi )

(Var (f (X))))

which shows that if MI(f (X); Xi ) is large, then Ii (f ) is also large. That proves the intuitive idea that if an input determines f (X) to some extent, this input also has to be sensitive to errors. Conversely, an input i can have large influence and still MI(f (X); Xi ) = 0. An example of such a function is the PARITY function, where Ii (f ) = 1 and MI(f (X); Xi ) = 0. Interestingly, the influence also has an information theoretic interpretation: Ii (f ) =

H f (X)|X[n]\{i} H(Xi )

which shows that the influence of a variable is a measure for the uncertainty of the function’s output that remains if all variables except variable i are set. Finally, we characterize statistical independence of f (X) and a set of its arguments XA in terms of Fourier coefficients. This result generalizes a theorem derived by

33

Xiao and Massey [8] from uniform to product distributed X.

where the sum is over all m nodes that represent genes, and hence are functions of the input node’s states. We assumed that Pr [Xj = 1] = 1/2 and computed D(j) for each input variable and found that D(j) is large just for some inputs, such as the variables o2 xt (36.9 bit), leul xt (20.9 bit) and glc-d xt (19.3 bit), (here we adopted the names from the original dataset), but is small for most other variables. From the previous section, it is clear that this cannot be explained solely from the fact that nodes with large values of D(j) tend to have many outgoing edges, while most other nodes do not. This is also what we observed from analyzing the E. coli network, e.g., the state variable glc-d xt has 99 outgoing edges, but D(glcd xt) = 19.3 bit, whereas variable o2 xt has out degree 72, but D(o2 xt) = 36.9 bit. Next, let X⌧ (1) , ..., X⌧ (l) be the inputs with the l largest determinative powers. To see whether knowledge about a small set of those reduces the entropy of the networks states significantly, we computed H(Y|X⌧ (1) , ..., X⌧ (l) ) as a function of l and found that knowledge of merely the states of the most determinative nodes reduces the uncertainty about the network’s states significantly. The quantity H(Y|X⌧ (1) , ..., X⌧ (l) ) can be interpreted as a measure of the size of a subset of the overall state space where the system is likely to be found, given knowledge about the states X⌧ (1) , .., X⌧ (l) [5].

Theorem 2. Let A ✓ [n] be fixed, f be a Boolean function and X be product distributed. Then f (X) and the inputs XA = {Xi : i 2 A} are statistically independent if and only if fˆ(S) = 0 for all S ✓ A \ ;. Theorem 2 shows that if a function is concentrated on the coefficients of low degree d = |S|, which is the case for functions that are tolerant to perturbations, then small sets of inputs and the function’s output are statistically dependent. Unate Functions. A Boolean function f is said to be unate in variable xi if for each x = (x1 , ..., xn ) 2 { 1, +1}n and for some fixed ai 2 { 1, +1}, f (x1 , ..., xi = ai , ..., xn )  f (x1 , ..., xi = ai , ..., xn ). The function f is said to be unate, if f is unate in each variable xi . For example, each linear threshold function and each nested canalizing function is unate and one can suppose that the majority of regulatory interactions in a biological network are unate. The basic argument is that if an element acts either as a repressor or an activator for some gene, but never as both, then the function determining the gene’s state is unate by definition. For unate functions, we have that fˆ({i}) = ai i Ii (f ), 8i 2 [n] where ai 2 { 1, +1} is the parameter as given in the definition above. The proof goes along the same lines as the proof for monotone functions in [6, Lem. 4.5]. With (4) this yields an explicit relation of Ii (f ) and MI(f ; Xi ), based on which we find that for unate functions, the mutual information MI(f ; Xi ) is increasing in the influence |Ii (f )|. Moreover if f is unate, and xi is a relevant variable, i.e., a variable on which the function actually depends on, then |fˆ({i})| > 0. We furthermore find that if f is unate, the statement “xi is a relevant variable” is equivalent to MI(f (X); Xi ) 6= 0. In a Boolean model of a biological regulatory network, this implies that if the functions in the network are unate, then a regulator and the target gene must be statistically dependent.

5. REFERENCES [1] S. Kauffman, Homeostasis and differentiation in random genetic control networks, Nature 224 (5215) (1969) 177–178. [2] M. W. Covert, E. M. Knight, J. L. Reed, M. J. Herrgard, B. O. Palsson, Integrating high-throughput and computational data elucidates bacterial networks, Nature 429 (6987) (2004) 92–96. [3] J. Kesseli, P. R¨am¨o, O. Yli-Harja, On spectral techniques in analysis of Boolean networks, Phys. D: Nonlin. Phenom. 206 (1-2) (2005) 49–61. [4] A. S. Ribeiro, S. A. Kauffman, J. Lloyd-Price, B. Samuelsson, J. E. S. Socolar, Mutual information in random Boolean models of regulatory networks, Phys. Rev. E 77 (1) (2008) 011901.

4. E. COLI REGULATORY NETWORK In [2], the authors presented a complex computational model of the E. coli transcriptional regulatory network that controls central parts of the E. coli metabolism. The network consists of 798 nodes and 1160 edges and has a layered feed-forward structure, i.e., no feedback-loops exist. The 133 elements in the first layer can be viewed as the inputs of the system and the elements in the following 7 layers are interacting genes representing the internal state of the system. Our investigations showed that all functions are unate, which is a non-typical property of the network. We identified the input-nodes that have large determinative power using the MI. To this end, we define the determinative power of input Xj over the states in the network as m X D(j) , MI(fi (X); Xj )

[5] R. Heckel, S. Schober, M. Bossert, Harmonic analysis of Boolean networks: Determinative Power and Perturbations, arXiv:1109.0807v1, (2010). [6] N. H. Bshouty, C. Tamon, On the Fourier spectrum of monotone functions, J. ACM 43 (4) (1996) 747– 770. [7] T. M. Cover, J. A. Thomas, Elements of Information Theory, 2nd Edition, Wiley-Interscience, 2006. [8] G. Xiao, J. Massey, A spectral characterization of correlation-immune combining functions, IEEE Trans. Inf. Theory 34 (3) (1988) 569–571. [9] L. Raeymaekers, Dynamics of Boolean networks controlled by biologically meaningful functions, J. Theor. Biol. 218 (3) (2002) 331–341.

i=1

34

ALGORITHM FOR IN SILICO OPTIMIZATION OF PRODUCTION STRAINS Elli Heikkinen1 , Antti Larjo1 , Ville Santala2 , Olli Yli-Harja1 , and Tommi Aho1 Department of Signal Processing, Tampere University of Technology, P.O. Box 553, FI-33101 Tampere, Finland 2 Department of Chemistry and Bioengineering, Tampere University of Technology P.O. Box 541, FI-33101 Tampere, Finland [email protected], [email protected], [email protected], [email protected], [email protected] 1

ABSTRACT

romyces cerevisiae [7] and Escherichia coli [8]. Simultaneously, computational methods to simulate, analyze and predict metabolic phenotypes have been developed [9]. It follows that both the models themselves and the methods for analyzing metabolism are becoming more accurate and have been able to predict the consequences of genetic modifications [10, 11]. Several computational methods have been developed for analyzing models and to perform more systematic strain design [12, 13, 14]. Most of them provide strain design only through identification of gene knockouts or reaction activation/inhibition, but not through gene or reaction additions [12, 15]. In this paper, an algorithm for in silico strain design, that also searches for reaction additions, is presented. There is a framework called BNICE (Biochemical Network Integrated Computational Explorer) that discovers novel biochemical pathways in a slightly similar way as our approach but concentrates more on designing completely new reactions and does not search for gene deletions [16]. The new algorithm approach presented here employs Genetic Design through Local Search (GDLS) [14] together with a search of feasible reaction additions from a set of candidate reactions. In addition to strain design, our approach provides predictions of maximized growth and production rates. We demonstrate our algorithm by designing modifications to Acinetobacter baylyi ADP1 metabolism with the aim of maximal acetate production. The results show that the algorithm has potential as a tool in computational strain design.

Several computational approaches for designing efficient microbial production systems have been introduced during the last decade. This paper presents an algorithm that attempts to identify pathway modifications of microbial metabolism for enhanced production rates of desired metabolites coupled with growth. These modifications can include gene deletions in the wild-type genome and reaction additions from a set of non-native reactions from KEGG. The algorithm was tested by optimizing acetate production in the metabolic model of Acinetobacter baylyi ADP1. Considerable enhancements in both acetate and biomass production rates were predicted by implementing non-native reaction additions suggested by the algorithm. The results show that the algorithm has potential as a computational tool in designing metabolic engineering strategies. 1. INTRODUCTION In recent years, metabolic engineering of microorganisms for enhanced production of desired metabolite has received great attention. This is because microbial production offers abilities to produce diverse compounds, including fuels, biomolecules and drugs, from sustainable sources, as we have become more and more concerned about environmental and energy issues. More importantly, possibility of better economic efficiency in the production of more complex molecules, compared to traditional chemical methods, has encouraged researchers to deploy metabolic engineering for industrial production [1]. Already, a variety of compounds are being produced industrially by microbial production systems [2] and great effort is put to develop many others [3, 4]. Over the recent decade, traditional random mutagenesis and experimental selection have been replaced by metabolic engineering. To understand the characteristics and production capability of the cell under certain genotypic and environmental conditions, computational models and simulations play an important role [5]. The number of organisms for which genome-scale metabolic models have been reconstructed is increasing owing to the growing amount of whole genome sequencing data. As it is, genome-scale metabolic models are already available for over 30 organisms such as Clostridium acetobutylicum [6], Saccha-

2. MATERIALS AND METHODS The algorithm was implemented using MATLAB. A few ready-made software packages and databases running in the MATLAB environment were utilized. The model handling and simulation of maximal cellular growth using flux balance analysis (FBA) were done using constrainbased reconstruction and analysis (COBRA) toolbox [9]. Additionally, a method called Genetic design through Local Search (GDLS) [14] is employed to recognize favourable gene knockouts to couple cellular growth and objective metabolite production. A collection composed of 6626 individual candidate reactions was compiled from Kyoto

35

2.2. Reduction to central metabolism

Encyclopedia of Genes and Genomes (KEGG) reaction database [17]. The algorithm performance was tested by maximizing acetate production in the genome-wide metabolic model of A. baylyi ADP1 [18]. A. baylyi is a wide-spread soil bacterium whose genome reveals several similarities to that of Escherichia coli. It is nutritionally versatile and has a compact, easily transformable genome, which make it a potential choice as the production host of valuable biochemicals. [18] 2.1. Algorithm procedure The main idea of the algorithm is that it searches for nonnative reaction pathways and gene deletions that both couple and improve the growth and/or product formation rate. The algorithm operates iteratively. At every iteration cycle, it attempts to find reaction additions and gene deletions that improve the current solution. As a result, the growth and product formation rates gradually improve. The outline of one iteration cycle is presented below: Step 1. Identification of non-native pathways starting from and ending to central metabolism metabolites using breadthfirst search (BFS) from the candidate reaction set. KEGG identifiers were used to match the candidate reactions metabolites to those of the metabolic model. The pathways were constrained to consist of max ’k’ number of reactions (often k = 1..3). In each iteration, starting metabolite for the search is different. Step 2. Screen out the non-native pathways that are nonfunctional in the host. This can be because of dead-end metabolites in the pathway, or the pathway can be stoichiometrically unbalanced. Pathway functionality is inspected by adding the pathway to the host model and simulating the maximal material flux through the pathway using FBA. Step 3. From the found functional non-native pathways, find a set of pathways that, when acting together with gene knockouts identified by GDLS, couples the production of desired metabolite with biomass formation. This is done by adding each pathway to the host and running GDLS. It predicts the maximal growth and product formation rates and the gene knockouts that are needed to produce that result. Step 4. Select the best solution identified by GDLS and compare it to the current solution. If GDLS has succeeded in coupling the biomass and objective flux, i.e. the flux solution space has shifted so that maximal growth and objective flux are coupled and improved in comparison to the current solution then implement the suggested modifications to the full model. Otherwise, no modifications are done to the full model. Additional requirement for the predicted solution is that the predicted growth rate is at least 10% of that of the wild type. Step 5.Take another starting metabolite for the non-native pathway search and start again from step 1 until all central metabolism metabolites have been gone through.

36

The non-native pathway search could be performed using the full model. However, reaction additions are to be targeted to central metabolism and not to the perimeter of the metabolism where their effect might be negligible and have no connection to the objective metabolite synthesis. Therefore, central metabolism is used as a starting and finishing point of each new reaction pathway search. This also reduces the computational complexity of the algorithm as the search space becomes smaller. Central metabolism of a metabolic network is conventionally considered as the most complex and core part of the metabolism and it includes the most important pathways and metabolites for the metabolism. It has also been called a ’giant strong component’ [19]. In our case, the central metabolism contains only the most vital reactions and metabolites that function always independent on the medium used in growth simulations. We simulated 72 different conditions in which the cell was viable, and reactions that were not active in all simulations were left out. The central metabolism of A. baylyi then contained 389 reactions and 438 metabolites while the full model had 996 reactions and 828 metabolites. 2.3. Curation of candidate reactions Candidate reactions were compiled from Kyoto Encyclopedia of Genes and Genomes (KEGG) reaction database [17]. Because FBA requires definite structure of the stoichiometric matrix, some reactions were excluded from our candidate reaction set. Thus, reactions that e.g. had unspecified numbers as coefficients or polymer metabolites with unspecified number of units were screened out. Accepted reactions were parsed into a stoichiometric matrix. All reactions were considered as reversible. While searching for candidate pathways to be added into the host, certain connections through various small molecules, also called ’currency metabolites’, such as ATP, NAHD, and H2 O, were ignored. These small molecules usually take part in numerous reactions. Thus, most metabolites can be connected through them, but these connections do not represent real feasible product formation pathways [19]. Reactions that were finally added into the full model were complete. 3. RESULTS AND DISCUSSION In the attempt to enhance acetate production, a 2-fold increase in the maximal production rate (from 500 to 1000 mmol /gDW/hr) was obtained according to FBA analyses. Additionally, there was almost a 3-fold improvement in the maximal biomass formation rate (from 12.3 to 30.7 (1/hr)). The uptake rate of the carbon source, succinate, was constrained between 0 and 1000 mmol/gDW/hr. The results are shown by production envelopes of both the wildtype and the mutant model in Figure 1. Surprisingly, the design strategy found by the algorithm did not include any gene knockouts but only reaction additions. Together, 27 reaction pathways, each consisting of 1-3 reactions, were added to reach the solution.

space which makes the results easier to analyze [21]. Second, the models could be improved by refining the candidate reactions in non-native pathways. For example, all reactions in KEGG are marked as reversible and that assumption was also used in this work. However, a careful inspection of the reactions would probably show that some of the reactions are irreversible in practice. This refined information would help the algorithm to produce more accurate results. Furthermore, elementally unbalanced reactions in reaction databases have been reported for example by Pharkya et al. (2004) [13], especially in respect to hydrogen atoms. The computing time of the algorithm is depends on the number of paths found in BFS and the size of the central metabolism. Here, using a central metabolism of 438 metabolites, the total runtime varies from 8 hours to 40 hours depending also on the objective metabolite. The majority of the time is used by the GDLS and its SCIP solver [22]. The computations were done using 3.00 GHz Intel Core processors. In the further development of the algorithm the aim is the successful coupling of target product and biomass formation and even better computational efficiency. Tests with various microorganisms and products are to be performed and their results to be validated experimentally. Also, more detailed analysis of the performance of the algorithm is to be conducted.

Figure 1. Acetate production envelopes as a function of the biomass formation rate of the wild type A. baylyi model and the mutant model. The curves are produced by the COBRA Toolbox function that computes the minimal and maximal acetate production rates at different rates of biomass production. These included pathways in coenzyme-A (CoA) biosynthesis, amino acid and pyrimidine metabolism. All processes are essential in the growth, development and reproduction of organisms: Amino acids are the building blocks of proteins, the nucleotides in DNA and RNA, are derivatives of pyrimidine and CoA acts as a substrate in several processes such as citric acid cycle and in the synthesis of fatty acids. Comparison with previous studies could not be done since no previous reports on acetate production using reaction additions were found. The number of reaction additions suggested is quite high as the number of modifications that can be done experimentally is limited. In theory, all except the essential genes could be deleted but in practice, additions and deletions in the genome are technically difficult to perform. Additionally, there are only a limited number of available antibiotic resistance genes that are needed to screen out the successfully modified cells. At the moment, maximally about 10 modifications can be implemented into a bacterium with reasonable effort. The validity of the metabolic engineering strategies suggested by the algorithm depends on the validity of the metabolic model used and the correctness of the reactions in the non-native pathways. First, the metabolic models analyzed by the algorithm lack for example kinetic and regulatory constraints, which may result in flux distributions that are unfeasible in practice. There have been attempts to incorporate these factors in models and production system design tools [20]. Imposing regulatory constraints makes the models better representations of the organisms and allows more varied metabolic engineering strategies to be identified. It also reduces the solution

4. CONCLUSION In this work, we have developed an algorithm for designing production strains in silico. In contrast to many earlier algorithms that focus on identifying gene deletions, our approach also searches for pathway additions. Our tests show that the approach is capable in production strain design and it also has potential for further development. 5. ACKNOWLEDGMENTS Matti Karp, Suvi Santala and Noora M¨annist¨o are acknowledged for their valuable comments. The work was funded by The Academy of Finland project Butanol from sustainable sources (no. 140018). 6. REFERENCES [1] M. Gavrilescu and Y. Chisti, “Biotechnology - a sustainable alternative for chemical industry,” Biotechnology advances, vol. 23, pp. 471–499, 2005. [2] G. Chotani, T. Dodge, A. Hsu, M. Kumar, R. LaDuca, D. Trimbur, W. Weyler, and K. Sanford, “The commercial production of chemicals using pathway engineering,” Biochimica et Biophysica BctaProtein Structure and Molecular Enzymology, vol. 1543, pp. 434–455, 2000. [3] E. M. Green, “Fermentative production of butanol - the industrial perspective,” Current Opinion in Biotechnology, vol. 22, pp. 337–343, 2011.

37

[4] Z.-J. Zhao, C. Zou, Y.-X. Zhu, J. Dai, S. Chen, D. Wu, J. Wu, and J. Chen, “Development of L-tryptophan production strains by defined genetic modification in Escherichia coli,” Journal of Industrial Microbiology & Biotechnology, vol. 38, pp. 1921–1929, 2011.

[14] D. S. Lun, G. Rockwell, N. J. Guido, M. Baym, J. A. Kelner, B. Berger, J. E. Galagan, and G. M. Church, “Large-scale identification of genetic design strategies using local search,” Molecular Systems Biology, vol. 5, 2009. [15] P. Pharkya and C. Maranas, “An optimization framework for identifying reaction activation/inhibition or elimination candidates for overproduction in microbial systems,” Metabolic Engineering, vol. 8, pp. 1–13, 2006.

[5] J. M. Park, T. Y. Kim, and S. Y. Lee, “Constraintsbased genome-scale metabolic simulation for systems metabolic engineering,” Biotechnology Advances, vol. 27, pp. 979–988, 2009.

[16] K. C. Soh and V. Hatzimanikatis, “DREAMS of metabolism,” Trends in Biotechnology, vol. 28, no. 10, pp. 501–508, OCT 2010.

[6] J. Lee, H. Yun, A. M. Feist, B. O. Palsson, and S. Y. Lee, “Genome-scale reconstruction and in silico analysis of the Clostridium acetobutylicum ATCC 824 metabolic network,” Applied Microbiology and Biotechnology, vol. 80, pp. 849–862, 2008.

[17] M. Kanehisa and S. Goto, “KEGG: Kyoto Encyclopedia of Genes and Genomes,” Nucleic Acids Research, vol. 28, pp. 27–30, 2000.

[7] A. R. Zomorrodi and C. D. Maranas, “Improving the iMM904 S. cerevisiae metabolic model using essentiality and synthetic lethality data,” BMC Systems Biology, vol. 4, 2010.

[18] M. Durot, F. Le Fevre, V. de Berardinis, A. Kreimeyer, D. Vallenet, C. Combe, S. Smidtas, M. Salanoubat, J. Weissenbach, and V. Schachter, “Iterative reconstruction of a global metabolic model of Acinetobacter baylyi ADP1 using highthroughput growth phenotype and gene essentiality data,” BMC Systems Biology, vol. 2, 2008.

[8] J. D. Orth, T. M. Conrad, J. Na, J. A. Lerman, H. Nam, A. M. Feist, and B. O. Palsson, “A comprehensive genome-scale reconstruction of Escherichia coli metabolism-2011,” Molecular Systems Biology, vol. 7, 2011.

[19] H. Ma and A. Zeng, “The connectivity structure, giant strong component and centrality of metabolic networks,” Bioinformatics, vol. 19, pp. 1423–1430, 2003.

[9] J. Schellenberger, R. Que, R. M. T. Fleming, I. Thiele, J. D. Orth, A. M. Feist, D. C. Zielinski, A. Bordbar, N. E. Lewis, S. Rahmanian, J. Kang, D. R. Hyduke, and B. O. Palsson, “Quantitative prediction of cellular metabolism with constraint-based models: the COBRA Toolbox v2.0,” Nature Protocols, vol. 6, pp. 1290–1307, 2011.

[20] M. W. Covert, N. Xiao, T. J. Chen, and J. R. Karr, “Integrating metabolic, transcriptional regulatory and signal transduction models in Escherichia coli,” Bioinformatics, vol. 24, pp. 2044–2050, 2008.

[10] S. Fong and B. Palsson, “Metabolic gene-deletion strains of Escherichia coli evolve to computationally predicted growth phenotypes,” Nature Genetics, vol. 36, pp. 1056–1058, 2004.

[21] J. Kim and J. L. Reed, “OptORF: Optimal metabolic and regulatory perturbations for metabolic engineering of microbial strains,” BMC Systems Biology, vol. 4, 2010.

[11] N. E. Lewis, K. K. Hixson, T. M. Conrad, J. A. Lerman, P. Charusanti, A. D. Polpitiya, J. N. Adkins, G. Schramm, S. O. Purvine, D. Lopez-Ferrer, K. K. Weitz, R. Eils, R. Koenig, R. D. Smith, and B. O. Palsson, “Omic data from evolved E-coli are consistent with computed optimal growth from genomescale models,” Molecular Systems Biology, vol. 6, 2010.

[22] T. Achterberg, Constraint integer programming, Ph.D. thesis, TU Berlin, July 2007.

[12] A. Burgard, P. Pharkya, and C. Maranas, “OptKnock: A bilevel programming framework for identifying gene knockout strategies for microbial strain optimization,” Biotechnology and Bioengineering, vol. 84, pp. 647–657, 2003. [13] P. Pharkya, A. Burgard, and C. Maranas, “OptStrain: A computational framework for redesign of microbial production systems,” Genome Research, vol. 14, pp. 2367–2376, 2004.

38

NEW METHODS FOR FINDING ASSOCIATIONS IN LARGE DATA SETS: GENERALIZING THE MAXIMAL INFORMATION COEFFICIENT (MIC) Tomasz M. Ignac1,2 , Nikita A. Sakhanenko2 , Alexander Skupin1,2 and David J. Galas1,2 Luxembourg Centre for Systems Biomedicine University of Luxembourg 7, Avenue des Hauts-Fourneaux L-4362 Esch-sur-Alzette 2 Institute for Systems Biology 401 Terry Avenue North, Seattle, Washington 98109, USA {tomasz.ignac, alexander.skupin}@uni.lu {nsakhanenko, dgalas}@systemsbiology.org 1

ABSTRACT

which he called a ”correlation for the 21st century” [3], to more than two variables. In this paper, we take up this challenge by making use of our early work [2] where we used the concept of interaction information [4, 5], a multivariate generalization of mutual information, to find a suitable binning of a continuous variable while preserving the relationships between the other variables. We use the interaction information idea as a framework for extending MIC from two to three variables. We propose substituting a normalized information distance for mutual information, which is used in MIC, as the key measure of dependence. The approach we propose also offers a clear conceptual starting point for extending the theory of MIC much farther. The central aim of this paper is to provide a theoretical framework for future work on constructing measures of associations between three variables. There are two general approaches to this problem: the first is based on information theory methods like MIC and interaction information. The second is the more traditional approach similar to the partial correlation coefficient which is an extension of the linear correlation between variables X1 and X2 while a third variable Y is fixed at some value [3]. In the current paper we focus our attention on the first approach; however, we briefly discuss possibilities of alternative solutions of the problem.

We propose here a natural, but substantive, extension of the MIC. Defined for two variables, MIC has a distinct advance for detecting potentially complex dependencies. Our extension provides a similar means for dependencies among three variables. This itself is an important step for practical applications. We show that by merging two concepts, the interaction information, which is a generalization of the mutual information to three variables, and the normalized information distance, which measures informational sharing between two variables, we can extend the fundamental idea of MIC. Our results also exhibit some attractive properties that should be useful for practical applications in data analysis. Finally, the conceptual and mathematical framework presented here can be used to generalize the idea of MIC to the multi-variable case. 1. INTRODUCTION Data sets that represent measurements on complex systems often embody functional relationships between variables that are difficult to discover. In a complex system the dependency between measured variables can have a functional form that is itself complex, and can therefore be difficult to detect by standard methods. The maximal information coefficient (MIC) represents an interesting new approach [1] to measuring dependency between two random variables. It is able to capture a wide range of functional associations, which makes it a particularly useful tool for exploring large, complex data sets, and thus it is especially appropriate for investigating large biological data sets with many variables. For example, the yeast based data set discussed in [2] contains 225 variables representing genetic markers and 374 yeast strains leading obviously to a potentially huge number of genetic interaction. In Speed’s commentary on reference [1] he pointed out that an interesting challenge is presented by this new approach. The challenge is the generalization of MIC,

2. THE MAXIMAL INFORMATION COEFFICIENT Here we present a brief description of the process of calculating the MIC. A more detailed discussion can be found in [1]. Let X1 and X2 be two continuous random variables, and let D be a set of pairs drawn from the joint probability distribution P (x1 , x2 ). Then, the value M IC(D), which stands for the MIC computed for the sample set D, is the maximal possible mutual information [6] between all possible binnings of these variables. More precisely, let X10 and X20 be binned (discretized) versions

39

of X1 and X2 . Based on the data set D, we can approximate the mutual information, I(X10 ; X20 ), between X10 and X20 . Subsequently, I(X10 ; X20 ) is normalized by log(min(|X10 |, |X20 |)), where |Xi0 |, i = 1, 2, is the number of states (bins) of Xi0 . Finally, the M IC(D) is the maximal normalized mutual information estimated from all possible binnings X1 and X2 . An algorithm for approximating this value was presented in the supplementary materials of [1]. An improvement of this algorithm will be one of the most important topics of our future work. The basic concept behind the idea of MIC is simply that the mutual information is a good measure of association between two variables, and maximizing this discretized measure by choice of binnings produces MIC. It simply finds the most informative binning of the two variables of interest. Unfortunately, however, the use of mutual information in practical applications can be difficult. There are two sources of these difficulties: 1) estimation of mutual information from data is difficult, especially for continuous random variables, and 2) mutual information itself is not a normalized measure; thus, the interpretation of the results may be sometimes problematic [3, 6]. MIC cannot be treated as an approximation of the mutual information between X and Y , however. On the other hand, there are two theorems [1] showing that if X and Y are independent or if X = f (Y ), then the M IC(D) converges to 0 or 1 respectively with the size of D going to infinity. What is more, MIC offers a natural normalization of obtained results.

ables. This is illustrated by the Example 3.1. A better generalization of MIC can be achieved by replacing the mutual information measure, I(X1 ; X2 ), with the normalized information distance, d(X1 ; X2 ), and then extending d(X1 ; X2 ) to three variables using the concept of interaction information. The normalized information distance [8, 9] is a metric defined as d(X1 ; X2 ) =

max[H(X1 |X2 ), H(X2 |X1 )] , max[H(X1 ), H(X2 )]

which can be rewritten to d(X1 ; X2 ) =

max[H(X1 ), H(X2 )] I(X1 ; X2 ) . max[H(X1 ), H(X2 )]

Here, H(·) stands for the entropy. The normalized information distance was defined in [8] in terms of Kolmogorov complexity [10]. However, it can be easily adapted to the Shannon’s formalism [6]. We want to point out that this distance offers an alternative approach for the normalization of the mutual information between X1 and X2 . The extension to three variables d(X1 ; X2 .Y ) can then be made in the same way as in the case of the interaction information: d(X1 ; X2 .Y ) = d(X1 ; X2 |Y )

d(X1 ; X2 ).

We call this quantity the interaction distance by analogy, even though it is not a metric, as opposed to the twovariable form. Here d(X1 ; X2 |Y ) is a conditional version of the normalized distance, i.e., d(X1 ; X2 |Y ) =

3. INTERACTION DISTANCE

max[H(X1 |Y ), H(X2 |Y )] I(X1 ; X2 |Y ) . max[H(X1 |Y ), H(X2 |Y )]

To extend MIC we have to have a multivariate generalization of the mutual information. The natural choice is the conditional mutual information I(X1 ; X2 |Y ). However, it can be easily shown that this is not the best solution. For example, I(X1 ; X2 |Y ) equals zero when either all the three variables are mutually independent or when X1 and X2 are independent given Y . These are two completely different situations, which should be differentiated by a good measure, but are not distinguished by conditional mutual information. In [2] we used the concept of interaction information, I(X1 ; X2 ; Y ) [4, 5, 7], which is defined as a difference between I(X1 ; X2 |Y ), the mutual information between X1 and X2 given Y , and I(X1 ; X2 ). The value of I(X1 ; X2 ; Y ) can be either positive or negative, as can be seen from the range of values of these two terms. A positive value suggests a synergy between X1 and X2 ; i.e., both variables together contain more information about Y than separately. A negative value suggests redundancy between X1 and X2 . While the interaction information appears to provide the path to a natural extension of MIC, we note that I(X1 ; X2 ; Y ) is symmetric: for example, I(X1 ; X2 ; Y ) = I(X1 ; Y ; X2 ). Consequently, I(X1 ; X2 ; Y ), a single number describing a relationship between three variables, still does not capture all possible associations between the vari-

It can be shown that d(X1 ; X2 |Y ) is a metric. The proof of that property will be presented in the extended version of this paper. Here, we want to note that the proof for the Shannon’s form of the distance is relatively simple; i.e., it is an extension of the original proof [8], that d(X1 ; X2 ) is a metric. This extension is based on the fact that: H(X, Y |Z) = H(X|Z)

H(Y |X, Z).

Unfortunately, the Kolmogorov counterpart of this property does not hold exactly [11]. Therefore, it is unclear if the Kolmogorov’s version of d(X1 ; X2 |Y ) is a metric. To prove usefulness of the interaction distance, we need to describe basic properties of d(X1 ; X2 .Y ). In order to do that, we need to go back to a theorem which is a key result of [7]. This theorem describes behavior of the interaction information in the context of various relationships between the three variables. Here, we present a lemma that is a counterpart of this theorem. The detailed proof is omitted here, due to space limitations. The lemma can be treated as a corollary of the theorem. Lemma 3.1 The following three properties hold for arbitrary functions f and g:

40

1.

conditional entropy. Thus, d(X1 ; X2 |Y ) = 0 for arbitrarily small value of these quantities. Hence, in the limit, we obtain the zero interaction distance. The immediate advantage of using d(X1 ; X2 .Y ) is its ability to capture a broader spectrum of relations than the interaction information itself. A positive value of the interaction information indicates synergy between variables [4, 5]. However, the above example shows that the reverse is not always true. To understand the difference between the interaction distance and the interaction information we need to go back to the key theorem of [7]. From this theorem it follows that in situations similar to that in our example the interaction information equals H(Xi |Y ). Hence, when the conditional entropy tends to zero the interaction information follows. We can see that the distance is independent from the values of the entropies and conditional entropies of X1 and X2 . This allows us to detect associations that cannot be captured by the interaction information itself. The price for this capacity is that the distance is not symmetric, and not a metric. Thus, sometimes, one may need to consider three cases of conditioning by all variables.

1  d(X1 ; X2 .Y )  1.

2. d(X1 ; X2 .Y ) = 1 if and only if X1 , X2 are independent and Xi = fj (Xj , Y ), i, j = 1, 2. 3. d(X1 ; X2 .Y ) = 1 if and only if Xi = gi (Y ) = fj (Xj ), i, j = 1, 2. The first property simply shows the range of the interaction distance. The limit values are obtained if and only if certain functional associations between the three variables occur. The second property implies that Y is fully determined by two independent variables. The third statement describes situation when X1 and X2 are determined by Y . In more general setting, see [4, 5], when the relation between variables is not functional, the negative values of the interaction distance suggest synergy of X1 and X2 ; while the positive distance means redundancy of the information of these two variables with respect the conditioning variable Y . This includes also a case when X1 and X2 are independent given Y . Note that, in contrast to the interaction information, the information distance is negative in the case of a synergy between X1 , X2 , and positive if the variables are redundant. Example 3.1 Here we present an example that demonstrates the difference between the interaction information and the interaction distance. Let X1 and X2 be two independent, binary, random variables such that P (Xi = 0) = 0.5, i = 1, 2. Let us define a third variable Y in as follows:

4. CONCLUSIONS AND DISCUSSION

• Y = 2 for X1 = 0 and X2 = 1;

One of the directions for future work is exploration of the question of statistical power of MIC. However, this is not specific to the three (or more) variable case. There are two main directions on future problems. The first is the practical implementation of the interaction distance. The second one is to find possible alternatives for the interaction distance. Further work will involve using this present framework, used to generalize to three variables, to extend MIC to multi-variable cases.

• Y = 3 for X1 = 1 and X2 = 1.

4.1. Implementation

• Y = 0 for X1 = 0 and X2 = 0; • Y = 1 for X1 = 1 and X2 = 0;

Note that the triple X1 , X2 , Y fulfills requirements of the Lemma 3.1, point 2. This is an obvious case of synergy between X1 and X2 ; i.e., on one hand, knowledge about the state of only one of these two variables leaves the state of Y uncertain. On the other hand, when states of both X1 and X2 are known, the state of Y becomes certain. Let us now analyze the behavior of the interaction information and interaction distance in this case. Clearly, I(X1 ; X2 ) = 0; subsequently, elementary calculations reveal that I(X1 ; X2 |Y ) = H(Xi |Y ) = 0 and thus I(X1 ; X2 ; Y ) = 0. Since X1 and X2 are independent d(X1 ; X2 ) = 1; then, we can show that d(X1 ; X2 |Y ) = 0. To this end, we note that

For a given data set the information distance, d, is approximated from the data in a manner very similar to the MIC algorithm [1]. The details of these operations are beyond the scope of this short paper and will be presented in its extended version. We have simply adapted the algorithm presented in [1]. In short, we estimate d(X1 ; X2 .Y ) in two steps. In the first step, we maximize d(X1 ; X2 ) and in the second step we maximize d(X1 ; X2 .Y ). Note that the first step is a simple adjustment of the existing MIC algorithm. On the other hand, the second step is more complex, as it requires taking into account the conditional variable Y . The simplest solution here is to impose an equi-partition on the values of Y ; c.f., supplementary materials of [1]. It is unclear if this solution is generally optimal in practical applications. In [1] a similar approach is used where one of the variables is equi-partitioned. Note that by maximizing d(X1 ; X2 |Y ) and d(X1 ; X2 ) separately different discretizations of X1 and X2 can be obtained. Consequently, an important direction for future research is to define context-dependent discretization of a random variable: such discretization will be different when we change contextual variables of the variable of interest.

I(X1 ; X2 |Y ) = H(Xi |Y ) for i = 1, 2, Thus, it follows, from the definition of the conditional distance, that d(X1 ; X2 |Y ) = 0. Hence, we have d(X1 ; X2 .Y ) = d(X1 ; X2 |Y ) d(X1 ; X2 ) = 0 1 =

1.

Note that in the example d(X1 ; X2 |Y ) looks like zero over zero. We treat it as zero since this is the special case when the conditional mutual information is equal to the

41

5. ACKNOWLEDGEMENTS

The complexity and time for the computations could be a key issue for future applications. We have performed some preliminary tests on the yeast data set, mentioned above, with 225 binary variables and 375 samples. We calculated the interaction distances d(Xi ; Xj .Y ), where i, j went across all the pairs of the 225 variables and Y was an additional variable that represents phenotype. The phenotype variable was binned into four states. The running time on a laptop was about ten seconds. We artificially generated similar set with 1000 samples and 1200 variables: the running time in this case was about nine minutes. Even if we take into account that we may need to test various binnings, the running time should be acceptable. The detailed discussion of this issue will be considered in future papers.

This work was supported by the ISB-Luxembourg Program, and by the FIBR program of NSF (0527023). Nikos Vlassis from the Luxembourg Centre for Systems Biomedicine for discussions. 6. REFERENCES [1] D. N. Reshef, Y. A. Reshef, H. K. Finucane, S. R. Grossman, G. McVean, P. J. Turnbaugh, E. S. Lander, M. Mitzenmacher, and P. C. Sabeti, “Detecting novel associations in large data sets,” Science, vol. 334, no. 64, pp. 1518–1524, 2011. [2] N. A. Sakhanenko and D. J. Galas, “Interaction information in the discretization of quantitive phenotype data,” in Proceedings of the 8th International Workshop on Computational Systems Biology, Zurich, Switzerland, June 2011.

4.2. Alternatives

[3] T. Speed, “A correlation for the 21st century,” Science, vol. 334, no. 64, pp. 1502, 2011.

The problem with the interaction distance is the large number of samples required to obtain a sound estimation of d(X1 ; X2 .Y ). Even for the MIC between two variables we need relatively large number of samples: the minimal practical size of D is about 100. Since the interaction distance involves three variables, an order of magnitude more samples will be required here. In many practical applications we may not be able to collect a sufficient amount of observations. Thus, we need to find an alternative. This could be based on the idea that underlies the partial correlation between X1 and X2 given Y , denoted by ⇢X1 X2 ·Y . This is the correlation of residuals of X1 and X2 calculated from the linear regression of X1 given Y and X2 given Y .

[4] A. Jakulin and I. Bratko, “Testing the significance of attribute interactions,” in Proceedings of the 21st International Conference on Machine Learning, Banff, Canada, June 2004. [5] A. Jakulin and I. Bratko, “Quantifying and visualizing attribute interactions: An approach based on entropy,” http://arxiv.org/abs/cs.AI/0308002 v3, 2004. [6] T. M. Cover and J. A. Thomas, Elements of Information Theory, Wiley-Interscience, New York, NY, USA, 1991. [7] T. Tsujishita, “On triple mutual information,” Advances in Applied Mathematics, vol. 16, no. 3, pp. 269–274, 1995.

In our future research we want to develop a similar approach but replace the correlation of the residuals by a measure or a statistical test that can capture more than only linear relationships. A good candidate for such a measure seems to be the distance correlation introduced by Szekely [12].

[8] M. Li, X. Chen, X. Li, B. Ma, and P. Vitanyi, “The similarity metric,” IEEE Transactions on Information Theory, pp. 863 – 872, Sep. 2003.

In the Science perspective commenting on [1] a challenge was presented [3]: ”MIC is a great step forward, but there are many more steps to take.” To take the first step we have proposed here a relatively natural, but substantive, extension of the MIC for detecting potentially complex associations among three random variables. This itself is an important step for practical applications. We have shown that by merging two concepts, the interaction information, which is a generalization of the mutual information to three variables, and the normalized information distance, which measures informational sharing between two variables, we are able to extend the fundamental idea of MIC. The interaction distance we propose exhibits some attractive properties that should also be useful for practical applications in many aspects of data analysis and the framework presented here can be used to generalize to the multi-variable case. The technical details of our method will be a topic of a future publication.

[9] D. J. Galas, M. Nykter, G. W. Carter, N. D. Price, and I. Shmulevich, “Biological information as setbased complexity,” IEEE Transactions on Information Theory, vol. 56, pp. 667 – 677, Feb. 2010. [10] A. N. Kolmogorov, “Three approaches to the definition of the concept quantity of information (russian),” Problemy Peredachi Informacii, vol. 1, pp. 3 – 11, 1965. [11] P. Gacs, J. T. Tromp, and P. M. B. Vitanyi, “Algorithmic statistics,” IEEE Transactions on Information Theory, vol. 47, pp. 2443 – 2463, Sep. 2001. [12] G. J. Szekely, M. L. Rizzo, and N. K. Bakirov, “Measuring and testing dependence by correlation of distances,” The Annals of Statistics, vol. 35, no. 6, pp. 27692794, 2007.

42

BLIND SOURCE SEPARATION USING LATENT GAUSSIAN GRAPHICAL MODELS Katrin Illner, Christiane Fuchs and Fabian J. Theis Institute of Bioinformatics and Systems Biology, Helmholtz Zentrum München and Institute for Mathematical Sciences, Technische Universität München {katrin.illner,christiane.fuchs,fabian.theis}@helmholtz-muenchen.de ABSTRACT

mance of the algorithm on synthetic data and on observed gene expression data. For the latter we show how the separation of the proposed model conincides with results found in [5].

We want to identify independent projections of interest in multivariate data in an unsupervised manner. Dealing with data of a specific temporal or spatial structure is well established for this so-called blind source separation problem. The recently published GraDe-algorithm addresses more complex network structures often found in biology; it analytically separates sources with respect to a given network. We translate these separation assumptions and propose a flexible Bayesian model to include for instance missing observations and parameter priors. Technically, we define a Gaussian graphical model with latent variables and estimate the parameters using expectation maximization. As large-scale application we consider gene expression data, where the dependence structure is given by a literature-derived transcriptional regulation network. We demonstrate that the model identifies relevant biological processes.

2. THE MODEL 2.1. The mixing N

We consider the following mixing model: X = x(i) i=1 are observable Gaussian variables with state space Rm , N and we assume latent Gaussian variables S = s(i) i=1 with state space Rq (q  m), such that each variable x(i) is the linear mixture of the components of the latent variable s(i): x(i) = A s(i) + "(i) ,

i = 1, . . . , N .

(1)

Here "(i) 2 R is additive i.i.d. noise "(i) ⇠ N (0, I), independent of the latent variables. A 2 Rm⇥q denotes the mixing matrix and we refer to the components of the latent variables as sources, i.e. for k = 1, . . . , q we have sources N sk = sk (i) i=1 . m

1. INTRODUCTION In blind source separation (BSS) one assumes informative sources underlying the mixture of observed signals, and BSS has often been adopted in computational biology [1, 2]. Approaches differ with respect to the mixing model, the assumptions on the sources and the estimation method. If the data has additional structure, one often finds structure specific separation assumptions; a widelyused approach for time series data is that different sources have vanishing time-delayed covariance. The separated sources then represent single time processes [3, 4]. This concept has already been translated to general networks in [5], where different sources are assumed to have vanishing graph-delayed covariance. However, the proposed GraDe-algorithm [5] diagonalizes the graph-delayed correlation matrix analytically and is not very flexible in terms of modeling. In order to deal with missing observations, to evaluate the parameter estimates in terms of their posterior distributions, and to perform network selection, we combine latent Gaussian graphical modeling and blind source separation. We discuss in detail how the distribution of the latent variables can mirror complex network structures and how the separation assumptions translate to the model. The parameters and sources are estimated using expectation maximization where we exploit the restrictions given by the separation assumptions. We evaluate the perfor-

2

2.2. Latent Gaussian graphical model We now define X and S in terms of a latent Gaussian graphical model with hidden part S. We assume a network structure given by a (weighted) directed acyclic graph G = (S, E) with edges E ✓ S ⇥ S that connects the latent variables. Let the sequence of the nodes be ordered in a way that all parents of a node s(i) have indices lower than i, and the first r 1 nodes are roots. (Such an ordered sequence always exists for directed acyclic graphs.) We further connect all nodes x(i) to S by an edge (s(i), x(i)) and assume that the joint distribution of X and S decomposes as p(X, S) =

N Y

i=1

p(x(i) | s(i))

N Y

i=r

p(s(i) | Pa(i))

rY1

p(s(i)) , (2)

i=1

where Pa(i) is the vector of all parents of s(i), i. e. all direct predecessors. Note that root nodes are mutually independent, and so are any two non-adjacent nodes s(j) and s(i) with j < i, if we condition on all parent nodes of s(i). Such a graphical model is also known as Bayesian network, and in the following we specify the stochastic properties of S.

43

2.3. Graph-delayed covariance

that ⌃D (i) is also diagonal and ! D (i) consists of blocks of diagonal matrices. Dependent on the weighted graph G one can determine an interval I G ✓ R such that all covariance matrices ⌃D (i) are positive definite for D with components in I G .

We extend the concept of wide sense stationary time processes [3, 4] to the case of (weighted) networks. For an edge (s(j), s(i)) 2 E let ji 2 R denote its weight and we assume E[s(i)] = µ ,

independent of i

(3)

Cov s(i), s(i) = ⌃ ,

independent of i

(4)

Cov s(j), s(i) =

ji D ,

for (s(j), s(i)) 2 E .

3. PARAMETER ESTIMATION The unknown components of our model are the parameters ✓ = (A, µ, 2 , D) and the latent variables S, and we are interested in both. A common approach for latent graphical models is to use expectation maximization. We shortly introduce this method for the proposed model.

(5)

Up to scaling factors we assume a unique covariance between two adjacent nodes in the network (5), and we therefore refer to D as graph-delayed covariance (of delay 1). Our main assumption concerning the issue of source separation is that different sources have vanishing graphdelayed covariance. For j = i and for j with s(j), s(i) adjacent nodes we claim Cov sk (j), sl (i) = 0 ,

for k 6= l .

3.1. Expectation maximization Let Epost denote the expectation with respect to the posterior distribution p(S | X, ✓) of the latent variables given observations X and parameters ✓. We consider the expectation⇥ of the complete data ⇤ ⇥ log-likelihood ⇤ that is⇥ given by ⇤ Epost ln p(X, S) = Epost ln p(X | S) + Epost ln p(S) . With the decomposition of p(X, S) in (2) and the conditional distributions (9) and (10) we find:

(6)

In other words, the covariance matrices ⌃ and D defined in (4) and (5) are diagonal. There is a well-known indeterminacy in the proposed linear mixing model. For any invertible matrix B 2 Rq⇥q holds x(i) = A s(i) + "(i) = (A B 1 ) (B s(i)) + "(i), and there is no unique solution for the decomposition into A and s(i). We therefore scale Cov s(i), s(i) = I and claim that the components of D are pairwise different. The solution is then unique up to a signed permutation matrix, i.e. B has exactly one non-zero entry in each row and each column and this entry equals ±1. If we redefine the mixing model with Aµ = (A, µ) and sµ (i) = (s(i)0 , 1)0 —enlarged by a constant component— we can assume s(i) ⇠ N (0, I) and have x(i) = Aµ sµ (i) + "(i) ,

⇥ ⇤ Epost ln p(X | S) = 2

⇥ ⇤ Epost ln p(S) =

r 1

⇥ ⇤ 1X Epost s(i)s(i)0 2 i=1

x(i) | s(i) ⇠ N (Aµ sµ (i),

N X

(8)

I)

(i

(i 1) .

r)

ln det(⌃D (i))

i=r

1

1

! D (i) 1

! D (i)

i (12)

The EM-algorithm consists of two steps which are repeated alternately until convergence. In the E-step we estimate the occuring expectations from the posterior distribution of the latent variables. For unobserved variables z1 and z2 one has the well-known property E[z1 z02 ] = Cov(z1 , z2 )+E[z1 ] E[z2 ]0 . If only z2 or both variables are observed the l.h.s. equals E[z1 ] z02 and z1 z02 , respectively. For Bayesian networks we can use the junction tree algorithm to infer the posterior distribution of the unobserved variables. We use a Matlab implementation from the Bayes Net Toolbox [6]. ⇥ ⇤ In the M-step we maximize Epost ln p(X, S) with respect to the parameters. From (11) we get parameter updates for Aµ and 2 . For better readability we define ⇥ ⇤ PN Esx = i=1 Epost sµ (i)x(i)0 and accordingly Ess and

with the second term equal to zero (Section 2.2). From p(S) and the mixing (7) we get all conditional distributions in (2): 1)

(11)

+ Tr Epost [x(i)x(i)0 ] ! D (i)0 ⌃D (i)

+E[Cov s(j), s(i) | Pa(i) ] ,

2

Nq ln(2⇡) 2

N

Cov s(j), s(i) = Cov E[s(j) | Pa(i)] , E[s(i) | Pa(i)]

(i

)

i

1 Xh Tr Epost [s(i)s(i)0 ]⌃D (i) 2 i=r

The above assumptions define a unique joint distribution p(S): Let p(s(1), . . . , s(i 1)) with i r be given. For each node s(j) with j < i and (s(j), s(i)) 2 / E we can determine Cov s(j), s(i) using the variance theorem, where we condition on all parent nodes of s(i). We have

2

i=1

2 Tr Epost [s(i)Pa(i)0 ] ⌃D (i)

s(i) | Pa(i) ⇠ N (! D (i) Pa(i), ⌃D (i)) ,

Nm ln( 2

Tr Epost [x(i)x(i)0 ]

+ Tr Epost [x(i)x(i)0 ] A0µ Aµ

.2.4. Conditional distributions

s(i) ⇠ N (0, I) ,

2

2 Tr Epost [sµ (i)x(i)0 ] Aµ

(7)

i = 1, . . . , N .

N 1 Xh

Nm ln(2⇡) 2

(9) (10)

Here ! D (i) 2 Rq⇥q ni and ⌃D (i) 2 Rq⇥q with ni the number of parents of s(i) depend only on the graph-delayed covariance. Since we assume that D is diagonal we find

44

N = 50 nodes and Ne = 500 edges and assign weights 2 , D) ji 2 [ 1, 1] to the edges. For parameters ✓ = (A, µ, G with components of D in I we first sample S from the distribution given in (9) and then mix the observations X with noise variance 2 = 0.3 . Figure 1 and 2 show that parameter and source estimation work satisfyingly well. Only in case of the noise variance 2 we observe the tendency of underestimation. We now compare the estimation performance in three different settings with one varying variable in each: a) increasing noise variance 2 , b) decreasing number of edges Ne , and c) increasing number of missing observation values n0  N . Table 1 summarizes the parameter combinations in detail. Since we keep the number of nodes fixed (N = 50) we can compare the estimation results in terms of the data log-likelihood L(✓) = ln p(X | ✓). We find that in settings a) and c) the mean data log-likelihood (over 10 runs) decreases, which is plausible. In contrast, a low number of edges has no substantial impact on the estimation quality (Figure 3).

3 estimate true value

A

2

A:2

A:3

2

3

D

D22

D33

2

Figure 1. Parameter estimation for synthetic data. The estimates of A, µ, 2 , and D after 500 iterations of the algorithm are shown together with the empirical 95% confidence intervals. True values are marked as squares.

4.2. Gene expression data We apply the model to gene expression data and compare the outcome to the results of the GraDe-algorithm in [5]. In the experiments of the original paper mice are stimulated with cytokine interleukin IL-6 over 4 hours. The data consists of the expression values of N = 5709 genes at 4 different time points. Referring to the paper we assume that the genes are linked along a gene regulatory network derived from the Transpath database. We find 630 (directed) edges corresponding to activation and inhibition, and 5360 genes are not linked at all. Since these genes contain no information about the graph-delayed covariance we only consider the remaining N0 = 349 genes for parameter estimation. This restriction also leads to a substantial decrease in computing time. For the impact of all activators and inhibitors of a gene there is no further information available. We therefore assume that this impact is constant over the network, and we define edge weights as ji = ±1/#{parents of s(i)} , where +1 and 1 correspond to activation and inhibition, respectively. To that end our model requires an acyclic graph which is fulfilled if a small amout of edges is deleted from the above network. Choosing egdes with minimal number of parents results in a low number of deletions (15 egdes). Since the mean out-degree (2.65) of the network is substantially larger than the mean in-degree (0.11) this deletion preserves the upstream gene regulation. N0 The observations x(i) i=1 are now given by the expression values x(i) 2 R4 of the selected genes, and we estimate the parameters A, µ, 2 , and D with q, d = 4. With these parameters we can estimate all latent variables s(i) for i = 1, . . . , N (N = 5709) as the mean of their posterior distribution µs(i)|x(i) = A(AA0 + 2 I) 1 (x(i) µ) . This

Figure 2. Source estimation for synthetic data. The estimates of s1 , s2 , and s3 after 500 iterations of the algorithm are plotted as solid lines together with the empirical pointwise 95% confidence intervals. For better visualization true values are shown as dots. Exx. We have the following updates 0

1

Aµ = Esx Ess , ⇥ ⇤ 1 2 = Tr(Exx) 2Tr(Esx Aµ ) + Tr(Ess A0µ Aµ ) . Nm

To update D we numerically maximize (12). Since D is diagonal and because of the shape of ! D (i) and ⌃D (i) discussed in Section 2.4 we can maximize with respect to each component separately. The search space is given by the interval I G . According to our definition of the distribution p(S), the parameter D occurs in all ! D (i) and ⌃D (i) in a different manner. Such a specific parameter tying is not supported by the Bayes Net Toolbox, and we implemented own code for the maximization step. Missing values. In applications one often faces the problem of missing observation values. If for variables X0 ⇢ X the observation values are missing we can treat X0 as additional latent variables and estimate S and X0 from the posterior distribution p(S, X0 | X\X0 , ✓) in the Estep. In Section 4.1 we demonstrate the impact of missing values in terms of parameter estimation quality. 4. APPLICATIONS 4.1. Synthetic data

N

yields to sources sk = sk (i) i=1 for k = 1, . . . , 4. Associated to the largest graph-delayed covariances (absolute values) we find two dominating sources s1 and s2 . Both show an immediate response at the earlier time points (Fig-

We show the performance of our estimation approach on synthetic data. To that end we fix m, q = 3. We generate a graphical model S as a directed acyclic graph G with

45

a)

Table 1. Parameter values for a comparison of the estimation performance in case of a) increasing noise variance 2 , b) decreasing number of edges Ne , and c) increasing number of missing values n0 . a) b) c)

v1

v2

v3

v4

= 0.1 Ne = 500 n0 = 0

0.3 250 10

0.5 100 25

0.7 10 35

2

0.4

2

0.1 0

n0 = 0 2 = 0.3 2 = 0.3

0.1 0

d1

d2

d3

d4

t1

t2

t3

t4

Figure 4. Parameter estimation for real data. a) absolute values of the graph-delayed covariance; the clearly larger values d1 and d2 indicate two dominating sources s1 and s2 . b) time response (columns of A) of the dominating sources.

−180 −200 ln p (X!θ)

1

0.3

0.2

−160

−220

multiple networks (e. g. given by single pathways) we can use Bayesian model selection and the proposed model is a promising starting point.

−240 −260

a) noise level ↑ b) number of edges ↓ c) number of missing values ↑

−280 −300

0.4 0.2

0.3

fixed values Ne = 100 n0 = 0 Ne = 100

b)

0.5

v1

v2

6. ACKNOWLEDGEMENTS

v3

This work was supported by the European Union within the ERC grant ’LatentCauses’ and the Federal Ministry of Education and Research (BMBF) in its GerontoSys initiative (project ’Stromal Aging’).

v4

Figure 3. Comparison of the estimation performance. Mean data log-likelihood of 10 runs with one varying variable in setting a)-c), cf. Table 1.

7. REFERENCES [1] Teschendorff, A. E., Journe, M., Absil, P. A., Sepulchre, R., and Caldas, C.: Elucidating the altered transcriptional programs in breast cancer using independent component analysis. PLoS Comput Biol, 3(8), (2007)

ure 4b). We further selected genes with absolute z-score > 1.5 and performed enrichment analysis with Gene Ontology terms and KEGG pathways; p-values are corrected by false discovery rates and we used 0.05 as cutoff. Associated to s1 we found more treatment specific terms like "(acute) inflammatory response" and "(external) stimulus". In constrast, associated to s2 , we found for example processes involved in "cell cycle". This separation indeed coincides with results discussed in [5], and for details on the biological meaning we refer to the original paper.

[2] Lutter, D., Ugocsai, P., Grandl, M., Orso, E., Theis, F., Lang, E., and Schmitz, G.: Analyzing m-csf dependent monocyte/macrophage differentiation: expression modes and meta-modes derived from an independent component analysis. BMC Bioinformatics, 9(100), (2008) [3] Tong, L., Soon, V.C., Huang, Y.F., Liu, R.: Amuse: A new blind identification algorithm. IEEE International Symposium on Circuits and Systems, pp. 1784–1787 (1990)

5. CONCLUSION AND OUTLOOK We propose a flexible Bayesian model for blind source separation that deals with data of general network structure. The separation assumptions from [5] are assigned to a latent Gaussian graphical model in a way that indeed leads to similar results. As an additional feature we are able to deal with missing observation values and evaluate the parameter estimates and sources in terms of their distributions. In future and ongoing work we include priors over the parameters (e. g. for mixing matrices with strictly positive entries) and consider networks with directed cycles. The latter is mainly a question of defining the joint distribution p(S). A further issue for biological applications is dealing with heterogenous data, i. e. data of different scales like mRNA, miRNA, and protein measurements. In order to determine the affinity of the data to

[4] Belouchrani, A., Abed-Meraim, K., Cardoso, J.F., Moulines, E.: A blind source separation technique using second-order statistics. IEEE Transactions on Signal Processing, 45(2), pp. 434-444 (1997) [5] Kowarsch, A., Blöchl, F., Bohl, S., Saile, M., Gretz, N., Klingmüller, U., Theis, F. J.: Knowledge-based matrix factorization temporally resolves the cellular responses to IL-6 stimulation. BMC Bioinformatics, 11, pp. 585–598 (2010) [6] Murphy, K., and others: The Bayes net toolbox for Matlab. Computing Science and Statistics, 33(2), pp. 1024–1034 (2001)

46

WARPED GAUSSIAN PROCESS MODELLING OF TRANSCRIPTIONAL REGULATION Ruirui Ji1,2 and Dirk Husmeier1 School of Mathematics & Statistics, University of Glasgow, United Kingdom School of Automation and Information Engineering, Xi’an University of Technology, China [email protected], [email protected] 1

2

ABSTRACT

expression profiles of all regulated genes

This article extends recent work on Gaussian process modelling of transcriptional regulation, which assumed additive Gaussian noise of constant variance, to heteroscedastic noise. Our work is based on an explicit noise model for transcriptional profiling and the concept of warped Gaussian processes.

xi = [xi (t1 ), . . . , xi (tT )]; i = 1, . . . , G

is described by a Gaussian process prior with a covariance matrix, K, that depends on the scale hyperparameter l and the parameters that characterise the transcriptional regulation processes via (1):

1. INTRODUCTION A linear model of gene expression was proposed by Barenco et al. [1] dxi (t) dt

=

Bi + Si f (t)

Di xi (t)

(1)

where i 2 {1, . . . , G} is a set of genes regulated by the same transcription factor TF, xi (t) are the (unknown) true gene expression levels at time point t, f (t) is the (unknown) TF activity, Bi is the basal transcription rate of gene i, Si is the sensitivity to binding of TF, and Di is a decay rate. We assume that (noisy) measurements of xi (t) can be obtained, e.g. with microarrays or RT-PCR scans. However, TF activity may be subject to post-transcriptional regulation and hence not amenable to transcriptional profiling techniques. We therefore assume that f (t) is unobservable. Equation (1) has the analytic solution Z t Bi xi (t) = + Si exp( Di (t u))f (u)du (2) Di 0

=

N (f |0, Kf ,f )

p(x|✓ 0 ) = N (0, K);

K = K(✓ 0 )

(6)

0

✓ = (l, B1 , . . . , BG , S1 , . . . , BG , D1 , . . . , DG ) See [2, 3] for explicit expressions. To relate the unknown true gene expression profiles xi to noisy measurements yi = [yi (t1 ), . . . , yi (tT )]; i = 1, . . . , G, the standard approach (e.g. [4], Sect. 6.4.2) assumes additive Gaussian noise of constant variance 2 : p(y|x,

2

) = N (y|x,

2

I)

(7)

where I is the identity matrix. The marginalisation over y is analytically tractable and gives, with the definition ✓ = (✓ 0 , 2 ): Z p(y|✓) = p(y|x, 2 )p(x|✓ 0 )dx = N (y|0, C(✓)) C(✓) = K(✓ 0 ) +

2

(8)

I

Inference of the parameters ✓ can then be achieved in a maximum likelihood or Bayesian framework, as described in standard textbooks on Gaussian processes [4, 5].

where transient terms have been ignored. Gao et al. [2] proposed a nonparametric Bayesian approach to inference in this model by placing a Gaussian process prior with squared exponential covariance matrix on the unknown TF activities f = (f (t1 ), . . . , f (tT )) at timepoints t = (t1 , . . . , tT ) p(f )

(5)

2. METHODOLOGICAL INNOVATION The assumption of additive Gaussian noise is not biologically realistic. Gao et al. [2] formally introduced a casedependent variance, t2 , but such an over-flexible model is not amenable to statistical inference. Our approach is based on Durbin et al. [6], who proposed a general noise model for transcriptional profiling with microarrays:

(3)

that is, the prior probability of the TF activities, p(f ), is a zero-mean multivariate Gaussian distribution with covariance matrix Kf ,f , whose elements are ✓ ◆ (t t0 )2 Kf,f (t, t0 ) = exp (4) l2

yi (t) = c + xi (t) exp(µt ) + "t µt ⇠ N (0,

2 µ ); "t

⇠ N (0,

2 ")

(9)

where yi (t) is the measured, and xi (t) the unknown true expression level of gene i at time point t, c is mean background noise, and µ2 and "2 are unknown variance parameters. Note that this form of noise reduces to additive Gaussian noise in the limit of low expression levels,

where l is a scale hyperparameter. The linear form of equation (2) implies that the joint prior distribution of the

47

xi (t) > 1. Inserting (9) into (7) does not give a closed-form solution, and renders inference with Gaussian processes intractable. To proceed, we apply a result found in [7]. Using the delta method of classical statistical inference, the authors derived a variance-stabilising transformation for measured gene expression levels y of the form: ⇣ ⌘ p h(y) = U arsinh(↵ + y) = log y + y 2 + 1 (10) Following [8] we define the warping function ⇣ ⌘ zi [t1 ], . . . , zi [tT ] = zi = h(yi ; ) ⇣ ⌘ = h(yi [t1 ]; ), . . . , h(yi [tT ]; ) (11)

We note that the proposed approach is a modification of the one proposed [8], with a warping function that is motivated by the transcriptional noise model (9). 3. SIMULATION We tested the performance of the proposed scheme on data simulated in a similar manner as described in [9]. We assume that the unobservable TF activity has the form f (t) =

Using the standard variable transformation rule for probability densities, (12) implies the following distribution for the measured gene expression levels y: ⇣ ⌘ ✓ @h(y; ) ◆ 1 p(y|✓, ) = N h(y)|0, C(✓) (13) @y ◆ 1 G Y T ✓ ⇣ ⌘Y @hi (yi (tk ); ) = N h(y)|0, C(✓) @yi (tk ) i=1 k=1

Inference is achieved by taking derivatives of the log likelihood log p(y|✓, ) with respect to both ✓ and , and applying a scaled conjugate gradient search for the maximum likelihood parameters. In this way, both the parameters of the covariance matrix, ✓, and those of the nonlinear transformation, , are learnt simultaneously under the same probabilistic framework. As we demonstrate in our simulation study, this can be expected to achieve better results than applying the transformation (10) in a separate data preprocessing step. The distribution of zi (t⇤ ) at a new time point t⇤ has a Gaussian distribution ⇣ ⌘ p(zi (t⇤ )|x1 , . . . , xG , ✓) = N zi (t⇤ )|ˆ z(✓), ˆ 2 (✓) (14) where ˆ z(✓) and ˆ 2 (✓) are obtained by standard transformations of multivariate Gaussian distributions; see [4, 5] for explicit expressions. The distribution in the original data space is obtained by passing this Gaussian distribution through the non-linear warping function (10):

]|ˆ z (✓), ˆ 2 (✓)

median[yi (t⇤ )] = h

)

(ˆ z (✓);

(t

µj ) 2 2



(17)

Table 1. The errors of estimated parameters, inferred gene expression profiles and transcriptional factor activity on simulated dataset (mean and median for 20 repeats). model Bmean Dmean Smean Genemean T Fmean Bmedian Dmedian Smedian Genemedian T Fmedian

(15) ⌘

Note that as opposed to (14), the distribution in (15) is not Gaussian. When we require a point prediction, rather than the whole distribution, it is convenient (for analytical tractability) to take the median 1



with = 1.5, a1 = a2 = 1.5, a3 = a4 = 0.5, µ1 = 4, µ2 = 6, µ3 = 8.5 and µ4 = 10.5. We generated three gene expression profiles x = [xi (t1 ), . . . , xi (tT )], i = 1, 2, 3, from (1–2) over 100 time points regularly spaced from 0 and 18, using the settings as: B1 = 0.01, B2 = 7.5 ⇥ 10 2 , B3 = 2.5 ⇥ 10 3 , S1 = 1.0, S2 = 0.4, S3 = 0.4, D1 = 1.0, D2 = 0.05, D3 = 0.001. We then sampled six data points from each target gene, which provided the training data. Unlike [9], we did not simply add iid Gaussian noise, but simulated noisy measurements from (9), with parameter settings c = 0, µ2 = 0.01, "2 = 0.01. We compared five approaches: GP: standard Gaussian processes, as in [2]; asinhGP: standard Gaussian processes after pre-processing the data according to the transformation (10), estimating the parameters as described in [7]; GPtanh: warped Gaussian processes, with the mixture-of-tanh warping function proposed in [8]; GPlog: warped Gaussian processes with a logarithm warping function; and GPasinh: warped Gaussian processes with the warping function proposed in the present paper. For all applications, we optimized the parameters ✓ and (if applicable) in a maximum likelihood sense using scaled conjugate gradients, and computed the posterior median according to (16). Table 1 shows the mean absolute prediction error for the estimated chemical kinetic parameters defined in (1), the gene expression profiles and TF activities at the fixed time points. Figures 1-5 show a comparison between the five methods in terms of the error distribution for various quantities, obtained from 20 datasets.

(12)

p(yi (t⇤ )|x1 , . . . , xG , ✓) dh[yi (t⇤ )] ⇣ = N h[yi (t⇤ ), dyi (t⇤ )

aj exp

j=1

where = (↵, , ), and we model z = (z1 , . . . , zG ) with a Gaussian process of (8): p(z|✓) = N (z|0, C(✓))

4 X

GP 0.597 0.319 0.377 0.515 0.430 0.486 0.269 0.410 0.252 0.371

asinhGP 0.591 0.356 0.265 0.589 0.497 0.442 0.251 0.178 0.439 0.507

GPtanh 0.459 0.193 0.298 0.533 0.232 0.384 0.181 0.266 0.159 0.199

GPlog 0.413 0.367 0.479 0.433 0.281 0.353 0.217 0.395 0.144 0.190

GPasinh 0.181 0.182 0.292 0.078 0.168 0.165 0.197 0.300 0.069 0.162

Our findings suggest that the proposed warped Gaussian process tends to achieve the lowest prediction error

(16)

48

4 Mean squared error of gene

Absolute error of B

8

6

4

2

0

3

2

1

0 GP

asinhGP

GPtanh

GPlog

GPasinh

GP

Figure 1. Boxplot of the median, the 25th and 75th percentiles, and the outliers of the absolute error of the basal transcription rate, Bi in (1), over 20 datasets.

asinhGP

GPtanh

GPlog

GPasinh

Figure 4. Boxplot of the median, the 25th and 75th percentiles, and the outliers of the mean squared error for all genes expression profiles, over 20 datasets

1.4 Mean squared error of TF

Absolute error of D

1.2 1 0.8 0.6 0.4

0.8

0.6

0.4

0.2

0.2 0

0 GP

asinhGP

GPtanh

GPlog

GP

GPasinh

Absolute error of S

1.4 1.2 1 0.8 0.6 0.4 0.2 0 GPtanh

GPlog

GPlog

GPasinh

the TF activity, Figure 5) the transformation based on the mixture-of-tanh warping function, proposed in [8], is on a par with our method - but not consistently. It is particularly striking that nonlinearly transforming the data in a pre-processing step, as in [7], using the same warping function (10) as for our warped GP, achieves comparable result only for estimating the kinetic parameters Si (Figure 3), but is otherwise outperformed by our method. This confirms the conjecture raised earlier that systematically inferring the parameters of the warping transformation simultaneously with the hyperparameters of the GP achieves better results than following [7] and applying the warping transformation in a separate data preprocessing step.

1.6

asinhGP

GPtanh

Figure 5. The boxplot of the median, the 25th and 75th percentiles, and the outliers of the mean squared error for the inferred transcription factor activity, over 20 datasets.

Figure 2. Boxplot of the median, the 25th and 75th percentiles, and the outliers of the absolute error of the decay rate, Di in (1), over 20 datasets.

GP

asinGP

GPasinh

Figure 3. Boxplot of the median, the 25th and 75th percentiles, and the outliers of the absolute error of the sensitivity, Si in (1), over 20 datasets.

4. REAL-DATA APPLICATION We have applied the warped Gaussian processes to the transcriptional profiles from Barenco’s study [1], which reflect the expression levels of five target genes, DDB2, BIK, TNFRSF20b, p21 and hPA26, under the influence of regulation by a known transcription factor, P53. The

and outperforms the competing approaches for the majority of the evaluation criteria. For certain reconstruction errors (reconstruction of the decay Di , Figure 2, and

49

training data encompass five gene expression levels at 7 time points (0,2,4,6,8,10,12 hours). We compared our estimates of the kinetic parameters, defined in (1), with those referenced in Barenco’s work. The results are shown in Table 2 and Figures 6-7 and suggest that the proposed warped Gaussian process with the arsinh transfer function of (10) infers kinetic parameters that, overall, are in good agreement with the parameters found in [1]. In terms of constructing the gene expression profiles, the warped GPs with both warping function, the arsinh function of (10) and the mixture-of-tanh functions from [8], are on a par, both outperforming the competing schemes. In terms of agreement of the inferred kinetic parameters with those from [1], our new warped GP with the arsinh function outperforms all other methods.

S 3.5

2.5 2 1.5 1 0.5 0

GP 0.117 0.467 0.182

asinhGP 0.157 0.336 0.214

GPtanh 0.049 0.234 0.158

GPlog 0.036 0.251 0.164

D

0.6

[1] M. Barenco et al., “Ranked prediction of p53 targets using hiddenvariable dynamic modeling,” Genome Biology, vol. 7, no. 3, pp. R25, 2006. [2] P. Gao, A. Honkela, M. Rattray, and N. D. Lawrence, “Gaussian process modelling of latent chemical species: applications to inferring transcription factor activities,” Bioinformatics, vol. 24, pp. i70–i75, 2008. [3] A. Honkela et al., “Model-based method for transcription factor target identification with limited data,” PNAS, vol. 107, no. 17, pp. 7793–7798, 2010. [4] Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer, Singapore, 2006. [5] C. E. Rasmussen and C. K. I Williams, Gaussian Processes for Machine Learning, MIT Press, 2006. [6] B.P Durbin, J.S. Hardin, D.M. Hawkins, and D.M. Rocke, “A variance-stabilizing transformation for gene-expression microarray data,” Bioinformatics, vol. 18, no. suppl1, pp. S105–S110, 2002. [7] W. Huber, A. von Heydebreck, H. Sultmann, A. Poustka, and M. Vingron, “Variance stabilization applied to microarray data calibration and to the quantification of differential expression,” Bioinformatics, vol. 18, no. suppl1, pp. S96–104, 2002. [8] E. Snelsen, C.E. Rasmussen, and Z. Ghahramani, “Warped Gaussian processes,” Neural Information Processing Systems (NIPS), vol. 16, 2004. [9] N.D. Lawrence, M. Rattray, P. Gao, and M. Titsias, Gaussian Processes for Missing Species in Biochemical Systems. In N.D. Lawrence, M. Girolami, M. Rattray, and G. Sanguinetti (eds.), Learning and Inference in Computational Biology, MIT Press, 2010.

0.2

DDB2

PA26 TNFRSF20b

BIK

6. REFERENCES

True GP asinhGP GPlog GPtanh GPasinh

0.4

0

PA26 TNFRSF10b

variance stabilising transformation for transcriptional data [7]. We have shown that integrating this approach into the Gaussian process inference scheme achieves better results than transforming the data in a separate preprocessing step, and that our novel scheme outperforms warped Gaussian processes based on other warping functions.

GPasinh 0.02 0.178 0.158

1

0.8

DDB2

Figure 7. Comparison of the true and inferred sensitivities, Si in (1), as obtained with different methods from the transcription profiles in [1].

Table 2. The errors of estimated parameters and inferred gene expression profiles obtained from the real transcription profiles described in [1]. model D S Gene

True GP asinhGP GPlog GPtanh GPasinh

3

BIK

Figure 6. Comparison of the true and inferred decay rates, Di in (1), as obtained with different methods from the transcription profiles in [1]. 5. CONCLUSIONS Gaussian processes have been proposed as a promising tool for modelling transcriptional regulation. However, the widely applied constant variance additive noise model (e.g. [4], Sect. 6.4.2) is oversimplistic and does not adequately reflect the intrinsic heteroscedastic nature of the noise. While warping the Gaussian process to correct for this effect tends to achieve a more reliable parameter and gene expression profile reconstruction, the empirical warping function proposed in [8] does not take into account the specific nature of the noise inherent in transcriptional regulation. In the present article we have proposed a warping function based on an explicit noise model for transcriptional regulation [6], which has been widely applied as a

50

SPATIAL STOCHASTIC SIMULATION OF TRANSCRIPTION FACTOR BINDING REVEALS MECHANISMS TO CONTROL GENE ACTIVATION Michael Klann1 * and Heinz Koeppl1 1

BISON Group, Automatic Control Laboratory, ETH Z¨urich, Switzerland *[email protected], [email protected] ABSTRACT

the correct binding site [8]. This reduced search time is needed to explain the observed high binding rate constant of ⇡ 7⇥109 M 1 s 1 [9], which seem to be above the diffusion limit. However, the specific, crowded in vivo conditions together with the electrostatic interactions between the DNA and the TF could change the diffusion limit and thus explain the high value without the need of sliding [4]. After all, sliding can only marginally contribute to finding the target site because the TF might have to walk along millions of nucleotides until it reaches the target site – while diffusion in 1D is not an efficient means of transport (the average displacement is zero) [10, 11], and observed 1D diffusion coefficients are much smaller than the 3D diffusion [12, 5]. Here, we employ a particle-tracking algorithm based on Brownian dynamics to simulate the search process [13]. To our knowledge, this is the first study of protein-DNAinteraction in 3D at this mesoscopic resolution. Only Wunderlich and Mirny [11] have conducted a similar simulation in 2D, all other studies where just exploring the binding, sliding and hopping probabilities. Molecular dynamic simulations, in contrast, can explore the detailed interaction between the TF and the DNA but cannot track how the TF reaches the target.

The activation of genes is one of the most prominent yet most stochastic processes in the cell. Gene activation controls the state of the cell. But normally only one or a few copies of each gene exist, causing a highly stochastic process. In addition, transcription factors (TF) that activate the respective gene have to find the corresponding target site on the tremendously long DNA strand. In order to elucidate the process, we have developed a particle-tracking algorithm including both the spatial and stochastic aspects of the process. TFs can diffuse through the nucleus which is filled with DNA strands. They can unspecifically bind to the DNA backbone or directly to the target gene sequence in a reversible manner. The results of the detailed simulation show that the activation of the genes does not only depend on the number of TFs and the specific binding process to the target sequence but is strongly controlled by the unspecific binding process to all other nucleotide sequences along the DNA. 1. INTRODUCTION This abstract aims at elucidating the binding of transcription factors to the promoter regions of DNA in order to promote (or block) the activation of the respective genes. The binding of the TFs to the DNA is governed by the interaction between atoms in the DNA-binding domain of the TF and the structure of the DNA [1, 2]. The overall backbone structure of the DNA is independent of the location, which means that the TFs can bind anywhere to the DNA in an unspecific binding process (see Fig. 2). According to their function, the binding of the TF also depends on the DNA sequence, adding the specific interaction with the target site to the model [3, 2]. The key question in this process is how the TF can find the target sequence efficiently amongst the billions of base pairs in the nucleus [4, 5, 6]. Note, that this work omits the more detailed structure of the DNA, which is coiled around histones and arranged in a complex, (fractal?) structure [7]. So some parts of the DNA are (temporarily) not accessible. Thus the question remains how the accessible target sequence can be found within all other accessible sequences. As also shown in Fig. 2, it is suggested that the TF can slide along the DNA in a 1D random walk [2]. This could lead to a facilitated diffusion of TFs, which can reduce the search time until the TF finds

2. MODEL AND SIMULATION METHOD It remains unclear which driving force leads to a sliding step of the TF along the DNA. All the bonds between the DNA and the TF have to be broken for both, dissociation or sliding [4]. Accordingly in the model presented here, a transcription factor can be either bound to the DNA or free to diffuse in 3D, but not sliding in 1D along the DNA. Fig. 1a shows the interactions of the model. The model nucleus of this study has a radius of 0.75 µm which corresponds to a small cell like for example yeast. It is filled with 5000 cylinders with a diameter of 2.0 nm (equal to DNA) and a total length of LDNA = 2.5 mm (corresponding to ⇡7.3 million base pairs 1 . In the center of 1000 out of the 5000 cylinders a target sequence for the TFs is included, modeled as a sphere with a slightly bigger radius than the cylinder rsite = 1.1 nm (cf. Figure 2). The TF is also modeled as a sphere with rT F = 1.1 nm and it is diffusing with DT F = 1 µm2 / s. The simulation 1 The DNA of yeast consists of 13 million base pairs, but not the whole DNA is accessible at any time.

51

contains 2000 TFs. Unbound TFs can move in a discrete time continuous space random walk [13]: p xi (t + t) = xi (t) + 2Di t ⇠ (1)

2.66 ⇥ 103 M

1

s

1

off , ksite = 0.01s

1

.

3. RESULTS AND DISCUSSION

Let us denote the number of free/unbound transcription factors with Nf , the number of specifically bound ones at the target site with Ns and the number of unspecifically bound with Nb , which follow the dynamics

(⇠ is standard normal random number). If the TFs enter the restricted volume the step is rejected. If the TFs enter the interaction volume, they react and convert to the bound state (cf. Figure 2a). The reaction volume is adjusted so that the desired binding rate is achieved as indicated in Figure 1b. Bound TFs are immobile. Unbinding is modeled with a first order reaction rate constant k⇤off [s 1 ], and the unbinding probability in every time step is given by P⇤off = 1 exp(k⇤off t) ⇡ k⇤off t at the chosen t [15]. Binding to the target site is determined by the bimolecon ular rate constant ksite [M 1 s 1 ], which can be converted 3 to [µm /s] and results in a spherical interaction volume [13]. Only its part outside of the restricted volume is acon cessible (cf. Figure 1b). In this study we used ksite =

Nb * ) Nf * ) Ns . A deterministic model will predict a steady state that solely depends on the fraction of the binding and unbinding rate constants: on Nb k on Ns ksite Nt = DNA and = off off Nf N f kDNA kt Vnucl

(2)

where Nt is the number of free specific target binding sites. In the present simulation we kept the specific bindoff on ing and unbinding rate constants ksite and ksite constant on while changing the unspecific binding rate constants kDNA off and kDNA simultaneously. Table 1 shows parameters and results, especially that Nb /Nf = 2.8 for all parameters. Figure 2 shows how the number of activated sites depends on the unspecific binding rate constant in the stochastic simulation. First of all, unspecific binding reduces Nf on which reduces Ns as long as ksite remains constant. However, the unspecifically bound molecules are directly at the DNA, which can be close to the binding site (lowest curve in Figure 2). Increasing the frequency of the unspecific binding and dissociation process alters the outcome of the simulation. While the fraction Nb /Nf remains constant, the fraction Ns /Nf is rising with an increased unspecific binding rate. This means that the resulting effective specific binding rate is increased due to the faster unspecific binding process. This result is in agreement with the findings of Berg [12], stating that unspecific association rate should be as large as possible. Unbinding makes the bound molecules temporarily mobile such that they can explore their neighborhood, while the stronger binding catches them before they really diffuse away. Similar results have been reported by Zhou and Szabo [16] for a nonspecific attractive potential, which increases the local TF concentration at the target site. Likewise a study in [14] shows, that fast transitions between the bound and unbound state allow lumping them into one state. Accordingly the ability to react (based on collisions) from the mobile state and the closer location from

Figure 1. (a) Model for the interaction of transcription factors (TF) with DNA. Since the TF is highly specific, the binding strength with the correct site should be stronger than the plain binding along the remaining DNA, i.e. on on ksite > kDNA . Sliding is not present in our model (i.e., the 1D diffusion is set to 0). (b) Simulation definitions: • Excluded volume: the center of mass of the TF cannot be closer to the centerline of the DNA object as rT F + rDNA . • Unspecific binding volume: the unspecific binding volume is uniformly distributed along the DNA surface. The total unspecific binding volume leads to on the desired binding rate based on kDNA cDNA where cDNA = LDNA /V is the concentration of DNA, on in [µm/µm3 ]. Thus kDNA has units [µm2 /s] and on kDNA tLDNA is the required reaction volume [14].

on kDNA cDNA 0s 1 1.4s 1 4.2s 1 14.0s 1 41.9s 1

• Specific binding volume: The specific interaction volume between TF and target site is exactly their collision volume, but obviously only the volume fraction outside of the excluded volume contributes to the interaction. The reaction probability for TFs inside of the interaction volume is determined as described in [13].

off kDNA 0s 1 0.5s 1 1.5s 1 5.0s 1 15.0s 1

Nb 0 1356 1298 1165 1017

Nf 1701 484 464 417 364

Ns 299 160 238 418 619

Table 1. Rate constants and steady state numbers.

52

the bound state are mixed ([14], Chapter 5). Thus also the unspecifically bound TFs can interact with the target site, and make use of their high local concentration along the DNA. Based on the theory of diffusion-controlled reactions also the reduced mobility of the TFs due to unspecific binding comes into play. Not only associations can be diffusion limited, but also the dissociation process in reversible reactions: if the molecule does not diffuse away fast enough, its fate will be a geminate recombination rather than the unbound state [17]. We found that the increased number of Ns strongly depends on the increased fraction of geminate recombinations. on Figure 3 shows that the effective ksite is increased of f while ksite is reduced if we fit our simulated data to the following ODE model: d[N ]t dt d[N ]f dt d[N ]s dt

=

off on ksite [N ]t [N ]f + ksite [N ]s

=

off on ksite [N ]t [N ]f + ksite [N ]s

=

on +ksite [N ]t [N ]f

4. CONCLUSIONS Our results show that the activity of the genes can be controlled by two factors, namely off on 1. The fraction kDNA /kDNA , which determines the fraction of unbound transcription factors that can diffuse around to search for the target site.

off ksite [N ]s ,

from which the steady state of Equation (2) can be derived. If instead of [N ]f the total TF concentration ([N ]f + on [N ]b ) is used, the effective ksite is lower accordingly. We are planning to investigate with more detailed simulations the change in the effective rate constants and the mixing of the bound and free TF state. On the level of the ODE model we therefore will include the corresponding states. The jumps between two unspecifically bound phases of a TF can look like a sliding effect if the respective rates

2. The speed of the unspecific binding process, namely on the rate constant kDNA , which couples the bound and unbound state. The present results show the importance of stochastic modeling and especially the necessity to include spatial aspects into the models. In the present example the results from the spatial stochastic model strongly differ from those obtained in a deterministic ODE model with the same rate constant. The rate constants have to be adjusted in the ODE model in order to reproduce the results from the spatial simulation [12]. We will quantify this effect on the effective rate constants in future work also in a non-spatial stochastic model and extend it to investigate also enhanced reaction processes on membranes [16].

Number of Activated Binding Sites 700 600

Simulations, fast unspecific binding

400

Simulation & ODE, no unspecific binding

300

1.0

10

0.9

9

0.8

200

Simulations, slow unspecific binding

100

ODE model with unspecific binding

0 0

2.5

5

7.5 Time [min]

10

12.5

8

0.7

15

0.6

6

0.5

5

0.4

4

0.3

3 based on Nf

0.2 0.1 0.0

Figure 2. Results of the TF-DNA binding study. Bold curve: without unspecific binding to the DNA. The other curves include the unspecific binding to the DNA with Keq = 2.8 and parameters given in Table 1. Faster unspecific binding leads to a higher number of TFs that bound to a target site. The dashed lines indicate mean ± standard deviation of the stationary distribution.

7

off k site

2

on k site

based on (Nf + Nb )

0.5

on effective ksite

500

off effective ksite

Number of Active Sites

become faster: a sliding step is nothing but an immediate rebinding of the TF to the DNA with a minimal displacement along it. Frequent changes between the bound and unbound state can in addition cause anomalous diffusion of the TF [13]. Arrestingly a reaction-diffusion analysis of target site search on a wormlike chain DNA also reveals superdiffusive behavior [18]. Anomalous diffusion, in turn, can have a significant impact on the reaction rates. Finally it is also worth noting that the reversible binding process plays an important role in other cellular process, for instance at the plasma membrane with respect to receptor clustering [19] or on the level of the signaling cascade [17]. The microcompartimentalization and localization of certain molecules also depends on such reactions [20, 21].

1

15.0 5.0 1.5 Unspecific unbinding rate constant [1/s]

off off on on Figure 3. Relative effective ksite /ksite, 0 and ksite /ksite, 0 obtained by fitting the simulation result to the ODE model. on on ksite /ksite, 0 is also calculated for (i) activation based on all TFs (Nf + Nb ) or only the free ones Nf .

53

5. ACKNOWLEDGMENTS

[12] O. Berg, “Surface diffusion as a rate-enhancing mechanism in macromolecular association: The effectiveness of one-and two-dimensional surface sliding,” in Makromolekulare Chemie. Macromolecular Symposia. Wiley Online Library, 1988, vol. 17, pp. 161–175.

Funding: SystemsX.ch initiative and the Swiss Confederation’s Commission for Technology and Innovation (CTI) project 12532.1 PFLS-LS. H.Koeppl. acknowledges the support from the Swiss National Science Foundation, grant no. PP00P2 128503.

[13] M. Klann, A. Lapin, and M. Reuss, “Agent-based simulation of reactions in the crowded and structured intracellular environment: Influence of mobility and location of the reactants,” BMC Systems Biology, vol. 71, no. 5, 2011.

6. REFERENCES [1] U. Gerland, J. Moroz, and T. Hwa, “Physical constraints and functional characteristics of transcription factor–DNA interaction,” PNAS, vol. 99, no. 19, pp. 12015–12020, 2002.

[14] M. Klann, Development of a Stochastic Multi-Scale Simulation Method for the Analysis of Spatiotemporal Dynamics in Cellular Transport and Signaling Processes, Ph.D. thesis, Universit¨at Stuttgart, Germany, 2011.

[2] P. von Hippel and O. Berg, “Facilitated target location in biological systems.,” Journal of Biological Chemistry, vol. 264, no. 2, pp. 675–678, 1989.

[15] S. Andrews and D. Bray, “Stochastic simulation of chemical reactions with spatial resolution and single molecule detail,” Phys. Biol., vol. 1, pp. 137–151, 2004.

[3] T. Hardiman, H. Meinhold, J. Hofmann, J. Ewald, M. Siemann-Herzberg, and M. Reuss, “Prediction of kinetic parameters from DNA-binding site sequences for modeling global transcription dynamics in Escherichia coli.,” Metabolic engineering, vol. 12, no. 3, pp. 196–211, 2010.

[16] H. Zhou and A. Szabo, “Enhancement of association rates by nonspecific binding to DNA and cell membranes,” Phys Rev Let, vol. 93, no. 17, pp. 178101, 2004.

[4] S. Halford, “An end to 40 years of mistakes in DNA-protein association kinetics?,” Biochem. Society trans., vol. 37, pp. 343–348, 2009.

[17] M. Morelli and P. Ten Wolde, “Reaction Brownian dynamics and the effect of spatial fluctuations on the gain of a push-pull network,” J Chem Phys, vol. 129, pp. 054112, 2008.

[5] J. Schonhoft and J. Stivers, “Timing facilitated site transfer of an enzyme on DNA,” Nature Chemical Biology, 2012.

[18] M. de la Rosa, E. Koslover, P. Mulligan, and A. Spakowitz, “Dynamic strategies for target-site search by DNA-binding proteins,” Biophys J, vol. 98, no. 12, pp. 2943–2953, 2010.

[6] A. Kolomeisky and A. Veksler, “How to accelerate protein search on DNA: Location and dissociation,” J Chem Phys, vol. 136, pp. 125101, 2012.

[19] A. Mugler, A. Bailey, K. Takahashi, and P. Wolde, “Membrane clustering and the role of rebinding in biochemical signaling,” Biophys J, vol. 102, no. 5, pp. 1069–1078, 2012.

[7] Lieberman-Aiden et al., “Comprehensive Mapping of Long-Range Interactions Reveals Folding Principles of the Human Genome,” Science, vol. 326, no. 5950, pp. 289–293, 2009.

[20] A. Berezhkovskii and L. Dagdug, “Effect of binding on escape from cavity through narrow tunnel,” J Chem Phys, vol. 136, pp. 124110, 2012.

[8] K. Klenin, H. Merlitz, J. Langowski, and C. Wu, “Facilitated diffusion of DNA-binding proteins,” Phys.rev.let., vol. 96, no. 1, pp. 18104, 2006.

[21] M. Byrne, M. Waxham, and Y. Kubota, “The impacts of geometry and binding on CaMKII diffusion and retention in dendritic spines,” Journal of computational neuroscience, vol. 31, no. 1, pp. 1–12, 2011.

[9] A. Riggs, S. Bourgeois, and M. Cohn, “The lac repressor-operator interaction. 3. Kinetic studies.,” J. mol. biol., vol. 53, no. 3, pp. 401–417, 1970. [10] S. Halford and J. Marko, “How do site-specific DNA-binding proteins find their targets?,” Nucleic acids research, vol. 32, no. 10, pp. 3040, 2004. [11] Z. Wunderlich and L. Mirny, “Spatial effects on the speed and reliability of protein-DNA search,” Nucleic Acids Research, vol. 36, no. 11, pp. 3570–3578, 2008.

54

MODELING AND ANALYSIS OF DIVISION-, AGE-, AND LABEL-STRUCTURED CELL POPULATIONS Patrick Metzger, Jan Hasenauer and Frank Allgower Institute for Systems Theory and Automatic Control, Pfaffenwaldring 9, 70550 Stuttgart, Germany {patrick.metzger, jan.hasenauer, frank.allgower}@ist.uni-stuttgart.de corresponding author. age a

ABSTRACT In this paper a model for cell proliferation is presented. This model describes the dynamics of a division-, age-, and label-structured population (DALSP), using a system of coupled partial differential equations (PDEs). We show that this system of PDEs can be decomposed into two lower-dimensional models. For both models, analytical solutions are presented which provide an analytical solution of the DALSP model.

ASP

DASP

DALSP label x EG

1. INTRODUCTION Cell proliferation is an essential biological process. It plays an eminent role within many different fields of life science, such as immunology, cancer growth or stem cell induced tissue remodeling. To analyze proliferation dynamics, proliferation assays are used. These assays use fluorescent dyes, for instance, Carboxyflourescein succinimidyl ester (CFSE), which are at cell division split among daughter cells. The resulting time-dependent fluorescent distribution conveys information about the subpopulation structure. To study proliferation processes on the level of the population, mainly four modeling approaches are used: exponential growth (EG) models, division-structured population (DSP) models, age-structured population (ASP) models, and label-structured population (LSP) models. EG models describe the dynamics of the overall amount of cells / cell population size using a single ODE [1]. DSP models are systems of ODEs and take into account that cells underwent a different number of divisions [2]. The PDE-based ASP models have been introduced to study the age of individual cells and age-dependent division rates [3]. Finally, the LSP model describes the time-dependent label distribution within the cell population [4, 5] using a single PDE. Apparently, while DSP models and ASP models enable the detailed description of the individual cell (focusing on different aspects) the LSP model allows for a direct comparison of model prediction and data. To partially combine these two aspects, we introduced the divisionand label-structured population (DLSP) model [6, 7]. The DLSP model describes the dynamics of cell proliferations by employing a system of PDEs. Unfortunately, it still

DSP division number i

LSP

DLSP

Figure 1. Illustration of existing models, there properties and the proposed DALSP model. Furthermore, the DASP model, which results as a byproduct of our analysis of the DALSP models, is illustrated. does not account for the age of the individual cells. This is critical as the rates of cell division and cell death are clearly time-dependent. To overcome this shortcoming, we introduce a population model which accounts for the single cell properties: cell age, division number, and label concentration. The model is denoted as division-, age- and label-structured population (DALSP). The dependencies of the different model classes are illustrated in Figure 1. In Section 2, the DALSP model is introduced. This system of PDEs is decomposed into two lower-dimensional systems of PDEs, for which analytical solutions are presented. To allow for a biological interpretation, we discuss the relation of the rates contained in the model and biological properties, such as the time-dependent probability of cell division. In Section 3 we present a small simulation study, before we conclude the paper in Section 4. Notation: In the following, we denote the probability density of property x by p x , the number density of property x by n x , and the number of cells by N. The difference between probability densities and number densities is that the integral over the former is one.

55

2. METHODS

2.2. Information provided by DALSP model

2.1. Formulation of DALSP model

The number density n a, x, i t contains a variety of biologically relevant information about the cell population. By marginalizing n a, x, i t over the label concentrations, x, the joint number density of age and cell division number

The state variable of the DALSP is the joint number density of age, label concentration and cell division number, n a, x, i t . Given n a, x, i t , the number of cells with i divisions, a ⌦a , and x ⌦ x is

n a, i t

n a, x, i t dxda, ⌦a

is obtained. Marginalization over the cell age, a, yields

⌦x

The dynamics of n a, x, i t are governed by a system of PDEs, n a, x, i t t

n a, x, i t a ↵i t, a

⌫ t, x n a, x, i t x t, a n a, x, i t i

n x, i t

(1)

0 : n a, x, 0 0

n0 a p0 x ,

i

1 : n a, x, i 0

0,

N it

i

0 : n 0, x, 0 t 1 : n 0, x, i t

0, 2

⌫ t, x n a, x, i t ⌫ t, x ,

n a, x, i t dxda.

R

N t R

↵i

1

t, a n a, x, i

N it . i

1 t da,

N0

The overall label distribution, that is, the label density of the complete cell population, is obtained by summarizing n x, i t over the division number i. It is denoted by

with R := 0, . The i-th PDE describes the dynamics of the subpopulation of cells with i cell divisions. The product structure of the initial condition, n a, x, 0 0 n0 a p0 x , is assumed to allow for a simple solution algorithm, however, a generalization is straight forward. The function ↵i t, a : R R R is the rate at which cells of subpopulation i transition into subpopulation i 1 due to cell division, whereas i t, a : R R R denotes the rate of cell death in the i-th subpopulation. The function ⌫ t, x : R R R denotes the label dilution in each cell. To ensure the existence of n a, x, i t it is assumed that ↵i t, a , i t, a , ⌫ t, x , and p0 x are in C0 with respect to t, a and x. The fluxes affecting the distributions n a, x, i t , are:

n a, x, i t

R

The number of cells in the overall population is

and boundary conditions (BCs) i

n a, x, i t da

R

the joint number density of label and cell division number. Moreover, the number of cells which underwent i divisions is given by

with initial conditions (ICs) i

n a, x, i t dx

R

n xt

n x, i t . i

In addition, i

N0 and t

N0 R , for which N i t

0,

n x, i t N it

p x i, t

defines the probability density that a cell which divided i times has label concentration x. 2.3. Solution of the DALSP model Systems of coupled PDEs are in general difficult to solve. Fortunately, for (1) a complexity reduction can be achieved by decomposition into two models.

x, change of label x with rate

a, increase of cell age a,

Theorem 1 The solution of (1) (for ⌫ t, x

↵i t, a i t, a n a, x, i t , loss of cells from the i-th subpopulation due to cell division with rate ↵i t, a and due to cell death with rate i t, a ,

n a, x, i t

k t x) is

n a, i t p x i, t

(2)

in which n a, i t solves

2

↵i 1 t, a n a, x, i 1 t da, birth of cells in R the i-th subpopulation with age a 0.

n a, i t t

The label dilution at cell division is determined by the factor 1, 2 . For the detailed derivation of the model and more detailed versions of the following proofs we refer to [8].

with ICs

n a, i t a

↵i t, a

i

i

0 : n a, 0 0

n0 a ,

i

1 : n a, i 0

0,

t, a n a, i t , (3)

and BCs

Remark 1 In the remainder, we consider time-dependent linear degradation rates, ⌫ t, x k t x, which have been shown to allow for an accurate description of the degradation dynamics [5].

56

i

0 : n 0, 0 t

0,

i

1 : n 0, i t

2

R

↵i

1

t, a n a, i

1 t da,

(4)

and p x i, t solves p x i, t t with ICs i

⌫ t, x p x i, t x i

N0 : p x, i 0

i

p0

the joint dynamics of age and division number, and generalizes the existing age-structured population models [3]. Here it is obtained as a byproduct of the analysis of the DALSP model. The dynamics of the individual subpopulations are in the DASP model governed by a Von Foerster-like equation. For the Von Foerster equation, a solution has been derived in [9]. By applying this solution repeatedly to the individual subpopulations, we obtain for i 0

(5)

0,

x.

Proof of Theorem 1. To prove Theorem 1, we verify that n a, x, i t n a, i t p x i, t with the dynamics of n a, i t and p x i, t governed by (1) and (5) solves (1). Therefore, equivalence of the dynamics, the ICs and the BCs is checked. The left-hand side (⇤) of the original PDE model (1) is reformulated by inserting the ansatz (2), yielding n a, i t t

(⇤)

n a, i t a p x i, t t

n a, i t

n a, 0 t n0 a and for i

i

2

N0 : n 0, x, i t

t, a n a, i t p x i, t .

!

n 0, i t p x i, t .

↵i

R !

1

2

t, a n a, i

R

↵i

1

1 t p x, i

t, a n a, x, i

p x i, t .

t 0

k t dt p0

i

exp

t 0

,t

a

↵i

1

t, a n a, i

,t

a

1 t da

a 0

i

a

t

a, a da

↵i a exp

0

↵i a da .

Accordingly, for ↵i a 0, the probability density of observing a cell death, B, of a cell with age a is p a, B

The decomposition of the DALSP model nicely illustrates that the label dynamics are in some sense orthogonal to the dynamics of the age and division number distribution. Therefore, the label dynamics can be separated and solved analytically using the method of characteristics, yielding i N0 : exp

R

p a, A ↵i

!

i

a

In the previous section, we introduced the DALSP model which provides a population level description of cell proliferation. The properties of this cell population model can be related to the properties of single cells. The rates ↵i and i determine the occurrence of the events: A – cell division, and B – cell death. Following [3, 8, 10], it can be shown that for i a 0, the probability density of observing a cell division, A, of a cell with age a is a

Hence, p x, i 1 t p x i, t has to hold for equivalence of the BCs. This is indeed the case for ⌫ t, x k t x and can be shown, e.g., using the analytical solution (5) stated below (similar to [6, 7]). This verifies that the BCs are equivalent and concludes the proof. ⇤

p x i, t

,t

2.4. Relation between rates and single-cell properties

(6)

1 t da 1 t da

a

in which i t, a := ↵i t, a i t, a . This solution can also be derived using the method of characteristics. The decomposition and the analytical solutions for (3) and (5) simplify the analysis of the DALSP model. Instead of dealing with a rather complex PDE system, the solution of the DALSP model n a, x, i t can then be determined by multiplying the solutions p x i, t and n a, i t .

For i 0, these BCs vanish and the identity is ensured. In contrast, for i 1 from (6) it follows that 2

,t

1

exp

This is similar to the result obtained by inserting the ansatz (2) in the right-hand side of (1), proving the consistency of the dynamics. The equivalence of the ICs follows directly by comparison. To verify the equivalence of the BCs, they are equated, i

t dt

0

Subsequently, the first term in brackets is substituted with the right-hand side of (3), yielding ↵i t, a

a

n a, i t

.

0, due to (5)

(⇤)

t, t

0

0

0

p x i, t ⌫ t, x p x i, t x

t

t exp

i

i

a

a exp

0

i

a da .

The probability density that either of the mutually exclusive events A or B occurs at age a is p a, A

k t dt x

B ↵i ,

i

p a, A ↵i

1

p a, B

1

i

a 0 a 0

p a, B

i

da

p a, A ↵i da ,

(cf. [8]). Given the probability density p a, A B ↵i , i , also the probability density of cell division in the presence of cell death ( i a 0) can be determined,

(cf. [7]). Thus, only the division- and age-structured population (DASP) model (3) remains. This model describes

57

p a, A ↵i ,

i

p a, A

B ↵i ,

i

p a, A ↵i p a, A ↵i p a, B

i

.

5

(a)

x 10 4 0.2 pdf

is alike.

3 0.1

x

n ( x|t ) · x

i

n xt

3. ILLUSTRATIVE EXAMPLE In this section, we compare a system with constant division rate, ↵i a 0.092 [1/day], to a system with sigmoidal division rate, ↵i a 0.3a8 48 a8 . Similar to [6], we choose i a 0.05 [1/day], k 0.144 [1/day], and 2 [-]. Figure 2(a) and (b) depict the label dynamics of the overall population during the first 16 days for both model alternatives. Although the constant and the sigmoidal division rate result in the same mean inter-division time, R a p a, A ↵i , i da, there are key differences between the distributions. The most important difference is that model with constant ↵i – this model can be reduced to a DLSP model [6] – predicts the existence of a significant number of cells with high division numbers. This is a result of the exponentially distributed division times obtained for constant division rates. This exponential distribution is biologically not plausible as arbitrarily small inter-division times are possible. As shown in Figure 2(b), for the model with sigmoidal age dependence this is not the case, which renders the model, with age-dependent division rate, biologically more plausible.

0.05 0

5 10 15 20 25 c etime l l a gte [day] a [day]

−1 5

10x 10 4 0.2

x

n ( x|t ) · x

0

10

pAB pA pB

0.15

n xt

0 4 8 12 16

1 0

(b)

0

2

pdf

The equation for p a, B ↵i ,

day day day day day

pAB pA pB

0.15

3 0.1 0.05 0

2

0

1

2

3

10 10 10 la b e l c o nc e nt r a t io n x day day day day day

4

10 0 4 8 12 16

5 10 15 20 25 c etime l l a gte [day] a [day]

1 0

−1

10

0

10

1

2

3

10 10 10 la b e l c o nc e nt r a t io n x

4

10

label concentration x

Figure 2. Illustration of the population dynamics for (a) constant and (b) sigmoidal division rate, and the corresponding event distributions (pAB a p a, A B ↵i , i , pA a p a, A ↵i , i , and pB a p a, B ↵i , i ). [2] R. De Boer, V. Ganusov, D. Milutinovi`o, P. Hodgkin, and A. Perelson, “Estimating lymphocyte division and death rates from CFSE data,” Bull. Math. Biol., vol. 68, no. 5, pp. 1011– 1031, 2006.

4. CONCLUSION

[3] G.F. Webb, “↵- and -curves, sister-sister and mother-daughter correlations in cell population dynamics,” Computers Math. Appl., vol. 18, no. 10–11, pp. 973–984, 1989.

In this paper, we introduced a novel model for cell proliferation dynamics. This model is an extension of the DLSP model [6, 7] and takes the label concentration, the division number, and the cell age into account. We proved that the dynamics of DALSP cell populations are governed by a system of PDEs, which can be decomposed into two lower-dimensional systems of PDEs (one coupled, one decoupled) for which analytical solutions were stated. Using these analytical solutions, the response of the DALSP model can be determined efficiently, allowing for advanced analysis tools and parameter estimation. Eventually, the link between division and death rates and the corresponding age distribution for the individual cell is established. The linking of population level information and single-cell properties is crucial to gain biological insight into the process, e.g., to design appropriate experiments. This might be improved further by accounting for different cell types, as discussed by [11].

[4] T. Luzyanina, D. Roose, T. Schenkel, M. Sester, S. Ehl, A. Meyerhans, and G. Bocharov, “Numerical modelling of label-structured cell population growth using CFSE distribution data,” Theor. Biol. Med. Model., vol. 4, pp. 26, 2007. [5] H.T. Banks, K.L. Sutton, W.C. Thompson, G. Bocharov, M. Doumic, T. Schenkel, J. Argilaguet, S. Giest, C. Peligero, and A. Meyerhans, “A new model for the estimation of cell proliferation dynamics using CFSE data,” J. Immunological Methods, vol. 373, no. 1–2, pp. 143–160, 2011. [6] D. Schittler, J. Hasenauer, and F. Allg¨ower, “A generalized population model for cell proliferation: integrating division numbers and label dynamics,” in Proc. of Workshop Comp. Syst. Biol., Z¨urich, Switzerland, pp. 165–168, 2011. [7] J. Hasenauer, D. Schittler, and F. Allg¨ower, “A computational model for proliferation dynamics of division- and labelstructured populations,” Tech. Rep., arXiv:1202.4923v1 [q-bio.PE], 2012. [8] P. Metzger, “A unified growth model for division-, age- and label-structured cell populations,” Diploma Thesis, University of Stuttgart, Stuttgart, Germany, 2012.

5. ACKNOWLEDGEMENTS

[9] E. Trucco, “Mathematical models for cellular systems: The Von Foerster equation: Part 1,” Bull. Math. Biophysics, vol. 27, no. 3, pp. 285–304, 1965.

The authors acknowledge financial support by the German Research Foundation (DFG) within the Cluster of Excellence in Simulation Technology (EXC 310/1) at the University of Stuttgart.

[10] O. Diekmann, M. Gyllenberg, J. Metz, and H. Thieme, “On the formulation and analysis of general deterministic structured population models. i. Linear theory,” J. Math. Biol., vol. 36, no. 4, pp. 349–388, 1998.

6. REFERENCES

[11] D. Schittler, J. Hasenauer, and F. Allg¨ower, “Extension of a proliferation model to account for cell types,” accepted for publication at the Workshop Comp. Syst. Biol., Ulm, Germany, 2012.

[1] M. Zwietering, I. Jongenburger, F. Rombouts, and K. van ’t Riet, “Modeling of the bacterial growth curve,” Appl. Environ. Microbiol., vol. 56, no. 6, pp. 1875–1881, 1990.

58

IDENTIFICATION OF FEEDBACK CIRCUITS THAT ARE CONNECTED TO MULTIPLE FIXED POINTS IN BIOLOGICAL NETWORKS Nicole Radde Institute for Systems Theory and Automatic Control University of Stuttgart, Pfaffenwaldring 9, 70569 Stuttgart Email: [email protected] ABSTRACT

unstable fixed points. Moreover, with simulations one can never guarantee to get the complete set of fixed points of the system. An approach for fixed point calculations, the circuit breaking algorithm (CBA), was introduced in [5]. The CBA uses the graph topology to make fixed point calculations more efficient by exploiting independence relations between the variables. This is done by constructing a one dimensional characteristic c() whose zeros correspond to the fixed points of the system. We have extended this work in [6] and showed that c() also partly contains information about the stability of the fixed points. Here we demonstrate how our algorithm can be used to identify circuits that can give rise to multiple fixed points within a larger network. The approach is applied to a regulatory network model for calcium oscillations in hepatocytes [7], which consists of several interrelated positive and negative feedback circuits.

This paper introduces a new approach to identify circuits within a regulatory network that can cause the existence of multiple fixed points in the overall system. The method is based on the circuit-breaking algorithm, an algorithm that uses the topology of the interaction graph to make fixed point calculations in nonlinear systems more efficient. We apply our method to a regulatory network model for cytoplasmic calcium oscillations based on experiments in hepatocytes and reveal the circuits that can give rise to multiple fixed points. 1. INTRODUCTION Describing the dynamic behavior of intracellular regulation networks using ordinary differential equations that are based on chemical reaction kinetics has become a standard approach in recent years. Simulation of the dynamic behavior of those systems requires knowledge of initial concentrations and kinetic rate constants, which are often not known and also hardly accessible experimentally. On the other hand, at least for some regulatory modules it is known which molecules react with each other, i.e. there is information available on the interaction graph (I-graph) topology. This has facilitated research on the question of which information about the behavior of the system is encoded solely in the topology of the I-graph and thus independent of the exact parameters. In this context, feedback circuits are important network motifs that are linked to complex dynamic behavior such as multistability or oscillations. While the role of single circuits has already been investigated intensively (see for example [1, 2]), several studies indicate that the interconnection of multiple feedback circuits increases functional robustness [3]. So far, there exist only few approaches for the analysis of more complex networks with interrelated circuits. One example is described in [4], who use methods from linear systems theory to determine subnetworks that are mainly responsible for bifurcations which destabilize a fixed point. This method as well as many others assume the fixed points of the system to be known. While stable fixed points can in principle be found via simulations, this is not possible for

2. IDENTIFICATION OF CIRCUITS THAT LEAD TO MULTIPLE FIXED POINTS 2.1. The circuit-breaking algorithm We start by first explaining the rough idea of the CBA, which is described in more detail in [5]. We consider a regulatory network model, that is, a dynamical system x˙ = f (x),

x 2 Rn , f : Rn ! R n 2 C 1

(1)

with underlying I-graph G(V, E) whose vertices correspond to the system’s variables and edges indicate regulatory influences. Formally, they are defined as eii 2 E

,

eij 2 E

,

@fi (x) >0 @xi @fi (x) 9x with 6= 0. @xj 9x with

(2) (3)

To keep notation simple we consider for the moment Igraphs that are strongly connected. A generalization to arbitrary graph topologies is straightforward. In the first step we break all circuits in the I-graph by assigning fixed values to a subset of the variables, which means that we skip all regulations onto those vertices. Let’s assume that we fix vertices V˜ = {v1 , . . . , vk } by setting vi = i and collecting these values in a vector  2 Rk . The fixed point coordinates x ¯j () of the remaining vertices

This work was supported by the German Research Foundation (DFG) within the Cluster of Excellence in Simulation Technology (EXC 310/1) at the University of Stuttgart.

59

Vb = {vk+1 , . . . , v|V | } can then be calculated in dependence of the input . In the following steps the circuits are iteratively closed by releasing vertices in V˜ one after another, starting with vertex vk and going backwards until finally v1 is released. Mathematically, this translates into reducing  by it’s last entry z , shifting the respective vertex vz from V˜ to Vb and solving the implicit equation fz (z , , x ¯j (, z )) = 0 j = z + 1, . . . , |V |

hepatocytes, which is described in [7]: x˙ 1

= k1 + k2 x1

x˙ 2

= k5 x1

x˙ 3

= k7 x2 x3 m(x4 , ✓4 ) + k8 x2 + k9 x1 =

k4 x3 m(x1 , ✓2 )

k6 m(x2 , ✓3 )

k10 m(x3 , ✓5 ) x˙ 4

k3 x2 m(x1 , ✓1 )

k11 m(x3 , ✓6 )

k7 x2 x3 m(x4 , ✓4 ) + k11 m(x3 , ✓6 )

(5)

with x = (Ga⇤ , PLC⇤ , Cacyt , Caer ) denoting concentrations of an active G-protein linked receptor, active phospholipase C enzyme, free calcium in the cytoplasm, and calcium in the endoplasmic reticulum, respectively. The functions x (6) m(x, ✓) = x+✓ are Michaelis-Menten terms, and all rate constants k and Michaelis constants ✓ are non-negative. The model can show periodic oscillations, multiple fixed points and bursting. We are interested in the fixed points of the system in the positive orthant for parameter values used in [8] (Table 1). The I-graph of the system is strongly connected and contains seven circuits, all of which can be broken by setting V˜ = {v1 , v3 }. Figure 1 illustrates the circuit-closing steps during the CBA in case that x1 is released first (top row) and in case that x3 is released first (bottom row). In the first case we set x1 = 2 and x3 = 1 . The fixed point coordinates of the other two vertices are given by

(4)

for z , which gives the set x ¯z () of fixed point coordinates of variable xz in dependence of the reduced input . The fixed point coordinates of the remaining variables in Vb can easily be adapted by inserting these values for z into x ¯j (). We also refer to the left hand side of equation (4) as partial circuit characteristic pz (, z ) : R||+1 ! R. Since the dimension of  is reduced in each circuitclosing step, it is empty when the last vertex, here v1 is released, and thus p1 (1 ) : R ! R is a one dimensional characteristic, the circuit-characteristic c1 (1 ) of the system. It’s zeros correspond to the fixed point coordinates x ¯1 of variable x1 . The respective coordinates of the other vertices are obtained by inserting those values into the partial circuit-characteristics in reverse order.

2.2. Using the CBA for subnetwork identification

x ¯2 () =

The role of different subnetworks can be investigated by repeated application of the CBA, whereby the vertices in V˜ are released in different orders, and analysis of the corresponding partial circuit characteristics. Since the graphs (, x ¯j ()) are bifurcation diagrams with bifurcation parameter  that describe the set of fixed points of the subgraph spanned by vertices in Vb in dependence of the input , these contain the information about the maximal possible number of fixed points caused by this subgraph. If this number is increased when releasing another vertex in the I-graph, it is clear that these additional fixed points can be assigned to the circuits that have been closed in this step. Note that the actual number of fixed points of a subnetwork might vary with , since the subnetwork might undergo a bifurcation if  is varied. Thus, since closing circuits corresponds to assigning fixed values to the components of , the respective circuits might not always be active, i.e. do not cause the appearance of additional fixed points in the overall network, if the values are not in the right range. However, this can also be seen in the characteristics, and might help to find appropriate parameters for a desired fixed point set.

x ¯4 () =

k5 2 ✓3 k6 k5 2 k11 m(1 , ✓6 )✓4 . k7 x ¯2 (2 )1 k11 m(1 , ✓6 )

(7) (8)

The partial circuit characteristic reads p1 () = f1 (x1 = 2 , x2 = x ¯2 (), x3 = 1 , x4 = x ¯4 ()), whose zeros 2 (1 ) = x ¯1 (1 ), together with the fixed point coordinates of the other variables, are shown in Figure 2 (left column). It can be seen that the subgraph spanned by the vertices v1 , v2 and v4 with the two circuits {v1 } and {v1 , v2 } can show bistable behavior through two saddle node bifurcations. There are two stable fixed point branches indicated with gray and black solid lines, which are separated by an unstable one (dashed line). Note that the discontinuity in x ¯4 comes from the fact that f (x) is not continuously differentiable on the whole Rn . The circuit-characteristic at the bottom of the figure has only one zero, which belongs to the gray fixed point branch, indicating that the system has one single fixed point. The coordinates x ¯ = (6.5, 10.0, .2, 23) can be read off the graphs as indicated. Thus, the subnetwork is not active in this case. Furthermore, all circuits that are closed in the second step, namely {v3 }, {v1 , v3 }, {v1 , v2 , v3 }, {v3 , v4 } and {v1 , v2 , v3 , v4 }, cannot give rise to further fixed points, since the shape of the characteristic only allows for three fixed points at most. In comparison, releasing v3 first leads to the partial circuit characteristic

3. APPLICATION TO A MODEL FOR CALCIUM OSCILLATIONS IN HEPATOCYTES We demonstrate our approach on a network model for cytoplasmic calcium oscillations based on experiments in

p3 () = f3 (x1 = 1 , x2 = x ¯2 (), x3 = 2 , x4 = x ¯4 ())

60

Table 1. Parameter set copied from [8] k1 .09

k2 2

1

3

k3 1.27

2

k4 3.73

1

3

k5 1.27

k6 32.24

1

2

4

3

1

1

3

2 p3 (1 , 2 ) = 0

2

4

2

1

x ¯2 (¯ x1 (1 ))

4

x ¯2 (2 ) x ¯4 (¯ x2 (2 ), 1 ) 1

k8 .05

x ¯1 (1 )

p1 (1 , 2 ) = 0 2

k7 2

3

2

k10 153

1

k11 4.85

✓1 .19

1

3

2

4

x ¯3

c(1 ) = 0

4

✓2 .73

✓3 29.09

x ¯1 (¯ x3 ) x ¯2 (¯ x1 (¯ x3 ))

1

1

3

2

4

✓4 2.67

1

3

2

✓5 .16

✓6 .05

x ¯3

x ¯1

4

x ¯4 (¯ x2 (¯ x1 (1 )), 1 )

x ¯2 (1 )

4

x ¯2 (1 ) x ¯4 (¯ x2 (1 ), 2 )

k9 13.58

x ¯4 (¯ x2 (¯ x1 (¯ x3 )), x ¯3 )

x ¯3 (1 )

x ¯1 c(1 ) = 0

1

3

2

4

x ¯4 (¯ x2 (1 ), x ¯3 (1 ))

x ¯1 x ¯2 (¯ x1 )

1

3

2

x ¯3 (¯ x1 )

4 x ¯4 (¯ x2 (¯ x1 ), x ¯3 (¯ x1 ))

x ¯1

1

1

3

2

4

x ¯2

x ¯4

3

2

4

x ¯2

x ¯4

x ¯3

x ¯3

6.5

x ¯3 (1 )

x ¯1 (1 )

Figure 1. The CBA applied to the calcium oscillation network described in [7]. Upper column: variable x1 is released first; Lower column: variable x3 is released first.

0.2 0.0

0.0

10.0

0.0

0.0

22.8

22.4 x ¯4 (1 )

x ¯4 (1 )

x ¯2 (1 )

x ¯2 (1 )

10.0

0.0 100.0

0.0

2.0

-150.0

c(1 )

c(1 )

0.0 0.0

0.0

0.2

-7.0

0.5

1

0.0

6.5

10.0

1

Figure 2. Fixed point sets x ¯() together with the circuit-characteristics c(1 ) in case that x1 is released first (left column) and x3 is released first (right column). The zeros of the characteristic are the fixed point coordinates of variable x3 (left) and x1 (right), respectively. The other fixed point coordinates can be read off the graphs, as indicated with lines here, and x ¯ = (6.5, 10, .2, 23).

61

with x1 = 1 and x3 = 2 . It’s zeros 2 (1 ) = x ¯3 (1 ) are shown in Figure 2 (right column), along with the other fixed point coordinates. There is only one single fixed point branch, and the subnetwork spanned by v2 , v3 and v4 can hence not give rise to multiple fixed points of the system, regardless of the value of 1 . Furthermore, the circuit characteristic shown at the bottom is not monotone, and we can imagine that this characteristic might have multiple zeros for appropriately chosen parameter values. Thus, an observed bistability in the system can be assigned to the circuits that are closed in the second step, i.e. {v1 }, {v1 , v2 }, {v1 , v3 }, {v1 , v2 , v3 } and {v1 , v2 , v3 , v4 }. Since the last three ones have already been excluded by our previous analysis, in which v1 was released first, we conclude that only the subgraph spanned by v1 and v2 can give rise to bistability in this system.

1998. [2] R. Thomas, “Laws for the dynamics of regulatory networks,” J. Dev. Biol., vol. 42, pp. 479–485, 1998. [3] C. Trané and E. Jacobsen, “Network structure and robustness of intracellular oscillators,” in Proc. of the 17th World Congress of the Int Federation of Autom Ctrl, 2008, pp. 10989–10994. [4] H. Schmidt and E. Jacobsen, “Identifying feedback mechanisms behind complex cell behavior,” IEEE Contr Syst, vol. 24, no. 4, pp. 91–102, 2004. [5] N. Radde, “Fixed point characterization of differential equations with complex graph topology,” Bioinf., vol. 26, no. 22, pp. 2874–80, 2010. [6] N. Radde, “Analyzing fixed points of intracellular regulation networks with complex graph topology,” 2012, BMC Syst Biol, accepted.

4. CONCLUSION We have proposed a method to identify feedback circuits in a regulatory network that can give rise to multiple fixed points of the system. This method works on the topology of the interaction graph and calculates fixed point coordinates by first regarding a circuit-free system and then closing the circuits one after another. The circuit characteristics that are constructed in each of these steps can be used to investigate potential behavior that can be caused by the set of circuits that are closed in this step. We have applied our approach to a regulatory network model for calcium oscillations that consists of four variables and seven interrelated feedback circuits. With our analysis we could identify the subgraph that can cause bistability in the system. Realistically the method introduced is only practicable up to at most medium size networks with 5 10 components, depending on the equations and the network topology. Since the number of circuit closing steps and the dimension of the implicit equations that have to be solved are both determined by the number of variables that are set to fixed values, computational effort is mainly determined by the cardinality of this set V˜ . If this is larger than two, advanced visualization methods for the analysis of the partial circuit characteristics are needed. However, subnetwork identification can still be done, for example by applying the CBA several times and changing the order of the vertices that are released, as we have demonstrated it in our example. The characteristics constructed in the CBA can also be used to find bifurcation points and parameter values for which the system has indeed multiple fixed points, as it is demonstrated for example in [5]. In the future we will apply our analysis method to other network models as well. We will also investigate further properties of the circuit-characteristics besides it’s zeros that are related to the dynamic behavior of the system.

[7] U. Kummer, L. Olsen, C. Dixon, A. Green, E. Bornberg-Bauer, and G. Baier, “Switching from simple to complex oscillations in calcium signaling,” Biophys. J., vol. 79, pp. 1188–1195, 2000. [8] M. Peifer and J. Timmer, “Parameter estimation in ordinary differential equations for biochemical processes using the method of multiple shooting,” IET Syst. Biol., vol. 1, no. 2, pp. 78–88, 2007.

5. REFERENCES [1] J.-L. Gouzé, “Positive and negative circuits in dynamical systems,” J. Biol. Syst., vol. 6, no. 21, pp. 11–15,

62

IN VIVO KINETICS OF ASYMMETRIC DISPOSAL OF INDIVIDUAL PROTEIN AGGREGATES IN E. COLI, ONE AGGREGATE AT A TIME Andre S. Ribeiro1,*, Jason Lloyd-Price1, Antti Häkkinen1, Meenakshisundaram Kandhavelu1, Ines J. Marques1, Sharif Chowdhury1, Eero Lihavainen1, and Olli Yli-Harja1,2 1

Laboratory of Biosystem Dynamics, Computational Systems Biology Research Group, Department of Signal Processing, Tampere University of Technology, FI-33101 Tampere, Finland 2

Institute for Systems Biology, 1441N 34th St, Seattle, WA, 98103-8904, USA.

[email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected] es appears to occur by a common mechanism irrespective of the unwanted substance segregated (2). To better understand the disposal mechanism we need to observe it in action. A recent technique allows tracking RNA molecules in E. coli by tagging them with fluorescent proteins. It uses the MS2 protein fused to green fluorescent protein (MS2-GFP) and a reporter construct with MS2 binding sites (3)(4). RNA-MS2GFP complexes can be tracked for hours in live cells by fluorescence microscopy allowing the study of the spatial kinetics. The binding of MS2-GFP to the target RNA is fast, making them visible once produced (3). Each RNA-MS2-GFP spot is generally comprised of one RNA and approximately 70 MS2-GFP (3). Here, we study spatial kinetics of RNA-MS2-GFP complexes, determine where novel complexes appear in the cells, and track their positions over time. We discuss the classification and disposal of these complexes.

ABSTRACT Evidence suggests that Escherichia coli employ a selective strategy at division, asymmetrically segregating unwanted substances in the older pole. We investigated the kinetics of this process in vivo by tracking RNAs tagged with MS2-GFP fluorescent molecules. These complexes appear at midcell and reach a pole before another is formed, usually remaining there. The choice of pole is found to be probabilistic and biased towards one pole. There is an accurate classification for disposal since, although the RNA-MS2-GFP complexes are disposed of, the MS2-GFP tagging molecules alone are not. Such division asymmetries can be used to define a parent cell, which inherits most unwanted substances, and a daughter cell, with better survival chances. Notably, this strategy is either imperfect or purposefully tempered. 1.

INTRODUCTION

2.

A bacterium, in suitable environments, appears to be functionally immortal as it perpetuates itself by dividing into cells with a genotype identical to the mother cell, aside mutations. Recent evidence suggests that this strategy may have costs and requires mechanisms to deal with them. Such costs include the accumulation of unwanted substances and degradation of internal structures. One mechanism that may be used to cope with these problems is the deliberate asymmetry of the partitioning process in cell division (1). The underlying mechanisms causing the asymmetries (and their kinetics) remain unknown. The functional asymmetries between sister cells in Escherichia coli have been associated with aging (1) since unwanted aggregates tend to concentrate at the older pole of the mother cell, causing cumulatively slower growth of the daughter cells receiving the aggregates. This pattern of segregation of unwanted substanc-

MATERIALS AND METHODS

The method of RNA detection and quantification (3)(4) exploits the ability of bacteriophage MS2 coat protein to bind specific RNA sequences. High resolution detection of single RNA transcripts with 96 tandem repeats of the MS2 binding sites uses dimeric MS2 fused to GFP (MS2-GFP) as a detection tag. The method uses two genetic constructs. One is a medium-copy vector expressing the MS2-GFP fused protein, whose promoter (PtetO) is regulated by tetracycline. The other is a single copy F-based vector, with a Plac/ara-1 promoter controlling the production of the transcript target, mRFP1 followed by a 96 MS2 binding site array. Both constructs were generously provided by I. Golding (University of Illinois). Cells with MS2-GFP and transcript target plasmids were grown overnight at 37C in LB and appropriate antibiotics. The next day, cells were diluted in fresh medium plus antibiotics. To induce expression of MS2-

63

We studied where RNA-MS2-GFP complexes first appear. Since complexes are visible shortly after the target RNA is transcribed (3), the location where they are first seen should be near the F-plasmid. In E. coli, F-plasmids preferentially reside near the center of the cell, migrating to the one-quarter and three-quarter positions along the cell’s major axis following its duplication (7). Consequently, the RNA molecules should be produced and first appear near the center or at the onequarter or third-quarter positions. We recorded the position along the major axis where each complex is first detected. Results in figure 2 show that spots are first observed at these positions. Most spots appear close to the center and a few appear at the one-quarter and thirdquarter positions.

GFP, 100 ng/mL anhydrotetracycline (IBA GmbH, Germany) was added to the diluted bacterial culture. Expression of the target RNA was induced by IPTG (2 mM, Fermentas, Finland) and L-arabinose (13.4 mM, Sigma-Aldrich, Germany). Cells were incubated with inducers at 37C for 1 h with shaking to a optical density (600 nm) of ~0.4. After, 8 L of culture is placed on a microscopic slide between a cover slip and 0.8% LBagarose gel pad set. Cells are visualized by a Nikon Eclipse (TE-2000-U, Nikon, Japan) inverted confocal laser-scanning microscope with 100X (1.49NA) objective. GFP fluorescence was measured by a 488 nm laser (Radius 405 laser, Coherent, Inc., CA) and a 515/30 nm detection filter (100–110 gain). Images were taken by a Nikon Digital Sight camera, and acquired with Nikon EZ-C1 FreeViewer software 3.30 (Nikon Corp). Images of cell populations were taken ~1 h after induction. For time series, images were taken ~7 min after induction, 1/min, for ~2 h. On the slide, cell division time is ~90 minutes. We detect cells from images by, first, dividing images in background, border and cell region (5). We apply an iterative cell segmentation based on size and edge information. Clumped cells are discarded by a threshold based on cell size. We segment spots with kernel density estimation (Gaussian kernel) and Otsu thresholding (6). The number of complexes in a spot is given by the spot intensity distribution slicing approach (3). We use | N|, the absolute difference between the number of RNA-MS2-GFP complexes in each pole of a cell, as a measure of the bias in pole partitioning. Assuming that the partitioning between the poles follows a biased binomial distribution with a given p, the expected | N| for a given number (N) of complexes in a cell is: N

E

N

( N , p)

2k k 0

3.

N

N k p (1 p) N k

Figure 2. Spatial distribution of RNA-MS2-GFP complexes in cells when first observed. We determined how many RNA-MS2-GFP complexes are in each spot from the intensity of fluorescence when they first appeared. All spots observed contained one target RNA. A previous assessment (3) also reported that the complexes at midcell contain only one RNA.

k

RESULTS

To analyze the production and spatial kinetics of RNAMS2-GFP complexes, we measured their number and location over time in individual cells (Figure 1).

Figure 1. Sequence of images of a cell with RNA-MS2-GFP complexes. Arrows point to a spot as it is formed at midcell and moves to a pole.

Figure 3. Spatial distribution of RNA-MS2-GFP complexes in a cell population.

64

367 cells, 60 minutes after induction as a function of the number of RNA-MS2-GFP complexes in each cell (N). For comparison, we show the expected value of | N| as a function of N, assuming binomial partitioning of RNA-MS2-GFP complexes, with a bias of 0.5, 0.85 and 1.0 (see Methods section). The results are in clear agreement with the temporal measurements since, again, a good fit is obtained with a highly biased binomial pole-to-pole partitioning (p = 0.85). The results also show that the bias is independent of N, the number of RNA-MS2-GFP complexes in a cell, as would be expected if the bias results from asymmetric segregation of aggregates.

We tracked the movement of RNA-MS2-GFP complexes. All but 5% travelled independently to one pole during the observation period. After a complex reaches a pole, it tends to remain there, in agreement with previous observations (3). Even when only one tagged RNA was present, it travelled to and remained at a pole most of the time, implying that this dynamic is not due to interactions between RNA-MS2-GFP complexes. To verify that RNA-MS2-GFP complexes preferentially locate at the poles, we imaged 762 cells, 1 h after induction of the target gene. We tested different concentrations of inducers to examine if the spatial distribution would differ with mean expression. No differences were found, and we thus merged the results of these tests. Figure 3 shows the distribution of the distances of all individual RNA-MS2-GFP complexes to the center of the cells, normalized by half of the cell length. The distribution is bimodal with a strong peak centered at 0.8, showing that most spots are located at the poles, and a small “peak” at the midcell region, likely due to the RNA-MS2-GFP complexes visible while the target RNA is being transcribed and anchored to the plasmid (3). This distribution is strikingly similar to the one reported in (8) of the spatial distribution of a tagged chaperone (IbpA) involved in aggregate processing. From time series measurements, we registered the time between consecutive productions of complexes, and recorded the pole to which each complex went to. The time to reach a pole is much smaller than the mean interval between production events, indicating that newly created RNA-MS2-GFP complexes reach a pole prior to the transcription of the next target RNA. Finally, while the choice of pole that the RNA-MS2-GFP complexes travel to in each cell is probabilistic, it is strongly biased towards one of the poles of the cell. Next, we investigated whether there is a bias in the choice of pole. Figure 4 shows the progression in time of the mean absolute difference between the numbers of RNA-MS2-GFP complexes in each pole of each cell ( | N| ), taken over all cells for each time point (50 cells observed, each of which produced at least 1 RNA-MS2GFP complex). For comparison, we show how this quantity would vary over time, assuming unbiased binomial partitioning of complexes by the poles (p = 0.5), a totally biased partitioning (p = 1.0), and a partitioning that follows a binomial distribution with a bias of p = 0.85 towards one of the poles, which was found to fit the measurements well. In all estimations, we assume the same total number of RNA-MS2-GFP complexes as those detected in the measurements at each moment in time. We conclude that RNA-MS2-GFP complexes are partitioned in a biased fashion, with a bias close to 0.85, which appears to be consistent over time. In figure 5, we show the mean difference in RNAMS2-GFP complexes between poles in a population of

Figure 4. Temporal evolution of the pole-to-pole bias in number of RNA-MS2-GFP complexes. The observations are from three measurements, in which the cells were subject to different induction levels. No significant difference was found in the mean bias in the three sets. Due to that, we merged the results from the measurements. In conclusion, we find that RNA-MS2-GFP complexes are partitioned in a biased fashion with a bias that is consistent both over time and with respect to the number of previously segregated complexes.

65

harmful, or a compromise between energy costs and removal of harmful substances? Finally, what are the mechanisms targeting specific protein complexes and segregating them? Future studies are needed to address these questions, and may be of relevance to understand the adaptability of these organisms and their coping with aging.

Figure 5. Cellular pole-to-pole bias in number of RNA-MS2-GFP complexes versus total no. of complexes. Errors bars show standard deviation. It is possible, given the cell’s long division time, that the segregation of the F-plasmids to the first and third quarter following its duplication, could affect the results. We tested for a correlation between the side of the cell at which an RNA-MS2-GFP-complex first appears (when it appears at a quarter region) and the pole to which it travels to. We tracked 17 RNA-MS2-GFPcomplexes that first appear at the first or third quarter positions and registered to which pole they travelled to. The Pearson correlation coefficient between the initial and the final normalized locations along the cells major axis is -0.1, indicating no strong correlation between where complexes appear and the pole they travel to. It is of importance to determine the kinetics of the MS2-GFP tagging molecules alone. In (3) it was shown that MS2-GFP distributes itself homogenously in the cell cytoplasm. We confirmed this result. From this, we conclude that only the RNA-MS2-GFP complexes are disposed of by the cells, while the tagging molecules alone or the RNA alone are not. This suggests that this disposal mechanism is capable of accurately distinguishing protein aggregates from functional proteins.

5. ACKNOWLEDGMENTS Work supported by Academy of Finland and Finnish Funding Agency for Technology and Innovation. 6.

REFERENCES

[1] Stewart, E.J., Madden, R., Paul, G., Taddei, F. 2005. Aging and Death in an Organism That Reproduces by Morphologically Symmetric Division. PLoS Bio. 3:295300. [2] Gordon, S., Rech, J., Lane, D., and Wright, A. 2004. Kinetics of plasmid segregation in Escherichia coli. Mol. Microbiol. 51:461-469. [3] Golding, I., Paulsson, J., Zawilski, S. M., and Cox, E. C. 2005. Real-time kinetics of gene activity in individual bacteria. Cell 123:1025-1036. [4] Fusco, D., Accornero, N., Lavoie, B., Shenoy, S. M., Blanchard, J. M., Singer, R. H. et al. 2003. Single mRNA molecules demonstrate probabilistic movement in living mammalian cells. Curr. Biol. 13:161-167. [5] Chen, T. B., Lu, H., Lee. Y.-S., and Lan, H.-J. 2008. Segmentation of cDNA microarray images by kernel density estimation. J. Biomed. Inform. 41:1021-1027. [6] Otsu, N. 1979. A threshold selection method from graylevel histograms. IEEE Trans. Sys. Man. Cyber. 9:62–66. [7] Gordon, G. S., Sitnikov, D., Webb, C. D., Teleman, A., Straight, A., Losick, R. et al. 1997. Chromosome and low copy plasmid segregation in E. coli: visual evidence for distinct mechanisms. Cell 90:1113-1121. [8] Lindner, A. B., Madden, R., Demarez, A., Stewart, E. J., and Taddei, F. 2008. Asymmetric segregation of protein aggregates is associated with cellular aging and rejuvenation. Proc. Natl. Acad. Sci. U.S.A. 105:3076-3081.

4. CONCLUSION In (8), the appearance and inheritance of spontaneous protein aggregates in lineages of E. coli in nonstressed conditions was followed using time-lapse microscopy and a fluorescently tagged chaperone (IbpA) involved in aggregate processing. These measurements revealed that these aggregates accumulated upon cell division in cells with older poles. The authors suggested the existence of an asymmetric strategy whereby dividing cells segregate damage at the expense of aging individuals. Examples exist of such mechanisms in other organisms as well (8). Since RNA-MS2-GFP complexes are not native to E. coli, it may be that the cell recognizes them as undesirable. We observed that the disposal mechanism acts on individual aggregates, and that the choice of pole is probabilistic and biased. Since the cells do not accumulate MS2-GFP molecules at the poles, even though these protein complexes are also not indigenous to the cells, it is possible to conclude that this mechanism is capable of targeting very specific protein complexes. Several questions remain unanswered. For example, the choice of pole is well fit by a binomial distribution with a bias of 0.85. It would be of interest to determine if this bias is similar in all cases, e.g. for all aggregates, or varies with the aggregate being segregated. Also, is the bias observed a demonstration of some inefficiency of the mechanism, a “safety precaution” against the absolute segregation of substances wrongly identified as

66

IMAGE ANALYSIS OF NUCLEAR ENVELOPE BREAKDOWN EVENTS USING KNIME Thorsten Rieß1 , Joseph Marino2 , Cornelia Wandke2 , Dorit Merhof 1 , Oliver Deussen1 , Gábor Csúcs3 , Ulrike Kutay2 , Péter Horváth3 1 2 3

INCIDE, University of Konstanz, Germany

Institute of Biochemistry, ETH Zürich, Switzerland

Light Microscopy and Screening Centre, ETH Zürich, Switzerland

{thorsten.riess, dorit.merhof, oliver.deussen}@uni-konstanz.de, {jomarino, cwandke, csucsga, ulkutay, phorvath}@ethz.ch

ABSTRACT Insights into the molecular mechanism underlying nuclear envelope breakdown (NEBD) are crucial to understand dynamic changes of the nuclear envelope (NE) that occur at mitotic entry of eukaryotic cells. In this paper, we present an image processing algorithm and its implementation in K NIME, which allows to automatically quantify image data obtained from the in vitro NEBD assay. The algorithm consists of image alignment via phase correlation, nucleus identification using adaptive thresholding, morphological operations, and intensity measurements. The results of these measurements are verified using images from a known nuclear envelope disassembly assay. 1. INTRODUCTION Higher eukaryotic cells undergo an open mitosis which is characterized by the breakdown of the nuclear envelope (NE). The resulting loss of nucleo-cytoplasmic compartmentalization allows the formation of the mitotic spindle, which is essential for chromosome segregation during cell division [1]. In order to visualize morphological changes in the NE during nuclear disassembly at mitotic onset, we have previously established an in vitro nuclear disassembly assay [2]. In this assay semi-permeabilised HeLa cells expressing a fluorescently-tagged NE protein are incubated with mitotic HeLa cell extracts to induce NEBD. The gradual loss of the nuclear permeability barrier is followed by tracking fluorescent dextran influx into the nucleus using time-lapse confocal microscopy. At the same time, loss of the GFP-tagged NE marker from the nuclear rim is followed. This results in a (time-)sequence of microscopic images showing the nuclei as the envelope breakdown occurs. Typically, the nuclei are clearly visible as dark regions in the first image of the sequence, and are then subsequently flooded with the fluorescent marker in the course of the experiment. From the image processing point of view, the problem is challenging for two main reasons: it is very hard to visibly detect the cell nuclei once the envelope broke down, and the cells move during the experiment. The movement of the cells is either due to the individual cell movement (which is, however, very little and thus negligible), or due

to the limited motorized microscopic stage location reproducibility during multi-location image acquisition process, which usually results in significant paraxial shifts. We propose an algorithm and its implementation using K NIME [3] which solves the above mentioned problems and enables automated analysis of large data sets. The algorithm consists of the following steps: spatial image alignment, adaptive thresholding of the aligned image sequence, object identification, intensity measurements, and report generation. To our knowledge, there is no comparable approach published; alternative approaches using cell tracking instead of image alignment are almost certainly bound to fail due to the drastic change of the cell appearance in the course of the experiment. The paper is organized as follows: Section 2 describes the biological background, the image processing pipeline is presented in Section 3. In Section 4 the implementation of the algorithm is presented, and in Section 5 results for some applications are shown. Finally, Section 6 gives a summary and outlook. 2. BIOLOGICAL BACKGROUND: NUCLEAR ENVELOPE BREAKDOWN The genetic material of eukaryotic cells is enclosed by a double lipid bilayer termed NE. The NE consists of an outer and an inner nuclear membrane (ONM and INM, respectively) that are fused at sites of nuclear pores. The ONM is continuous with the endoplasmic reticulum (ER) and shares similar characteristics. In contrast, the INM contains a unique set of integral membrane proteins that are linked to chromatin and the nuclear lamina. ONM and INM are fused at numerous sites where nuclear pore complexes (NPCs) are inserted into the NE to allow for exchange of material between nucleus and cytoplasm. NPCs are composed of about 30 nucleoporins (nups) that together form assemblies of octagonal rotational symmetry mediating nucleo-cytoplasmic transport of macromolecules [1]. The molecular features of NPCs determine both the barrier and permeability characteristics of the NE. Open mitosis requires nuclear envelope breakdown (NEBD), which is accompanied by a gradual loss of the NE permeability barrier. Besides other mitotic kinases, the cyclin-dependent kinase 1 (CDK1) plays a key role in the

67

nuclear disassembly process. Furthermore, the disintegration of the NE membrane involves the dispersal of nups into the cytoplasm, the depolymerisation of the lamina and the absorption of NE membranes and their INM proteins into the ER [1]. When chromosome segregation is completed, the NE reforms around the decondensing chromatin masses of each daughter cell thereby re-establishing the NE permeability barrier [4]. The formerly established in vitro NE disassembly assay has proven to be a powerful tool to decipher the molecular requirements of NEBD [2, 5]. However, image processing of the data is still not fully automated and thus a time-consuming task. The K NIMEbased application described here simplifies and accelerates the process of quantification of in vitro NEBD.

the nucleus. Nuclei displaying an intensity ratio r = 0.3 are defined to be dextran positive (this threshold is used in [5]). Algorithm 1 shows the detailed image processing pipeline in pseudo-code. Algorithm 1 The image processing pipeline. Require: Image sequence of N images: I = {Ij }j∈N , N = {1, 2, . . . , N }. Ensure: Ratio functions for the M observed nuclei: rk : N → [0, 1], k = 1, . . . , M . for j = 2 → N do P ← phase correlation image of Ij−1 and Ij . (ˆ xj , yˆj ) ← argmax(x,y) P (x, y). !j Ij ← shift Ij by k=2 (ˆ xk , yˆk ). end for I ← cropped image sequence I. T ← thresholded binary image I1 . C ← connected components of T . C ← eroded components in C. M ← |C|. {C = {C1 , . . . , CM }} B ← eroded T c {background} for j = 1 → N do {I|J means: image I restricted to the ROI J} b ← mean intensity of Ij |B. for k = 1 → M do i ← mean intensity of Ij |Ck . rk (j) ← i/b. end for end for

3. METHOD The image processing algorithm consists of three steps. In the first step, the image sequence is aligned to account for the paraxial movement of the microscope. This registration is based on phase correlation [6] and it is crucial for the following steps. The paraxial nature of the alignment problem reduces the complexity to the detection of translation between consecutive frames. Let Ij and Ij+1 denote two consecutive frames. For the phase correlation image of Ij and Ij+1 , the Fourier transforms F{Ij } and F{Ij+1 } are computed. Then the phase correlation image P is the inverse Fourier transform of the pixel-wise multiplication of F{Ij } and F{Ij+1 }c . The position of the maximum value in the phase correlation image denotes the shift that is required from frame Ij to Ij+1 that minimizes the difference between the frames. Note that it is also possible to only compute the phase correlation image of the first frame versus the remaining frames, but due to the significant change of the nucleus appearance in the course of the experiment, the use of consecutive frames is more robust. The second step is the segmentation of the cell nuclei in the first image of the sequence. At the beginning of the experiment, the cell nuclei are fully visible as dark areas (see Figure 2 upper right image), but after the NEBD event they cannot clearly be distinguished from the background. The segmentation in the first image is performed using the Otsu adaptive thresholding method [7]. Touching nuclei are separated with morphological operations, nuclei that cannot be separated are excluded by filtering out segments that are too large. To avoid measurement errors – caused by individual cell movements – the cell masks are reduced to a disk with significantly smaller size. Finally, in the third step, the location of the segmented nuclei is used to measure the mean intensity of each individual nucleus in each image of the registered sequence (note that nuclei that move too much or that have dextran influx already at the beginning of the experiment need to be excluded manually from the measurements). This intensity is divided by the mean intensity of the background to achieve an intensity ratio r (typically, r ∈ [0, 1]), where r = 0 means that the nuclear envelope is intact and no dextran has entered, and r = 1 means that the background intensity is identical to the intensity inside

4. IMPLEMENTATION IN KNIME K NIME [3] is a data mining and integration platform developed at the University of Konstanz. It is freely available for download at http://www.knime.org and runs on the major platforms. We choose K NIME as the main implementation framework because it is flexible, has an intuitive graphical user interface and has recently been enriched with a powerful image processing plug-in, which provides basic image processing algorithms that can be readily used. The concept of K NIME is that data is processed by a workflow consisting of nodes and connections between nodes. Each node can transform, input, output or visualize data. The intrinsic data structure that is passed from one node to another is a simple table structure, where each cell represents a value of a certain data type. The image processing plug-in is not part of the standard K NIME distribution, but can be downloaded at http://tech.knime.org. For the presented implementation K NIME version 2.5.1 was used. The image processing pipeline presented in Algorithm 1 can readily be modeled as a K NIME-workflow using nodes from the image processing plug-in. After reading in the image sequence, the alignment-node aligns and crops the images according to the first part of the algorithm. Subsequent nodes handle the segmentation of the cell nuclei and background in the first image of the sequence and

68

c

b

a

b

c Figure 1. (a) the main K NIME-workflow. The segmentation of the nuclei and background as well as the intensity computation and the postprocessing are wrapped into sub-workflows to keep the main workflow as simple and clear as possible. (b) sub-workflow for the segmentation of the nuclei and the background in the first image of the sequence. (c) sub-workflow for the intensity measurements in each image of the sequence.

the transformation of this segmentation using standard binary morphological operations. Finally, a loop iterates through all images in the sequence and the mean intensities of the segmented nuclei and the background are computed for each image. The result is a table holding the ratio of these intensities for each nucleus at each image in the sequence. Standard K NIME nodes can be used to either save the result table or to visualize the result as a line plot. Figure 1 a. shows the main workflow as a skeleton for the registration, segmentation and measurement steps. The actual segmentation and measurement steps are implemented as sub-workflows shown in Figure 1 b. and c.

ing 10-30 nuclei), and the whole experiment is repeated 3 times. The individual images usually have two channels, one for the fluorescent dextran and the other for the GFPtagged nucleoporin. This results in 12 image sequences for each experimental condition. Due to slight shifts of the microscope alignment, the individual image sequences are usually shifted paraxially. (If the microscope delivers 3D image stacks, the best focal plain is manually selected for the computational analysis.) The timing of NEBD is compared in cells incubated with either mitotic extracts (ME) or with extracts supplemented with the CDK1 inhibitor alsterpaullone (ME + AP). Figure 2 left shows ME images at 16, 36 and 56 minutes.

5. RESULTS We applied the image processing pipeline described in Section 3 on real image data. In a typical experimental setup, the analyzed image sequences consist of 15 images taken in every 4 minutes between 16 and 68 minutes after addition of mitotic cell extracts to semi-permeabilized HeLa cells stably expressing GFP-Nup58. These image sequences are acquired at 4 different locations (each show-

Paraxial microscope stage shifts are detected using the presented phase-correlation algorithm and images are cropped and aligned accordingly. On Figure 2 left red dashed rectangles show the detected common parts of the images. On the first image of every sequence cell nuclei are segmented. Results are shown in Figure 2 right. The in-

69

% dextran positive nuclei

155 kDa TRITC Dextran

time

t=16m

100

GFP-Nup58 ME GFP-Nup58 ME+AP

75

50

25

0 16 20 24 28 32 36 40 44 48 52 56 60 64 68 time (min)

t=36m

Figure 3. Quantification of dextran positive nuclei over time. The purple line marks the time point at which 50% of the nuclei are dextran positive. Error bars represent the standard error of the mean. neither biased nor subjective as a manual evaluation would be.

t=56m

7. REFERENCES

Figure 2. The influx of TRITC-labelled 155 kDa dextran into nuclei is shown for three different time points (left) and segmented nuclei (right). The discs overlayed on the right show the area of the actual intensity measurement and are green for intact nulcei and red for nuclei classified as dextran positive (intensity ratio larger than 0.3).

[1] S. Guttinger, E. Laurell, and U. Kutay, “Orchestrating nuclear envelope disassembly and reassembly during mitosis,” Nat Rev Mol Cell Biol, vol. 10, no. 3, pp. 178–91, 2009. [2] P. Muhlhausser and U. Kutay, “An in vitro nuclear disassembly system reveals a role for the RanGTPase system and microtubule-dependent steps in nuclear envelope breakdown,” J Cell Biol, vol. 178, no. 4, pp. 595–610, 2007.

tensity measurements are taken using a disk mask significantly smaller than the nuclear size placed on the centroid of segmented regions. Figure 3 shows the intensity ratio measurements over time for the two experimental conditions analyzing image series taken in 3 independent experiments. It can be observed that CDK1 inhibition by alsterpaullone strongly delays the time that is required for 50% nuclei to achieve NEBD.

[3] M. Berthold, N. Cebron, F. Dill, T. Gabriel, T. Kötter, T. Meinl, P. Ohl, C. Sieb, K. Thiel, and B. Wiswedel, “KNIME: The Konstanz Information Miner,” in Proc. Data Analysis, Machine Learning and Applications, 2008, pp. 319–326. [4] B. Burke and J. Ellenberg, “Remodelling the walls of the nucleus,” Nat Rev Mol Cell Biol, vol. 3, no. 7, pp. 487–97, 2002.

6. DISCUSSION In this paper we present an image processing pipeline for the automated image analysis of NEBD events. The implementation of the pipeline is based on the image processing plug-in of the data analysis framework K NIME, which allows for a very flexible prototype workflow and can readily be used by biologists. The resulting intensity ratios can either be saved and used externally, or they can be further processed or visualized using the tools already built into K NIME. Moreover, the workflow-based approach allows batch-processing of huge amounts of experimental data, also in parallel. The presented tools allow the biologist to use the workflow as-is, but also allows tuning all parameters such that the workflow can be used for a wide range of experiments and does not require a computer science expert to adjust to different image data. The measurement results allow an accurate estimation of the nuclear breakdown events for individual nuclei, which is

[5] E. Laurell, K. Beck, K. Krupina, G. Theerthagiri, B. Bodenmiller, P. Horvath, R. Aebersold, W. Antonin, and U. Kutay, “Phosphorylation of Nup98 by multiple kinases is crucial for NPC disassembly during mitotic entry,” Cell, vol. 144, no. 4, pp. 539–50, 2011. [6] E. De Castro and C. Morandi, “Registration of translated and rotated images using finite fourier transforms.,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 9, no. 5, pp. 700–703, 1987. [7] N. Otsu, “A Threshold Selection Method from GrayLevel Histograms,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 9, no. 1, pp. 62–66, 1979.

70

MIXTURE MODEL CLUSTERING FOR PEAK FILTERING IN METABOLOMICS Simon Rogers1 , R´on´an Daly2 and Rainer Breitling2,3 School of Computing Science, University of Glasgow, UK Institute of Molecular, Cell and Systems Biology, University of Glasgow, UK 3 Groningen Bioinformatics Centre, University of Groningen, Netherlands [email protected], [email protected], [email protected] 1

2

ABSTRACT In recent years, the use of liquid chromatography coupled to mass spectrometry has enabled the high-throughput profiling of the metabolic composition of biological samples. However, the large amount of data obtained is often difficult to analyse. This paper focuses on a particular problem, that of detecting and potentially removing derivative peaks of a substance of interest. A mixture model for clustering peaks based on chromatographic peak shape correlation is presented, and comparison of this model to the behaviour of a leading mass spectrometry analysis tool is presented. Based on the results, the mixture model is shown to have better overall performance characteristics. 1. INTRODUCTION Recent studies have shown that changes in an organism’s metabolome composition are more closely correlated with phenotypic variation than changes in either the transcriptome or the proteome [1], motivating the use of high-throughput metabolomic assays for a wide variety of biomedical applications. The most popular method for analysing the metabolome is mass spectrometry (MS) coupled to a separation phase resulting in data consisting of a set of chromatographic peaks characterised by their mass and retention time. Identifying the peaks in these spectra makes metabolomic analysis more challenging than, for example, microarray analysis, where the structure of the mRNA makes it possible to build a probe to which the molecule of interest will bind with high specificity. Whilst, in some cases, matching a measured mass to a database of known masses will result in a correct identification, sometimes several matches might be found within a window defined by the mass accuracy of the equipment [2]. The problem is made more complicated by the fact that an experimental sample containing a few dozen metabolites will result in the production of several hundred peaks. The large number of additional peaks can come from various sources, including isotopes, adducts, molecular fragments, and multiply charged ions [3]. As some of these derivative peaks will have masses very similar to known metabolites, filtering them out is crucial to avoid the overwhelming number of false identifications that would be produced if they were all compared to a database of known masses.

Figure 1. A chromatogram, showing peak intensity as a function of retention time in the separation phase. Co-eluting peaks often have a similar shape; the peaks tightly clustered around 863 s (boxed) are derived from the same metabolite. Fortunately, peaks that are derived from the same metabolite share characteristics that make it possible to group them together. In particular, they will elute from the separation phase at the same time (their retention times will be very similar) and their peaks will have similar shapes [3], as can be seen in Figure 1. Here, we introduce a mixture model for clustering peaks based on the correlation between their peak shapes. The remainder is organised as follows: In the next section we briefly introduce mzMatch, one of the most popular open-source metabolomics analysis pipelines, as this is the system against which we compare our proposed approach. In Section 3 we introduce our model and in Section 4 demonstrate its performance on some standard metabolomic datasets. Finally, in Section 5 we present a brief discussion and conclusions. 2. MZMATCH mzMatch is a popular open-source metabolomic data analysis pipeline that enables researchers to perform end-to-end analysis of diverse biological datasets [4]. Whilst the complete toolkit performs many functions from peak extraction

71

from raw data files, to metabolite identification, for the purposes of this paper we are focusing on the peak-derivative detection phase, also known as peak clustering. Currently, peak clustering in mzMatch is done via a simple greedy algorithm. The following algorithm describes mzMatch’s behaviour.

p(qnm |rnm = 1, ✏nm = 1)

p(qnm |rnm = 1, ✏nm = 0)

Algorithm 1 mzMatch clustering algorithm while there are unclustered peaks do • Find the most intense unclustered peak.

• Create a new cluster based on this peak.

• Find other unclustered peaks whose Pearson correlation over retention time with this peak is greater than a pre-defined threshold (0.75) and add these to the cluster.

Figure 2. Example of the generative densities (assuming the value is observed, rnm = 1) where = 4, µ = 0, 2 = 0.3.

end while

generative distributions: p(qnm , rnm = 1|✏nm = 1) = (1

There are several potential issues with this approach. Firstly, the algorithm is greedy – whilst it is likely that the true metabolite peaks will have high intensity [3], repeatedly picking the most intense remaining peak and building a cluster around it results in an algorithm that could be highly sensitive to small changes in intensity values. Secondly, correlation is only computed between the initial peak and others. Most inter-peak correlation values are not used to produce the clustering – information is being thrown away.

p1 ) e

(1 qnm )

(2)

p(qnm , rnm = 0|✏nm = 1) = p1 (qnm ) p(qnm , rnm = 1|✏nm = 0) = (1

(1)

2

p0 ) N qnm |µ,

(3) (4)

p(qnm , rnm = 0|✏nm = 0) = p0 (qnm ) ,

where conditioning on p0 , p1 , , µ, 2 is omitted. An example is given in Figure 2. The density of qnm conditioned on a particular value of rnm , e.g. p(qnm |rnm , ✏nm = 1) =

3. MIXTURE MODEL FOR CORRELATIONS

rnm e

Our observations consist of a symmetric N ⇥ N matrix of peak shape correlations (Pearson) between the N observed peaks, Q, the n, mth element of which (the correlation between peaks n and m) is denoted by qnm . Note that it would be possible to start from the actual peaks themselves rather than correlation values, and this is an interesting avenue for future work. Our aim is to partition the peaks into K clusters, each of which will potentially correspond to one metabolite and its derivatives. We use a set of binary indicator variables, znk to indicate cluster membership i.e., znk = 1 if peak n is assigned to cluster k. Collectively, these indicators are denoted as Z and our task is therefore to infer Z from Q. It willPbe convenient to define a second set K of indicators: ✏nm = k=1 znk zmk , i.e. ✏nm = 1 if peaks n and m are in the same cluster, and zero otherwise. We assume that the correlation values were generated by a mixture model with two components, one describing correlations between peaks within the same cluster, and one describing correlations between peaks in different clusters. Inspection of typical outputs from mzMatch [4] led us to choose an exponential-type distribution for the former, and a Gaussian for the latter. In addition, to allow the correlation matrices to be produced at low computational cost, peak correlations are only calculated for peaks that co-elute within a particular retention time window. Hence, we additionally observe a binary N ⇥ N matrix R, the elements of which, rnm , equal 1 if the correlation between peaks n and m is observed and 0 otherwise. We therefore assume the following

(1 qnm )

+ (1

rnm ) (qnm ) ,

is similar to the ‘spike-and-slab’ distributions used in DNA microarray analysis (see, e.g., [5]). The likelihood of the complete set of observed correlations is given by: L(Q, R|Z) =

N Y1

N Y

n=1 m=n+1

p(qnm , rnm |✏nm = 1)✏nm

⇥ p(qnm , rnm |✏nm = 0)1

✏nm

,

where the various components in the product are given by Equations 1 to 4. To complete our model specification, we require a prior density over the membership of cluster k, i.e. p(znk = 1). To avoid having to a priori specify the number of components, we use a Dirichlet Process (DP) prior, with concentration parameter ↵ (see, e.g., [6] for more details). 3.1. Inference The distribution of interest here is the posterior density over cluster assignments, p(Z|Q, R). Analytical inference is not tractable, but it is possible to generate samples using a Gibbs sampling scheme. At each stage, we re-sample the assignment for one peak conditioned on the current assignments of all other peaks. Using Z n to denote all of the assignments except that for peak n, the conditional distributions required are given by: P (znk = 1| . . .) / sk L(Q, R|Z

72

n

, znk = 1),

Table 1. Comparison of mixture model to mzMatch Sample

Ionisation Mode

# Peaks

TPR

mzMatch FPR BA

# Clusters

TPR

Mixture Model FPR BA

Standard 1

Negative

3664

0.548

0.211

0.668

799

0.466

0.060

0.703

250

Standard 1

Positive

6291

0.583

0.220

0.682

1403

0.444

0.061

0.692

409

Standard 2

Negative

2386

0.571

0.247

0.662

621

0.446

0.071

0.688

206

Standard 2

Positive

6883

0.597

0.340

0.628

2370

0.5

0.076

0.711

565

Standard 3

Negative

999

0.632

0.418

0.607

421

0.474

0.212

0.631

216

Standard 3

Positive

2316

0.632

0.497

0.567

1151

0.474

0.187

0.643

440

for the current cluster k that has sk members (not including peak n), and: P (znk⇤ = 1| . . .) / ↵L(Q, R|Z

n

# Clusters

as a compound that is known to be present in the sample. False Positive (FP) If the peak is a base peak and does not correspond to an expected compound.

, ✏nm = 0 8 m),

for a new cluster, k ⇤ , where ✏nm = 0 8 m describes the fact that n is in its own cluster and hence not in the same cluster as any other peak. As we are interested in a single set of assignments to compare against the clustering produced by mzMatch, we typically run the sampler for a fixed number of iterations and keep the sample with the highest posterior value.

True Negative (TN) If the peak is not a base peak and not identified as an expected compound. False Negative (FN) If the peak is not a base peak, but matches an expected compound. The following statistics were calculated:

4. TESTING AND RESULTS

True Positive Rate: T P R =

In order to compare the clusterings produced by the mixture model against those produced by mzMatch, both algorithms were ran against the data files produced from a set of LC-MS experiments. In this case, the data are based on standard samples used to calibrate chromatographic columns. Each sample consisted of a mixture of known compounds, with known mass and expected retention time. Three samples were used, comprising 104, 96 and 40 compounds respectively, with runs in both positive and negative mode, for a total of six data sets. The mixture model parameters were set at µ = 0, 2 = 0.4, = 8, p1 = 0.001, p0 = 0.97, ↵ = 1. This is our initial choice, and it is unlikely that these will be optimal. Note also that the Gaussian density will provide some probability mass for values outside the feasible correlation range. Optimising the form and parameters of these densities based on the known constituents of standard samples is ongoing work. After the clusterings were produced, each of the peaks in each of the clusters was labelled as a base peak or a derivative of a base peak, using the algorithm given by mzMatch (the base peak is defined as the most intense peak in a mass spectrogram or in a cluster). An identification was also attempted on each peak, using the known masses of the compounds in the sample and the mzMatch identification algorithm. To assess the performance characteristics of the clusterings, each peak was compared against the known masses of the compounds and assigned a label. The labels were:

False Positive Rate: F P R = Balanced Accuracy: BA =

TP T P +F N . FP F P +T N .

0.5⇤T P T P +F N

+

0.5⇤T N T N +F P

.

The results of the experiments are summarised in Table 1. The first thing to notice is that mzMatch produces many more clusters than the mixture model. This gives rise to higher TPR (if every peak were assigned to its own cluster, we would observe a TPR of 1). The smaller number of clusters given by the mixture model causes a considerable improvement in the FPR. This improvement outweighs the lower TPR, as can be seen by the mixture model having a higher BA in all samples. These results suggest that the mixture model represents a very promising alternative to the clustering currently used by mzMatch. In addition to these results, Figure 3 shows the correlation matrix Q for Standard 3 in Negative mode unordered (a) and then ordered according to the mzMatch clustering (b) and the mixture model clustering (c). The larger clusters produced by the mixture model are clearly visible, as is the marked reduction in high correlation values between peaks in different clusters (white colouring off the diagonal). 5. DISCUSSION AND CONCLUSION The results presented in the previous section suggest that the mixture model approach can cluster, and subsequently filter, peaks more effectively than the algorithm used in mzMatch. Whilst mzMatch finds more compounds, it does so as the result of degenerate behaviour when clustering peaks

True Positive (TP) If the peak is a base peak and identified

73

(a) Unordered Q matrix

(b) Q ordered by mzMatch assignments (clusters ordered by size)

(c) Q ordered by mixture model assignments (clusters ordered by size)

Figure 3. Various orderings of the correlation matrix for Standard 3 Negative data. White corresponds to high positive correlation, black to high negative correlation. Unobserved values are given the value zero and are shown as grey. together, and hence produces many spurious identifications. This behaviour can be understood as a consequence of the greedy behaviour and fixed threshold of mzMatch; by committing early to a particular clustering, and by not being able to update clusters in the light of new information, peaks that might naturally be clustered with others are left out if they do not correlate to the base peak. With no other peaks to correlate to (they are all in the original cluster) they are left by themselves in singleton clusters. The clustering produced by the mixture model is still far from perfect. One way of further improving the filtering process is through the incorporation of additional information into the clustering process. For example, it is straightforward to extend the model to handle technical and/or biological replicates by assuming a single clustering across all replicates and taking the product of the likelihoods of the observed data in each clustering conditioned on the assignments. Alternatively, mass is currently not used in the clustering at all. Many peaks deriving from the same metabolite should have explainable mass differences – development of a likelihood function that can take this information into account is a promising avenue for future work [7]. Taking a longer view, peak filtering is just one step in the complete analysis pipeline. Often, one will be interested in identifying the metabolites and performing differential analysis across time, or different experimental conditions. Rather than extracting a single clustering from the Gibbs sampler, one could combine this filtering model with other probabilistic models that assign peaks to metabolites [7] or perform differential analysis [8] and thus propagate any uncertainty in the clustering stage through the analysis in a manner similar to that done for microarray data in [9].

M. Koornneef, D. Vreugdenhil, R. Breitling, and R. C. Jansen, “System-wide molecular evidence for phenotypic buffering in Arabidopsis,” Nature Genetics, vol. 41, pp. 166–167, 2009. [2] T. Kind and O. Fiehn, “Metabolomic database annotations via query of elemental compositions: Mass accuracy is insufficient even at less than 1 ppm,” BMC Bioinformatics, vol. 7, pp. 234, 2006. [3] R. A. Scheltema, S. Decuypere, J.-C. Dujardin, D. G. Watson, R. C. Jansen, and R. Breitling, “A simple data reduction method for high resolution LC-MS data in metabolomics,” Bioanalysis, vol. 1, no. 9, pp. 1551–1557, 2009. [4] R. A. Scheltema, A. Jankevics, R. C. Jansen, M. A. Swertz, and R. Breitling, “PeakML/mzMatch: A file format, Java library, R library, and tool-chain for mass spectrometry data analysis,” Analytical Chemistry, vol. 83, no. 7, pp. 2786–2793, 2011. [5] C. Carvalho, J. Chang, J. Lucas, J. Nevins, W. Wang, and M. West, “High-dimensional sparse factor modelling: Applications in gene expression genomics,” J Am Stat Assoc, vol. 103, no. 484, pp. 1438–1456, 2008. [6] C. E. Rasmussen, “The infinite Gaussian mixture model,” in Advances in Neural Information Processing Systems 12. 2000, pp. 554–560, MIT Press. [7] S. Rogers, R. A. Scheltema, M. Girolami, and R. Breitling, “Probabilistic assignment of formulas to mass peaks in metabolomics experiments,” Bioinformatics, vol. 25, no. 4, pp. 512–518, 2009.

6. ACKNOWLEDGEMENTS

[8] I. Huopaniemi, T. Suvitaival, J. Nikkila, M. Oresic, and S. Kaski, “Two-way analysis of high-dimensional collinear data,” Data Mining and Knowledge Discovery, vol. 19, no. 2, pp. 261–276, 2009.

Many thanks to Andris Jankevics and Stefan Weidt for very useful discussions and for supplying the data. 7. REFERENCES

[9] R. Pearson, X. Liu, G. Sanguinetti, M. Milo, N. Lawrence, and N. Rattray, “PUMA: A Bioconductor package for propogating uncertainity in microarray analysis,” BMC Bioinformatics, vol. 10, 2009.

[1] J. Fu, J. J. B. Keurentjes, H. Bouwmeester, T. America, F. W. A. Verstappen, J. L. Ward, M. H. Beale, R. C. H. de Vos, M. Dijkstra, R. A. Scheltema, F. Johannes,

74

QUANTIFYING THE RELATIONSHIP BETWEEN THE STRUCTURE AND THE DYNAMICS OF RANDOM BOOLEAN NETWORKS FROM TIME SERIES DATA Septimia Sarbu, Juha Kesseli, Matti Nykter Department of Signal Processing, Tampere University of Technology, P.O. Box 553, FI-33101 Tampere, Finland [email protected], [email protected], [email protected], ABSTRACT

The order parameter for random Boolean networks with random functions with bias p, s = 2 · K · p · (1 p) [8], shows the dynamical regime of the network. The network is ordered if s < 1, critical if s = 1 and chaotic if s > 1. The actual value of s represents the degree of regularity or the degree of chaoticity, depending on the dynamical region of the system’s behaviour.

The normalized compression distance (NCD) is used to quantify the dynamical behaviour of random Boolean network models, using time series measurements of the state of the network. The system can only be measured in a limited number of time points sampled as time series. This is commonly the case for biological networks. The experimental measure of system dynamics we derive here from simulated data will be applied to real-world biological data from various dynamical systems. In addition, we construct a method of inferring the two parameters of the Boolean network model that define its structure and its dynamics.

2. DATA

1. BACKGROUND Random Boolean networks [1] represent the simplest model for gene regulatory networks [2], [3]. They are characterized by three parameters: the number of nodes, N , the number of input nodes to each node, K, and the function bias, p. In our case, the in-degree K is fixed and the both the input and output connections are random. The function bias p is defined as the probability that the output of each node is either 0 or 1. The parameter K and the type of connections define the structure of the network and the parameter p defines its dynamical behaviour. The Derrida plot [4] is a method of determining the dynamical regime of a discrete network. The normalized compression distance (NCD) [5], [6] is used to create the Derrida plot [7] instead of the Hamming distance because the NCD is universal and can be applied to any type of objects to assess their similarity. The Hamming distance can only be applied to strings. The NCD is computed as N CD =

Cxy min(Cx , Cy ) . max(Cx , Cy )

(1)

Cx and Cy represent the compressed sizes of the objects x and y, respectively. Cxy represents the compressed size of the concatenation of x and y. The range of the NCD is between 0 and 1 and it reveals how much information two objects share, based on their compressed sizes. Losless real-world compressors, such as gzip, bzip2, PPM etc., are used to compress two states of the system. If N CD = 0, then the two states are identical and if N CD = 1, then the two states are totally random.

75

For different pairs of (K, p) and N = 1000, time series of network states are generated starting from a random state and run forward in time for 15 time points. We observe from computer simulations that, approximately after this time point, ordered networks have reached the attractor and all the states are identical. There is no additional information that can be gained by running the simulation further. The analysis of time series measurements is performed in the transient phase of the evolution in time of the networks, where the largest differences in the network dynamics take place, according to the network topologies. Since the nodes of the network are randomly connected, each experiment is repeated 20 times and the average value of the quantitative measure of interest is taken. Different network structures are created by varying the fixed in-degree K from 1 to 10 and different dynamics by varying the function bias p from 0.01 to 0.5, with 20 linearly spaced values in this range. 3. METHODS For a chosen pair (K, p), the network starts from a random state and is run forward in time for 15 time steps. The NCD is computed between the state of the system at time point t1 and the state of the system at time point t2 , where t1 = {1, 2, ..., 15} and t2 = {t1 + 1, t1 + 2, ..., 15}. In order to determine the dynamical regime of the network, the NCD based Derrida plot [7] is extended to the following representation: the NCD(system(t1 )),system(t2 )), denoted as N CDx , is plotted on the x axis and the NCD (system(t1 + 1)),system(t2 + 1)), denoted as N CDy , is plotted on the y axis. The NCD points are interpolated using the Matlab function smooth.m, with the option ’lowess’ [9]. This function performs polynomial interpolation using a moving average filter for smoothing the data. In the ’lowess’ op-

tion, the method of interpolation is local regression using weighted least squares and the degree of the polynomial is 1. The results for unbiased and biased networks are shown in Figure 1 and Figure 2, respectively. The area between the fitted curve and the diagonal is computed using the Matlab function trapz.m [9]. It numerically integrates the curve given by the vectors of points N CDx and N CDy , using the trapezoidal rule. This method approximates the curve between two consecutive points by a line. For each pair (K, p), the previous steps are repeated 20 times. The median of the area values is taken between the 20 experiments. The median value of the area (A) is used in subsequent computations. A three dimensional scatter plot is created with the vector of points K, p and A. A surface fitting is performed on the two dimensional grid given by (K, p) using the scatter data A, with the Matlab function gridfit.m, developed by John D’Errico and available at the Matlab Central website [10]. The results are shown in Figure 3. 4. RESULTS AND DISCUSSION There are fewer states available to measure in the case of time series data, than in the case of a perturbation analysis. The classical NCD based Derrida curve [7], applied to this type of data, does not provide enough points for an accurate interpolation. As a result, the Derrida plot is extended by taking the NCD between states that are more than one time step apart, as described in the Methods section. Figure 1 is created for unbiased networks. The function bias is fixed p = 0.5 and the in-degree K is varied from 1 to 4. The first panel of Figure 1 shows the extended Derrida plot for an ordered random Boolean network with sensitivity s = 0.5. Ordered networks have points below the diagonal, covering almost the entire range of the NCD measure. The area between the fitted curve and the diagonal decreases to zero as the network dynamics change from extremely ordered behaviour to critical behaviour. For critical networks, i.e. the sensitivity s = 1, shown the second panel, the NCD points form an elongated cloud, streching along the diagonal. The area between the fitted curve and the diagonal is approximately zero. The third and fourth panels indicate chaotic networks. The NCD points are tightly clustered on the diagonal around the point (0.65, 0.65). It is not possible to quantitatively separate different chaotic networks. It is possible, however, to qualitatively distinguish them from ordered and critical networks, based on the shape of the clouds of points. Chaotic behaviour is defined as an exponential amplification of small perturbations. When dealing with time series of the evolution of the system, applying small perturbations and measuring their progress in time may not be possible. In this case, the extended Derrida plots of panels 3 and 4 are used to show the presence of chaos. Figure 2 is created for biased networks. The in-degree K is fixed K = 3 and the function bias p takes the values [0.036 0.113 0.216 0.423]. The first two panels of this figure contain ordered networks with different degrees of regularity. The sensitivity of the first network is

s = 0.208 and the sensitivity of the second network is s = 0.601. The area between the fitted curve and the diagonal is larger for the first network, than that for the second network. In the case of ordered networks, the area measure is a valid indicator of system dynamics, because it acts as the network sensitivity. The elongation of the cloud of NCD points along the diagonal in panel 3 shows that the network operates in the phase transition between the ordered and chaotic regimes. In panel 4 a tight cloud of NCD points can be observed similar to the one for unbiased chaotic networks of panels 3 and 4 from Figure 1. The states of a chaotic network are always far apart, such that the pairs of NCD values fluctuate around the point (0.65, 0.65). This causes the NCD points to cluster in a small compact cloud. Ideally, the NCD value between totally random states should be equal to 1. But, due to the finite size of the network, the states cannot diverge indefinitely and become totally random, so the NCD can never reach 1. The finite size effect is eliminated by interpolating between [0 1] to find the polynomial for the whole range of NCD values. The purpose of analysing time series data is to determine the dynamical regime of the system, when either K or p or both are unknown and to infer the value of the unknown parameters K and p. The first question is answered by examining the extended Derrida plots. The inference problem is solved using the nonlinear relationship between the area A, K and p, shown in Figure 3. This is possible only when either K or p are unknown. Given one of the parameters K or p, the other one can be computed using the fitted surface of the area A as a function of K and p. The case when both of them are unknown is not dealt with in this paper. 5. CONCLUSIONS We have investigated the dynamical behaviour of random Boolean networks from time series measurements of the state of the network, shown in figures 1 and 2. We have developed an experimental method of inferring parameters K and p, when either parameter is unknown. It is based on surface interpolation, illustrated in figure 3. The extended Derrida plot represents a qualitative identification procedure of ordered, critical or chaotic networks. The time series data only carries quantitative information about the system dynamics for ordered and critical networks. The limitation of our area measure is in the range of chaotic behaviour. It is not possible to quantitavely tell apart chaotic networks that have different K and p parameters, because the area between the fitted curve and the diagonal in the extended Derrida plot is approximately zero in all cases. But, as biological networks have been proved to be either ordered or critical [11], [12], [13], this drawback does not have any impact in analysing real biological time series data. Future work will be carried out in developing a method of inferring K and p from time series data, when both of them are unknown. The next step will be to apply these methods to real time series measurements of biological networks.

76

Figure 1. The extended NCD based Derrida plot for unbiased RBNs, i.e. p = 0.5, with fixed in-degree K = [1

Figure 2. The extended NCD based Derrida plot for biased RBNs, p = [0.036 in-degree K = 3.

77

0.113

0.216

2

3

4].

0.423], with fixed

Figure 3. The area between the interpolated NCD points and the diagonal of the extended Derrida plot as a function of the fixed in-degree K and the boolean function bias p. 6. ACKNOWLEDGMENTS

[7] M. Nykter, N. D. Price, M. Aldana, S. A. Ramsey, S. A. Kauffman, L. E. Hood, O. Yli-Harja, and I. Shmulevich, “Gene expression dynamics in the macrophage exhibit criticality,” PNAS, vol. 105, no. 6, pp. 1897–1900, February 2008.

The funding for this work has come from the Academy of Finland project no. 132877, project no. 251937 and from the Graduate School in Electronics, Telecommunication and Automation (GETA), Aalto University.

[8] I. Shmulevich and S. A. Kauffman, “Activities and sensitivities in boolean network models,” Physical review letters, vol. 93, no. 4, 2004.

7. REFERENCES [1] S. A. Kauffman, “Metabolic stability and epigenesis in randomly constructed genetic nets,” Journal of Theoretical biology, vol. 22, pp. 437–467, 1969.

[9] “http://www.mathworks.se/ help/toolbox /curvefit/smooth.html ; http://www.mathworks.se /help/techdoc /ref/trapz.html,” .

[2] S. Kauffman, C. Peterson, B. Samuelsson, and C. Troein, “Random boolean network models and the yeast transcriptional network,” PNAS, vol. 100, pp. 14796–14799, 2003.

[10] “http://www.mathworks.com /matlabcentral /fileexchange/8998,” .

[3] A. S. Ribeiro, S. A. Kauffman, J. Lloyd-Price, B. Samuelsson, and J. E. Socolar, “Mutual information in random boolean models of regulatory networks,” Physical Review E, vol. 77, 2008.

[11] I. Shmulevich, S. Kauffman, and M. Aldana, “Eukaryotic cells are dynamically ordered or critical but not chaotic,” PNAS, vol. 102, pp. 13439–13444, 2005.

[4] B. Derrida and Y. Pomeau, “Random networks of automata: a simple annealed approximation,” Europhysics letters, vol. 1, pp. 45–49, 1986.

[12] E. Balleza, E. Alvarez-Buylla, A. Chaos, S. Kauffman, I. Shmulevich, and M. Aldana, “Critical dynamics in genetic regulatory networks: examples from four kingdoms,” PLoS ONE 3, vol. 6, 2008.

[5] M. Li, X. Chen, X. Li, B. Ma, and P. M. Vitanyi, “The similarity metric,” IEEE Transactions on Information Theory, vol. 50, 2004.

[13] M. Nykter, N. D. Price, A. Larjo, T. Aho, S. A. Kauffman, O. Yli-Harja, and I. Shmulevich, “Critical networks exibit maximal information diversity in structure-dynamics relationships,” Physical review letters, vol. 100, February 2008.

[6] R. Cilibrasi and P. M. Vitanyi, “Clustering by compression,” IEEE Transctions on Information Theory, vol. 51, 2005.

78

A MODEL FOR PROLIFERATING CELL POPULATIONS THAT ACCOUNTS FOR CELL TYPES Daniella Schittler, Jan Hasenauer, and Frank Allg¨ower Institute for Systems Theory and Automatic Control, University of Stuttgart, Germany {daniella.schittler, jan.hasenauer, frank.allgower}@ist.uni-stuttgart.de ABSTRACT

While the number of cell divisions i is theoretically unlimited, i 2 N0 , the number of cell types j is finite, j 2 {1, . . . , J}. According to the considered properties, the state of an individual cell is defined by (x, i, j). As our model describes the population statistics of the individual cells contained in the population, the model’s state variables are the number densities n(x, i, j|t) of observing a cell with label concentration x, division number i, and cell type j at time t. This number density changes due to label degradation, as well as fluxes between distinct subpopulations (division numbers and cell types). When a cell of division number i divides, two cells of division number (i + 1) appear. In addition, a cell of type j1 can change into another cell type j2 . The DCLSP model, representing the dynamics of n(x, i, j|t), is given by a system of PDEs, 8i 2 N0 , 8j 2 {1, . . . , J} :

In this paper, we propose an extended model for cell proliferation to account for distinct cell types. This division-, cell type-, and label-structured population model allows for division number- and cell type-dependent parameters, as well as the comparison with experimental data. Using this model, the effects of proliferation properties on the population structure are analyzed for asymmetrically dividing stem cell populations. 1. INTRODUCTION Cell proliferation is a central topic of systems biological research. Mathematical models have been tailored to analyze measurement data from proliferation assays, e.g., using CFSE labeling [1, 2, 3]. Recently, a cell proliferation model has been introduced which integrates cell division numbers as well as label dynamics [4, 5, 6]. This model offers several advantages compared to previous models [1, 2], but does not account for the fact that cell populations often comprise subpopulations with distinct proliferation properties [6, 7, 8, 9, 10]. Examples are active versus quiescent stem cells, dividing stem cells versus committed cells, or healthy versus cancerous cells. In this paper, we propose a model that accounts for distinct cell types. It incorporates label dynamics, cell division, and transitions between distinct cell types. Therefore, we call it “division-, cell type-, and label-structured population” (DCLSP) model, in analogy to existing models [4, 5, 6]. In Section 2, we introduce the DCLSP model and delineate a decomposition approach, which allows for an efficient solution scheme. In Section 3, the DCLSP model is employed to study an asymmetrically dividing stem cell population.

@n(x, i, j|t) @(x n(x, i, j|t)) k(t) = @t @x J J X X µj jµ i (t)n(x, i, j|t) + i (t)n(x, i, µ|t) µ=1

µ=1

j i (t)n(x, i, j|t)

+



↵ij (t)n(x, i, j|t)

0 PJ µ jµ 2 µ=1 ↵i 1 (t)wi 1 (t)n( x, i

1, µ|t)

, i=0 , i 1

with initial conditions

n(x, 0, j|0) = nini (x, 0, j), 8i

1 : n(x, i, j|0) = 0.

In the following, it is assumed that the labeling efficiency is independent of the cell type, which is a plausible assumption, yielding nini (x, 0, j) = Nini (0, j) · pini (x). The dynamics of n(x, i, j|t) are determined by the fluxes:

2. METHODS

k(t) @(x n(x,i,j|t)) : label degradation with rate k(t)x, @x

2.1. Structured population model



A recently developed cell proliferation model [4, 5, 6], for which we coined the term “division- and label-structured population model” (DLSP), describes individual cells in proliferation assays using (a) the label concentration x (continuous) and (b) the number of cell divisions i (discrete) of a cell. In the paper at hand, we extend this to a DCLSP model by (c) the cell type j (discrete) that a cell belongs to.

• +↵ij1 1 (t)wij2 j11 (t)n( x, i 1, j1 |t): cell division in subpopulation (i 1, j1 ), and simultaneous transition to cell type j2 with transition probability wij2 j1 (t) PJ (8i, j1 , t: j2 =1 wij2 j1 (t) = 1), and



j i (t)n(x, i, j|t) and

↵ij (t)n(x, i, j|t): cell death and cell division in subpopulation (i, j),

• ± ij2 j1 (t)n(x, i, j1 |t): transitions from cell type j1 to j2 (independent of cell division).

79

The division rates ↵ij (t) 2 [0, 1), the death rates ij (t) 2 (0, 1), and the transition rates ij2 j1 (t) 2 [0, 1) may depend on division number i, cell type j, and time t. The parameter 2 (1, 2] denotes the label dilution upon cell division (cf. [2, 3, 4]). The DCLSP model provides information about: • Number of cells in each subpopulation: The number of cells in the subpopulation Z (i, j) is N (i, j|t) =

R+

n(x, i, j|t)dx.

stem cells:

committed cells:

(1)

8i, 8j :

Although the DCLSP model is a system of coupled PDEs, it can be split up into two independent parts, each representing the dynamics of a biologically distinct process: A system of ordinary differential equations (ODEs) describing the dynamics of numbers of cells N (i, j|t) that belong to subpopulations (i, j) at time t; and a set of decoupled PDEs that governs the probability density p(x|i, j, t) of finding a cell at label intensity x, given that it belongs to (i, j) at time t. The overall solution is assembled from the solutions of these two parts:

(2, 2)

...

@p(x|i, j, t) @t

k(t)

@(x p(x|i, j, t)) =0 @x

(6)

In this section, we will investigate the dynamics of a cell population which descends from a stem cell pool, but where cells can commit towards a second, more mature stage upon cell division. In asymmetric cell division, one daughter cell remains a stem cell, while the second cell is a committed cell. In symmetric cell division, both daughter cells are of the same cell type (either stem cells or committed cells) [8]. Our case study is based on the following assumptions:

(4)

• There are two distinct cell types: stem cells (j = 1) and committed cells (j = 2). • Only stem cells divide (8i : ↵i1 mitted cells: 8i : ↵i2 = 0.

0), but not com-

• Cell type transitions only occur in the context of asymmetric cell division, not without cell division: j2 j1 = 0 8i, j1 , j2 . i

jµ i (t)N (i, µ|t)

Time and division number dependency of parameters is omitted, 8i : ij (t) = j > 0, ↵ij (t) = ↵j 0, and wi11 = w11 , wi21 = w21 2 [0, 1], with w21 = 1 w11 . The resulting model structure for this scenario is illustrated in Fig. 1, with subpopulations (i, j) depicted as compartments, and fluxes due to cell division.

µ=1

↵ij (t)N (i, j|t) 1, µ|t)

w221 ↵21

3.1. Asymmetrically dividing stem cell population

(i) Number of cells with i divisions, cell type j at time t, N (i, j|t):

0 PJ 2 µ=1 ↵iµ 1 (t)wijµ1 (t)N (i

(1, 2)

...

3. RESULTS

The proof works similar to the case without cell types (J = 1) for which we have proven the decomposition previously [4, 5]. This decomposition yields two decoupled model parts:



w121 ↵11

w211 ↵21

with initial conditions 8i, j : p(x|i, j, 0) = i pini ( i x). This decomposition is powerful as it renders the model amenable to established analysis tools. The PDE (6) can be solved via the methods of characteristics (cf. [4, 5]), and the system of ODEs (5) can be analyzed by frequency domain approaches, deriving dependencies between parameters and the model solution. In the following section we show how this feature can be exploited.

2.2. Decomposition of model and solution

+

(2, 1)

(ii) Probability density of a cell with i divisions and cell type j to have label concentration x at time t, p(x|i, j, t):

which corresponds to the experimental output in proliferation assays.

j i (t)N (i, j|t)

w111 ↵11

with initial conditions 8j : N (0, j|0) = Nini (0, j), 8i 1 : N (i, j|0) = 0. This ODE system (5) governs the dynamics of a divisionand cell type-structured population (DCSP).

i=0 j=1

µ=1

(1, 1)

Figure 1. Illustration of the model for asymmetrically dividing stem cells: Subpopulation (i, j) of cells with i cell divisions and cell type j; and the fluxes between these subpopulations.

• Overall label density: The overall label density over x in the whole population follows as: 1 X J X m(x|t) = n(x, i, j|t), (3)

dN (i, j|t) = dt J J X X µj i (t)N (i, j|t) +

w011 ↵01 w021 ↵01

Furthermore, the number of cells with the same division number i (regardless of cell type), as well as the number of cells of the same cell type j (regardless of division number), can be obtained by summing over all j or all i, respectively. • Label density in each subpopulation: The probability density over x, given that a cell belongs to the subpopulation (i, j), is for all (i, j, t) for which N (i, j|t) > 0 given by n(x, i, j|t) p(x|i, j, t) = . (2) N (i, j|t)

n(x, i, j|t) = N (i, j|t) · p(x|i, j, t).

(0, 1)

, i=0 , i 1 (5)

80

To study the dynamics of subpopulation sizes N (i, j|t), we study the corresponding DCSP model: 8i 2 N0 :

dN (i, 1|t) = dt

1

+ dN (i, 2|t) = dt

N (i, 1|t)

⇢ 2

↵1 N (i, 1|t)

0 2↵1 w11 N (i

,i = 0 ,i 1

1, 1|t)

N (i, 2|t) + 2↵1 w21 N (i

1, 1|t)

with initial conditions N (0, 1|0) = Nini , 8(i, j) 6= (0, 1) : N (i, j|0) = 0. For this model the analytical solution can be determined, (2w11 ↵1 t)i ( 1 +↵1 )t e Nini , i! 21 (2w11 ↵1 )i w 2 t w11 (7) N (i, 2|t) = e (i 1)! ✓Z t ◆ 2 1 ↵1 )⌧ ⌧ i 1 e( d⌧ Nini . · N (i, 1|t) =

Figure 2. (a) Parameter subspace on which the total amount of stem cells remains constant. For parameter ¯ (1|t) increases, whereas below, values above the plane, N ¯ N (1|t) decreases. (b,c) Projections for exemplary fixed values in 1 -w11 -plane and in ↵1 -w11 -plane.

0

Given the solution of (7), the solution p(x|i, j, t) of (6) (which can be found in [4, 5]) can be used to reassemble the overall solution (3). This is the case as the previously introduced decomposition principle holds. Remark The derivation of (7), as well as the solution to the more general case with division number dependent parameters can be found in the online supplementary file at www.ist.uni-stuttgart.de/⇠schittler/Publications.shtml.

values are ↵ 2 [0, 1) and 2 (0, 1), a necessary condition to conserve the amount of stem cells is w11 > 0.5. This observation, along with the feasible parameter value ranges, answers (Q1). A second interesting question originates from two conflicting goals: Minimization of the number of committed cells, in order to keep the stem cell pool as “pure” as possible; or maximization of the number of committed cells, to build up or repair tissue. One may therefore ask: (Q2) What is the expected ratio of the number of committed cells to the number of stem cells? As also demonstrated, e.g., in [9], this question of optimizing the ratio of committed-to-stem cells may be addressed by mathematical modeling to gain additional biological insight. To approach this question, we consider the committed1 to-stem cells ratio, for 2 + (2w11 1)↵1 6= 0:

3.2. Analysis of proliferation properties Since high-quality stem cell pools are rare and costly to achieve, it is relevant to identify conditions under which they will not diminish too fast. Therefore, a crucial question in the introduced scenario of stem cell division is: (Q1) How can the total amount of stem cells be kept constant, and which parameters have to be controlled in order to achieve this? In order to answer this question, we consider the total amount of stem cells ¯ (1|t) = N

1 X

N (i, 1|t)

(8)

i=0

= e((2w

11

1)↵

1

1

)t

¯ (2|t) N ¯ (1|t) = N

Nini .

From the sign of the exponent in (8), the total number of stem cells ¯ (1|t) increases , N ¯ (1|t) is constant , N ¯ (1|t) decreases , N

1

< (2w11

1)↵1 ,

1

= (2w11

1)↵1 , and

1

11

> (2w

2w21 ↵1 2 + (2w 11 1)↵1 ⇣ 2 11 · 1 e ( +(2w

1 1)↵1

1

)t



(detailed derivation in supplement, cf. Remark). 1 The sign of ( 2 + (2w11 1)↵1 ) determines the dynamics of the committed-to-stem cells ratio: 2 ¯ (2|t)/N ¯ (1|t) (i) If 1 < (2w11 1)↵1 , then N will grow with saturation to the limit ¯ (2|t) N 2w21 ↵1 max = 2 . R2/1 = lim ¯ 1 t!1 N (1|t) + (2w11 1)↵1

(9)

1

1)↵ .

The parameter subspace on which the amount of stem cells remains constant, along with illustrative projections, is depicted in Fig. 2. Since biologically feasible parameter

81

2 (ii) If 1 = (2w11 1)↵1 , then ¯ (2|t)/N ¯ (1|t) = 2w21 ↵1 te ( 2 +(2w11 N

1)↵1

1

)t

alistic dynamics and better explanation of data from cell proliferation assays. It might be combined with models considering the age distribution in cell populations [11] in future work.

.

(iii) If > (2w 1)↵ , then ¯ (2|t)/N ¯ (1|t) will approach exponential growth. N 1

2

11

1

5. ACKNOWLEDGMENTS

¯ (1|t) increases according to (9), then always Note that if N case (i) holds since 2 > 0. Further effects of the individual parameters on the maximum ratio can be achieved via max the derivatives @R2/1 /@•: Increasing 1 or w21 always increases the fraction of committed cells, whereas increasmax ing 2 or w11 reduces R2/1 . The effect of the cell division 1 rate ↵ interestingly depends on the relationship of death rates: For 2 > 1 , increasing ↵1 has a positive effect on max R2/1 ; whereas for 2 < 1 , it has a negative effect. These observations answer (Q2). The focus of this exemplary case study was on the dynamics of the number of cells of the distinct subpopulations. Clearly, for comparison to measurement data, also the label dynamics and finally the overall label density have to be considered. We confirmed our analytically derived results by simulations of the overall DCLSP model (results not shown). Although this was out of the scope of this contribution, already the study of cell numbers highlights crucial properties of the cell proliferation dynamics.

The authors would like to thank for financial support from the German Research Foundation (DFG) within the Cluster of Excellence in Simulation Technology (EXC 310/1) at the University of Stuttgart, from the German Federal Ministry of Education and Research (BMBF) within the SysTec program (grant nr. 0315-506A), and from The MathWorks Foundation of Science and Engineering.

6. REFERENCES [1] R.J. De Boer, V. Ganusov, D. Milutinovi`o, P. Hodgkin, and A. Perelson, “Estimating lymphocyte division and death rates from CFSE data,” Bull. Math. Biol., vol. 68, pp. 1011–1031, 2006. [2] T. Luzyanina, D. Roose, T. Schenkel, M. Sester, S. Ehl, A. Meyerhans, and G. Bocharov, “Numerical modelling of label-structured cell population growth using CFSE distribution data,” Theor. Biol. Med. Model., vol. 4, pp. 1–14, 2007. [3] H. Banks, K. Suttona, W. Thompson, G. Bocharov, D. Roose, T. Schenkel, and A. Meyerhans, “Estimation of cell proliferation dynamics using CFSE data,” Bull. Math. Biol., vol. 73, pp. 116–150, 2010.

4. CONCLUSION & OUTLOOK In this paper, we presented an extension of a cell proliferation model [4, 5] to account for distinct cell types. The model proposed here allows to represent the dynamics of proliferating cell populations which comprise several cell types. Each subpopulation may have distinct proliferation properties, corresponding to distinct parameter values. In contrast to most existing structured population models (see, e.g.,[10]) which consider different cell types (but no division number), we managed to derive a partially analytical solution, which reduces the computational complexity. We demonstrated how our model can serve to analyze specific scenarios, arising, e.g., for populations of stem cells, which may partly differentiate into a committed cell type upon asymmetric or symmetric cell division. Such an analysis can yield conclusions about parameter settings for which qualitatively different population dynamics arise. This renders the model especially valuable if biological experiments are time consuming or not at hand, or it can be used to complement experimental investigations, e.g., by comparing and validating hypotheses. Our model can furthermore be used to infer proliferation parameters from CFSE data, as done already with existing models [2, 3, 6]. Moreover, it offers the possibility of studying the subpopulation structure using typical measurements from CFSE labeling assays, which so far was not possible with existing model. Therefore, the overall label density has to be reassembled and compared to measurement data. This was out of the scope of this work, but will be addressed in future work. For cell populations containing cell types with distinct proliferation properties, our model will allow for more re-

[4] D. Schittler, J. Hasenauer, and F. Allg¨ower, “A generalized population model for cell proliferation: Integrating division numbers and label dynamics,” Proc. of 8th Workshop on Comp. Syst. Biol. (WCSB), TICSP series #57, pp. 165–168, 2011. [5] J. Hasenauer, D. Schittler, and F. Allg¨ower, “A computational model for proliferation dynamics of divisionand label-structured populations”, Technical Report arXiv:1202.4923v1 [q-bio.PE], 2012. [6] W.C. Thompson, “Partial differential equation modeling of flow cytometry data from CFSE-based proliferation assays,” Ph.D. thesis, North Carolina State University, U.S.A., 2011. [7] A. Wilson, E. Laurendi, G. Oser, R.C. van der Wath, W. Blanco-Bose, M. Jaworski, S. Offner, C.F. Dunant, L. Eshkind, E. Bockamp, P. Lio, H.R. MacDonald, and A. Trumpp, “Hematopoietic stem cells reversibly switch from dormancy to self-renewal during homeostasis and repair,” Cell, vol. 135, pp. 1118–1129, 2008. [8] S. J. Morrison, and J. Kimble “Asymmetric and symmetric stem-cell divisions in development and cancer,” Nature, vol. 441, pp. 1068–1074, 2006. [9] G. Balazsi, A. van Oudenaarden, and J. Collins, “Cellular decision making and biological noise: from microbes to mammals”, Cell, vol. 144, no. 6, pp. 910–925, 2011. [10] M. Gyllenberg, and G.F. Webb, “A nonlinear structured population model of tumor growth with quiescence,” J. Theor. Biol., vol. 28, pp. 671–694, 1990. [11] P. Metzger, J. Hasenauer, and F. Allg¨ower, “Modeling and analysis of division-, age-, and label-structured cell populations,” publication at the 9th Workshop on Comp. Syst. Biol. (WCSB), 2012.

82

PARETO-OPTIMAL RNA SEQUENCE-STRUCTURE ALIGNMENTS Thomas Schnattinger1 , Uwe Sch¨oning1 and Hans A. Kestler2,⇤ Institute of Theoretical Computer Science, Ulm University; 2 Research Group Bioinformatics and Systems Biology, Institute of Neural Information Processing, Ulm University Ulm University, D-89069 Ulm, Germany [email protected], [email protected], [email protected] 1

ABSTRACT

is not practical, more efficient heuristic approaches have been developed [6, 7]. Instead of computing the alignment and the folding at the same time, the secondary structures can be pre-calculated. Hofacker and co-workers [8, 9] showed that a sequence-structure alignment can be found more efficiently by first computing McCaskill’s base pair probability matrices (BPPM) of the two sequences [10], which contain structure information of all possible secondary structures. Using these BPPMs, a pairwise alignment together with a consensus secondary structure can be calculated that maximizes an objective function which is composed of specific parts for sequence and structure. One of the limits of this approach can be the fixed weighting between the scoring for the sequence alignment and for the consensus structure. This fixing of the influence of different parts of an objective function is a well known problem in decision theory [11, 12]. In the sequence alignment context it has been applied to overcome fixed definitions for matches and indels [13, 14]. Also Taneda [15] used this principle as part of an evolutionary computation that approximates sets of sequence-structure alignments . If this weighting is estimated or optimized prior to a calculation of functional similarity, alignments may be missed which show an energetically stable secondary structure. We propose to split this objective function into multiple independent objectives and utilize them in a dynamic programming optimization, generating a set of feasible Pareto-optimal solutions rather than one single solution. To the best of our knowledge, this is the first approach which applies Pareto-optimality in this way.

Functional RNA molecules often are conserved in their secondary structure rather than in their primary sequence. To assess functional similarity, primary sequence as well as secondary structure information need to be taken into account. Based on a Sankoff-style algorithm (cf. [1]) for sequence-structure alignment, we developed a method which results in a set of Pareto-optimal alignments, so that a prior weighting of the structure and alignment objectives is not necessary. We also show that a conventional algorithm which calculates an optimal alignment regarding a single objective function may not always be able to find all biologically relevant secondary structures. 1. INTRODUCTION Unlike messenger RNA, which is translated into protein, it is known that non-coding RNA plays important roles in the organism, for example in translation or gene regulation [2]. To investigate RNA families and their structural similarity, important instruments for molecular biologists are multiple alignments. Multiple alignments can be produced by first calculating pairwise alignments and combining these into a progressive multiple alignment [3]. A common concept that is also found in agglomerative hierarchical cluster algorithms in which a pairwise distance is then extended to groups. Many families of non-coding RNA show little sequence similarity, but a conserved secondary structure. This is because the function of RNA molecules, besides their primary sequence, is mainly defined by their secondary structure (cf. [4]; see Figure 1). Standard sequence alignment tools fail when the sequence identity is too low [5]. To produce reliable alignments of RNA, both the primary sequence and the secondary structure need to be taken into account. To our knowledge Sankoff [1] was the first who suggested to find an optimal alignment and the corresponding consensus secondary structure simultaneously. This algorithm, however, is computationally expensive, having a runtime of O(n6 ) for two sequences of length n. As this

Figure 1. Example of an RNA molecule. The primary sequence is the sequence of nucleotides (encircled), whereas the base pairs (in light grey) between nucleotides make up the secondary structure.

T.S. is supported by the DFG (Scho302/8-2). H.A.K. acknowledges support by the BMBF (SyStaR). ⇤ To whom correspondence should be addressed.

83

2. RNA SEQUENCE-STRUCTURE ALIGNMENT

must be estimated or optimized [8]. The problem is that the ratio of these parameters to one another is not clear. There may be applications where one does not want to use or know a fixed weighting between “sequence” and “structure” in this objective function.

The function of an RNA molecule is determined not only by its sequence, but also its secondary structure. RNA sequences can be grouped into families that feature high sequence similarity and a common secondary structure. For example, the Rfam database [16] is a large collection of RNA sequences, arranged in currently 1973 different families. Since the construction of a good sequence alignment is impossible if the sequence similarity is too low [5], the secondary structure can be used as another criterium to improve the alignments. Sankoff’s algorithm constructs an optimal sequence alignment together with a consensus secondary structure. For two sequences of length n, it uses O(n6 ) time and O(n4 ) memory, and is therefore not practicable for larger n in many applications. Hofacker and co-workers [8, 9] present a simplified variant of the Sankoff algorithm, which pre-calculates structure information of the RNA sequences A and B. The tool RNAfold from the Vienna RNA Package [17] is used to compute the BPPMs P A and P B . Since the logarithm of a base pair probability represents the free energy of this base pair [8], the structure score A for a base pair (Ai , Aj ) is defined (analogously B ) as ( log(PijA /pmin ), if PijA pmin A (1) ij = 0, otherwise.

3. MULTI-OBJECTIVE OPTIMIZATION We split the single objective function from Equation (2) into d components. Since these components are treated separately, the concept of optimality of an alignment has to be generalized. A vector a = (a1 , . . . , ad ) dominates b = (b1 , . . . , bd ) if ai bi for all 1  i  d and aj > bj for at least one 1  j  d. A vector a 2 S is called Pareto-optimal regarding S if there is no b 2 S which dominates a. Accordingly, we define the problem of multi-objective optimization as the identification of all Pareto-optimal solutions in terms of given objective functions [11, 12]. The idea behind this definition is that the set of Pareto-optimal solutions represents all possible trade-offs in the weighting of different, possibly even conflicting objectives. Here, we will split the objective function into two parts: the “sequence” score fseq (Equation (3)), which sums up the matches and gaps of an alignment, and the “structure” score fstr (Equation (4)), which sums up the matched base pairs (assuming a base pair substitution score ⌧ = 0).

A sequence-structure alignment is a sequence alignment R of A and B, together with a secondary structure S. Here, (i, k) 2 R means that Ai is matched with Bk , and S is a secondary structure, where (i, j; k, l) 2 S means that (i, k) 2 R, (j, l) 2 R, Ai can form a base pair with Aj and Bk can form a base pair with Bl . The sequencestructure alignment problem is to find a sequence-structure alignment of A and B which maximizes the following equation ([8]): X f (R, S) = · Ngap + (Ai , Bk ) + X

(ij;kl)2S



(i,k)2R unpaired

A ij

+

B kl

+ ⌧ (Ai , Aj ; Bk , Bl )



fseq (R, S)

fstr (R, S)

=

=

· Ngap + X

X

(Ai , Bk ) (3)

(i,k)2R unpaired

A ij

+

B kl

(4)

(ij;kl)2S

Now, the gap penalty is = ( 1, 0), and the sequence scoring function likewise is vector-valued with the second component zero. Define the new structure weight A (analogously B ) to be A ij

(2)

=

(

0, log(PijA /pmin ) , (0, 0),

if PijA pmin otherwise.

(5)

We generalize the dynamic programming algorithm to use a vector-valued recursive scoring function f (R, S) = (fseq (R, S), fstr (R, S)) (cf. [19]). Define Si,j;k,l to be the set of Pareto-optimal scoring vectors of alignments of the subsequences A[i..j] and B[k..l]. We obtain the following recursion:

The sum in Equation (2) consists of three parts. The first one penalizes the insertion of gaps into the alignment, where is the (negative) gap score and Ngap = |A| + |B| 2|R| is the number of gaps. The second part sums up the matches of all unpaired bases Ai and Bk , which are scored by (Ai , Bk ). The last one sums up all B weights A ij and kl of matched base pairs in S, together with the base pair substitution score ⌧ (Ai , Aj ; Bk , Bl ). An algorithm to find an alignment R⇤ with a structure S ⇤ which optimize this objective function can be found by dynamic programming [8]. For the scoring functions ⌧ and , experimentally determined substitution matrices like RIBOSUM85-60 can be used [18], but the remaining parameters pmin and

0

Si,j;k,l = Pareto-max of: {s +

: s 2 Si+1,j;k,l } [

1

B C B {s + : s 2 Si,j;k+1,l } [ C B C B {s + (A , B ) : s 2 S C i k i+1,j;k+1,l } [ B C B 8 9 C A B C B s + + + s : < = 1 2 i,h k,q B S C @ A s 2 S , 1 i+1,h 1;k+1,q 1 hj,ql : ; s2 2 Sh+1,j;q+1,l

84

(6)

Figure 2. Two dimensional scatter plot of the Pareto-optimal alignments of two given tRNA sequences, with the sequence score fseq on the horizontal and the structure score fstr on the vertical axis. We observe alignments featuring several different secondary structures. On the right is an alignment which shows the typical tRNA cloverleaf structure. On the left is another alignment having a different structure, which also is energetically stable. with the initializations Si,i

1;k,l

Si,j;k,k Si,i

=

1

=

1;k,k 1

=

{( · (l

{( · (j

{(0, 0)}.

k + 1), 0)}

(7)

i + 1), 0)}

(8)

the sequence of tRNA-Lys from Xenopus laevis mitochondrial DNA (see Figure 2). Both sequences were taken from the Rfam database [16]. We observe |A| = 74 and |B| = 75. First, the two BPPMs P A and P B , and the corresponding base pair weights A and B were calculated. By repeatedly applying Equations (6) to (9) we compute S1,74;1,75 , the set of 54 Pareto-maximal scoring vectors of the sequence-structure alignments of A and B, see Figure 2. Here, each point represents one alignment including its structure information. The horizontal axis stands for the sequence score, i.e. the sum of gap penalties and base match scores. The vertical axis stands for the structure B score, that is the sum A ij + kl of all matched base pairs (Ai , Aj ) and (Bk , Bl ). As described, we now use backtracking to obtain the actual sequence-structure alignments from the dynamic programming table S. To show that there is more than one biologically relevant RNA structure in the set of Paretooptimal alignments, we use RNAeval [17] to recalculate their minimum free energy. This is a measure for the stability of an RNA molecule. Let mfeA (S) be the minimum free energy of the RNA molecule with sequence A and secondary structure S. On the right side of Figure 2, one alignment is highlighted which has the standard tRNA cloverleaf structure S1 , hav12.2 kcal ing mfeA (S1 ) = 23.0 kcal mol and mfeB (S1 ) = mol . On the left side, there is another alignment, which has a structure S2 that is only slightly less stable with mfeA (S2 ) = 16.2 kcal 10.1 kcal mol and mfeB (S2 ) = mol . The alignment (R, S2 ) contains more gaps than (R, S1 ), however, its structure score is significantly higher. An algorithm like pmcomp [8] which uses a single objective function only finds one optimal solution, which in this case is the cloverleaf structure of (R, S1 ). Using our approach, one has the freedom to choose the best alignment.

(9)

The first two sets in Equation (6) stand for the insertion of gaps into the alignment. The third set matches nucleotide Ai and Bk . The last term matches all possible base pairs (Ai , Ah ) to (Bk , Bq ). The union of these sets contains many alignments which are not Pareto-optimal, so the operator Pareto-max removes all those elements, which are dominated by another element. The base cases for the recursion are defined in the Equations (7) to (9). For example, S4,3;5,6 = (2 , 0) (Equation (7)) means that the substring B[5..6] is matched with two gaps between A[3] and A[4], and therefore the gap penalty in the first component equals 2 . Like in the one-dimensional approach [8], the algorithm can be restricted to ignore biologically unlikely alignments to reduce the computational costs. So in our implementation, we limit differences between the positions of the nucleotides of a match |i k| and between the spans of a matched base pair |(h i) (q k)| to some constant. At the end of the computation, the set S1,|A|;1,|B| contains the Pareto-optimal scoring vectors of the RNAs A and B. To obtain the actual alignments, a backtracking procedure is used. Starting from each scoring vector a 2 S1,|A|;1,|B| we use Equation (6) to simulate all steps which may have led to a. This is repeated recursively until one of the base cases from Equations (7) to (9) is encountered. 4. EXAMPLE To illustrate the results from our approach, let A be the sequence of tRNA-Met from Sulfolobus islandicus and B

85

5. DISCUSSION AND CONCLUSION

motifs in a set of RNA sequences,” Nucleic Acids Research, vol. 25, no. 18, pp. 3724–3732, 1997.

Even though many different ideas have been proposed, sequence-structure alignment is still an open problem [5]. Our approach does not give one optimal solution, but provides the human expert with a set of best alignments, and aids in choosing an optimal sequence-structure alignment. A comparable approach has been proposed by Taneda [15]. Here, a genetic algorithm (GA) which optimizes a set of pairwise sequence alignments with respect to two objective functions is used. The scoring function for the sequence similarity is the same as ours, whereas the structure score is defined as the sum of all base pairs, even mutually incompatible ones, which both RNAs can have. So one solution does not represent one secondary structure, but all secondary structures which are compatible to the alignment. In our algorithm, on the other hand, every solution has its own secondary structure, allowing two solutions to have the same sequence alignment, but with two different secondary structures. Even with the potential drawback of a higher computational demand, our algorithm is able to calculate the exact set of Pareto-optimal sequence-structure alignments regarding two given objective functions, where the GA only gives an approximation. Apart from that, our results give a completely new perspective on the interplay between primary sequence and secondary structure of RNA molecules. Algorithms which use fixed parameters may be too inflexible to find the biologically most relevant solution. Furthermore, it can be speculated that there may not even be one single optimal solution, and, accordingly, no fixed weighting for an objective function. Consequently, our observations might even suggest that an RNA molecule can have more than one secondary structure, each one serving a different purpose.

[7] D. H. Mathews and D. H. Turner, “Dynalign: an algorithm for finding the secondary structure common to two rna sequences,” Journal of Molecular Biology, vol. 317, no. 2, pp. 191–203, 2002. [8] I. L. Hofacker, S. H. F. Bernhart, and P. F. Stadler, “Alignment of RNA base pairing probability matrices,” Bioinformatics, vol. 20, no. 14, pp. 2222–2227, 2004. [9] S. Will, K. Reiche, I. L. Hofacker, P. F. Stadler, and R. Backofen, “Inferring noncoding RNA families and classes by means of genome-scale structurebased clustering,” PLoS Comput Biol, vol. 3, no. 4, pp. e65, 2007. [10] J. S. McCaskill, “The equilibrium partition function and base pair binding probabilities for RNA secondary structure,” Biopolymers, vol. 29, no. 6-7, pp. 1105–1119, 1990. [11] H. Laux, Entscheidungstheorie, Springer-Verlag, Berlin, Germany, 6th edition, 2005. [12] K. Deb, Multi-Objective Optimization Using Evolutionary Algorithms, John Wiley & Sons, Inc., New York, NY, USA, 2001. [13] M. Roytberg, M. Semionenkov, and O. Tabolina, “Pareto-optimal alignment of biological sequences,” Biophysics, vol. 44, no. 4, pp. 565–577, 1999. [14] L. Paquete and J. P. O. Almeida, “Experiments with bicriteria sequence alignment,” in Cutting-Edge Research Topics on Multiple Criteria Decision Making, vol. 35 of Communications in Computer and Information Science, pp. 45–51. Springer Berlin Heidelberg, 2009.

6. REFERENCES [1] D. Sankoff, “Simultaneous solution of the RNA folding, alignment and protosequence problems,” SIAM Journal on Applied Mathematics, vol. 45, no. 5, pp. 810–825, 1985.

[15] A. Taneda, “Multi-objective pairwise RNA sequence alignment,” Bioinformatics, vol. 26, no. 19, pp. 2383–2390, 2010.

[2] D. Latchman, Gene regulation: a eukaryotic perspective, Taylor & Francis, 2005.

[16] P. P. Gardner, J. Daub, J. Tate, B. L. Moore, I. H. Osuch, S. Griffiths-Jones, R. D. Finn, E. P. Nawrocki, D. L. Kolbe, S. R. Eddy, and A. Bateman, “Rfam: Wikipedia, clans and the “decimal” release,” Nucleic Acids Research, vol. 39, pp. D141–D145, 2011.

[3] D.-F. Feng and R. Doolittle, “Progressive sequence alignment as a prerequisite to correct phylogenetic trees,” Journal of Molecular Evolution, vol. 25, pp. 351–360, 1987.

[17] I. L. Hofacker, W. Fontana, P. F. Stadler, S. L. Bonhoeffer, M. Tacker, and P. Schuster, “Fast folding and comparison of RNA secondary structures,” Monatsh. Chem., vol. 125, pp. 167–188, 1994.

[4] M. Zuker and D. Sankoff, “RNA secondary structures and their prediction,” Bulletin of Mathematical Biology, vol. 46, pp. 591–621, 1984.

[18] R. Klein and S. Eddy, “RSEARCH: Finding homologs of single structured RNA sequences,” BMC Bioinformatics, vol. 4, no. 44, 2003.

[5] P. P. Gardner, A. Wilm, and S. Washietl, “A benchmark of multiple sequence alignment programs upon structural RNAs,” Nucleic Acids Research, vol. 33, no. 8, pp. 2433–2439, 2005.

[19] M. I. Henig, “Vector-valued dynamic programming,” SIAM Journal on Control and Optimization, vol. 21, no. 3, pp. 490–499, 1983.

[6] J. Gorodkin, L. J. Heyer, and G. D. Stormo, “Finding the most significant common sequence and structure

86

AN ADVANCED IMAGE PROCESSING APPROACH BASED ON PARALLEL GROWTH AND OVERLAP HANDLING TO QUANTIFY NEURITE GROWTH Felix Sch¨onenberger1,2 , Anne Krug3 , Marcel Leist3 , Elisa Ferrando-May1,2 , Dorit Merhof 1,4 Interdisciplinary Center for Interactive Data Analysis, Modelling and Visual Exploration (INCIDE), University of Konstanz 2 Bioimaging Center (BIC), University of Konstanz 3 Doerenkamp-Zbinden Chair for in-vitro Toxicology and Biomedicine, University of Konstanz 4 Visual Computing, University of Konstanz Email: [email protected] 1

ABSTRACT Methods to assess neurite growth in populations of neuronal cells are required for many applications in biological image analysis. In response to the need for efficient methods to assess neurite growth, we have previously proposed an image processing framework to quantify the number of viable cells and the extent of neurite growth [1]. The approach is based on region growing and uses cost penalties and an upper cost limit to ensure that neurite outgrowth is not overestimated. However, these thresholds need to be defined manually and are set to a fixed value for the entire image. Also, the approach is not able to account for overlapping neurites in dense cell populations. For this reason, we propose two extensions to overcome the aforementioned shortcomings: By growing all regions originating from individual cell nuclei simultaneously, the approach adapts to the underlying microscopy image and doesn’t require manually defined cost limits. An overlap handling is introduced, which is particularly valuable in dense cell populations with overlapping neurites. The results demonstrate that our advanced image processing approach generates results which are even closer to the manual ground truth.

high production volume chemicals (chemicals produced or imported into the United States at or above one million pounds per year), nearly half have no basic toxicity data available [2], and only 7% have a complete set of toxicity data, including developmental toxicity. In the absence of data, the risk of developmental neurotoxicity for these chemicals is unknown, but it is estimated to be high [3]. Accordingly, there is an increased public concern that exposure to chemicals in the environment may be partially responsible for the increased number of cases of neurological disorders in children and adults. Lund human mesencephalic (LUHMES) cells are a fetal human mesencephalic cell line which has been established as a general human neuronal cell model [4]. They are easy to handle and differentiate in vitro into highly homogenous cultures of mature dopaminergic neurons. LUHMES cells are thus particularly well suited for large scale testing of neurotoxicants in single-cell based assays. In previous work, we have established a high-throughput live cell imaging system for identifying neurotoxic agents [5], which enables quantifying the overall neurite mass as well as cell viability in differentiated LUHMES cultures. We presented an image processing framework to compute the number of viable cells with and without neurite growth in these images [1]. This approach is based on region growing performed for one cell at a time. In order to make sure that neurite outgrowth is not overestimated as shown in Figure 1 (left), cost penalties and an upper cost limit are employed to ensure that the region growing maintains a minimum distance to other cell nuclei. However, these thresholds need to be defined manually and are set to a fixed value for the entire image. Also, the approach was not able to account for overlapping neurites. In this work, we propose an advanced approach which comprises two extensions to overcome the aforementioned shortcomings. By growing all regions originating from individual cell nuclei simultaneously, as shown in Figure 1 (right), the approach adapts to the underlying microscopy image and doesn’t require manually defined cost limits.

1. INTRODUCTION The quantitative and rapid analysis of large numbers of image data is one major bottleneck of high-throughput fluorescence microscopy assays. Existing tools provided by the manufacturers of automated microscopes and of integrated high content imaging systems usually do not deliver satisfying results due to the large variability of the tested biological systems. For this reason, there is a need for custom tools optimized for the analysis of specific microscopy endpoints. Neurite outgrowth is a hallmark phenotype for de- and regeneration in the nervous system. Due to its robustness and sensitivity it is an endpoint of choice for the in vitro testing of toxic chemicals, in particular those suspected to affect neuronal development. For example, of the 3000

87

channel is also used to exclude dead cells from further analysis. Microscopy images were acquired for cell populations which had been treated with control medium or increasing concentrations of the test compound. Partly visible neuronal cells at the border of the microscopy image are automatically detected and labelled by the screening system and excluded from subsequent analysis. 2.2. Software framework The presented approach was developed using the software platform KNIME (The Konstanz Information Miner [7]), which is an open-source tool for data integration, processing, analysis and exploration. Essentially, KNIME workflows consist of interacting nodes which exchange data via data tables which are passed from one node to another according to their connections. The graphical user interface makes it possible to construct workflows consisting of different nodes and their interconnection via a simple dragand-drop mechanism. The advanced image processing workflow presented in this work consists of several custom KNIME nodes that extend the workflow presented in [1].

Figure 1. Red: Grown region. Green: Path with maximum length. Left: Neurite growth according to [1] grows one region at a time. Cost penalties stop growth when approaching other nuclei. Right: Proposed algorithm, grows all nuclei simultaneously, no cost penalties needed.

Furthermore, an overlap handling is introduced, which is particularly valuable in dense cell populations with overlapping neurites. The microscopy images used in this work were acquired in a toxicity study of U0126, which has previously been shown to influence neurite outgrowth: U0126 is a potent inhibitor of the mitogen-activated protein kinase (MAPK) pathway, which is involved in regulating neurite growth. MAPKs have been implicated in a variety of cellular functions, including neuronal differentiation [6]. In this work, only non-invasive labelling and detection methods are applied. The neuronal cells are grown at high density to allow extensive networking, which results in microscopy images that are challenging to process from an image analysis point of view. In order to investigate if the considered compound affects neurite outgrowth at the single cell level, an image processing system is needed that counts viable cells with and without neurite growth. The presented image processing framework allows quantifying the number of cells with extensions longer than one cell diameter (defined here as neurites) and outputs the counts of neuronal cells with and without neurites. Compared to [1], this advanced approach doesn’t require manually defined cost limits for region growing and is also able to account for cells with overlapping neurites, which is particularly important for cell populations grown at high density. Results show that the proposed advanced approach follows the manual ground-truth more closely and is therefore preferred by the biologists.

2.3. Quantification of neurite growth The approach for quantifying the number of cells with neurites presented in [1] as well as the advanced method introduced in this work comprise two general steps: First, the nuclei of the neuronal cells are segmented from the H-33342 images, and for each neuronal cell all pixels belonging to the nucleus are classified as seedpoints. Secondly, the cytoplasm region of viable cells is then grown from these seed points to expand the initial contour of the nucleus outwards. For each pixel added to the expanding volume, the length of the path to the initial boundary of the nucleus is computed and stored. The approach presented in this work has two major extensions, which are outlined in the following: Parallel growth: The approach operates in the style of the Dijsktra algorithm [8], which is commonly used in computer science for different types of search problems. It builds up a graph with nodes (corresponding to pixels) and edges, where the edges are assigned a local cost corresponding to the Euclidean distance between pixel centers plus the normalized inverted intensity. For performing the search, the algorithm maintains two lists, an open list comprising all pixels currently under consideration and a closed list containing pixels that have already been visited. In the beginning, the open list comprises all pixels at the border of every cell nucleus and the closed list is empty. Each node c stores the accumulated cost required to travel along the path to the respective node, the accumulated path length, the previous node and the nucleus from which the node was reached. In each iteration, the algorithm selects the node with lowest accumulated cost from the open list, adds all neighbor pixels n with an intensity value above the cytoplasm intensity threshold tcyto to the open list and moves the selected node to the closed list. If

2. MATERIAL AND METHODS 2.1. Image data Microscopy images of LUHMES cells were acquired on an Assay-Scan II High Content Screening (HCS) System, Cellomics. In order to clearly identify the nuclei, neuronal cells were stained with the DNA dye H-33342. Imaging of the cell shape region (cytoplasm), including the cell body (soma) and its extensions (neurites), was performed using the vital dye calcein-AM. Since dead cells cannot accumulate and retain this dye in their cytoplasm, the calcein

88

the path length exceeds the length threshold lmin , the corresponding nucleus is marked as grown. These processing steps continue until the open list is empty. Overlap handling: Although parallel growth eliminates the need of a cost penalty, it is too restrictive in case of cell clusters with overlapping neurites. The proposed overlap handling softens the region border and introduces a permeable area were regions from multiple nuclei can overlap. Algorithm 1 shows the pseudocode of the overlap handling. The algorithm maintains a global open list O. Each individual node c on the open list keeps track of the accumulated distance, cost and which nuclei reached and closed the node. In each iteration, the algorithm selects the node with lowest accumulated cost from the open list, adds all neighbor pixels n with an intensity value above tcyto to the open list and marks the current node as partly closed. If a node is only partly closed it can be reopened for another nucleus. The permeable area is updated by backtracking and completely closing all nodes with a distance to the open list greater than lo↵set . This permeable area makes it possible for neurites to overlap to a certain, user-defined extent. These extensions narrow down the parameter set to the neurite length threshold lmin , the intensity threshold tcyto and the permeable area offset lo↵set .

Figure 2. Decreasing neurite outgrowth for U0126. Purple: Manual ground truth. Blue: Automated method [1]. Black: Proposed advanced method.

3. RESULTS The image processing framework was applied to microscopy images (two-channel images, 512⇥512 pixels per channel) of LUHMES human neuronal precursor cells treated with U0126. On a PC with an Intel Xeon W5130, 2 GHz, 4 GB RAM, image analysis took 2 minutes, 27 seconds for ten concentration levels and three microscopy images per level. In Figure 2, a direct comparison of the automated approach proposed in [1] (blue) and the advanced method presented in this work (black) is provided. Both approaches are compared against a manual groud truth (purple), for the chemical U0126. The x-axis denotes an increasing concentration of the chemical, the y-axis the decrease of neurite outgrowth. The curve obtained from the advanced method follows the manual ground truth more closely. This is also reflected by computing the difference between manual evaluation and either method for the measured concentrations for each of the ten concentration levels. The difference between mean values of the manual evaluation and the automated analysis [1] amounts to an average value of 0.185, compared to 0.081 in case of the proposed advanced approach versus manual ground truth, respectively. In Figure 3, a visual result of the implemented method is provided. Red nuclei indicate dead cells according to the calcein staining. Green corresponds to cell nuclei with neurite growth, and grey to nuclei without neurite growth according to the analysis. For each cell with neurite growth, the maximum path length between the border of the nucleus and the border of the cytoplasm is indicated by a green line. Pixels that have been visited by the algorithm during front propagation are denoted by dark green. The maximum extension of each region grown from the outline of its nucleus is outlined by a red rim, which may disappear in case of overlapping neurites, though.

while O 6= ; do c removeBest(O) if c.dist > lmin then State[c.nuc] “grown” foreach n 2 neighbors(c) do if c.nuc 2 n.Closed then continue if n.intensity < tcyto then backtrackCloseCompletely(c, 0) continue if isClosedCompletely(n.nuc) then continue # n| ‰ + (c.intensity) cost c.cost + |c, # ‰ dist c.dist + |c, n| if n.Closed 6= ; then if dist + n.dist > lmin then State[c.nuc] “grown” State[n.nuc] “grown” if n 2 O then if n.cost > cost then n.cost, n.dist cost, dist n.nuc, n.P ath[c.nuc] c.nuc, c else n.cost, n.dist cost, dist n.nuc, n.P ath[c.nuc] c.nuc, c O O [ {n} backtrackCloseCompletely(c, lo↵set ) c.Closed c.Closed [ {c.nuc} Algorithm 1: A node c holds intensity c.intensity, nucleus c.nuc, cost c.cost, distance c.dist, parent nodes c.P ath. The function computes the normalized inverted intensity and lo↵set controls the local offset of the permeable area. Lower case variables hold scalar values, upper case variables refer to lists.

4. DISCUSSION The development of automated methods to assess the effects of treatment with toxic substances is challenging, particularly at low concentrations which defines the sen-

89

Figure 3. Neuronal cells without (left) and with (right) neurite growth. Original images (upper row) and image analysis result (lower row). Red nuclei correspond to dead cells, green nuclei to cells with neurite growth, and grey nuclei to viable cells without neurite growth. In case of neurite growth, the maximum distance between nucleus and boundary of the cytoplasm is indicated by a green line. Pixels that have been visited by the algorithm are denoted by dark green. The red rim delineates the boundary of the maximally dilated regions after growing from individual nuclei. In case of overlapping neurites this boundary disappears.

6. REFERENCES

sitivity of the toxicity test. The neurons naturally tend to form clusters, which makes it difficult to detect the length of single neurites.

[1] D. Merhof, F. Sch¨onenberger, N. Stiegler, A. Krug, T. Rieß, C. Karreman, M. Leist, and E. Ferrando-May, “Automated image processing to quantify neurite growth in LUHMES human neuronal precursor cells,” in Proc. 8th International Workshop on Computational Systems Biology (WCSB), 2011, pp. 133–136.

The result obtained by the proposed algorithm is a binary decision whether a cell exhibits neurite growth or not. According to Figure 2, the result is very close to the human ground truth, even for low concentrations of the neurotoxicant, and also reaches the same absolute extent of neurite growth inhibition for the maximum concentration of U0126.

[2] U.S. Environmental Protection Agency (EPA), “Chemical hazard data availability study: What do we really know about the safety of high production volume chemicals?,” in Washington, DC: Office of Pollution Prevention and Toxics, 1998. [3] P. Grandjean and P. Landrigan, “Developmental neurotoxicity of industrial chemicals,” Lancet, vol. 368, no. 9553, pp. 2167–2178, 2006. [4] D. Scholz, D. P¨oltl, A. Genewsky, M. Weng, T. Waldmann, S. Schildknecht, and M. Leist, “Rapid, complete and large-scale generation of post-mitotic neurons from the human LUHMES cell line,” J Neurochem., vol. 119, no. 5, pp. 957–971, 2011.

5. CONCLUSION In this work, an advanced system is presented that outputs the counts of neuronal cells with and without neurites in LUHMES human neuronal precursor cells. The results demonstrate that our advanced image processing approach can reliably quantify chemical effects on initial neurite outgrowth. Compared to the manual ground truth, improved results were obtained with the advanced approach presented in this work. Identifying chemicals that act as developmental neurotoxicants is a major challenge in current research. Computational tools that facilitate the extraction of quantitative data from respective experiments are therefore of great interest to the biology community.

[5] N. Stiegler, A. Krug, F. Matt, and M. Leist, “Assessment of chemical-induced impairment of human neurite outgrowth by multiparametric live cell imaging in high-density cultures,” Toxicol Sci., vol. 121, no. 1, pp. 73–87, 2011. [6] T. Lewis, P. Shapiro, and N. Ahn, “Signal transduction through map kinase cascades,” Adv Cancer Res., vol. 74, pp. 49–139, 1998. [7] M. Berthold, N. Cebron, F. Dill, T. Gabriel, T. K¨otter, T. Meinl, P. Ohl, C. Sieb, K. Thiel, and B. Wiswedel, “KNIME: The Konstanz Information Miner,” in Proc. Data Analysis, Machine Learning and Applications, 2008, pp. 319–326. [8] E. Dijkstra, “A note on two problems in connexion with graphs. in: Numerische mathematik,” Numerische Mathematik, vol. 1, pp. 269–271, 1959.

90

SET BASED UNCERTAINTY ANALYSIS AND PARAMETER ESTIMATION OF BIOLOGICAL NETWORKS WITH THE BIOSDP TOOLBOX Steffen Waldherr, Jan Hasenauer, and Frank Allg¨ower Institute for Systems Theory and Automatic Control, University of Stuttgart Pfaffenwaldring 9, 70550 Stuttgart, Germany {waldherr, hasenauer, allgower}@ist.uni-stuttgart.de

ABSTRACT

where x 2 Rn is the network’s state vector (e.g. protein or metabolite concentrations), p 2 Rm is the parameter vector, and F is a vector of polynomial functions in x and p, typically n-dimensional, describing the network’s steady state conditions. The values of the parameters p in (1) are uncertain in that only a bounding box for the parameters is given, but not the exact values. This box is defined by element-wise inequalities, yielding the set

We present the bioSDP toolbox for Matlab, which provides methods for the analysis of biological networks with parametric uncertainty. Its set based approach allows to compute guaranteed bounds on the network behaviour under uncertainty, or on the parameter values consistent with uncertain measurement data. These two applications of the toolbox are illustrated with two case studies for specific biological networks.

P = p 2 Rm | pˇ  p  pˆ ,

1. INTRODUCTION

where pˇ and pˆ are element-wise lower and upper bounds on the parameter vector p. The task in the steady state uncertainty analysis problem is to compute a tight outer approximation Xb for the set X of all feasible solutions x of (1), i.e.,

The main goal of system biology is a quantitative description of cellular processes. Unfortunately, achieving this is rather difficult, as models of biological systems are subject to a significant degree of uncertainty. This uncertainty arises from limited knowledge about the system, mostly due to limitations in the experimental technology, and/or large variations in environmental and internal boundary conditions. These are manifested by the large parametric uncertainty shown by most models of biochemical networks. Drawing dedicated conclusions about various system properties is a major computational challenge under such uncertainty. We present the Matlab toolbox bioSDP which provides methods to analyse biological networks modeled by ordinary differential equations with polynomial and rational terms. It allows to study the variation in steady states of a network under parametric uncertainty, and to upper-bound the size of the remaining parametric uncertainty from noise corrupted measurement data. This allows to derive guaranteed predictions about important system properties despite the uncertainty in the input data. The bioSDP toolbox is available under an open source license from [1].

Xb

X = x 2 Rn | 9p 2 P : F (x, p) = 0 .

(3)

In the algorithm implemented in the bioSDP toolbox, the approximation Xb is either computed as one big bounding box for all feasible steady states, or as the union of many small boxes, depending on the user’s choice. The second option generally yields a tighter approximation at the expense of a higher computational effort. 2.2. The parameter estimation problem In the parameter estimation problem [4, 5], we consider a dynamic network defined by the difference equation xk = F (xk 1 , p) yk = H(xk , p),

(4)

where xk 2 Rn , p 2 Rm , and F are as in Section 2.1, and yk 2 Rq is a vector of measurements, which depends on the network’s state via the measurement function H. The index k = 1, . . . , N denotes the discrete time steps. We assume that both F and H are polynomial or rational functions in both the state x and the parameter p. Uncertain measurement data are given as a bounding box Yk on the output vector yk for each time point k,

2. THEORETICAL BACKGROUND In this section, we present the steady state uncertainty analysis problem, the parameter estimation problem and the algorithms used to solve them. 2.1. The steady state uncertainty analysis problem

Yk = y 2 Rq | yˇk  y  yˆk ,

The steady state uncertainty analysis problem [2, 3] is defined by a system of polynomial or rational equalities F (x, p) = 0,

(2)

k = 1, . . . , N.

(5)

Also, a bounding box on the states xk is given as X = x k 2 Rn | x ˇ  xk  x ˆ .

(1)

91

(6)

A parameter p is called consistent, if there exist a sequence of states xk 2 X and a sequence of outputs yk 2 Yk with k = 1, . . . , N satisfying (4). The goal in the parameter estimation problem is to compute a tight outer b of the set of consistent parameters P: approximation P b P

P = p 2 Rm | p is consistent .

approximation of the set of feasible steady states. The system variable contains the definition of the model variables and equations, while the uncertainty variable describes the parameter range to consider as well as the a priori state bounds. The behaviour of the uncertainty analysis function is further controlled by several options, which are passed as an additional structure variable, here called options, to the function. A required option is set exclusion.method, which specifies how regions in state space are removed from the feasible set. The default choice is ’box shrinkage’, which just tries to reduce the size of the initial box as far as possible. A method which allows to obtain a more refined uncertainty set is ’bisection’, in which the uncertainty set is computed by multi-dimensional bisection. However, the bisection method is only recommended for state spaces up to dimension three due to significantly increased computational effort in higher dimensions. The steady state uncertainty analysis is then simply performed by a call to stationary uncertainty, passing the problem setup in the variables system and uncer tainty as well as the algorithm options as arguments to the function. The parameter estimation problem is handled by the function parameter estimation. It also takes two problem definition variables as arguments. The first one, system, contains the model definition, and the second one, uncer tainty, contains the measurement bounds as well as the a priori state and parameter bounds. As in the steady state uncertainty analysis, either ’box shrinkage’ or ’bisection’ can be chosen as set exclusion method. In addition to computing outer bounds by set exclusion, bioSDP offers methods to compute samples of feasible steady states for the uncertainty analysis or consistent parameters for the parameter estimation. These can for example be used to evaluate the tightness of the computed outer approximations.

(7)

As in the steady state uncertainty analysis problem, the algorithm implemented in bioSDP can either compute an b as one big bounding box, or as the outer approximation P union of many smaller boxes. 2.3. Iterative set exclusion with an infeasibility test

The algorithms implemented in bioSDP compute the outer b with an iterative set exclusion approximations Xb and P approach. As a priori information, an initial estimate Xb0 b0 is required. The basis of the set exclusion approach or P is an infeasibility test, which applies to a system of polynomial equalities and a box constraint of the form G( ) = 0 ˇ   ˆ,

(8)

where is a vector of uncertain variables, e.g., the state x and the parameters p for the steady state uncertainty analysis problem, or the state and output sequences xk , yk and parameters p for the parameter estimation problem. G is a polynomial vector-valued function representing the equality constraints. In each subsequent iteration step, the approximation is refined by excluding subsets which pass an infeasibility test. In iteration i, the algorithm generates an appropriate ei,1 , . . . , P ei,r for the palist of test sets Xei,1 , . . . Xei,r (or P rameter estimation problem), and applies the infeasibility test to each test set. Those test sets which pass the infeasibility test are then united to form the exclusion set Xei , and the refined approximation for the next iteration is obtained as Xbi+1 = Xbi \ Xei . (9)

3.2. Generic SDP problem solver

In each iteration, the test sets are reduced in size, and the algorithm terminates when the size drops beneath a predefined threshold. The infeasibility tests are solved by a quadratic reformulation of the problem (8) and semidefinite programming [6]. Details on the construction of (8) and the infeasibility test are given in [2, 3] for the steady state uncertainty analysis, and in [4] for the parameter estimation. 3. STRUCTURE OF THE bioSDP TOOLBOX 3.1. Analysis tasks

The main functionality of bioSDP is to offer methods for solving the steady state uncertainty analysis problem from Section 2.1, and the parameter estimation problem from Section 2.2. The main bioSDP routine for steady state uncertainty analysis is stationary uncertainty. This routine takes the problem setup specified in Matlab structure variables called system and uncertainty, and computes an outer

92

At the core of the bioSDP toolbox is an optimization algorithm which solves set exclusion problems as discussed in Section 2.3 with semidefinite programming methods. This algorithm does the iteration required for the set exclusion, constructs the appropriate semidefinite programs for the infeasibility tests, and refines the bounding sets based on the solutions to the semidefinite programs. The semidefinite programs are not solved by bioSDP itself, but are handed to specialised optimisation toolboxes, for example SeDuMi [7]. In short, the algorithm takes a system of constraints as defined in (8), together with an initial bounding box on the uncertain variables . The vector is thereby structured in variables where the uncertainty remains fixed to the initial bounds (e.g., the parameter bounds in the steady state uncertainty analysis problem), and variables where the uncertainty is to be reduced as much as possible, while ensuring that all feasible solutions to (8) are retained in the set (e.g., the steady state bounds for the steady state uncertainty analysis problem). The algorithm then iteratively prunes subsets of this initial uncertainty set, based on results from the infeasibility tests.

Table 1. Nominal parameter values for the insulin pathway model (10). k1 k0 k2 kR k3 km3 ins 0.05 10 6 1.0 0.5 1.0 30 1.0 In typical use cases, this generic algorithm need not to be called directly by the user. Instead, the wrapper methods for individual analysis tasks as described in Section 3.1 should be used. These methods take care of constructing the constraints (8) and the vector of uncertain variables as appropriate for the specific task from the problem-specific user input data. 10

3.3. Visualisation routines The bioSDP toolbox offers two main possibilities for visualising the set based estimates resulting from the implemented analysis algorithms: box plots for two- and three-dimensional state or parameter spaces, and parallel coordinates plots for problems of any dimension. The output of the analysis functions described in Section 3.1 can directly be passed to the bioSDP routine visualize uncertainty for this purpose.

Concentration

8

We present the two main applications of the bioSDP toolbox—steady state uncertainty analysis and parameter estimation—with two exemplary studies. In the interest of brevity, we discuss for each example only the problem setup, some key points of their implementation in the bioSDP toolbox, and significant conclusions drawn from the analysis. For the full implementation of both examples, we refer the reader to the toolbox’ software package available from [1], which contains both examples implemented as Matlab scripts. 4.1. Uncertainty analysis of an insulin pathway model The steady state uncertainty analysis is exemplified with a simple insulin pathway model taken from [8]. The model equations are given as

˙ = k3 IRP (IRStot IRSP

IRSP)

IR

4

2

4. EXAMPLE APPLICATIONS

˙ = k0 IR k1 ins IR + kR (IRtot IR ˙ = k0 IR + k1 ins IR k2 IRP IRP

6

IRP)

km3 IRSP,

(10) with IR, IRP, and IRSP the concentrations of insulin receptor, phosphorylated insulin receptor, and phosphorylated insulin receptor substrate, respectively. The total protein amounts are conserved and given by IRtot = 10 and IRStot = 10. Note that the conservation relations can directly be used to compute an a priori outer bound Xb0 independent of any parameter values. For the steady state uncertainty analysis problem, we assume that all of the model parameters may vary by a factor of 2 around their nominal values, which are given in Table 1. Using both the multi-dimensional bisection method and the simpler box shrinkage, we let bioSDP compute an outer approximation to the set of feasible steady states under this uncertainty. The resulting outer estimate

93

0 IR

IRP variables

IRSP

Figure 1. Results for the steady state uncertainty analysis of the insulin pathway model (10). Top: Outer bounds to the set of feasible steady states in a box plot. Bottom: Outer bounds to the set of feasible steady states in a parallel coordinates plot together with sampled steady states. Feasible intervals from the box shrinkage algorithm are shown as light gray, and from the bisection algorithm as dark gray. Xb on the feasible steady states is shown in Figure 1, together with some sampled steady states. These plots are generated directly by bioSDP’s builtin visualisation routines discussed in Section 3.3. The comparison between the outer bounds obtained with the set exclusion methods and the sampled steady states shows that the outer bounds are reasonably tight. The computation time on a standard desktop computer was about 50 seconds for the box shrinkage method and 140 seconds for the bisection method. Both methods achieved similar bounds in this example. In the parallel coordinates plot, the thinning waist between IR and IRP, from the bounds obtained by bisection, indicates a slight negative correlation between these two variables. This correlation is also seen more explicitly in the three-dimensional box plot. 4.2. Parameter estimation for a reversible modification reaction The set based parameter estimation with bioSDP is illustrated with a very simplistic biological model of a reversible modification reaction. The time-continuous model for this reaction is given by the scalar differential equation x˙ =

km x + kd (1

x),

(11)

5. CONCLUSIONS We have presented the bioSDP toolbox for Matlab. bioSDP provides methods for steady state uncertainty analysis and parameter estimation from uncertain measurement data in biological networks. While the examples presented here only involve small networks, the methods are also applicable to medium-sized networks. For example, a case study for a rather detailed model of tumor necrosis factor signalling is presented in [9]. 6. ACKNOWLEDGEMENTS We thank Ganzhou Wang and Shen Zeng for their help in the implementation of the bioSDP toolbox, Pelle Lundberg and Gunnar Cedersund for testing, and Julian Heinrich for helpful comments on the parallel coordinate plots. 7. REFERENCES [1] “The biosdp toolbox,” http://biosdp.sourceforge. net, 2012, Accessed 26th March 2012. [2] S. Waldherr, R. Findeisen, and F. Allg¨ower, “Global sensitivity analysis of biochemical reaction networks via semidefinite programming,” in Proc. of the 17th IFAC World Congress, Seoul, Korea, 2008, pp. 9701–06. [3] J. Hasenauer, P. Rumschinski, S. Waldherr, S. Borchers, F. Allg¨ower, and R. Findeisen, “Guaranteed steady state bounds for uncertain (bio-)chemical processes using infeasibility certificates,” J. Process Control, vol. 20, no. 9, pp. 1076–1083, 2010.

94

5.5 5 4.5 Parameter value

where x is the concentration of the unmodified variant of the considered molecular species measured relative to its total concentration, and km and kd are unknown parameters to be estimated. The parameter estimation is done for artificial measurement data for x from five time points which are 0.1 time units apart. The measurement is uncertain in that only upper and lower bounds are available. For the purpose of this example, we set x ˇ = (0.0, 0.44, 0.57, 0.6, 0.56) as lower bound and x ˆ = (0.1, 0.54, 0.67, 0.70, 0.66) as upper bound. First, we transform the differential equation model (11) to a difference equation using the Euler-forward discretisation scheme. In bioSDP, this is simply done by calling the auxiliary function discretize ode, passing the continuous model (11), its state variable x, the length of the time step (here 0.1) and the number of steps to take (here 4) as arguments. bioSDP then automatically generates the discrete equations (4) and the internally required sequence variables xk . In the next step, the set based parameter estimation is carried out by a call to the parameter estimation function. The results from the set based analysis are complemented by samples from the set of consistent parameters obtained through Monte Carlo sampling. Result plots from bioSDP’s visualisation routines are shown in Figure 2. From the parallel coordinates plot, we observe that there seems to be a correlation in the consistent parameters: from the sampled parameter vectors, either both elements are low, or both are high within the respective intervals. Using the box plot with the ’bisection’ set exclusion method, this observation is confirmed by the resulting outer bound on the set of consistent parameters.

4 3.5 3 2.5 2 1.5 km

kd variables

Figure 2. Results for the parameter estimation of the reversible modification reaction (11). Top: Outer bounds to the set of feasible steady states in a box plot. Bottom: Outer bounds to the set of feasible steady states in a parallel coordinates plot together with sampled steady states. [4] J. Hasenauer, S. Waldherr, K. Wagner, and F. Allg¨ower, “Parameter identification, experimental design and model falsification for biological network models using semidefinite programming,” IET Syst. Biol., vol. 4, no. 2, pp. 119–130, 2010. [5] Ph. Rumschinski, S. Borchers, S. Bosio, R. Weismantel, and R. Findeisen, “Set-base dynamical parameter estimation and model invalidation for biochemical reaction networks,” BMC Syst. Biol., vol. 4, 69, 2010. [6] L. Vandenberghe and S. Boyd, “Semidefinite programming,” SIAM Rev., vol. 38, no. 1, pp. 49–95, 1996. [7] J. F. Sturm, “Using SeDuMi 1.02, a Matlab toolbox for optimization over symmetric cones,” Optim. Meth. Softw., vol. 11, no. 1, pp. 625–653, 1999. [8] C. Br¨annmark, R. Palmer, S.T. Glad, G. Cedersund, and P. Str˚alfors, “Mass and information feedbacks through receptor endocytosis govern insulin signaling as revealed using a parameter-free modeling framework,” J. Biol. Chem., vol. 285, no. 26, pp. 20171–20179, 2010. [9] S. Waldherr, J. Hasenauer, M. Doszczak, P. Scheurich, and F. Allg¨ower, “Global uncertainty analysis for a model of TNF-induced NF-B signalling,” in Advances in the Theory of Control, Signals and Systems with Physical Modeling, Jean L´evine and Philippe M¨ullhaupt, Eds., vol. 407 of LNCIS, pp. 365–377. Springer Verlag, Berlin Heidelberg, 2010.

ABSTRACTS

95

COMPARATIVE STUDY OF MICRORNA PATHWAY-ANALYSIS METHODOLOGIES IN COLORECTAL CANCER Teresa Creanza1, Vania Liuzzi2, Rosalia Maglietta2, Roberto Anglani2, Patrizia Stifanelli2, Ada Piepoli3, Sayan Mukherjee4, Francesco Schena1, and Nicola Ancona2 1University

of Bari, Bari, Italy, of Intelligent Systems for Automation - National Research Council, Bari, Italy 3IRCCS, Casa Sollievo della Sofferenza, San Giovanni Rotondo, Italy 4Institute for Genome Science and Policy, Duke University , USA [email protected] , [email protected] , [email protected] , [email protected] , [email protected] , [email protected] , [email protected] , [email protected] , [email protected] 2Institute

microRNAs (miRNAs) are short non-coding RNAs involved in the post-transcriptional control of protein levels. They drive the RNA-induced silencing complex to target sites located at the 3'UTR of mRNAs with the effects of cleaving them or repressing their translation. It is well known that miRNAs can contribute to the cancer development and progression and show a differential expression between normal and neoplastic tissues, but their functional role in tumors it is still not completely clear. The prediction of miRNA targets is commonly achieved by combining sequence complementarity, free energy calculations of duplex formation, and conservation analysis. At the aim of predicting miRNA functions in the biological processes context, several studies were performed under the assumption that if the targets of a specific miRNA are enriched of genes annotated in a specific biological pathway, it is realistic to hypothesize that the miRNA is involved in the same pathway. This approach does not take into account the miRNA indirect effects on biological pathways, that could be essential in the disease context. We present a strategy to infer miRNA-driven pathways potentially altered in tumor development by integrating mRNA and miRNA expression profiles. Our scoring algorithm is based on the inference of pathways in terms of miRNAs determined by correlations evaluated on paired expression levels of human miRNAs and mRNAs. An enrichment analysis of the resulting miRNA pathways allows to identify pathways associated with differentially expressed miRNAs. The application of this approach to colorectal cancer unveils many cancer-related biological pathways that sequencebased methods for prediction of miRNA targets were not able to reveal. Combining miRNA target inferred by sequence analysis with those highlighted by gene expression data may lead to an improved understanding of colon cancer biology and to the identification of novel targets for therapy.

96

SELECTING DIFFERENTIALLY EXPRESSED GENES WITH HIDDEN SUBCLASSES Carina Fortes1, Antonia Turkman2 and Lisete Sousa2 1Higher

School of Health Technology of Lisbon, Lisbon, Portugal, of Sciences of University of Lisbon, Lisbon, Portugal [email protected] , [email protected] , [email protected] 2Faculty

Microarray technology allows to analyze the expression of thousands of genes in a single experiment. The identification of genes whose expression changes in two (or more) conditional experiments, is a very common aim of microarray studies. In this respect, different statistical tests, generally based on means or medians have been proposed. However, if there are genes differentially expressed on different subclasses, those techniques do not select them because either mean or median values tend to be similar between the considered groups. Genes with a bimodal or a multimodal distribution within a class (considering a binary study) may indicate the presence of unknown subclasses with different expression values. As a result, important genes differentially expressed in a subset of samples can be missed by gene selection criteria based on the difference of sample means. The particular application that motivated our work concerns the development of a methodology which could simultaneously identify up- and down-regulated genes and differentially expressed with bimodal or multimodal distributions with similar means on both groups. For convenience, the latter case is referred to as mixed genes. We propose using the area under de receiver operating characteristics curve (AUC) and the coefficient of overlapping (OVL) simultaneously to select different types of differentially expressed genes, and plotting OVL against AUC we get a graph which we named as arrow plot. We conducted a simulation study in order to evaluate the detection performance of the proposed method. We compared the performance of our methodology with some usual methods to select differentially expressed genes between two conditions, namely, Fold Change (FC), Average Difference (AD), Weighted Average Difference (WAD), Rank Products (RP), Welch t-statistic (Welch-t), Significance Analysis of Microarrays (SAM), Moderated t-statistic (modT), intensity-based moderated t-statistic (ibmT), empirical area under the ROC curve (empAUC), area under ROC curve estimated using kernel smoothing (ker- AUC) and SAMROC. Arrow plot showed the best performance. Arrow plot is an exploratory graphical tool for microarray experiments, useful in identification of different kinds of differentially expressed genes, particularly genes with a special behavior which are not detected by usual methods and yet can bring relevant biological information.

97

ATTRACTOR ROBUSTNESS DOES NOT RESTRICT INFORMATION PROPAGATION IN CRITICAL RANDOM BOOLEAN NETWORKS Abhishekh Gupta, Jason Lloyd-Price, Olli Yli-Harja, and Andre Ribeiro Department of Signal Processing, Tampere University of Technology, Tampere, Finland, [email protected] , [email protected] , [email protected] , [email protected]

Introduction: Gene regulatory networks (GRNs) are information processing systems which robustly perform their functions, yet remain responsive to external stimuli. One gene network codes for multiple cell types, each expressing a unique set of genes. The robustness of the phenotype of each cell type suggests that these correspond to dynamical attractors of the GRN. Using Random Boolean Networks as models of GRNs, we study how the capacity of a network to propagate information within an attractor, as quantified by the average pairwise mutual information (IA), relates to the robustness of that attractor to external perturbations. Results: We find that the dynamical regime of the network affects the relationship between IA and robustness. In the ordered and chaotic regimes, IA is anti-correlated with robustness. For the attractors of networks operating on the boundary between order and chaos (so-called "critical" networks), these quantities are uncorrelated. Conclusions: The anti-correlation between IA and robustness in the ordered and chaotic regimes implies that attractors of these networks which are highly robust to external perturbations have limited capacity to propagate information. Meanwhile, IA and robustness are uncorrelated in critical networks, indicating that these attractors are able to be arbitrarily robust to external perturbations without imposing limitations on their information propagation capabilities.

98

DERRIDA VALUES FOR NETWORKS BASED ON NESTED CANALIZING FUNCTIONS Claus Kadelka1,2, David Murrugarra1,2, Reinhard Laubenbacher1,2 2

1Department

of Mathematics, Virginia Tech, Blacksburg, VA 24061, USA. Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, VA 24061, USA. [email protected] , [email protected] , [email protected]

It has been discovered that gene regulation is a flexible, yet still robust process. One well-known indicator used to investigate the effect of small perturbations on the dynamics of gene regulatory networks is the so-called Derrida plot. For a fixed number of perturbations, the Derrida value is the expected Hamming distance after one time step. This value is generally small for networks that exhibit stable behavior and large for networks with chaotic behavior. Thus far, attainment of these values has depended on time-consuming Monte Carlo simulations. However, for networks based on Boolean nested canalizing functions – a class of functions found to be prevalent in gene regulation – explicit formulas for the Derrida values can be found; both for general nested canalizing functions and for nested canalizing functions of a particular Hamming weight. By assigning a so-called layer number to each function, the class of nested canalizing functions was recently further partitioned. An analysis of the derived formulas shows that the effect of small perturbations on the network increases monotonically in the layer number. Therefore, networks based on nested canalizing functions do not exhibit any particular type of dynamics in general. Those networks based on functions with a low layer number operate in the frozen regime, whereas large layer numbers lead to chaotic behavior. Having actual formulas for Derrida values of networks governed by any nested canalizing function precludes simulation, which simplifies the use of Derrida plots for robustness investigations of complex networks. Thus, this research contributes to a better understanding of the robustness of gene regulation.

99

PREDICTION OF REGULATORY IMPACT UNDER DE NED ENVIRONMENTAL CONDITIONS BASED ON A COMBINATION OF METABOLIC AND TRANSCRIPTIONAL NETWORKS Johannes Georg Klotz1, Gero Viertel1, Ronny Feuer2, Katrin Gottlieb3, Oliver Sawodny2, Georg Sprenger3, Martin Bossert1, Michael Ederer2, and Ste en Schober1 1Ulm

University, Ulm, Ulm, Germany for System Dynamics, University of Stuttgart, Stuttgart, Germany 3Institute for Microbiology, University of Stuttgart, Stuttgart, Germany [email protected] , [email protected] , [email protected] , [email protected] , [email protected] , [email protected] , [email protected] , [email protected] , [email protected] 2Institute

Motivation and Aims: When designing knockout studies it is of interest to predict the changes of the phenotypical parameters by in silico studies. It has been shown (Feuer et al. 2011 [1]) that the environmental conditions a ect the predictive e ectiveness of such in silico studies. Therefore, we investigate the in uence of the genes modeled in a transcriptional Boolean network (TN) on the phenotypical parameters for a large number of conditions. We are interested in nding regulatory genes that signi cantly in uence the biomass yield or the growth behavior of Escherichia coli, but only under certain environmental conditions. For other environmental conditions a change of transcription rates of these genes should have no e ect. Hence, this environmental condition can be used as an experimental control. Due to the high costs of experiments and the large number of potential candidates, we use in silico simulations of a combined model (transcriptional network together with a metabolic network model). First results (Feuer et al. 2011, [1]) show that the in uence of the TN in this combination on the phenotypical parameters are strongly dependent on the environmental conditions. Methods: We are combining a transcriptional Boolean network (TN) and a metabolic network model (MN), which have been presented by Covert et al. 2004 [2] and Feist et al. 2007 [3], respectively. The environmental conditions are represented in the model by the binary input variables of the TN. The overall states of the TN and, hence, its impact on the metabolic network depend solely on these inputs. By iteratively ipping the states of all regulatory genes in the TN for a large number of environment conditions, and a subsequent Flux-Balance Analysis to evaluate the MN, we can quantify the impact of a particular gene. By using graph-theoretic methods we identify not only direct in uences of those genes, but also indirect ones. For the analysis of the simulations we use the so-called in uence to identify a small set of regulatory genes and combination of genes, which seem to be ful lling our conditions mentioned above. The predicted in uence of these candidates on the metabolism should be veri ed in experiments. Conclusion: We present methods to quantify the impact of a particular regulatory gene on the growth behavior of Escherichia coli. Our results can then be used to design knockout studies to verify and extend our in silico ndings. Acknowledgments: This work was supported by the German research council “Deutsche Forschungsgemeinschaft” (DFG) under Grants Bo 867/25-2, Sa 847/11-2 and Sp 503/5-2. References [1] R. Feuer, K. Gottlieb, J. G. Klotz, S. Schober, M. Bossert, O. Sawodny, G. Sprenger, and M. Ederer, “Model-based analysis of adaptive evolution,” in Proceedings of the 8th International Workshop on Computational Systems Biology (WCSB), (Zuerich, Switzerland), June 2011. [2] M. W. Covert, E. M. Knight, J. L. Reed, M. J. Herrgard, and B. O. Palsson, “Integrating high-throughput and computational data elucidates bacterial networks,” Nature, vol. 429, pp. 92–96, May 2004. [3] A. M. Feist, C. S. Henry, J. L. Reed, M. Krummenacker, A. R. Joyce, P. D. Karp, L. J. Broadbelt, V. Hatzimanikatis, and B. O. Palsson, “A genome-scale metabolic reconstruction for escherichia coli k-12 mg1655 that accounts for 1260 orfs and thermodynamic information,” Molecular Systems Biology, vol. 3, p. 121, Jun 2007.

100

USING PHYSIOLOGY-BASED PHARMACOKINETIC MODELING FOR PHARMACEUTICAL RESEARCH AND DEVELOPMENT Lars Kuepfer Bayer Technology Services GmbH, Leverkusen, Germany [email protected]

Computational models play an increasing role in pharmaceutical research and development, since they offer an efficient way for storing, representing and analyzing experimental data at each stage of (pre-)clinical development. Physiologically-based pharmacokinetic (PBPK) models are a special form of pharmacokinetic models which represent the processes underlying the distribution of a substance within the human body mechanistically at a high level of detail. Based on generic drug distribution models and extensive collections of physiological parameters they thus enable a comprehensive simulation of drug pharmacokinetics at the whole-body scale. Moreover, structural refinements such as metabolization processes or active transport can easily be introduced into the basic PBPK models such that structural hypotheses can be evaluated. Using different exemplary case studies we show here how computational models allow the investigation of mechanisms governing a specific pharmacokinetic behavior. Simultaneous consideration of a drug's modes of action at the cellular level enables construction of integrated multi-scale models based on which dose-effect relationships can be investigated from a systems perspective. Such multi-scale models have amongst others been used for the correlation of genetic predisposition in a patient subgroup with clinical endpoints. Computational models thereby significantly support assessment of crucial points in drug development such as general evaluation of drug efficacy or analysis of inter individual variability in therapeutic outcomes.

101

MODELING METABOLISM OF AN ARTIFICIALLY CONSTRUCTED SYSTEM OF TWO BACTERIAL STRAINS Antti Larjo1, Suvi Santala2, Matti Karp2, and Ville Santala2 1Department

of Signal Processing, Tampere University of Technology, Tampere, Finland, of Chemistry and Bioengineering, Tampere University of Technology, Finland [email protected] , [email protected] , [email protected] , [email protected]

2Department

Introduction: While most studies concentrate only on culturing and modeling single species, bacteria in nature are practically always part of bacterial communities, competing for the same substrates and utilizing products of other bacteria. Also production of substances would often be preferable in a microbial microecosystem instead of pure cultures due to difficulty in maintaining culture purity and more efficient use of heterogeneous source materials. Here we discuss modeling of an artificial co-culture of two bacterial strains. Results: Making Acinetobacter baylyi ADP1 deficient of utilizing glucose and culturing it together with Escherichia coli created an artificial co-culture system. Using this system and glucose as sole carbon source produces a unidirectional carbon flow as E. coli consumes the glucose while its end products (acetate) are utilized by A. baylyi. Measurements of growth rates and key metabolites were done for a series of time points and were used for validating an in silico model. One issue for metabolic modeling is a phenomenon in several organisms called overflow metabolism, which takes place under conditions of excess substrate availability and where the organism utilizes resources in a non-optimal way. This is not handled well by metabolic models by default even though it can produce marked deviations in growth and production rates. Our dynamic flux balance analysis based model aims to capture the above-mentioned set-up as well as possible. Conclusions: Ability to model, construct and control novel biological systems consisting of more than one species will have a great significance in a number of biotechnological applications. Our model and results are for the simplest case and represent a necessary first step towards in silico modeling of whole bacterial communities.

102

USING KERNEL DENSITY ESTIMATORS TO DETECT BISTABLE STATES IN STOCHASTIC SIMULATIONS OF GENETIC CIRCUITS E. Monzon1, A. Tejeda1, C. Winstead1, C. J. Myers2, and C. Madsen3 1Dept. 2

of Electrical and Computer Engineering, Utah State University, Logan, UT, USA Dept. of Electrical and Computer Engineering, University of Utah, SLC, UT, USA 3 School of Computing, University of Utah, SLC, UT, USA [email protected] , [email protected] , [email protected] , [email protected] , [email protected]

Introduction: Detecting bifurcations is a problem germane to genetic circuit simulations. In the context of disease networks, it is believed that bistable circuits may drive transitions from one locked-in state (healthy state) to another (disease state). Thus, understanding circuits that exhibit bistable behavior is of utmost importance for classifying disease states and identifying potential drug targets or biomarkers. The stable states of a bistable system can be detected by analysing the results obtained from many stochastic simulation runs at different times. If one molecular species in a reaction system has more than one stable state, then its molecular count for independent simulation runs will tend to cluster around one of these two states. These clusters can be observed by analysing and detecting the modes of the underlying distribution at time t. We propose the use of kernel density estimates and a simple peak detection algorithm to perform this bifurcation analysis from stochastic simulations without intervention from the designer. Results: The stochastic bifurcation detection method has been successfully applied to a genetic toggle switch. Twenty independent gillespie SSA runs were performed for a length of 2, 000 time units. Every 50 time units the density of molecular species TetR was estimated and the peaks were detected. After the simulation the two stable states of species TetR were readily visible and very distinguishable. Conclusion: A new method to detect bistability from stochastic simulations of genetic circuits is proposed. The bifurcation detection method shown in this work selects representative results of bistable paths from a collection of stochastic simulations using the well-known kernel density estimators and a simple peak detection algorithm.

103

SINGLE-MOLECULE DYNAMICS OF THE BIDIRECTIONAL ARABINOSE PROMOTER Samuel Oliveira, Jarno Mäkelä, Meenakshisundaram Kandhavelu, Olli Yli-Harja, and Andre S. Ribeiro Department of Signal Processing, Tampere University of Technology, Tampere, Finland [email protected], [email protected], [email protected], [email protected], [email protected]

Introduction: Arabinose plays a central role in the metabolism of organisms such as Escherichia coli, controlling the expression of several genes, driven by the pBAD promoter. Arabinose acts as an activator by de-repressing this promoter, which allows the expression of proteins that are responsible for the conversion of L-arabinose into forms that are then used in the cellular metabolism. Results: We measured the in vivo kinetics of production of individual RNA molecules under the control of the native pBAD promoter in E. coli. Assuming that transcription initiation consists of successive steps, exponential in duration, we infer the number and duration of the steps. We find that the RNA production is regulated by a multi-step process with at least two major steps. As a result, this process is found to be sub-Poissonian. Conclusions: From this finding, along with similar findings for other promoters such as tet and lac derivatives, we hypothesize that transcription in bacteria may, in general, be a sub-Poissonian process. If so, it suggests that the results from measurements of cell-to-cell diversity in RNA numbers are affected by processes other that gene expression, such as biased partitioning in division or complex kinetics of RNA degradation.

104

TOWARDS CONFIDENCE INTERVALS FOR THE MUTUAL INFORMATION BETWEEN TWO BINARY RANDOM VARIABLES Arno G. Stefani1 , Johannes B. Huber1 , Christophe Jardin2 and Heinrich Sticht2 1

Institute for Information Transmission (LIT), FAU Erlangen-Nuremberg Cauerstr. 7, 91058 Erlangen, Germany 2 Bioinformatics, Institute for Biochemistry, FAU Erlangen-Nuremberg, Fahrstr. 17, 91054 Erlangen, Germany {stefani, huber}@LNT.de, {christophe.jardin, h.sticht}@biochem.uni-erlangen.de ! 2 )).

1. INTRODUCTION Inspired by the work of Ho and Yeung [1] we have found conjectures for tight upper and lower bounds on the mutual information (MI) of two binary random variables with a joint distribution having a maximal variational distance (L1 deviation) to some distribution (e.g. empirical distribution). Combined with a lower bound on the probability of a maximal variational distance [1, Lemma 3] between the true joint distribution and an empirical distribution, this gives a confidence interval for the mutual information, given an empirical joint distribution of two binary random variables. To our best knowledge this is the first result which does not make any assumptions on the true joint distribution and works in a non-asymptotic regime.

inf

V (P,Q)≤!

inf

V (P,Q)≤!

inf

V (P,Q)≤!

sup V (P,Q)≤!

det(Q),

I(Q) =

˜ = with det(Q)

 ˜ I(Q) 0

inf

V (P,Q)≤!

if

inf

V (P,Q)≤!

| det(Q)| > 0

else

| det(Q)|.

Conjecture 2  ˜ 1 ), I(Q ˜ 2 ))  max(I(Q    0  if ! < min(V (( 1/2 ), P), 0 1/2 sup I(Q) = 0 1/2  V (( 1/2 0 ), P)) V (P,Q)≤!    1 else

˜1) = with det(Q

written in matrix form for convenience, which has a variational distance V (P, Q) = "P − Q"1 ≤ !, 0 ≤ ! ≤ 2 to a given and fixed joint probability distribution P (e.g. empirical distribution). Note that I(X; Y ) = I(Q) = 0 iff det(Q) = 0. First, we will show how to determine | det(Q)|,

| det(Q)| can be solved by taking the supremum

Conjecture 1

We will state a lower and upper bound for the MI of two binary random variables X, Y ∈ {1, 2} with a variable joint probability distribution ! " q q Q = 11 12 , q21 q22

inf

det(Q) can be found in an analogous way.

or the infimum depending on the sign of det(P) and in addition setting the optimization result to 0 if the leading sign of the determinant has changed after the optimization. Now we can state our conjectures, that MI is closely related to det(Q).

2. RESULTS

V (P,Q)≤!

sup V (P,Q)≤!

inf

V (P,Q)≤!

sup

inf

V (P,Q)≤!

˜2) = det(Q) and det(Q

det(Q).

V (P,Q)≤!

We have a promising approach for an analytical proof, which already proves the conjecture for the lower bound for most cases, but so far unfortunately not for all cases. Extensive numerical computations did not reveal any results which contradict the conjectures.

det(Q).

3. CONCLUSIONS

det(Q) can be determined by seperately mini-

Reliable estimation of MI has lots of applications in biology, e.g. protein-protein docking analysis, microarray gene expression analysis, and spike train analysis. Therefore we think that our contribution is a first step towards a broader applicability of methods involving MI in biology.

mizing q11 q22 and maximizing q12 q21 . Of course, the sum of differences of entries of P and Q has to sum up to 0. At first q11 q22 is minimized by taking qii = max(0, pii − 2! ), where ii denotes the index of the minimizer of min(p11 , p22 ) and ¯i¯i the index of the other element. If qii = 0, we further choose q¯i¯i = max(0, p11 + p22 − 2! ), otherwise we choose q¯i¯i = p¯i¯i . In order to maximize q12 q21 we take r = min(pij + 2! , p¯i¯j ), where ij denotes the index of the minimizer of min(p12 , p21 ) and ¯i¯j the index of the other element. If r < p¯i¯j , then qij = r and q¯i¯j = p¯i¯j , otherwise we choose qij = q¯i¯j = min( 21 , 21 (p12 + p21 +

4. REFERENCES [1] S.-W. Ho and R. Yeung, “The interplay between entropy and variational distance,” Information Theory, IEEE Transactions on, vol. 56, no. 12, pp. 5906 – 5929, dec. 2010.

105

ADAPTIVE METRICS FOR RNA-PROTEIN INTERACTION STUDIES Marc Strickert ([email protected]), Marco Mernberger, and Eyke Hüllermeier Knowledge Engineering and Bioinformatics, University of Marburg, Germany INTRODUCTION

terior analysis of l* allows for identification of critical 4-mers that do play a role for the mediation from RNA to protein sequences. In a classification task we redo the RNA-protein interaction prediction described in [1] using their 599-dimensional feature vectors from the reference data set RPI2241. Generalized matrix relevance learning vector quantization (GRLVQ) is used as classification method in our study [5]. This provides a class representation based on interpretable data prototypes. Additionally, the classification-optimized distance measure provides an interpretable mapping of feature attributes into a lowerdimensional subspace. Using 100-fold randomized cross-validations based on 9:1 splits of training and test data we improve the originally reported best test accuracy from 89.6% to 91.39% ± 1.45%, thereby employing a 36-dimensional subspace and 15 prototypes separately assigned for on and off interaction classes. More detailed analysis shows that much interaction-specific information resides in the nucleotide 4-mers, as can be expected by the random selection of assumptive non-binding RNA samples in addition to known positives. Yet, best classification accuracy also depends on contributions from a small number of amino acid 3-mers.

RNA-Protein interactions are fundamental events for gene regulation and protein synthesis. A prediction of potential activity is thus a first step towards the characterization and understanding of biological response mechanisms. With the increasing availability of protein and RNA sequencing information the question of their molecular interplay has been addressed successfully at sequence level recently [1]. The approach proposed in [1] operates on k-mer frequency encoding of RNA and target proteins according to the conjoint triad feature code [2]: RNA strings are represented by nucleotide 4-mers that yield a 256dimensional frequency profile, and protein sequences are encoded as 343-dimensional 3-mer frequency profiles of amino acids being grouped into a 7 character alphabet for dipole moments and spatial volumes. Using this encoding, the authors of [1] predicted functional connections between both sequence spaces by mapping their 599-dimensional concatenation to on and off states. Random forests and support vectors machines with radial basis function kernels were used for this task. RESULTS In this work we investigate the use of adaptive metric methods [3] for creating interpretable connections between both spaces for the representation of RNA and protein sequences. We consider the following adaptive distance measure for the comparison of two vectors X and Y from the feature space: dl(X,Y) =

CONCLUSION K-mer statistics on sequence data contains valuable information to characterize basic RNA-protein interaction potential. Thereby, learning matrix metrics help in constructing interpretable models pointing towards interesting pairs of k-mer candidates. CMM and GRLVQ software are available online [6]. This work is supported by the LOEWE Center for Synthetic Microbiology (SYNMIKRO).

( (X-Y) lT l (X-Y)T )p

Quadratic forms being obtained for p=1 are suitable for classification, while p=½ refers to Mahalanobis type of distances type allowing to interpret rectangular matrices lÎRdxq as data mapping operator for q-dimensional subspaces where standard Euclidean distance is applied. In a regression study, we examine a subspace of the RNA feature vectors in which the pairwise distances are adapted by quasi-Newton l-BFGS optimization to match the distances of the corresponding protein features:

REFERENCES [1] U. Muppirala, V. Honavar, and D. Dobbs, “Predicting RNA-Protein Interactions Using Only Sequence Information,” BMC Bioinformatics, vol. 12, 2011. [2] J. Shen, J. Zhang, X. Luo, W. Zhu, K. Yu, K. Chen, Y. Li, and H. Jiang “Predicting protein-protein interactions based only on sequences information,” Proc Natl Acad Sci, 2007, 104:4337-41. [3] L. Yang, “Distance Metric Learning: A Comprehensive Survey,” Tech. Rep., Michigan State University, 2006 [4] A.J. Soto, G.E. Vazquez, M. Strickert, and I. Ponzoni, “Target-driven subspace mapping methods and their applicability domain estimation,” Molecular Informatics, 2011, Volume 30, 779-789 [5] P. Schneider, M. Biehl, and B. Hammer, “Distance learning in discriminative vector quantization,” Neural Computation 21(10), 2942–2969, 2009 [6] Mach. Learning Open Source Software: https://mloss.org/

l* = argmax l r(DlRNA,DProt) Correlation r applies to upper triangular distance matrices of RNA and protein feature vectors. This correlative matrix mapping (CMM) provides a uni-directional alternative to partial least squares regression and yields robust solutions [4]. Our study of the benchmark data set RPI2241 discussed in [1] shows a strong structural difference between both spaces that cannot be easily resolved in low-dimensional sub-spaces. Still, pos-

106

EFFECTS OF RATE LIMITING STEPS IN TRANSCRIPTION INITIATION ON THE BEHAVIOR OF SMALL GENETIC MOTIFS Huy Tran, Antti Häkkinen, Olli Yli-Harja and Andre S. Ribeiro Tampere University of Technology, Tampere, Finland, [email protected] , [email protected] , [email protected] , [email protected]

Introduction: In organisms, genes form networks, whose building blocks are small motifs. These perform specific functions, such as signal buffering, synchronization, or noise shaping. The behavior of each motif is determined not only by the gene-gene interactions, but also by the expression pattern of each constituent gene. In vivo single molecule measurements indicate that the sequence of steps in transcription initiation plays a key role in the dynamics of mRNA and protein numbers. It is unknown to what extent these steps affect the behavior of the motifs. Results: We study the behavior of stochastic genetic motifs, while varying the kinetics of transcription initiation of the constituent genes. First, we simulate the biphasic amplitude filter, which generates an output signal only for a specific input level. We find that the strength of the fluctuations in the input signal affects both the maximum output level and the amplitudes that the filter responds to, and that the performance of the filter deteriorates for highly noisy inputs. Next, we model a three-gene repressilator, a circuit able to act as a genetic clock. We find the dynamics of initiation to affect the repressilator's period and robustness. Finally, we observe a modulated oscillator and find that it can exhibit steep low-pass frequency behavior, with a cutoff frequency that varies with the duration of the steps in initiation. Conclusions: The kinetics of the rate limiting steps in transcription initiation is shown to have a strong effect on the behavior of genetic motifs. Since these steps are determined by a promoter's sequence, they can be mutated, either by evolutionary pressure or synthetically, to attain sought characteristics of the motifs without affecting their products. The results aid in understanding the degree of plasticity of these motifs, and will aid in the engineering of synthetic motifs with specific characteristics.

107

MODELING THE CIS- AND TRANS-EFFECT OF DNA COPY NUMBER ABERRATIONS ON GENE EXPRESSION LEVELS IN A PATHWAY Wessel van Wieringen, Mark van de Wiel VUmc University Medical Center, Amsterdam, Netherland [email protected] , [email protected]

The need for integration of genomic data from multiple sources is imperative for a mechanistic understanding of cancer. By putting together partial views of a complex process like tumorigenesis, we may obtain a more accurate and complete picture of the molecular mechanisms underlying it. We discuss the integration of DNA copy number and mRNA gene expression data from an observational integrative genomics study involving cancer patients. The two molecular levels involved are linked through the central dogma of molecular biology. DNA copy number aberrations abound in the cancer cell. Here we investigate how these aberrations affect gene expression levels within a pathway using observational integrative genomics data of cancer patients. In particular, we aim to identify differential edges between regulatory networks of two groups involving these molecular levels. We provide a taxonomy of the usages of the terms cis- and trans-effect relating DNA copy number aberrations and gene expression. Motivated by the rate equations, the interplay between DNA copy number aberrations and gene expression levels within a pathway is modeled by a simultaneous-equations model. This is introduced for the one-sample case. The model is fitted by means of penalized least squares using the lasso to achieve a sparse network topology. The simultaneous-equations model is extended to the two-sample situation. In the two-sample case the topology of the regulatory network is allowed to differ between the two groups, as such facilitating the discovery of differential interactions. In the estimation of this extended model, the fused lasso penalty is included to regulate the number of differential interactions. Extensive simulations of the one-sample case show that the inclusion of DNA copy number data benefits the discovery of gene-gene interactions. In addition, they reveal that cis-effects tend to be over-estimated in a univariate (single gene) analysis. In a two-sample setting, the indirect and direct trans-effect models are studied alongside, indicating that the former yields lower false positive and negative rates for differential edge identification. Application to pathway data from integrative genomic studies puts forward a possible cancer systems biology principle: genes with larger cis-effects tend to have fewer incoming, but more outgoing edges. Finally, analysis of the TP53 signalling pathway between ER+ and ER- samples from an integrative genomics breast cancer study identifies a.o. differential regulation of the IGF complex. This corroborates with existing literature: the IGF complex is known to crosstalk with ER. Moreover, this result reproduces in a different integrative genomics breast cancer study.

108

TICSP Series Editor

Jaakko Astola,

Tampere University of Technology, Finland

Editorial Board

Moncef Gabbouj, Murat Kunt, Truong Nguyen,

Tampere University of Technology, Finland Ecole Polytechnique Fédérale de Lausanne, Switzerland Boston University, USA

1 2 3 4 5 6 7 8 9

Egiazarian, Saramäki, Astola. Proceedings of Workshop on Transforms and Filter Banks. Yaroslavsky. Target Location: Accuracy, Reliability and Optimal Adaptive Filters. Astola. Contributions to Workshop on Trends and Important Challenges in Signal Processing. Creutzburg, Astola. Proceedings of Second International Workshop on Transforms and Filter Banks. Stankovic, Moraga, Astola. Readings in Fourier Analysis on Finite Non-Abelian Groups. Yaroslavsky. Advanced Image Processing Lab.: An educational and research package for Matlab. Klapuri. Contributions to Technical Seminar on Content Analysis of Music and Audio. Stankovic, Stankovic, Astola, Egiazarian. Fibonacci Decision Diagrams. Yaroslavsky, Egiazarian, Astola. Transform Domain Image Restoration Methods: Review, Comparison and Interpretation. 10 Creutzburg, Egiazarian. Proceedings of International Workshop on Spectral Techniques and Logic Design for Future Digital Systems, SPECLOG’2000. 11 Katkovnik. Adaptive Robust Array Signal Processing for Moving Sources and Impulse Noise Environment. 12 Danielian. Regularly Varying Functions, Part I, Criteria and Representations. 13 Egiazarian, Saramäki, Astola. Proceedings of the 2001 International Workshop on Spectral Methods and Multirate Signal Processing, SMMSP2001. 14 Stankovic, Sasao, Astola. Publications in the First Twenty Years of Switching Theory and Logic Design. 15 Saramäki, Yli-Kaakinen. Design of Digital Filters and Filter Banks by Optimization: Applications. 16 Danielian. Optimization of Functionals on Classes of Distributions with Moments’ Constraints, Part I, Linear Case. 17 Saramäki, Egiazarian, Astola. Proceedings of the 2002 International TICSP Workshop on Spectral Methods and Multirate Signal Processing, SMMSP2002. 18 Danielian. Optimization of Functionals on Classes of Distributions with Moments’ Constraints, Part II, Nonlinear Case. 19 Katkovnik, Egiazarian, Astola. Adaptive Varying Scale Methods in Image Processing, Part I Denoising and Deblurring. 20 Huttunen, Gotchev, Vasilache. Proceedings of the 2003 Finnish Signal Processing Symposium, Finsig'03. 21 Yli-Harja, Smulevich, Aho. Proceedings of the 1st TICSP Workshop on Computational Systems Biology, WCSB 2003. 22 Saramäki, Egiazarian, Astola. Proceedings of the 2003 International TICSP Workshop on Spectral Methods and Multirate Signal Processing, SMMSP2003. 23 Sarukhanyan, Agaian, Egiazarian, Astola. Hadamard Transforms. 24 Aho, Lähdesmäki, Yli-Harja. Proceedings of the 2nd TICSP Workshop on Computational Systems Biology, WCSB 2004. 25 Astola, Egiazarian, Saramäki. Proceedings of the 2004 International TICSP Workshop on Spectral Methods and Multirate Signal Processing, SMMSP2004. 26 Yaroslavsky. Discrete Sinc Interpolation Methods and their Applications in Image Processing. 27 Astola, Danielian. Regularly Varying Skewed Distributions generated by Birth-Death Process. 28 Kulemin, Zelensky, Astola, Lukin, Egiazarian, Kurekin, Ponomarenko, Abramov, Tsymbal, Goroshko, Tarnavsky. Methods and Algorithms for Pre-processing and Classification of Multichannel Radar Remote Sensing Images. 29 Manninen, Linne, Yli-Harja. Proceedings of the 3rd TICSP Workshop on Computational Systems Biology, WCSB 2005. 30 Astola, Egiazarian, Saramäki. Proceedings of the 2005 International TICSP Workshop on Spectral Methods and Multirate Signal Processing, SMMSP2005. 31 Astola, Danielian. Frequency Distributions in Biomolecular Systems and Growing Networks

32 Ruusuvuori, Manninen, Huttunen, Linne, Yli-Harja. Proceedings of the 4th TICSP Workshop on Computational Systems Biology, WCSB 2006. 33 Lugmayr. Proceedings of The TICSP Workshop on Ambient Media and Home Entertainment at the EuroITV 2006. 34 Astola, Egiazarian, Saramäki. Proceedings of the 2006 International TICSP Workshop on Spectral Methods and Multirate Signal Processing, SMMSP2006. 35 Lugmayr, Golebiowski. Interactive TV: A Shared Experience TICSP Adjunct Proceedings of EuroITV 2007. 36 Stankovic, Astola. Reprints from the Early Days of Information Sciences, 2007. E.I. Nečiporuk, "Network Synthesis by Using Linear Transformation of Variables" 37 Bregovic, Gotchev. Proceedings of the 2007 International TICSP Workshop on Spectral Methods and Multirate Signal Processing, SMMSP2007. 38 Grünwald, Myllymäki, Tabus, Weinberger, Yu. Festschrift in Honor of Jorma Rissanen on the Occasion of his 75th Birthday. 39 Stankovic, Astola. Gibbs Derivatives – the First Forty Years. 40 Stankovic, Astola. Reprints from the Early Days of Information Sciences. On the contributions of Akira Nakajima to the Switching Theory, 2008. 41 Ahdesmäki, Strimmer, Radde, Rahnenführer, Klemm, Lähdesmäki, Yli-Harja. Fifth International Workshop on Computational Systems Biology, WCSB2008. 42 Lugmayr, Kemper, Obrist, Mirlacher, Tscheligi. Changing Television Environments TICSP Adjunct Proceedings of the EuroITV 2008. 43 Heikkonen, Kontoyiannis, Liski, Myllymäki, Rissanen, Tabus. Proceedings of the First Workshop on Information Theoretic Methods in Science and Engineering, 2008. 44 Foi, Gotchev. The 2008 International Workshop on Local and Non-Local Approximation in Image Processing, LNLA2008 45 Totsky, Lukin, Zelensky, Astola, Egiazarian, Khlopov, Morozov, Kurbatov, Molchanov, Roenko, Fevralev. Bispectrum-Based Methods and Algorithms for Radar, Telecommunication Signal Processing and Digital Image Reconstruction, 2008. 46 Stankovic, Astola. Reprints from the Early Days of Informations Sciences. On the Contributions of P.S. Poreckij to the Switching Theory, 2008. 47 Egiazarian, Gabbouj, Tabus. Festschrift in Honor of Jaakko Astola on the Occasion of his 60th Birthday, 2009. 48 Manninen, Wiuf, Lähdesmäki, Grzegorczyk, Rahnenführer, Ahdesmäki, Linne, Yli-Harja (eds.). Sixth International Workshop on Computational Systems Biology, WCSB 2009. 49 Heikkonen, Kontoyiannis, Liski, Myllymäki, Rissanen, Tabus (eds.). Proceedings of the Second Workshop on Information Theoretic Methods in Science and Engineering 2009. 50 Stanković, Astola (eds.). Reprints from the Early Days of Information Sciences, On the Contributions of Arto Salomaa to Multiple-Valued Logic, 2009. 51 Nykter, Ruusuvuori, Carlberg, Yli-Harja (eds.). Seventh International Workshop on Computational Systems Biology, WCSB 2010. 52 Stankovic, Astola (eds.). Reprints from the Early Days of Information Sciences, Interview with Arto Salomaa. 2010. 53 Stankovic, Stankovic (eds.). Reprints from the Early Days of Information Sciences. Professor Heinz Zemanek – Reminiscences to the Work in Switching Algebra . 2010. 54 Stankovic, Astola (eds.). Reprints from the Early Days of Information Sciences. Paul Ehmfest – Remarks on Algebra of Logic and Switching Theory. 2010. 55 Kenji Yamanishi, Ioannis Kontoyiannis, Erkki P. Liski, Petri Myllymäki, Jorma Rissanen & Ioan Tabus (Eds.). Proceedings of the Third Workshop on Information Theoretic Methods in Science and Engineering. 2010 56 Ryabko, Jaakko Astola & Mikhail Malyutov. Compression-Based Methods of Prediction and Statistical Analysis of Time Series: Theory and Applications. 2010 57 Jugoslava Acimovic, Juha Kesseli, Tuomo Mäki-Marttunen, Antti Larjo, Heinz Koeppl & Olli YliHarja: Eigth International Workshop on Computational Systems Biology WCSB 2011( Eds.) 2011 58 Radomir S. Stankovic, Jaakko T. Astola Reprints from the Early Days of Information Sciences Reminiscences of Early Work on Walsh Functions - Interviews with Franz Pichler, William R. Wade, and Ferenz Schipp, 2011.

59 Jaakko Astola, Radomir Stankovic (eds.). Reprints from the Early Days of Information Sciences - Early Work of Professor Aimo Tietäväinen in Number Theory and Coding Theory. 2012. 60 Jaakko Astola, Radomir Stankovic (eds.). Reprints from the Early Days of Information Sciences Reminiscences of the Early Work in DCT, Interview with Professor K.R. Rao. 2012.

Tampereen  Teknillinen  Yliopisto PL  553 33101  Tampere Tampere  University  of  Technology Tampere  International  Center  for  Signal  Processing P.O.B.  553 ISBN  978-952-15-2853-8 ISSN  1456-2774