Inferring Adaptive Regulation Thresholds and ... - IEEE Xplore

2 downloads 0 Views 1MB Size Report
The method was tested on a Saccharomyces cerevisiae gene expression data set. ... genetic regulatory networks, machine learning, gene expression data, ...
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,

VOL. 4,

NO. 4,

OCTOBER-DECEMBER 2007

1

Inferring Adaptive Regulation Thresholds and Association Rules from Gene Expression Data through Combinatorial Optimization Learning Ignacio Ponzoni, Francisco J. Azuaje, Juan Carlos Augusto, and David H. Glass Abstract—There is a need to design computational methods to support the prediction of gene regulatory networks (GRNs). Such models should offer both biologically meaningful and computationally accurate predictions which, in combination with other techniques, may improve large-scale integrative studies. This paper presents a new machine-learning method for the prediction of putative regulatory associations from expression data which exhibit properties never or only partially addressed by other techniques recently published. The method was tested on a Saccharomyces cerevisiae gene expression data set. The results were statistically validated and compared with the relationships inferred by two machine-learning approaches to GRN prediction. Furthermore, the resulting predictions were assessed using domain knowledge. The proposed algorithm may be able to accurately predict relevant biological associations between genes. One of the most relevant features of this new method is the prediction of adaptive regulation thresholds for the discretization of gene expression values, which is required prior to the rule association learning process. Moreover, an important advantage consists of its low computational cost to infer association rules. The proposed system may significantly support exploratory large-scale studies of automated identification of potentially relevant gene expression associations. Index Terms—Combinatorial optimization, genetic regulatory networks, machine learning, gene expression data, decision trees.

Ç 1

BACKGROUND

A

gene regulatory network (GRN) aims to represent high-level relationships that govern the rates at which genes in the network are transcribed into mRNA. In this way, genes can be viewed as nodes in this network whose expression levels (outputs) are controlled by other nodes (transcription factors). Nowadays, the inference, modeling, and simulation of GRNs is a fundamental topic in functional genomics [1], [2]. Over the past few years, several statistical and artificial intelligence techniques have been proposed to carry out the reverse engineering of GRNs from monitoring and analyzing large-scale gene expression data [3], [4]. Clustering algorithms represented one of the first approaches to support the large-scale identification of regulatory modules [5], [6]. Such an approach approximated regulatory networks by 1) identifying groups of coexpressed genes and 2) analyzing relationships between their regulatory regions and DNA binding motifs targeted by known transcription factors. A key limitation of this approach is that it assumes that coexpression is always equivalent to regulation. Moreover, this method implies

. I. Ponzoni is with the Department of Computer Science and Engineering, Universidad Nacional del Sur, Av. Alem 1253, Bahı´a Blanca, CP 8000, Argentina. E-mail: [email protected]. . F.J. Azuaje is with the Computer Science Research Institute and the School of Computing and Mathematics, University of Ulster at Jordanstown, BT37 0QB, Newtownabbey, Co. Antrim, UK. E-mail: [email protected]. . J.C. Augusto and D.H. Glass are with the School of Computing and Mathematics, University of Ulster at Jordanstown, BT37 0QB, Newtownabbey, Co. Antrim, UK. E-mail: {jc.augusto, dh.glass}@ulster.ac.uk. Manuscript received 14 Mar. 2006; revised 23 Aug. 2006; accepted 6 Nov. 2006; published online 22 Jan.2007. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TCBB-0057-0306. Digital Object Identifier no. 10.1109/TCBB.2007.1049. 1545-5963/07/$25.00 ß 2007 IEEE

symmetric relationships between the genes, which may not always correspond to biological phenomena. Within the area of machine learning, Boolean Networks were one of the first models to be employed in GRNs inference [7], [8] and variations of this approach have been published recently [9]. These models basically aim at inferring logical rules from a discretization of gene expression time series. Even though these models can be easily applied, they depend on arbitrary discretizations of the gene expression values [10], which impose strong assumptions and restrictions about the biological system under study. Bayesian Networks have also provided the basis for several approaches to inferring GRNs [11], [12], [13]. These methods employ conditional probabilistic distributions for gene interactions modeling. Despite the strong theoretical rationale behind these approaches, the exponential explosion of the parameter space required for these models, together with the large quantity of data needed to make reliable inferences, reduces their capacity to infer complex GRNs by using gene expression data only. Since they are acyclic directed graphs, they cannot represent autoregulation or time-course regulation in a straightforward way [2]. From the area of evolutionary computing approaches [14], several methods were proposed. Ando et al. [15] presented an algorithm that combines genetic programming with the minimum least squares method. This technique infers a differential equation system that represents regulation interactions between genes. Although this method may be robust in statistical terms, the algorithm was only tested on small GRNs (10 genes) and the authors detected important scalability limitations when applied to more complex data. Iba and Mimura [16] proposed an iterative Published by the IEEE CS, CI, and EMB Societies & the ACM

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,

VOL. 4,

NO. 4,

OCTOBER-DECEMBER 2007

inference approach based on a genetic algorithm (GA) whose learning process was guided by a molecular biologist. The main goal was to allow the expert to perform interactive analysis and validation of the results based on the introduction of new constraints until a GRN with a high level of predictive confidence was achieved. One of the most important drawbacks of this methodology is that it requires the biologist to have a good understanding of the dynamics of the GA in order to select optimum learning parameters. Recently, Hallinan and Wiles proposed an evolutionary algorithm [17], which predicts GRNs based on the Artificial Genome model presented by Reil [18]. Although this model is more biologically plausible than traditional machine-learning methods and presents potentially useful properties, the network dynamics rely on synchronous updating, which is biologically implausible. On the other hand, when a more realistic asynchronous updating scheme was used, the dynamic behavior collapsed at a single point attractor under almost all conditions [17]. Soinov et al. [10] approached the task of reconstructing GRNs as a classification problem. In summary, the authors proposed the application of decision trees to infer classifiers that may represent regulatory rules (relationships) between genes. They applied the C4.5 algorithm to infer the decision trees [19]. This method’s computational efficiency limitations are well known for classification problems with continuous-valued attributes [20], which is the case in the GRNs inference problem since the gene expression values are real numbers. Although this is a sound and conceptually interesting approach, it may exhibit significant predictive limitations when dealing with more complex GRNs (that is, networks that may consist of hundreds or thousands of genes). Another important category of predictive approaches includes several methods that are based on the detection of modules of genes significantly coexpressed under specific conditions [21], [22], [23]. Such modules allow both the approximation of higher level network representations and module-specific relationships. This divide-and-conquer approach is a useful option for achieving reliable predictions in the absence of larger amounts of expression samples. However, recent evidence suggests conflicting views about the meaning and nature of functional modules represented in GRNs [24]. For a more comprehensive review of GRN inference methods, the reader is referred to [2], [25], and [26].

regulation rules). Thus, a new machine-learning algorithm based on combinatorial optimization, from now on referred to as GRNCOP (abbreviated from Gene Regulatory Network inference by Combinatorial OPtimization), is assessed. This method infers association rules that represent interactions between genes, which are obtained from gene expression data sets. The discovered rules may be used to predict the gene expression states of a gene in terms of the gene expression values of other genes and, in this way, a putative GRN may then be reconstructed by applying and combining these rules. Our approach offers several advantages in relation to existing methods. First of all, it does not assume arbitrary and uniform gene expression value discretizations. Second, GRNCOP is not constrained by regulatory symmetry relationships that are shown by clustering-based network inference techniques. Third, the results can be easily interpreted since the association rules are derived from models that classify the different regulation states. Finally, the algorithm computes the potential interactions between genes with a low computational effort of Oðn2 Þ, where n is the number of genes in the GRN. Moreover, the new methodology may in principle be adapted to other modeling approaches, such as modular methods [21], [22], [23] and multisource-based prediction techniques [27]. GRNCOP aims at inferring different types of rules that capture relevant associations (that is, potential regulatory relationships) reflected in the expression values of the genes. In order to test this approach, GRNCOP was applied to the microarray data sets presented by Spellman et al. [28] to infer a GRN relevant to the yeast cell cycle. The results were statistically validated. Moreover, the rules generated by GRNCOP were compared to relationships inferred by two other published methods [10], [12]. Furthermore, biologically relevant predictions were verified and potentially novel predictions were assessed through literature searches and an analysis of curated functional annotations derived from the Saccharomyces cerevisiae Genome Database (SGD). The rest of this paper is organized as follows: Key definitions to interpret the regulatory rules and the combinatorial optimization problem are introduced in Section 2, the new algorithm is explained in Section 3, and experimental results obtained by GRNCOP are discussed in Section 4. A summary of contributions, future research, and conclusions are presented in Section 5.

1.1 Proposed Approach The method proposed in this paper addresses key limitations shown by data-driven whole-set GRNs prediction methods. The main objective is to provide a user-friendly, biologically meaningful, and computationally efficient algorithm to support the inference of complex putative GRNs. We do not claim that data-driven machine-learning approaches are sufficient to infer biologically meaningful networks. However, such tools may provide significant evidence necessary to aid scientists in detecting and validating biologically relevant associations. Moreover, the method proposed here neither makes strong statistical assumptions nor applies arbitrary expression discretization schemes (including adaptive thresholds for inferring

2 2.1

SYSTEMS

AND

METHODS

Gene Expression Association Rules and GRNS Inference The time series encoded in a gene expression data set may be represented by means of a gene expression data matrix, X , where the rows and columns represent genes and samples (experimental perturbations or conditions), respectively. In this way, each element x ij of X contains the expression value of gene i in the sample j. Although the gene expression values belong to a continuous range of the real numbers, it is possible to define a finite expression state set for each gene by means of a discretization procedure. Such a procedure is required in order to encode the inputs to any combinatorial optimization

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

PONZONI ET AL.: INFERRING ADAPTIVE REGULATION THRESHOLDS AND ASSOCIATION RULES FROM GENE EXPRESSION DATA...

process or other machine-learning methods. The results reported in this paper, as in previous representative studies, concentrate on two states for each gene: upregulated (when the gene is expressed with a value greater than its mean gene expression value) and downregulated (when the gene is expressed with a value less than or equal to its mean expression value). Nevertheless, the model can be generalized to any number of states in a straightforward way. In GRNCOP, the state of a gene i in a sample j is denoted as sij and i represents the mean expression value for this gene. Thus, sij ¼ 1 if xij > i ; otherwise, sij ¼ 1. On the other hand, the inference process also requires the definition of discretization thresholds in order to infer putative regulatory relationships between genes. These “regulation thresholds” have traditionally been estimated as unique static values for all of the genes under study. For example, ad hoc methods based on mean expression values have been applied. However, a more biologically meaningful scheme should model the fact that a gene may actually have distinct regulation thresholds in relation to different genes in the regulatory network. For example, regarding the regulatory network under study (see Section 4), the genes CLB2 and SWI5 are shown to be inhibited by gene CLB1, but their respective downregulation thresholds are different. CLB2 is downregulated (or inhibited) when the gene expression value of CLB1 is above 0.07, whereas SWI5 is downregulated when the gene expression of CLB1 is above -0.28. Therefore, a fundamental problem consists of estimating the regulation thresholds for each gene in relation to each potential target gene, which can more accurately reflect significant interactions between genes. At this point, our hypothesis is stated as follows: Association rules (that is, potential regulatory relationships) may be accurately inferred from expression data to reveal how the present and future state of a gene may be affected by the gene expression values of the other genes, taking into account their relative regulation thresholds. In this paper, we consider three types of association rules: simultaneous, time-delay, and change-based rules. The rule types are the same as those studied by Soinov et al. [10] and Bulashevska and Eils [12], but the rule syntax adopted here is slightly different. In particular, Soinov et al. [10] referred to the third group of associations as changes rules. In this paper, we refer to such relationships as change-based rules. Simultaneous rules represent the situation in which the state of a gene i in a sample j depends on the gene expression values of other genes in the same sample j. The syntax for these rules is < symbol >< gene >,< symbol >< gene > . The symbols þ and  on the left side of the rule indicate above and below some specific regulation threshold, respectively, whereas the symbols þ and  on the right side of the rule indicate upregulated and downregulated states, respectively. For example, the rule þCLB1 , þCLB2 denotes that, when CLB1 is above its regulation threshold in relation to CLB2 t CLB1;CLB2 in a sample, then CLB2 will be upregulated in the same sample. Time-delay rules represent the situation in which the state of a gene i in a sample j depends on the gene expression values of other genes in the previous sample (that is, previous experimental condition) j  1. The syntax for these rules is

3

TABLE 1 Summary of the Different Types of Association Rules Inferred by GRNCOP

The first column encodes the cases.

< symbol >< gene >!< symbol >< gene > . The symbols þ and  on the left side of the rule indicate above and below some specific regulation threshold, respectively, whereas the symbols þ and  on the right side of the rule indicate upregulated and downregulated states, respectively. For example, the rule þ=  CLB1 ! = þ MCM1 denotes that, if CLB1 is above its regulation threshold in relation to MCM1, tCLB1;MCM1 , in a sample, then MCM1 will be downregulated in the next sample and, if CLB1 is below or equal to tCLB1;MCM1 in a sample j, then MCM1 will be upregulated in the next sample j þ 1. Finally, change-based rules represent events of the transition-state machine corresponding to the GRN. The syntax for these rules is < symbol >< gene >)< symbol >< gene > . In both sides of the rule, the symbols þ and  indicate upregulated and downregulated states, respectively. For example, the rule þCLB1 ) þCLB2 denotes that, when the gene CLB1 changes its state from downregulated to upregulated, then the gene CLB2 will also change its state from downregulated to upregulated at the same experimental condition j. The six resulting regulation cases for the three types of rules are shown in Table 1. Note that two different types of discretization are defined in this paper. The first one is to set the state of each gene, which is computed using its mean expression value, and the second one is to evaluate the potential interaction between each pair of genes and it is calculated in an adaptive gene-pair-specific way. In this paper, we focus on the impact of adaptive regulation thresholds in the rule inference process. However, the study of adaptive thresholds for the definition of the gene’s states is another potential improvement of existing GRN inference methods. This task will be part of future research.

2.2

Combinatorial Optimization for Putative GRNs Inference GRNCOP infers the association rules described above by exploring the possible combinations of interactions between each pair of genes. In this sense, we assume six particular cases, which are represented by the nonnull integer numbers between 3 and 3 and a special case that indicates the absence of association, which is represented by the number 0. All of these cases are described in Table 1.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,

VOL. 4,

NO. 4,

OCTOBER-DECEMBER 2007

Fig. 2. Schema of the phase 3 subroutine corresponding to the GRNCOP algorithm. The discretization step is shown outside of the loop. Fig. 1. General schema of the GRNCOP algorithm. Dotted arcs indicate the connection between the main program and the subroutine used for phases 1 and 2 (gray box).

In mathematical terms, the inference of the rules to reconstruct a GRN can be expressed as the following combinatorial optimization problem: n [ i ¼1

max  ð i ;  ðX X; i ÞÞ; i 2P

ð1Þ

subject to: n ¼ number of genes in the microarray data set, m ¼ number of samples in the microarray data set, X 2