A New Profile Alignment Method for Clustering Gene Expression Data Ataul Bari1 and Luis Rueda2 1
2
School of Computer Science, University of Windsor 401 Sunset Avenue, Windsor, ON, N9B 3P4, Canada
[email protected] Department of Computer Science, University of Concepci´ on Edmundo Larenas 215, Concepci´ on, Chile.
[email protected]
Abstract. We focus on clustering gene expression temporal profiles, and propose a novel, simple algorithm that is powerful enough to find an efficient distribution of genes over clusters. We also introduce a variant of a clustering index that can effectively decide upon the optimal number of clusters for a given dataset. The clustering method is based on a profilealignment approach, which minimizes the mean-square-error of the first order differentials, to hierarchically cluster microarray time-series data. The effectiveness of our algorithm has been tested on datasets drawn from standard experiments, showing that our approach can effectively cluster the datasets based on profile similarity.
1
Introduction
Grouping (or clustering) genes based on the similarity of their temporal profiles is important to researchers as genes with similar expression profiles are expected to be functionally related or co-regulated [3][4][6]. Many unsupervised methods for gene clustering based on similarity (or dissimilarity) of their microarray temporal profiles have been proposed in the past few years [1][6][12]. These methods have used different kinds of distance measures to group genes based on similarity (or dissimilarity) among the microarray time-series profiles. The most commonly used distance measures are the Euclidean distance, the Mahalanobis distance, the Manhattan distance and its generalization, the Minkowski distance [5]. One of the methods for clustering microarray time-series data is based on a hidden phase model (similar to a hidden Markov model) to define the parameters of a mixture of normal distributions in a Bayesian-like manner, which are estimated by using the expectation maximization algorithm [2]. This model has been recently introduced and, to the best of our knowledge, has only been tested on synthetic data. On the other hand, some authors have proposed linear-correlation methods for clustering genes using microarray time-series data [3][7]. The method proposed in [3] requires computing the mean expression levels of some candidate profiles using some pre-identified, arbitrarily selected profiles. The authors of [7]
10
8
8
6
6
6
4
2
0
4
2
0
0
1
2
3
Time in hours
(a)
4
4
2
0
Corr. Coff. = 0.9053
Corr. Coff = 0.8692
−2
−4
Expression level
10
8
Expression level
Expression level
10
−2
5
−4
Corr. Coff. = 0.8039
−2
0
1
2
3
Time in hours
(b)
4
5
−4
0
1
2
3
4
5
Time in hours
(c)
Fig. 1. (a) Two genes that are likely to be clustered as in [10]. (b) Two genes with different profiles that are likely to be clustered together by linear-correlation methods. (c) Two genes with similar profiles in terms of level of expression ratio changes between different time points.
proposed a method for clustering microarray time-series data employing a jackknife correlation coefficient with or without using the seeded candidate profiles. Clustering using the correlation distance may not always group genes that are closer in terms of their temporal profiles. For example, as shown in Fig. 1, using the correlation distance, genes in Fig. 1(b) are most likely to be clustered together (as it has the largest value for the correlation coefficient among all three, which is 0.9053), but if the prime interest is to cluster genes according to the variation of their expression level at different time points, then, genes from Fig. 1(c) would be better candidates to be clustered together than the genes in Fig. 1(a) and (b). However, the value of the correlation coefficient between the pairs of genes in Fig. 1(c) is the minimum (0.8039) among all three pairs of genes shown in Fig. 1. In [10], the authors proposed a method to select and cluster genes using the ideas of order-restricted inference, where estimation makes use of known inequalities among parameters. Although their method does not require arbitrarily selected genes to specify candidate profiles, it is necessary to specify the candidate profiles a priori. Also, as shown in Fig. 1 (a) and (c), by following this procedure, genes from Fig. 1(a) are more likely to be clustered together as they show similar profiles in terms of direction of the changes of expression ratios (e.g. up-up-up-down-down), even though, for one gene, the expression ratio increases sharply between the first three time points and decreases sharply between time points 3 and 4, whereas such increments/decrements are much softer for the second gene. However, in reality, it may be more desirable to cluster genes of Fig. 1(c) in the same group as they differ only a little amount of increase/decrease between the time points 2 and 3 and the time points 3 and 4. In this paper, we propose a mean-square-error profile alignment approach to cluster temporal gene expression data. We propose to use a profile-alignment algorithm, which minimizes the mean-square-error of the first order differentials, to hierarchically cluster microarray time-series data. The alignment algorithm
is also used to define a variant of a well-known clustering validity index that optimizes the number of clusters.
2
Minimum-Square-Error Profile Alignment
Consider a dataset D = {x1 , x2 , ..., xn }, where xi = [xi1 , xi2 , ..., xim ]t is an mdimensional feature vector that represents the expression level of gene i at m different time points. The aim is to partition D into k disjoint subsets D1 , D2 , ..., Dk , where D = D1 ∪ D2 ∪ ... ∪ Dk , and Di ∩ Dj = ∅, for ∀i, j, i 6= j, in such a way that a similarity (dissimilarity) cost function φ : {0, 1}n×k → < is maximized (minimized). We propose an efficient algorithm that takes two features vectors, and produces two new vectors in such a way that the mean-square-error difference between “aligned” vectors is minimized. Let x1 , x2 ∈ D be two feature vectors. The aim is to find two new vectors x1 and x02 = x2 − a (e.g. to find a scalar a) 2 such that, f (a) = kx2 − Pam− x1 k is minimized. The value of a that minimizes 1 f (a) is given by a = m i=1 (x2i − x1i ) (see [11]). Using the latter expression to obtain the value of a, i.e. the scalar that minimizes f (a), we align two vectors by following the procedure given in Algorithm Profile-Alignment. The algorithm takes two feature vectors from the original space as input (which are two temporal gene expression data in this case) and outputs two feature vectors in the transformed space after aligning them in such a way that the mean-square-error is minimized. Using the Profile-Alignment algorithm and a conventional distance function, we define a new distance function, namely dM SE (x1 , x2 ), which aligns x1 and x2 , and invokes a conventional metric [11]. In [11], it has been shown that the Profile-Alignment algorithm used in conjunction with any metric d is also a metric. Let d : 22 or k < 22. Therefore, we have taken the value of k = 22 as the optimal number of clusters and plotted the profiles, as clustered using the MSEHC method corresponding to k = 22. The plots are shown in Fig. 5. Then, similar to the first experiments, we have clustered the complete dataset into 22 clusters using the correlation
distance and the same hierarchical agglomerative clustering method, applied to the original data, resulting in 22 clusters as shown in Fig. 6. The comparison among the plots from Figs. 5 and 6, again, reveal the effectiveness of the MSEHC method. For example, the MSEHC method left clusters 1 to 6 containing a single gene each (IDs 328692, 470934, 361247, 147050, 323946 and 310406, respectively in Fig. 5). The correlation method, however, placed these genes in clusters 21, 2, 2, 13, 2 and 21, respectively (Fig. 6). Again, by visual inspection of all the temporal expression profiles, we noticed that these genes are differentially expressed and should be left alone in separate clusters, which is clearly done by the MSEHC method. Also, as shown in Fig. 5, the MSEHC method produced four clusters containing only two profiles each, clusters 7 (IDs 356635 and 429460 ), 9 (IDs 280768 and 416842), 14 (IDs 130476 and 130482) and 20 (IDs 26474 and 254436). The correlation method clustered these genes as follows: 356635 and 429460 in cluster 18, 280768 and 416842 in cluster 2, 130476 and 130482 in cluster 18, and 26474 and 254436 in cluster 6 (Figs. 5 and 6). Although the correlation method placed each pair of genes in the same cluster, it also placed some other genes with them. By looking at the plots of the profiles of the clusters produced by the correlation method (Fig. 6) and comparing them to the plots of the clusters of the corresponding genes produced by the MSEHC method (Fig. 5), it is clear that these pairs of genes are differentially expressed. Therefore, the gene profiles placed in the same cluster by the MSEHC method are more similar than those of the correlation method.
6
Conclusions
We have proposed a novel method to cluster gene expression temporal profile microarray data. On a sample and a complete real-life dataset, we have demonstrated that using hierarchical clustering with our method for similarity measure produces superior results when compared to that of the linear correlation similarity measure. We have introduced a variant of the I-index that can make a trade-off between minimizing the number of useful clusters and keeping the distinctness of individual clusters. The MSEHC method can be used for effective clustering of gene expression temporal profile microarray data. Although we have shown the effectiveness of the method in microarray time-series datasets, we are planning to investigate the effectiveness of the method as well in dose-response microarray datasets, and other time-series microarray data.
Acknowledgments: This research has been supported by the Natural Sciences and Engineering Research Council of Canada, the Canadian Foundation for Innovation, the Ontario Innovation Trust, and the Chilean National Council for Technological and Scientific Research, FONDECYT grant No. 1060904.
2.5
2
3
2.5
2
1.5
1.5
1
1
0.5
15
0
20
0 5
2
1
0
20
15
4
2.5
2
20
i=9 4
Expression ratio
3.5
3
2.5
2
1
0.5
0
5
10
15
0
20
5
5
4.5
4.5
i = 10
i = 11
4
4
3.5
3.5
3
2.5
2
3
2.5
2
1.5
2
1
1
0.5
0.5
15
0
20
0
5
10
15
0
20
0
5
10
Time in hours
Time in hours
20
1.5
1
10
15
2.5
0.5
5
10
3
1
0
5
i = 12
4
3.5
0.5
0
0
Time in hours
5
4.5
1.5
1.5
2
Time in hours
Expression ratio
4.5
3
2.5
1.5
Time in hours
5
20
3.5
3
0 10
15
i=8
0.5
5
10
4.5
1
0
5
Time in hours
1.5
Time in hours
0
5
4
0.5
15
0
20
3.5
2
1
15
i=7
2.5
1.5
10
10
5
3
1.5
5
5
4.5
Expression ratio
2.5
0
0.5
0
Time in hours
4
Expression ratio
Expression ratio
20
i=6
0.5
Expression ratio
15
3.5
488488
470008 3
0
10
4.5
359769
4
3.5
2
1
0.5
0
5
i=5
2.5
1.5
Time in hours
5
3
1
Time in hours
4.5
2
Expression ratio
10
2.5
Expression ratio
5
i=4
3.5
3
1.5
0.5
0
4.5
4
Expression ratio
320355 3
i=3
4
3.5
3.5
Expression ratio
Expression ratio
5
4.5
i=2 4
3.5
0
5
5
4.5
i=1
4
Expression ratio
5
4.5
15
0
20
0
5
10
15
20
Time in hours
Time in hours
5
4.5
i = 13
4
Expression ratio
3.5
3
2.5
2
1.5
1
0.5
0
0
5
10
15
20
Time in hours
Fig. 3. Different clusters obtained using the MSEHC on the 260 gene temporal expression profiles, where k = 13. 5
5
5
4.5
i=1
3.5
3.5
3.5
3
2.5
112179
2.5
40630 2
3
2.5
2
1.5
1.5
1.5
1
1
1
0.5
5
10
15
0
20
5
10
15
0
20
2.5
2
Expression ratio
488488
470008
4
4
3.5
3.5
3
2.5
2
3
2.5
2
1.5
1
1
0.5
5
10
15
0
20
5
15
0
20
Expression ratio
3
2.5
2
1.5
1
0
5
10
15
Time in hours
20
2
1
5
10
15
0
20
i = 10
4
3.5
2.5
2
3
2.5
2
1.5
2.5
2
1.5
1
1
1
0.5
0.5
15
0
20
20
3
0.5
10
15
i = 12
4
3.5
5
10
i = 11
4
3.5
0
5
Time in hours 5
4.5
0
0
Time in hours 5
4.5
1.5
0.5
3
5
3
20
i=8
2.5
4.5
Expression ratio
i=9 4
3.5
15
0.5
0
Time in hours
Time in hours 5
4.5
0
10
10
1.5
0.5
0
5
Time in hours
4
1.5
0
0
5
4.5
3.5
1
0.5
0
20
i=7
1.5
0
15
4.5
Expression ratio
i=5 359769
10
i=6
4
3
5
5
4.5
3.5
2
Time in hours
5
4.5
2.5
0.5
0
Time in hours
5
320355
1
0.5
0
Time in hours
Expression ratio
0
i=4
3
1.5
Expression ratio
2
3
Expression ratio
4
3.5
Expression ratio
4
0
Expression ratio
4.5
i=3
4
0.5
Expression ratio
5
4.5
i=2
4
Expression ratio
Expression ratio
4.5
0
5
10
15
Time in hours
Time in hours
20
0
0
5
10
15
20
Time in hours
5
4.5
i = 13 4
Expression ratio
3.5
3
2.5
2
1.5
1
0.5
0
0
5
10
15
20
Time in hours
Fig. 4. Different clusters obtained using the correlation distance on the 260 gene temporal expression profiles, where k = 13.
30
328692
15
10
15
10
5
5
5
10
15
5
10
15
20
30
15
323946 10
310406 15
10
0
0 10
15
5
Time in hours
15
416842
15
10
0
20
0
5
10
15
5
15
i = 12
20
15
10
0 5
10
15
20
0
5
Time in hours
10
15
20
Time in hours
30
30
i = 14
i = 15
25
20
5
0
20
30
i = 13
10
Time in hours
25
Time in hours
Time in hours
25
0
30
10
0 15
20
15
0 10
15
20
5
5
10
25
5
0
10
i = 11
5
30
15
0 5
30
Expression ratio
280768
10
i=8
20
Time in hours
20
20
5
0
25
Expression ratio
20
15
356635 429460
10
i = 10
i=9
15
25
15
20
30
25
10
Time in hours
20
Time in hours
30
Expression ratio
10
5
30
0 0
20
0
20
5
5
5
15
25
Expression ratio
Expression ratio
Expression ratio
20
5
10
i=7
i=6 25
10
0 5
30
i=5
20
15
Time in hours
30
25
147050
20
5
0
Time in hours
Time in hours
0
10
0 0
20
15
Expression ratio
0
20
5
0
0
25
361247
470934
20
i=4
i=3 25
Expression ratio
Expression ratio
Expression ratio
20
30
30
i=2
25
Expression ratio
i=1 25
Expression ratio
30
i = 16
25
25
10
130482 15
10
5
5
5
10
15
5
Time in hours
10
15
10
5
15
10
15
5
Time in hours
10
15
10
15
15
10
20
20
254436 15
26474 10
5
0 0
5
10
15
20
0
5
10
15
20
Time in hours
Time in hours
30
20
i = 20 25
20
Time in hours 30
i = 21
i = 22
25
25
20
20
Expression ratio
Expression ratio
5
Time in hours
0
0
20
0
30
5
0
0
20
25
20
5
10
15
i = 19
Expression ratio
Expression ratio
15
5
10
30
25
20
0
5
i = 18
25
10
Time in hours
30
i = 17
15
0 0
Time in hours
30
Expression ratio
20
20
5
0 0
20
10
Expression ratio
0
15
5
0
0
20
Expression ratio
15
20
Expression ratio
Expression ratio
Expression ratio
130476 20
15
10
15
10
5
5
0
0 0
5
10
15
Time in hours
20
0
5
10
15
20
Time in hours
Fig. 5. Different clusters obtained using the MSEHC on the 517 gene temporal expression profiles, where k = 22.
30
30
30
i=1
30
i=3
i=2
25
25
25
i=4
25
10
5
470934 323946
280768
15
416842
10
15
20
0
5
10
30
i=5
10
i=6
5
20
254436
15
26474 10
5
10
15
20
5
10
15
5
15
10
0 15
20
0
5
10
15
10
20
0
5
15
i = 11
20
i = 12
25
15
10
20
15
10
5
0 5
10
15
20
0
5
10
15
20
Time in hours
Time in hours 30
30
i = 15
25
25
10
Time in hours 30
0
i = 14
i = 13
15
20
20
30
25
15
0 10
Time in hours
30
20
5
5
0
0 10
Time in hours
20
i=8
5
5
5
10
25
Expression ratio
Expression ratio
Expression ratio
10
0
15
0
i = 10
20
15
25
30
25
10
i=7
Time in hours
i=9
15
5
Time in hours
20
20
30
20
0
30
Time in hours
25
20
0 0
Time in hours 30
15
5
0 0
10
25
5
0
10
0 5
30
Expression ratio
15
15
Time in hours
25
Expression ratio
20
20
5
0
20
30
25
Expression ratio
15
Time in hours
Expression ratio
10
Time in hours
Expression ratio
5
10
0
0 0
15
5
5
0
20
Expression ratio
15
20
Expression ratio
20
Expression ratio
Expression ratio
361247
i = 16
25
10
20
15
10
5
5
10
15
20
10
0
5
Time in hours
10
15
5
30
10
15
20
0
5
i = 18
25
10
15
20
Time in hours 30
30
i = 17
25
10
Time in hours
Time in hours
30
15
0 0
20
20
5
0
0
0
15
5
5
0
20
Expression ratio
15
Expression ratio
Expression ratio
Expression ratio
147050 20
i = 20
i = 19
25
25
10
356635
429460
130482
15
10
5
20
15
10
5
0
5
10
15
5
Time in hours
10
15
20
0
5
10
15
10
0
5
10
15
20
Time in hours
30
i = 22
i = 21 25
25
328692
20
Expression ratio
Expression ratio
20
Time in hours
Time in hours 30
15
0
0 0
20
20
5
5
0
0
Expression ratio
15
20
Expression ratio
Expression ratio
Expression ratio
130476 20
310406 15
10
5
20
15
10
5
0
0 0
5
10
Time in hours
15
20
0
5
10
15
20
Time in hours
Fig. 6. Different clusters obtained using the correlation distance on the 517 gene temporal expression profiles, where k = 22.
Table 1. k
Values of the IM SE index for 517 genes, where k = 16 to 27 and q = 0.3 to 1.0.
q=0.3
q=0.4
q=0.5
q=0.6
q=0.7
q=0.8
q=0.9
q=1.0
16 17344.58 13144.73 9961.84 7549.67 5721.58 4336.14 3286.18 2490.46 17 18791.87 14155.50 10663.02 8032.21 6050.49 4557.70 3433.21 2586.16 18 18685.75 13995.33 10482.28 7851.07 5880.32 4404.27 3298.73 2470.70 19 27333.00 20361.60 15168.29 11299.55 8417.55 6270.62 4671.27 3479.84 20 27169.35 20136.14 14923.59 11060.39 8197.23 6075.25 4502.58 3337.02 21 29057.98 21431.06 15805.99 11657.35 8597.61 6340.97 4676.64 3449.15 22 29577.07 21712.65 15939.35 11701.14 8589.86 6305.85 4629.15 3398.28 23 29554.55 21599.89 15786.24 11537.35 8432.05 6162.55 4503.89 3291.66 24 29493.69 21463.87 15620.21 11367.52 8272.65 6020.38 4381.29 3188.46 25 29981.57 21730.04 15749.49 11414.91 8273.29 5996.32 4346.01 3149.90 26 30081.89 21717.39 15678.71 11319.13 8171.76 5899.54 4259.13 3074.85 27 29959.90 21547.85 15497.71 11146.31 8016.69 5765.79 4146.89 2982.54
References 1. A. Brazma and J. Vilo. Gene expression data analysis. FEBS Lett., 480:17–24, 2000. 2. L. Br´eh´elin. Clustering Gene Expression Series with Prior Knowledge. In Lecture Notes in Computer Science, volume 3692, pages 27–38, October 2005. 3. S. Chu, J. DeRisi, M. Eisen, J. Mulholland, D. Botstein, P. Brown, and I. Herskowitz. The transcriptional program of sporulation in budding yeast. Science, 282:699–705, 1998. 4. S. Dr˘ aghici. Data Analysis Tools for DNA Microarrays. Chapman & Hall, 2003. 5. R. Duda, P. Hart, and D. Stork. Pattern Classification. John Wiley and Sons, Inc., New York, NY, 2nd edition, 2000. 6. M. Eisen, P. Spellman, P. Brown, and D. Botstein. Cluster analysis and display of genome-wide expression patterns. In Proc. Natl Acad. Sci., volume 95, pages 14863–14868, USA, 1998. 7. L. Heyer, S. Kruglyak, and S. Yooseph. Exploring expression data: identification and analysis of coexpressed genes. Genome Res., 9:1106–1115, 1999. 8. V. Iyer, M. Eisen, D. Ross, G. Schuler, T. Moore, J. Lee, J. Trent, L. Staudt, Jr. J. Hudson, and M. Boguski. The transcriptional program in the response of human fibroblasts to serum. Science, 283:83–87, 1999. 9. U. Maulik and S. Bandyopadhyay. Performance Evaluation of Some Clustering Algorithms and Validity Indices. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(12):1650–1654, 2002. 10. S. Peddada, E. Lobenhofer, L. Li, C. Afshari, C. Weinberg, and D. Umbach. Gene selection and clustering for time-course and dose-response microarray experiments using order-restricted inference. Bioinformatics, 19(7):834–841, 2003. 11. L. Rueda and A. Bari. Clustering Microarray Time-Series Data Using a MeanSquare-Error Profile Alignment Algorithm. Submitted for publication. Electronically available at http://www.inf.udec.cl/˜lrueda/papers/ProfileMSE-Jnl.pdf. 12. G. Sherlock. Analysis of large-scale gene expression data. Curr. Opin. Immunol., 12:201–205, 2000.