A New Profile Alignment Method for Clustering

3 downloads 0 Views 376KB Size Report
the Manhattan distance and its generalization, the Minkowski distance [5]. One of .... transformed space after aligning them in such a way that the mean-square-error ... Profile-Alignment algorithm used in conjunction with any metric d is also a .... corresponds to a value of q, and each cell shows the value of the index for the.
A New Profile Alignment Method for Clustering Gene Expression Data Ataul Bari1 and Luis Rueda2 1

2

School of Computer Science, University of Windsor 401 Sunset Avenue, Windsor, ON, N9B 3P4, Canada [email protected] Department of Computer Science, University of Concepci´ on Edmundo Larenas 215, Concepci´ on, Chile. [email protected]

Abstract. We focus on clustering gene expression temporal profiles, and propose a novel, simple algorithm that is powerful enough to find an efficient distribution of genes over clusters. We also introduce a variant of a clustering index that can effectively decide upon the optimal number of clusters for a given dataset. The clustering method is based on a profilealignment approach, which minimizes the mean-square-error of the first order differentials, to hierarchically cluster microarray time-series data. The effectiveness of our algorithm has been tested on datasets drawn from standard experiments, showing that our approach can effectively cluster the datasets based on profile similarity.

1

Introduction

Grouping (or clustering) genes based on the similarity of their temporal profiles is important to researchers as genes with similar expression profiles are expected to be functionally related or co-regulated [3][4][6]. Many unsupervised methods for gene clustering based on similarity (or dissimilarity) of their microarray temporal profiles have been proposed in the past few years [1][6][12]. These methods have used different kinds of distance measures to group genes based on similarity (or dissimilarity) among the microarray time-series profiles. The most commonly used distance measures are the Euclidean distance, the Mahalanobis distance, the Manhattan distance and its generalization, the Minkowski distance [5]. One of the methods for clustering microarray time-series data is based on a hidden phase model (similar to a hidden Markov model) to define the parameters of a mixture of normal distributions in a Bayesian-like manner, which are estimated by using the expectation maximization algorithm [2]. This model has been recently introduced and, to the best of our knowledge, has only been tested on synthetic data. On the other hand, some authors have proposed linear-correlation methods for clustering genes using microarray time-series data [3][7]. The method proposed in [3] requires computing the mean expression levels of some candidate profiles using some pre-identified, arbitrarily selected profiles. The authors of [7]

10

8

8

6

6

6

4

2

0

4

2

0

0

1

2

3

Time in hours

(a)

4

4

2

0

Corr. Coff. = 0.9053

Corr. Coff = 0.8692

−2

−4

Expression level

10

8

Expression level

Expression level

10

−2

5

−4

Corr. Coff. = 0.8039

−2

0

1

2

3

Time in hours

(b)

4

5

−4

0

1

2

3

4

5

Time in hours

(c)

Fig. 1. (a) Two genes that are likely to be clustered as in [10]. (b) Two genes with different profiles that are likely to be clustered together by linear-correlation methods. (c) Two genes with similar profiles in terms of level of expression ratio changes between different time points.

proposed a method for clustering microarray time-series data employing a jackknife correlation coefficient with or without using the seeded candidate profiles. Clustering using the correlation distance may not always group genes that are closer in terms of their temporal profiles. For example, as shown in Fig. 1, using the correlation distance, genes in Fig. 1(b) are most likely to be clustered together (as it has the largest value for the correlation coefficient among all three, which is 0.9053), but if the prime interest is to cluster genes according to the variation of their expression level at different time points, then, genes from Fig. 1(c) would be better candidates to be clustered together than the genes in Fig. 1(a) and (b). However, the value of the correlation coefficient between the pairs of genes in Fig. 1(c) is the minimum (0.8039) among all three pairs of genes shown in Fig. 1. In [10], the authors proposed a method to select and cluster genes using the ideas of order-restricted inference, where estimation makes use of known inequalities among parameters. Although their method does not require arbitrarily selected genes to specify candidate profiles, it is necessary to specify the candidate profiles a priori. Also, as shown in Fig. 1 (a) and (c), by following this procedure, genes from Fig. 1(a) are more likely to be clustered together as they show similar profiles in terms of direction of the changes of expression ratios (e.g. up-up-up-down-down), even though, for one gene, the expression ratio increases sharply between the first three time points and decreases sharply between time points 3 and 4, whereas such increments/decrements are much softer for the second gene. However, in reality, it may be more desirable to cluster genes of Fig. 1(c) in the same group as they differ only a little amount of increase/decrease between the time points 2 and 3 and the time points 3 and 4. In this paper, we propose a mean-square-error profile alignment approach to cluster temporal gene expression data. We propose to use a profile-alignment algorithm, which minimizes the mean-square-error of the first order differentials, to hierarchically cluster microarray time-series data. The alignment algorithm

is also used to define a variant of a well-known clustering validity index that optimizes the number of clusters.

2

Minimum-Square-Error Profile Alignment

Consider a dataset D = {x1 , x2 , ..., xn }, where xi = [xi1 , xi2 , ..., xim ]t is an mdimensional feature vector that represents the expression level of gene i at m different time points. The aim is to partition D into k disjoint subsets D1 , D2 , ..., Dk , where D = D1 ∪ D2 ∪ ... ∪ Dk , and Di ∩ Dj = ∅, for ∀i, j, i 6= j, in such a way that a similarity (dissimilarity) cost function φ : {0, 1}n×k → < is maximized (minimized). We propose an efficient algorithm that takes two features vectors, and produces two new vectors in such a way that the mean-square-error difference between “aligned” vectors is minimized. Let x1 , x2 ∈ D be two feature vectors. The aim is to find two new vectors x1 and x02 = x2 − a (e.g. to find a scalar a) 2 such that, f (a) = kx2 − Pam− x1 k is minimized. The value of a that minimizes 1 f (a) is given by a = m i=1 (x2i − x1i ) (see [11]). Using the latter expression to obtain the value of a, i.e. the scalar that minimizes f (a), we align two vectors by following the procedure given in Algorithm Profile-Alignment. The algorithm takes two feature vectors from the original space as input (which are two temporal gene expression data in this case) and outputs two feature vectors in the transformed space after aligning them in such a way that the mean-square-error is minimized. Using the Profile-Alignment algorithm and a conventional distance function, we define a new distance function, namely dM SE (x1 , x2 ), which aligns x1 and x2 , and invokes a conventional metric [11]. In [11], it has been shown that the Profile-Alignment algorithm used in conjunction with any metric d is also a metric. Let d : 22 or k < 22. Therefore, we have taken the value of k = 22 as the optimal number of clusters and plotted the profiles, as clustered using the MSEHC method corresponding to k = 22. The plots are shown in Fig. 5. Then, similar to the first experiments, we have clustered the complete dataset into 22 clusters using the correlation

distance and the same hierarchical agglomerative clustering method, applied to the original data, resulting in 22 clusters as shown in Fig. 6. The comparison among the plots from Figs. 5 and 6, again, reveal the effectiveness of the MSEHC method. For example, the MSEHC method left clusters 1 to 6 containing a single gene each (IDs 328692, 470934, 361247, 147050, 323946 and 310406, respectively in Fig. 5). The correlation method, however, placed these genes in clusters 21, 2, 2, 13, 2 and 21, respectively (Fig. 6). Again, by visual inspection of all the temporal expression profiles, we noticed that these genes are differentially expressed and should be left alone in separate clusters, which is clearly done by the MSEHC method. Also, as shown in Fig. 5, the MSEHC method produced four clusters containing only two profiles each, clusters 7 (IDs 356635 and 429460 ), 9 (IDs 280768 and 416842), 14 (IDs 130476 and 130482) and 20 (IDs 26474 and 254436). The correlation method clustered these genes as follows: 356635 and 429460 in cluster 18, 280768 and 416842 in cluster 2, 130476 and 130482 in cluster 18, and 26474 and 254436 in cluster 6 (Figs. 5 and 6). Although the correlation method placed each pair of genes in the same cluster, it also placed some other genes with them. By looking at the plots of the profiles of the clusters produced by the correlation method (Fig. 6) and comparing them to the plots of the clusters of the corresponding genes produced by the MSEHC method (Fig. 5), it is clear that these pairs of genes are differentially expressed. Therefore, the gene profiles placed in the same cluster by the MSEHC method are more similar than those of the correlation method.

6

Conclusions

We have proposed a novel method to cluster gene expression temporal profile microarray data. On a sample and a complete real-life dataset, we have demonstrated that using hierarchical clustering with our method for similarity measure produces superior results when compared to that of the linear correlation similarity measure. We have introduced a variant of the I-index that can make a trade-off between minimizing the number of useful clusters and keeping the distinctness of individual clusters. The MSEHC method can be used for effective clustering of gene expression temporal profile microarray data. Although we have shown the effectiveness of the method in microarray time-series datasets, we are planning to investigate the effectiveness of the method as well in dose-response microarray datasets, and other time-series microarray data.

Acknowledgments: This research has been supported by the Natural Sciences and Engineering Research Council of Canada, the Canadian Foundation for Innovation, the Ontario Innovation Trust, and the Chilean National Council for Technological and Scientific Research, FONDECYT grant No. 1060904.

2.5

2

3

2.5

2

1.5

1.5

1

1

0.5

15

0

20

0 5

2

1

0

20

15

4

2.5

2

20

i=9 4

Expression ratio

3.5

3

2.5

2

1

0.5

0

5

10

15

0

20

5

5

4.5

4.5

i = 10

i = 11

4

4

3.5

3.5

3

2.5

2

3

2.5

2

1.5

2

1

1

0.5

0.5

15

0

20

0

5

10

15

0

20

0

5

10

Time in hours

Time in hours

20

1.5

1

10

15

2.5

0.5

5

10

3

1

0

5

i = 12

4

3.5

0.5

0

0

Time in hours

5

4.5

1.5

1.5

2

Time in hours

Expression ratio

4.5

3

2.5

1.5

Time in hours

5

20

3.5

3

0 10

15

i=8

0.5

5

10

4.5

1

0

5

Time in hours

1.5

Time in hours

0

5

4

0.5

15

0

20

3.5

2

1

15

i=7

2.5

1.5

10

10

5

3

1.5

5

5

4.5

Expression ratio

2.5

0

0.5

0

Time in hours

4

Expression ratio

Expression ratio

20

i=6

0.5

Expression ratio

15

3.5

488488

470008 3

0

10

4.5

359769

4

3.5

2

1

0.5

0

5

i=5

2.5

1.5

Time in hours

5

3

1

Time in hours

4.5

2

Expression ratio

10

2.5

Expression ratio

5

i=4

3.5

3

1.5

0.5

0

4.5

4

Expression ratio

320355 3

i=3

4

3.5

3.5

Expression ratio

Expression ratio

5

4.5

i=2 4

3.5

0

5

5

4.5

i=1

4

Expression ratio

5

4.5

15

0

20

0

5

10

15

20

Time in hours

Time in hours

5

4.5

i = 13

4

Expression ratio

3.5

3

2.5

2

1.5

1

0.5

0

0

5

10

15

20

Time in hours

Fig. 3. Different clusters obtained using the MSEHC on the 260 gene temporal expression profiles, where k = 13. 5

5

5

4.5

i=1

3.5

3.5

3.5

3

2.5

112179

2.5

40630 2

3

2.5

2

1.5

1.5

1.5

1

1

1

0.5

5

10

15

0

20

5

10

15

0

20

2.5

2

Expression ratio

488488

470008

4

4

3.5

3.5

3

2.5

2

3

2.5

2

1.5

1

1

0.5

5

10

15

0

20

5

15

0

20

Expression ratio

3

2.5

2

1.5

1

0

5

10

15

Time in hours

20

2

1

5

10

15

0

20

i = 10

4

3.5

2.5

2

3

2.5

2

1.5

2.5

2

1.5

1

1

1

0.5

0.5

15

0

20

20

3

0.5

10

15

i = 12

4

3.5

5

10

i = 11

4

3.5

0

5

Time in hours 5

4.5

0

0

Time in hours 5

4.5

1.5

0.5

3

5

3

20

i=8

2.5

4.5

Expression ratio

i=9 4

3.5

15

0.5

0

Time in hours

Time in hours 5

4.5

0

10

10

1.5

0.5

0

5

Time in hours

4

1.5

0

0

5

4.5

3.5

1

0.5

0

20

i=7

1.5

0

15

4.5

Expression ratio

i=5 359769

10

i=6

4

3

5

5

4.5

3.5

2

Time in hours

5

4.5

2.5

0.5

0

Time in hours

5

320355

1

0.5

0

Time in hours

Expression ratio

0

i=4

3

1.5

Expression ratio

2

3

Expression ratio

4

3.5

Expression ratio

4

0

Expression ratio

4.5

i=3

4

0.5

Expression ratio

5

4.5

i=2

4

Expression ratio

Expression ratio

4.5

0

5

10

15

Time in hours

Time in hours

20

0

0

5

10

15

20

Time in hours

5

4.5

i = 13 4

Expression ratio

3.5

3

2.5

2

1.5

1

0.5

0

0

5

10

15

20

Time in hours

Fig. 4. Different clusters obtained using the correlation distance on the 260 gene temporal expression profiles, where k = 13.

30

328692

15

10

15

10

5

5

5

10

15

5

10

15

20

30

15

323946 10

310406 15

10

0

0 10

15

5

Time in hours

15

416842

15

10

0

20

0

5

10

15

5

15

i = 12

20

15

10

0 5

10

15

20

0

5

Time in hours

10

15

20

Time in hours

30

30

i = 14

i = 15

25

20

5

0

20

30

i = 13

10

Time in hours

25

Time in hours

Time in hours

25

0

30

10

0 15

20

15

0 10

15

20

5

5

10

25

5

0

10

i = 11

5

30

15

0 5

30

Expression ratio

280768

10

i=8

20

Time in hours

20

20

5

0

25

Expression ratio

20

15

356635 429460

10

i = 10

i=9

15

25

15

20

30

25

10

Time in hours

20

Time in hours

30

Expression ratio

10

5

30

0 0

20

0

20

5

5

5

15

25

Expression ratio

Expression ratio

Expression ratio

20

5

10

i=7

i=6 25

10

0 5

30

i=5

20

15

Time in hours

30

25

147050

20

5

0

Time in hours

Time in hours

0

10

0 0

20

15

Expression ratio

0

20

5

0

0

25

361247

470934

20

i=4

i=3 25

Expression ratio

Expression ratio

Expression ratio

20

30

30

i=2

25

Expression ratio

i=1 25

Expression ratio

30

i = 16

25

25

10

130482 15

10

5

5

5

10

15

5

Time in hours

10

15

10

5

15

10

15

5

Time in hours

10

15

10

15

15

10

20

20

254436 15

26474 10

5

0 0

5

10

15

20

0

5

10

15

20

Time in hours

Time in hours

30

20

i = 20 25

20

Time in hours 30

i = 21

i = 22

25

25

20

20

Expression ratio

Expression ratio

5

Time in hours

0

0

20

0

30

5

0

0

20

25

20

5

10

15

i = 19

Expression ratio

Expression ratio

15

5

10

30

25

20

0

5

i = 18

25

10

Time in hours

30

i = 17

15

0 0

Time in hours

30

Expression ratio

20

20

5

0 0

20

10

Expression ratio

0

15

5

0

0

20

Expression ratio

15

20

Expression ratio

Expression ratio

Expression ratio

130476 20

15

10

15

10

5

5

0

0 0

5

10

15

Time in hours

20

0

5

10

15

20

Time in hours

Fig. 5. Different clusters obtained using the MSEHC on the 517 gene temporal expression profiles, where k = 22.

30

30

30

i=1

30

i=3

i=2

25

25

25

i=4

25

10

5

470934 323946

280768

15

416842

10

15

20

0

5

10

30

i=5

10

i=6

5

20

254436

15

26474 10

5

10

15

20

5

10

15

5

15

10

0 15

20

0

5

10

15

10

20

0

5

15

i = 11

20

i = 12

25

15

10

20

15

10

5

0 5

10

15

20

0

5

10

15

20

Time in hours

Time in hours 30

30

i = 15

25

25

10

Time in hours 30

0

i = 14

i = 13

15

20

20

30

25

15

0 10

Time in hours

30

20

5

5

0

0 10

Time in hours

20

i=8

5

5

5

10

25

Expression ratio

Expression ratio

Expression ratio

10

0

15

0

i = 10

20

15

25

30

25

10

i=7

Time in hours

i=9

15

5

Time in hours

20

20

30

20

0

30

Time in hours

25

20

0 0

Time in hours 30

15

5

0 0

10

25

5

0

10

0 5

30

Expression ratio

15

15

Time in hours

25

Expression ratio

20

20

5

0

20

30

25

Expression ratio

15

Time in hours

Expression ratio

10

Time in hours

Expression ratio

5

10

0

0 0

15

5

5

0

20

Expression ratio

15

20

Expression ratio

20

Expression ratio

Expression ratio

361247

i = 16

25

10

20

15

10

5

5

10

15

20

10

0

5

Time in hours

10

15

5

30

10

15

20

0

5

i = 18

25

10

15

20

Time in hours 30

30

i = 17

25

10

Time in hours

Time in hours

30

15

0 0

20

20

5

0

0

0

15

5

5

0

20

Expression ratio

15

Expression ratio

Expression ratio

Expression ratio

147050 20

i = 20

i = 19

25

25

10

356635

429460

130482

15

10

5

20

15

10

5

0

5

10

15

5

Time in hours

10

15

20

0

5

10

15

10

0

5

10

15

20

Time in hours

30

i = 22

i = 21 25

25

328692

20

Expression ratio

Expression ratio

20

Time in hours

Time in hours 30

15

0

0 0

20

20

5

5

0

0

Expression ratio

15

20

Expression ratio

Expression ratio

Expression ratio

130476 20

310406 15

10

5

20

15

10

5

0

0 0

5

10

Time in hours

15

20

0

5

10

15

20

Time in hours

Fig. 6. Different clusters obtained using the correlation distance on the 517 gene temporal expression profiles, where k = 22.

Table 1. k

Values of the IM SE index for 517 genes, where k = 16 to 27 and q = 0.3 to 1.0.

q=0.3

q=0.4

q=0.5

q=0.6

q=0.7

q=0.8

q=0.9

q=1.0

16 17344.58 13144.73 9961.84 7549.67 5721.58 4336.14 3286.18 2490.46 17 18791.87 14155.50 10663.02 8032.21 6050.49 4557.70 3433.21 2586.16 18 18685.75 13995.33 10482.28 7851.07 5880.32 4404.27 3298.73 2470.70 19 27333.00 20361.60 15168.29 11299.55 8417.55 6270.62 4671.27 3479.84 20 27169.35 20136.14 14923.59 11060.39 8197.23 6075.25 4502.58 3337.02 21 29057.98 21431.06 15805.99 11657.35 8597.61 6340.97 4676.64 3449.15 22 29577.07 21712.65 15939.35 11701.14 8589.86 6305.85 4629.15 3398.28 23 29554.55 21599.89 15786.24 11537.35 8432.05 6162.55 4503.89 3291.66 24 29493.69 21463.87 15620.21 11367.52 8272.65 6020.38 4381.29 3188.46 25 29981.57 21730.04 15749.49 11414.91 8273.29 5996.32 4346.01 3149.90 26 30081.89 21717.39 15678.71 11319.13 8171.76 5899.54 4259.13 3074.85 27 29959.90 21547.85 15497.71 11146.31 8016.69 5765.79 4146.89 2982.54

References 1. A. Brazma and J. Vilo. Gene expression data analysis. FEBS Lett., 480:17–24, 2000. 2. L. Br´eh´elin. Clustering Gene Expression Series with Prior Knowledge. In Lecture Notes in Computer Science, volume 3692, pages 27–38, October 2005. 3. S. Chu, J. DeRisi, M. Eisen, J. Mulholland, D. Botstein, P. Brown, and I. Herskowitz. The transcriptional program of sporulation in budding yeast. Science, 282:699–705, 1998. 4. S. Dr˘ aghici. Data Analysis Tools for DNA Microarrays. Chapman & Hall, 2003. 5. R. Duda, P. Hart, and D. Stork. Pattern Classification. John Wiley and Sons, Inc., New York, NY, 2nd edition, 2000. 6. M. Eisen, P. Spellman, P. Brown, and D. Botstein. Cluster analysis and display of genome-wide expression patterns. In Proc. Natl Acad. Sci., volume 95, pages 14863–14868, USA, 1998. 7. L. Heyer, S. Kruglyak, and S. Yooseph. Exploring expression data: identification and analysis of coexpressed genes. Genome Res., 9:1106–1115, 1999. 8. V. Iyer, M. Eisen, D. Ross, G. Schuler, T. Moore, J. Lee, J. Trent, L. Staudt, Jr. J. Hudson, and M. Boguski. The transcriptional program in the response of human fibroblasts to serum. Science, 283:83–87, 1999. 9. U. Maulik and S. Bandyopadhyay. Performance Evaluation of Some Clustering Algorithms and Validity Indices. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(12):1650–1654, 2002. 10. S. Peddada, E. Lobenhofer, L. Li, C. Afshari, C. Weinberg, and D. Umbach. Gene selection and clustering for time-course and dose-response microarray experiments using order-restricted inference. Bioinformatics, 19(7):834–841, 2003. 11. L. Rueda and A. Bari. Clustering Microarray Time-Series Data Using a MeanSquare-Error Profile Alignment Algorithm. Submitted for publication. Electronically available at http://www.inf.udec.cl/˜lrueda/papers/ProfileMSE-Jnl.pdf. 12. G. Sherlock. Analysis of large-scale gene expression data. Curr. Opin. Immunol., 12:201–205, 2000.