A New Method for Hierarchical Clustering

3 downloads 0 Views 314KB Size Report
into clusters such that all patterns within a cluster are similar to each other ... possible to join into a cluster all those points whose pairwise entries in the ..... Sub-tree Membership Divergence (SMD): SMD for each pair j , k is defined as the ... In order to produce an ultrametric matrix, we have defined a new descriptor called ...
1

A New Method for Hierarchical Clustering Combination Abdolreza Mirzaeia, Mohammad Rahmatia,*, and Majid Ahmadi b a

Computer Engineering Department, Amirkabir University of Technology, Tehran 15914, Iran Electrical and Computer Engineering Department, University of Windsor, Ontario, Canada [email protected], [email protected], [email protected] b

Abstract. In the field of pattern recognition, combining different classifiers into a robust classifier is a common approach for improving classification accuracy. Recently, this trend has also been used to improve clustering performance especially in non-hierarchical clustering approaches. Generally hierarchical clustering is preferred in comparison with the partitional clustering for applications when the exact number of the clusters is not determined or when we are interested in finding the relation between clusters. To the best of our knowledge clustering combination methods proposed so far are based on partitional clustering and hierarchical clustering has been ignored. In this paper, a new method for combining hierarchical clustering is proposed. In this method, in the first step the primary hierarchical clustering dendrograms are converted to matrices. Then these matrices, which describe the dendrograms, are aggregated (using the matrix summation operator) into a final matrix with which the final clustering is formed. The effectiveness of different well known dendrogram descriptors and the one proposed by us for representing the dendrograms are evaluated and compared. The results show that all these descriptor work well and more accurate results (hierarchy of clusters) are obtained using hierarchical combination than combination of partitional clusterings.

Keywords --Clustering, hierarchical clustering, cluster ensembles, clustering combination.

1. INTRODUCTION Clustering algorithms are unsupervised methods for exploring and grouping similar patterns within a set of patterns. Clustering methods are applied to a set of patterns to partition the set into clusters such that all patterns within a cluster are similar to each other and different from the * Corresponding author. Address: Computer engineering department, Amirkabir University of technology, Tehran 15914, Iran, Tel: +98 21 64542741, Fax: +98 21 6495521, Email: [email protected]

2

members of other clusters. In another words, it is intended to have minimum variance within each cluster and maximum variance between different clusters. In the field of pattern recognition, an effective method for solving complicated problems is to use multiple classifier systems. In [8] it is suggested that by combining classifier systems the accuracy of recognition is improved. Recently, clustering combination has been used to improve the quality and robustness of clustering and many ensemble based methods has been introduced [1], [15], [16], [17], [18], [47], [50], [52]. In real situation, the ensemble based methods are preferred to the common clustering methods even though they demand high computational requirement. This is due to the fact that the common clustering methods are parameter sensitive. For example the number of the clusters and their starting points can affect the results of k-means clusterings or in hierarchical clustering different proximity measure and merging criterion produces different partitioning of data. Combining the outputs of different clustering algorithms leads to a new clustering method which is less sensitive to the parameters and is more robust. The main problem dealing with cluster combination is the existence of mismatch between the labels of different clusterings. Variety of methods has been introduced for solving this mismatch. The simplest way is to try to solve the correspondence problem between cluster labels by searching in different partitions of an ensemble. This problem has been solved by greedy algorithm [50], [52] and by Hungarian algorithm [47]. Following the re-labeling of partitions any one of classifier combination method may be applied on the partitions to drive the final partition of the ensemble. For example averaging [50] or different types of voting [52] have been used as a consensus function in the literature. There are other clustering combination methods which could be used to find the consensus

3

clustering without explicitly solving the label correspondence problem. The most known method in this category is based on the formation of a so called co-association matrix [1], [15], [16], [17], [18], [29], [33]. In this method, for each clustering a co-association matrix is produced where the (i, j ) th entry is set equal to one if objects i and j are placed in the same cluster by that clustering algorithm and equal to zero otherwise. The overall consensus matrix is derived by aggregating the consensus matrices of the clusterings. After deriving consensus matrix it is possible to join into a cluster all those points whose pairwise entries in the consensus matrix are greater than a predefined parameter θ and form single-element clusters for the remaining points. This method is called voting algorithm in [28]. Another possibility called stacked clustering in [28] is to interpret consensus matrix as similarity between objects and then applying any clustering that operates directly upon similarity matrix. The result of such clustering will be the ensemble partition. Some researchers have used graph algorithm for clustering combination without solving the correspondence problem [13], [42], [43]. To map clustering combination to a graph problem one can consider the objects, clusters [42], [43] or a combination of them [13] as graph vertices and the similarity between them as edge’s weigh. Having graph representation, we can use graph partitioning algorithms to find the consensus clustering. The feature based approach [21] in deriving consensus partition of an ensemble is to consider the output of base clustering algorithms as an intermediate feature space on which running another clustering algorithm produces final clustering. The authors in [44], [45] formulated the clustering combination problem as a categorical clustering of partitions result and used an Expectation-Maximization-based

algorithm

to

drive

the

consensus

clustering.

The

aforementioned consensus matrix resulting from aggregating co-association matrices can also be

4

considered as intermediate space. Here the consensus matrix is interpreted as data [29] i.e. the i th

object is represented by the i th row (or column) of consensus matrix. In this way each object

is represented with its similarity to other objects. Another method for deriving the consensus clustering is to find the median partition. To find the median partition, we try to minimize the sum of difference between the resultant partition and the input partitions so that the median partition is the best summary of ensemble partitions [46]. The best median partition is the one with maximum pair-wise agreement with the ensemble partitions. Gionis [20] proposed such an algorithm which can handle missing data and doesn’t require the number of clusters as an input parameter. The direct optimization method proposed in [42] can be categorized in this group. This method uses the Normalized Mutual Information (NMI) criterion to measure the agreement between the current partition produced by the ensemble and the partitions of ensemble. This method sequentially re-labels the points until a local minima of NMI is reached. Some researchers have used the well known Bagging [5] and Boosting [41] methods to the area of clustering combination [11], [14], [16], [19], [31], [32], [33], [47], [48]. For example, the Bergman divergence has been used in clustering [2], [3], [34]. The authors in [34] proposed a theoretical framework to display the advantage of weighting instances in clustering and developed weighted version of the well known clustering algorithms such as k-means, fuzzy cmeans, Expectation Maximization (EM), and k-harmonic means. Experimental studies conducted support the effectiveness of Bagging and Boosting in clustering [11], [14], [19], [31], [32], [48]. The aforementioned methods for clustering combination are all based on partitional clustering and to our knowledge there is not any research in the field of hierarchical clustering combination

5

reported in the literature. In this paper we aim to show the utility of the hierarchical clustering combination in order to improve its performance. The result of stacked clustering [28] is a hierarchy which has been derived from combination of some flat partitional clusterings. The main difference between this method and the method proposed in this paper is that the inputs to the stacked clustering are flat partitional clusterings and if the input clusterings are hierarchical, these clusterings must be first converted into partitional clusterings and then the combination algorithm is applied. For each hierarchical clustering a level which leads to the best partitional clustering of input data is selected based on a prespecified evaluation criterion. The partition corresponding to this level is considered as partitional clusterer of this hierarchy. Then the partitional clusterers are combined using stacked clustering. In stacked clustering we use only the information of just one level of primary hierarchical clusterings while there are useful information in other levels of hierarchical clusterings which could be used to improve the quality of combination methods. The task of combining hierarchical clustering has a counter part in the area of supervised learning i.e. combination of decision trees (DT). Different decision tree ensembles such as Boosting of DT [9] or random forests [6] have been proposed. Most of the DT ensembles combine the output of base trees to produce the output and they do not create a DT from the ensemble. The only exception, to the best of our knowledge, is the work of [26], [27], [35] which uses Fourier analysis to aggregate the trees in an ensemble to construct a single informative DT [35]. However, this method uses objects label and as a result this method can not be used in an unsupervised scenario. In this paper, we propose a new combination method which works directly on the input hierarchical clusterings to derive a consensus hierarchical clustering. The rest of this paper is

6

organized as follows. A brief overview of hierarchical clustering algorithms is presented in Section 2. The combination method proposed by us is described in Section 3. Section 4 provides detailed experimental results and Section 5 is concluding remarks.

2. HIERARCHICAL CLUSTERING The hierarchical clustering is a very useful tool in data mining applications. The output of a hierarchical clustering is usually represented as dendrogram. Dendrograms offer a better view of data distribution in different abstraction levels. This property makes the hierarchical clustering method an ideal choice for data exploration and visualization. Furthermore, in the real applications each cluster has a number of sub-clusters, so the hierarchical method is a proper way to represent the mentioned relationship. Many hierarchical clustering methods have been proposed [10], [24], [25], [49]. Generally these methods can be categorized into 2 groups, top-down (divisive) and bottom-up (agglomerative). In the bottom-up methods, first each individual pattern is assigned to a cluster containing only that pattern, then the two nearest clusters are combined with each other and this process continues until we reach to a cluster which contains all patterns[49]. In fact the difference which exists in defining the cluster distance will create different bottom-up methods. In the case of top-down, first, a cluster containing all patterns is created then with respect to the amount of separation between partitions this cluster is divided into two other clusters. This process is continued until we obtain clusters containing only one pattern. The partitional clustering methods can also be used for creation of hierarchical clustering, in this way the hierarchy is formed by successively splitting each cluster into smaller clusters by using a partitional method [24]. Except the way of splitting each cluster, this algorithm is similar

7

to the top-down clustering algorithms.

3. HIERARCHICAL CLUSTERING COMBINATION The answers to the following questions are considered as the necessary ingredient for constructing a clustering combination system. ƒ

How do we build the base clusterers?

ƒ

How do we combine them into a single informative clusterer?

The answers to the questions are given in the following subsections. 3.1.

Creating the base clusterers

In any clustering combination system it is desirable that the base clusterers to be as accurate as possible and also they must be complement of each other i.e. make errors on different objects. In other words, the diversity between the outputs of different clusterings is the main point for improving the quality of the final clustering [28]. If each clusterer derives an output similar to the output of the other clusterers, the combination of the outputs will not lead to a better result. Because the majority of methods which are used for creating the hierarchical clustering are predefined and deterministic, the following methods can be used in order to create an ensemble of diverse clusterers: 1. Using a random data subset to generate primary clusterers. It is obvious that such clusterers just show the relationship between the patterns which are present in their training subset. However if the pattern selection is a uniform random process, then ensemble will contain the relationship among all patterns. 2. Using different feature subsets. Using different subsets of original feature set will produce different proximity matrices and consequently different hierarchical clusterings of input data.

8

3. Using multiple hierarchical clustering methods. Using different pattern proximity measures will lead to different clusterings. Furthermore a variety of methods exist to define the distance between the clusters which can be used to create different clusterings. Also we can use some top-down clusterers along with bottom-up clustering methods to benefit from the advantage of both methods. 4. In order to generate the needed diversity between primary clusterers, it is possible to generate different hierarchical clusterers by repeating some partitional clustering methods such as Kmeans. This can be done by considering a random cluster center for each step. 5. On-line methods can also be used to create hierarchical clusterers. These methods strongly decrease the computational burden of creating base clusterers. They are sensitive to the primary order of the patterns, so in this case by changing the data order, we can generate different hierarchical clusterings. Although every trial of on-line hierarchical clusterings is sensitive to the order of the patterns, combining a diverse ensemble of them gives the intrinsic structure of the data. 3.2. Hierarchical clustering combination

Following the creating of primary hierarchical clusterings, they must be combined into a final clustering. The resultant clustering of this combination should best fit with the basic clusterings. In this paper we first create a description matrix for each hierarchical partitioning then we combine these matrices to drive a resultant description matrix. Hierarchical clustering can form the final dendrogram from this aggregated description matrix. Fig. 1 shows the Hierarchical Clustering Combination (HCC) algorithm.

*** figure 1 here ***

9

In the HCC algorithm represented in Fig.1 some issues remained unspecified which are: ƒ

Which description matrices should be used to represent dendrograms?

ƒ

How do we aggregate the description matrices into a final matrix?

ƒ

How do we recover the final hierarchy from the aggregated matrix?

Theses issues are discussed in the following subsections. 3.2.1. Dendrogram descriptors

Each dendrogram of N input object can be represented by a matrix of size N × N . Different descriptors of dendrograms have been proposed by different researchers (see for example [36] and the references therein). A dendrogram description matrix (descriptor) expresses the relative position of a given pair of terminal vertices in the dendrogram [36]. If D (i ) = {d (jki ) } is the description matrix of hierarchy H (i ) , then d (ijk) is the amount of difference between two patterns j and k within the same dendrogram i . Before providing further details about dendrogram descriptors a property called ultrametric inequality is described. A matrix D can be represented in the form of a dendrogram if it will be ultrametric i.e. the following ultrametric inequality is satisfied on it [30]:

d jk ≤ max( d jl , d lk )

(1)

for all triplets of points j , k , and l . This condition implies that for every triplet of points the two largest distances are equal. Any dendrogram is characterized with a unique ultrametric (distance) matrix and can be recovered from its corresponding ultrametric matrix using hierarchical clustering. Hierarchical clustering algorithms fit an ultrametric distance matrix to the objects distances. A distance matrix can be made ultrametric by increasing some of its entries (such an operation is performed in clustering algorithms like Complete Linkage), or decreasing some of

10

them (for example in Single Linkage), or simultaneously increasing some of them and decreasing some of the remaining ones (for example in Average Linkage or UPGMA). Applying a hierarchical clustering algorithm on an ultrametric distance matrix should not change it. Therefore if the original distance matrix is ultrametric, exactly one hierarchy will fit to it, but if it is not ultrametric because of the existence of different approximation operators which can be used to make it ultrametric, we actually do not obtain one solution; rather we get a variety of dendrograms. Therefore, if a matrix is used as a dendrogram descriptor, it should be preferably ultrametric. Various description matrices have been discussed in [36] and a short description is presented here. ƒ

Cophenetic Difference (CD): CD of a given pair objects in the dendrogram is defined as the lowest level of the hierarchy which two objects are joined together.

ƒ

Partition Membership Divergence (PMD): PMD of a given pair of objects, say j , k , is defined as the number of partitions implied by the dendrogram in which this pair of objects do not belong to the same cluster.

ƒ

Cluster Membership Divergence (CMD): CMD for each pair j , k is the number of objects in the smallest cluster containing both j , k .

ƒ

Sub-tree Membership Divergence (SMD): SMD for each pair j , k is defined as the number of sub-trees in which j , k do not occur together.

ƒ

Path Difference (PD): for each pair j , k of labeled terminal vertices, the PD descriptor is the number of edges along the path between j , k in the tree. PD creates a description matrix which is not ultrametric, so the hierarchical clustering algorithms can not recover the dendrogram from the PD matrix. In order to produce an ultrametric matrix, we have defined a new descriptor called

11

maximum edge distance which is introduced in the following. ƒ

Maximum Edge Distance (MED): this descriptor is basically similar to PD and CD descriptors. In this descriptor the original levels of the hierarchy are discarded and an abstract level is associated with each cluster. For clusters containing only one object this level will be considered to be zero. The abstract level of each cluster created from combining two clusters C1 and C2 is defined as: Level (C1 ∪ C 2 ) = max( Level (C1 ), Level (C 2 )) + 1

(2)

where Level (C i ) is the abstract level associated to cluster C i . Using these newly created levels the MED difference of j and k is defined (similar to the CD descriptor), as the lowest abstract level of the hierarchy with which these objects are joined together. Contrary to the CD which is sensitive to the exact values of levels in the hierarchy i.e. a small change in the levels can make the CD description of two topologically similar hierarchies to differ significantly[36], this problem has been resolved in the MED descriptor by defining the abstract levels for hierarchies. Furthermore, as shown below, the matrix obtained from this descriptor is ultrametric, therefore the dendrogram topology can be reconstructed from this matrix. Proposition 1: MED is an ultrametric descriptor. Proof: Suppose that in the input hierarchy, from the objects i , j , and k , the objects i and j are joined first and object k joins to them later (similar as in Fig. 2). Let the

lowest level of the hierarchy which the objects i , j are joined together get the abstract level of Lij . As each cluster merging increases the corresponding level of the resulted cluster, the level in which k joins the other two objects is at least equal to Lij + 1 . On the

12

other hand, because the objects i , j are in the same cluster after the level Lij , they will merge to k simultaneously. So MEDij ≤ MEDik = MED jk . This confirms that the MED is an ultrametric descriptor.à All descriptors which are introduced earlier, except the PD descriptor, are ultrametric or quasiultrametric (matrices in which only the off-diagonal values maintain the ultrametric property). Therefore, any standard hierarchical clustering algorithm will completely recover the dendrogram. An example is given in order to clarify this further, the dendrogram shown in Fig. 2 is used and its various matrix descriptions are presented in Table 1.

***Figure 2 here *** ***Table 1 here *** 3.2.2. Aggregating dendrogram descriptors

Following creation of primary dissimilarity matrices, D (i ) , 1 ≤ i ≤ L , they must be aggregated to a new dissimilarity matrix, D . In this paper the sum operator is used for this purpose. Using mean operator also leads to the same result because if all dissimilarity matrix elements are multiplied/divided by a constant number, the final hierarchy will not change. It must be noted that if we use subsampling method to create a diverse ensemble of clusterings, since not all objects will be included in the data sets, defining the difference between patterns which are missing in the clustering tree is impossible and the description matrices will not be completely filled. To solve this problem the missing value of the description matrix will be filled with don’t care values. In this situation the aggregation operator skips don’t care terms. The idea of using don’t care values in the consensus matrix was introduced in [33]. In this work, the subsampling technique is used and to handle unfilled entry in the resultant consensus matrix they forced to use

13

don’t care values. 3.2.3. Recovering the final hierarchy

The last step in the hierarchical clustering combination is to use the dissimilarity matrix D to create the final hierarchy. In this paper, the final hierarchy is created by performing standard hierarchical clustering algorithms such as Single Linkage (SL), Mean Linkage (ML) [49] methods on the combined dissimilarity matrix. Obviously, performing other hierarchical methods on the dissimilarity matrix is also possible. In order to demonstrate how HCC algorithm works, we have presented an example of combining three dendrograms, as shown in Fig. 3. In this Figure, the input hierarchies are ((k,(l,m)),n) , (((k,l),m),n) , and (((k,l),n),m) which they are corresponding to Fig. 3(a, b, c), respectively. The CD descriptors of these hierarchies are shown in Fig. 3(d, e, f). Fig. 3(g) gives the consensus matrix derived from the summation of the three CD descriptors. The final step of HCC method is to perform a hierarchical clustering algorithm on the obtained consensus matrix. In this example the Single-Linkage clustering is used. The algorithm begins with four clusters, each consisting of one sample. The two nearest clusters are then merged. The smallest number in matrix shown in Fig. 3(g) is 4 which the distance between samples k and l , so the clusters {k} and {l} are merged. Now we obtain the matrix 3(h) which gives the distances between clusters {k,l}, {m}, and {n}. In this matrix the value 6 which is the distance between clusters {k,l} and {m}

is

computed

as

follows.

Matrix

3(g)

shows

that

distance({k},{m})=7

and

distance({l},{m})=6. In SL algorithm the distance between clusters is the minimum of these values, 6, i.e.

distance({k, l},{m}) = min(distance({k},{m}), distance({l},{m})) .

(3)

The other value in the first row is computed in a similar way. The other value i.e.

14

distance({m},{n}) is simply copied from matrix 3(g). Since the minimum value in matrix 3(h) is 6 the clusters {k,l} and {m} are merged. Now the matrix 3(i) gives the distances between clusters {k,l,m} and {n}. The next step will merge the two remaining clusters at distance of 8. The ultrametric matrix and the dendrogram derived are shown in Fig. 3(j,k). In Fig. 3(k) the combined hierarchy is (((k,l),m),n). The result seems to be rational, because the output hierarchy contains the clusters (k,l), (k,l,m) and (k,l,m,n), at different levels, and each of these clusters repeated at least in two out of the three dendrograms.

***Figure 3 here *** 4. EXPERIMENTAL RESULTS 4.1.

Clustering validation techniques

There are many clustering validation techniques available. In general we can divide them into external and internal validation techniques [22]. In external evaluation techniques, it is assumed that the set of class labels of input patterns is known and by comparing it with clustering result, the clustering quality can be evaluated. The internal evaluation techniques do not use class labels and they estimate the clustering quality based on the information intrinsic to the data alone. Indices based on contingency table of pairwise assignment of data items like Rand index [37], adjusted Rand index [23], Jaccard coefficient [4], or measures which evaluate purity and completeness of clusters with respect to ground truth like F-measure [38] are examples of internal evaluation techniques. If no class labeling is available, the internal validation measures may be used for clustering evaluation. In internal evaluation techniques the quality of a clustering is judged by different objective functions like compactness, connectedness, separation, or combination of them. Dunn

15

Index [12], the Davies–Bouldin Index [7] and the Silhouette Width [40] are example of multiobjective evaluation techniques. Some authors use the degree of compliance between clustering results and distance information in the original data to measure clustering performance. The most famous methods of this category are based on the formation of Cophenetic matrix[39]. The Cophenetic matrix can be formed for partitional and hierarchical clusterings. So these methods may be used for comparing the hierarchical clusterings too. Another possibility of assessing clustering quality is to measure the stability of clustering [4]. In these methods the consistency of clusters obtained from perturbation and resampling the original dataset is used as indicator clusters quality. The primary goal of hierarchical clustering combination is to discover the optimal hierarchical representation of input data. Thus, the objective function which is used for comparing different combination methods must consider the entire hierarchy that is produced by a combination method (and not a specific level only). The FScore measure introduced in [51] has such property. Each hierarchy Η , which is produced by a combinational algorithm, contains a set of nested clusters. The evaluation must take into account all these partitions. The FScore measure finds a cluster S i in Η for each class Cr that can represent Cr better than the other clusters. Let Cr be a class with its size denoted by nr and use S i as the i th cluster with the size of ni . Let nri denotes the number of objects in cluster S i belonging to class Cr . Moreover let c be the total number of target classes, then the agreement between this cluster and class is defined as: F (C r , S i ) =

2 R(C r , S i ) P(C r , S i ) , R(C r , S i ) + P(C r , S i )

(4)

where R(C r , S i ) and P(Cr , Si ) are defined respectively, as

16

R (Cr , Si ) =

nri nr

(5)

nri . ni

(6)

and P(Cr , Si ) =

For each class Cr its corresponding cluster is found in Η , which is the cluster which has the maximum agreement with Cr , and their agreement is defined as FScore of C r . That is, F (Cr ) = max F (Cr , Si ). S i ∈Η

(7)

The quality of cluster hierarchy H is then calculated as the sum of individual classes FScore weighted according to class size, c

FScore = ∑

nr

r =1 N

F (Cr ) .

(8)

The FScore measure takes its maximum value of 1, if each class has a corresponding cluster in the tree completely agree with it. 4.2.

Evaluation method and datasets

Our proposed algorithm is evaluated by using the well known data sets, which are used for most classifiers evaluation [54], [55], [56]. Because each pattern in these data sets has its unique class label, this leads us to a simpler determination of clustering accuracy. It worth noting that it may happen that the intrinsic clusters of data don’t correspond to classes in datasets, but since many articles have reported this method for evaluation, we will use this method of evaluation as well. The specification and references of all data sets used in our experiments are shown in Table 2. We have used two groups of datasets, consisting of large and small datasets. Because of the existing randomness in base clusterers creation, one trial of combination methods is not sufficient for comparison. So for each combination type and each data set different ensembles were generated and the FScore values are averaged. This is a very time consuming process. So we

17

divided the datasets to small and large datasets. For the small datasets the FScore values of 100 trials and for large datasets the FScore values of 10 trials are averaged.

***Table 2 here *** 4.3.

HCC performance analysis

For each of the datasets reported in Table 2 an ensemble of 10 Single linkage (SL) hierarchical clusterers is generated ( L = 10 ). A random subsample (without replacement) of size 0.80 * N is used for creating each of the base clusterers, where N is the number of instances in the original dataset. As stated before if the base clusterers are hierarchical, there are two ways to produce a consensus hierarchical clustering. The first one, which we call flat combination, initially converts input hierarchical clusterings to partitional clusterings and then combines them using stacked clustering. The second one is to combine clustering’s dendrogram directly (hierarchical combination). In the hierarchical combination different dendrogram descriptors can be used. We used the CD, PMD, CMD, SMD, MED descriptors. The flat and hierarchical combination methods differ in the way they represent base clusterers. After the aggregation of the matrix representation of the base clusterers, the two methods (flat and hierarchical) create a consensus matrix D , on which the resultant clustering is derived. Consensus matrix can be interpreted as dissimilarity between objects or as data (intermediate feature space). In the second case, we used Euclidean distance to form dissimilarity matrix between objects. The dissimilarity matrix obtained is used as input of Single Linkage (SL) or Mean Linkage (ML) clustering and their result is considered as the resultant clustering. Several choices that we have used in creation of a clustering combination algorithm and the values that are used for our experiments are listed in Table 3. Indeed there are 24 combination methods where Table 4 summarizes all the combination methods used in this experiment.

18

***Table 3 here *** ***Table 4 here *** The FScore results of various combination methods on small and large datasets are shown in Table 5 and 6 respectively. The results are shown in the form of mean plus standard deviation of FScore

values averaged over 100 and 10 trials. The FScore of different ensemble methods are

averaged and shown in the 25 th row of these tables. To make a comparison with hierarchical clustering without ensemble, the result of applying SL algorithm on each dataset is also included in the 26 th row. These tables reveals that ensemble methods are significantly better than none ensemble methods on almost all datasets. In these tables, the best results on each dataset, are determined based on the method proposed in [53]. In this method, on each dataset the algorithm with the maximum FScore is selected and the FScore s of other algorithms is compared with it using a t-test with 0.01 significance level. The t-test divides the algorithms into 2 groups, (I) the algorithms significantly worse than the best method and (II) algorithms which are comparable to the best method. In this way the significantly best methods on each dataset is determined. They are boldfaced in Table 5 and 6. The fraction of datasets where a method achieves the significantly best performance is called winning frequency [53]. Fig. 4 and 5 present the wining frequencies of different combination methods on 12 small and 12 large datasets. These figures show that methods 16 and 24 achieve the highest winning frequency among all 24 combination methods. Note that winning frequency measure only considers the significantly best methods and if we are to use this measure for comparing different combination methods, it must be changed to consider the methods in other ranking, not only the first ranking.

***Table 5 here ***

19

***Table 6 here *** ***Fig. 4 here*** ***Fig. 5 here*** To analyze the performance of different combination method, we summarized the result of Tables 5 and 6 in two fashions. The first method for summarizing the result is to compare the robustness of different algorithms [53]. For each dataset, the relative FScore of each algorithm ( a lg ) is expressed by dividing it’s FScore by the largest FScore obtained for that particular dataset over different methods, that is, ra lg =

FScore a lg

(9)

max FScoreα α

The relative FScore measure takes its maximum value of one for the best algorithm and for other algorithms ra lg ≤ 1 . The sum of

ra lg

which is called robustness in [53], represents the

degree to which a particular method performed well over various datasets. Tables 7 and 8 present the ra lg of different combination algorithm on small and large datasets. In Fig. 6 and 7 the relative performance of each algorithm on different datasets are staked together to provide an indication of that algorithm’s robustness. Considering these figures, it can be found that the performance of hierarchical combination methods is superior to flat combination methods.

***Table 7 here *** ***Table 8 here *** ***Fig. 6 here*** ***Fig. 7 here*** As a second approach for summarizing the result of Tables 5 and 6, we have used the method

20

proposed in [29]. In this method, the statistical significance of difference between any two combination methods is studied. Two methods have a significant difference if their %95 confidence intervals for mean FScore values have no intersection. We compare each pair of methods to inspect which method outperform the other for most datasets. Let b(i, j ) denotes the number of datasets which the combination method i is significantly better than the combination method j . Moreover let w(i, j ) be the number of datasets which i has been significantly worse than j and let s (i, j ) to be the number of datasets where the difference between i and j is not significant. Now the following formula can be used to create a total index for each method, t (i ) =

Q

∑ b(i, j ) − w(i, j ),

i = 1,..., Q

(10)

j =1

where

Q

is the number of method being compared (24 in our case). The higher the t (i) index

value the better the combination solution is. Suppose that

P

be the number of datasets (12 in

each of the small and large datasets), then t (i) can take the value of −(Q − 1) P when the method i is significantly worse than all other Q − 1 methods on all P datasets, and the value of (Q − 1) P when the method i is significantly better than all other Q − 1 method on all P datasets. In other cases it takes a value in the interval [−(Q − 1) P , (Q − 1) P] . The value of t (i) may be normalized in the [0,1] interval, by following equation, Nt (i ) =

t (i ) + (Q − 1) P . 2(Q − 1) P

(11)

Tables 9 and 10 present the t (i) value for each combination method. The normalized version of t (i )

for different methods is represented in Fig. 8 and 9.

***Table 9 here *** ***Table 10 here ***

21

***Figure 8 here *** ***Figure 9 here *** As illustrated in Fig. 8 and 9, all dendrogram descriptors produce acceptable results and the performance of hierarchical combination algorithms is significantly better than the flat combination methods. Even it is noticeable that on small datasets, the worst hierarchical combination method (i.e. SMD/Data/ML method) has produced almost identical results to the best flat combination method (i.e. Flat/Dissimilarity/SL method). Comparing the performance of the four different implementations of SMD, PMD, and CD methods on small datasets, reveals a trend that in all cases the performance order is kept constant as shown in Fig. 8, i.e. the descending accuracy of these descriptors are in order of:

Dissimilarity/ML, Data/SL,

Dissimilarity/SL, Data/ML, respectively. The similarity in order of clustering accuracy can also be seen between CMD and MED. Also it can be seen that CMD and MED outperform the other three descriptors. By analyzing the Fig. 8, it can be observed that, ƒ

In the Flat combination Interpreting the consensus matrix as dissimilarity gives better results. Also it is preferable to use SL instead of ML.

ƒ

In the hierarchical combination a) If ML algorithm is applied, interpreting consensus matrix as dissimilarity gives better accuracy than interpreting it as data. This situation is reversed if SL algorithm is used. b) If consensus matrix is considered as dissimilarity, applying ML is better than SL. Considering consensus matrix as data matrix, more accurate clustering can be

22

obtained by SL. In overall, the best case is occurred when hierarchical combination is used, consensus matrix is interpreted as dissimilarity, and ML is chosen as final hierarchical algorithm. As shown in Fig. 9, similar results achieved for large datasets and the performance order of different methods is the same. The only noticeable difference is that the accuracy of SMD/Data/ML method is increased in the case of large datasets.

5. CONCLUSION The hierarchical clustering algorithm has the advantage that the number of true clusters is not required for the algorithm and an expert may analyze the clustering dendrogram to discover the best structure of data. Because of the important role of hierarchical clustering in data mining task, we proposed a combination strategy for combining hierarchical clustering results. The proposed method combines an ensemble of hierarchical clusterings to drive a single informative hierarchical clustering. The base clusterings are created using Single Linkage (SL) algorithm based on random sub-samples of the input data. It must be noted that because the combination algorithm works only on dendrograms any other hierarchical clustering can be used to create base clusterers. Following creation of the base clusterers, their dendrograms are represented in matrix form. Then these matrix representations of dendrograms are aggregated and the final clustering is produced. The experimental results show great improvement using this method of combination. There are several directions for future work with this combination method. The most important direction is the definition of other dendrogram descriptor or extending the existing ones for clustering

23

combination. For example MED can be extended to consider path length or link inconsistency. Other interesting future work issue concerns the fact that although the dendrogram descriptors are ultrametric, their sum (or average) is not necessarily ultrametric so the design of an aggregation function which imposes this property directly will release us from a final hierarchical clustering step.

REFERENCES [1] H. Ayad and M. Kamel, Finding Natural clusters using multi-clusterer combiner based on shared nearest neighbors, in: T. Windeatt and F. Roli, (Eds.), Proc. 4th International Workshop on Multiple Classifier Systems, volume 2709 of Lecture Notes in Computer Science, Guildford, UK, 2003, Speringer-Verlag, pp. 166-175. [2] Banerjee, S. Merugu, I. Dhillon, and J. Ghosh, Clustering with Bregman Divergences, in: Proc. of Fourth SIAM int’l Conf. Data Mining, 2004, pp. 234-245. [3] Banerjee, S. Merugu, I. Dhillon, and J. Ghosh, Clustering with Bregman Divergences, J. Machine Learning Research, 6 (2005) 1705-1749. [4] A. Ben-Hur, A. Elisseeff, I. Guyon, A stability based method for discovering structure in clustered data. In: Pacific Symposium on Biocomputing, R.B. Aetman, ed., New Jersey World Scientific Publishing Co. ,2002, pp. 6-17. [5] L. Breiman, Bagging predictors, Machine learning, 24 (2) (1996) 123–140. [6] L. Breiman, Random Forests, Machine Learning, 45 (1) (2001) 5-32. [7] D.L. Davies, and D.W. Bouldin, A cluster separation measure, IEEE Trans. Pattern Anal. Machine Intell. 1 (1979), 224–227. [8] T. Dietterich, Ensemble methods in machine learning, in: J. Kittler and F. Roli (Eds.), Proc. of the First Int. Workshop on Multiple Classifier Systems, Cagliari, Italy, 2000, Speringer-Verlag, pp. 1–15. [9] H. Drucker and C. Cortes, Boosting decision trees, advances in Neural Information systems, 8 (1996) 479-485. [10] R. O. Duda, P. E. Hart, D. G. Stork, Pattern classification, John Wiely & Sons, 2001. [11] S. Dudoit and J. Fridlyand, Bagging to improve the accuracy of clustering procedure, Bioinformatics, 19 (9) (2003)10901099. [12] J.C. Dunn, Well separated clusters and fuzzy partitions, J. Cybernet. 4 (1974), 95–104. [13] Xiaoli Z. Fern, Carla E. Brodley, Solving cluster ensemble problems by bipartite graph partitioning, in: Proc. of the twentyfirst international conference on Machine learning, vol. 69, 2004, pp. 36.

24

[14] Fischer, J. M. Buhmann, Bagging for path-based clustering, IEEE Trans. on Pattern Analysis and Machine Intelligence, 25 (11) (2003) 1411-1415. [15] A.L.N. Fred, Finding consistent clusters in data partitions, in: F. Roli and J. Kittler (Eds.), Proc. 2nd International Workshop on Multiple Classifier Systems, vol. 2096 of Lecture Notes in Computer Science, Cambridge, UK, 2001, Speringer-Verlag, pp. 309-318. [16] A.L.N. Fred and AK Jain, Robust data clustering, in Proc. of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003, pp. 128-133. [17] A.L.N. Fred and A.K. Jain, Data clustering using evidence accumulation, in Proc. 16th International Conference on Pattern Recognition, ICPR, Canada, 2002, pp. 276-280. [18] A.L.N. Fred, A. K. Jain, Combining Multiple Clusterings Using Evidence Accumulation, IEEE Transactions on Pattern Analysis and Machine Intelligence, 27 (6) (2006) 835-850. [19] D. Frossyniotis, A. Likas, A. Stafylopatis, A clustering method based on boosting, Pattern Recognition Letters, 25 (6) (2004). [20] Gionis, H. Mannila, and P. Tsaparas, Clustering aggregation, in: Proc. of int. Conf. on Data Engineering ICDE, Japan, 2005. [21] S. T. Hadjitodorov, L. I. Kuncheva and L. P. Todorova, Moderate diversity for better cluster ensembles, Information Fusion, 7 ( 3) (2006) 264-275. [22] M. Halkidi, et al. , On clustering validation techniques, J. Intelligent Inform. System 17 (2001), 107–145. [23] A. Hubert, Comparing partitions, J. Classification 2 (1985), 193–198. [24] K. Jain, R. C. Dubes, algorithms for clustering data, Prentice Hall, 1988. [25] K. Jain, M. N. Murty, P. J. Flynn, Data clustering: a review, ACM computing Surveys, 31 (3) (1999) 264-323. [26] H. Kargupta, B.H. Park, A Fourier spectrum-based approach to represent decision trees for mining data streams in mobile environments, IEEE Trans. Knowledge and Data Engineering, 16 (2) (2002) 216-229. [27] H. Kargupta, B.H. Park, H. Dutta, Orthogonal decision trees, IEEE Trans. Knowledge and Data Engineering 18 (7) (2006) 1028- 1042. [28] L. I. Kuncheva, Combining Pattern Classifiers: Methods and Algorithms, ISBN: 0-471-66025-6, 2004. [29] L.I. Kuncheva, S.T. Hadjitodorov, and L.P. Todorova, Experimental comparison of cluster ensemble methods, in Proc. I9th International Conference on information Fusion, ICIF '06, 2006, pp. 1-7. [30] J. Lapointe, P. Legendre, "A statistical framework to test the consensus among additive trees (clandograms)," JSTOR, Systematic Biology 41 (2) (1992) 158-171. [31] Leisch, K. Hornik, Stabilization of k-means with bagged clustering, in Proc. of the Joint Statistical Meetings, 1999.

25

[32] Minaei, A. Topchy and W.F. Punch, Ensembles of partitions via data resampling, in: Proc. Intl. Conf. on Information Technology, ITCC 04, Las Vegas, 2004. [33] S. Monti, P. Tamayo, J. Mesirov, and T. Golub, Consensus clustering: A resampling based method for class discovery and visualization of gene expression microarray data, Machine Learning, 52 (2003) 91-118. [34] R. Nock and F. Nielsen, On weighting clustering, IEEE Trans. on Pattern Analysis and Machine Intelligence, 25 (11) (2006) 1223-1235. [35] B.H. Park, Constructing simpler decision trees from ensemble models using Fourier analysis, in: proc. of seventh work shop research issues in data mining and knowledge discovery, 2002, pp. 18-23. [36] J. Podani, simulation of random dendrograms and comparison tests: some comments, Journal of classification 17 (2004) 123-142. [37] W. Rand, Objective criteria for the evaluation of clustering methods, J. Am. Stat. Assoc. 66 (1971), 846–850. [38] C. van Rijsbergen, Information retrieval, second edition, Butterworths, 1979. [39] H.C. 39, Cluster Analysis for Researchers, Belmont, 1984. [40] P.J. Rousseeuw, Silhouettes: a graphical aid to the interpretation and validation of cluster analysis, J. Comput. Appl. Math. 20, (1987), 53–65. [41] R.E. Schapire, The strength of weak learnability, Machine Learning, 5 (2) (1990) 197–227. [42] Strehl, J. Ghosh, Cluster ensembles - a knowledge reuse framework for combining multiple partitions, The Journal of Machine Learning Research, 3 (2002) 583 - 617. [43] Strehl, J. Ghosh, Cluster ensembles - a knowledge reuse framework for combining partitions, in: Proc. of AAAI, AAAI/MIT Press, 2002. [44] A. Topchy, A.K. Jain, W. Punch, A mixture model for clustering ensembles, in: Proc. of SIAM Conference on Data Mining, 2004, pp. 379-390. [45] A. Topchy, A.K. Jain, W. Punch, Clustering ensembles: models of consensus and weak partitions, IEEE Trans. on Pattern Analysis and Machine Intelligence, 27 (12) (2005). [46] A. Topchy, A.K. Jain, W. Punch, Combining multiple weak clustering, in: Proc. of IEEE int. Conf on Data Mining, Melbourne, Australia, 2003, pp. 331-338. [47] A. Topchy, B. Minaei, A.K. Jain, W. Punch, Adaptive clustering, in Proc. of ICPR, 2004, Cambridge, UK, 2004, pp. 272275. [48] S. J. Verzi, G. L. Heileman and M. Georgiopoulos, Boosted ARTMAP: Modifications to fuzzy ARTMAP motivated by boosting theory, Neural Networks, 19 (4) (2006) 446-468. [49] R. Web, Statistical Pattern Recognition, John Wiely & Sons, 2002.

26

[50] Weingessel, E. Dimitriadou, K Hornik, An Ensemble Method for Clustering, DSC Working Papers, 2003, available at http://www.ci.tuwien.ac.at/ Conferences/ DSC-2003 [51] Y. Zhao, G. Karypis, Evaluation of hierarchical clustering algorithms for document datasets, in: Proc. of the eleventh international conference on Information and knowledge management, McLean, Virginia, USA, 2002, pp. 515 – 524. [52] Zhi-Hua Zhou, Wei Tang, Clusterer ensemble, Knowledge-Based Systems, 19 (1) (2006) 77-83. [53] Z.H. Zhou and Y. Yu. Ensembling local learners through multimodal perturbation. IEEE Transactions on Systems, Man, and Cybernetics – Part B, 35 (2005), 725-735. [54] Gunnar Raetsch's Benchmark Datasets available at http://users.rsise.anu.edu.au/~raetsch/data/index [55] UCI repository of machine learning databases at: http://www.ics.uci.edu/~mlearn/MLRepository.htm [56] Real medical datasets used in [29] at: http://www.informatics.bangor.ac.uk/~kuncheva/activities/real_data.htm

27

Table 1 the result of applying different dendrogram descriptor to the dendrogram of Figure 2 object i j k l m n o p

0 1 6 6 6 7 10 10

CD

object i j k l m n o p

3 3 4 4 4 5 6 6

0 6 6 6 7 10 10

0 2 5 7 10 10

0 5 7 10 10

PMD

0 7 10 10

0 10 10

0 4

0

0 1 5 5 5 6 7 7

5

0 1 4 4 3 3 5 5

0 5 5 5 6 7 7

0 2 4 6 7 7

SMD 3 4 4 4 5 6 6

2 2 3 5 6 6

2 3 5 6 6

3 5 6 6

0 4 6 7 7

0 6 7 7

CMD

0 7 7

0 3

0

1 2 5 5 5 6 8 8

0

0 1 3 3 3 4 5 5

1 5 5 5 6 8 8

1 2 3 6 8 8

PD

5 6 6

5 5

0 4 4 3 3 5 5

0 1 2 4 6 6

0 2 4 6 6

0 3 5 5

1 3 6 8 8

1 6 8 8

1 8 8

1 2

1

0 5 5

0 1

0

MED

0 3 3

0 1

0 3 3 3 4 5 5

0 1 2 4 5 5

0 2 4 5 5

0 4 5 5

28

Table2. The characteristics and the source of the used datasets

Small datasets

Large datasets

Data set 1 2 3 4 5 6 7 8 9 10 11 12 1

Data Names Breast_cancer Flare_solar Titanic Ionosphere Wpbc Image_segmentation Liver_disorders Wine Laryngeal1 Laryngeal3 Weaning Contractions Diabetis

2

German

3

balance_scale

4

Vehicle

5

Laryngeal2

6

Banana

7

Ringnorm

8

Splice

9

Twonorm

10

Waveform

11

page_blocks

12

waveform_noise

#instances 263 144 24 351 198 210 345 178 213 353 302 98

#features 9 9 3 34 32 19 6 13 16 16 17 27

#class 2 2 2 2 2 7 2 3 2 3 2 2

768 1000 625 846 692 5300 7400 2991 7400 5000 5473 5000

8 20 4 18 16 2 20 60 20 21 10 40

2 2 3 4 2 2 2 2 2 2 5 3

Source Gunnar Raetsch's Benchmark[54] Gunnar Raetsch's Benchmark[54] Gunnar Raetsch's Benchmark[54] UCI ML datasets[55] UCI ML datasets[55] UCI ML datasets[55] UCI ML datasets[55] UCI ML datasets[55] real data kuncheva[56] real data kuncheva[56] real data kuncheva[56] real data kuncheva[56] Gunnar Raetsch's Benchmark[54] Gunnar Raetsch's Benchmark[54] UCI ML datasets[55] UCI ML datasets[55] real data kuncheva[56] Gunnar Raetsch's Benchmark[54] Gunnar Raetsch's Benchmark[54] Gunnar Raetsch's Benchmark[54] Gunnar Raetsch's Benchmark[54] Gunnar Raetsch's Benchmark[54] UCI ML datasets[55] UCI ML datasets[55]

29

Table 3. Different parameters used to create a variety of clustering combination algorithms Design Parameter Diversity Creation Primary clusterers dissimilarity matrix creation If the Consensus matrix is interpreted as Data If the Consensus matrix is interpreted as dissimilarity Combination type Dendrogram representation(only for hierarchical combination) Recovering final hierarchy

Algorithm/method Resampling without replacement SL Use Euclidean distance on the Consensus matrix The same as the Consensus matrix Flat/ Hierarchical CD/PMD/CMD/SMD/MED SL/ML

30

Table 4. All achievable combination methods Combination method 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Combination Type Flat Flat Flat Flat Hierarchical Hierarchical Hierarchical Hierarchical Hierarchical Hierarchical Hierarchical Hierarchical Hierarchical Hierarchical Hierarchical Hierarchical Hierarchical Hierarchical Hierarchical Hierarchical Hierarchical Hierarchical Hierarchical Hierarchical

Dendrogram Descriptor CD CD CD CD PMD PMD PMD PMD CMD CMD CMD CMD SMD SMD SMD SMD MED MED MED MED

Consensus matrix interpretation Data Data Dissimilarity Dissimilarity Data Data Dissimilarity Dissimilarity Data Data Dissimilarity Dissimilarity Data Data Dissimilarity Dissimilarity Data Data Dissimilarity Dissimilarity Data Data Dissimilarity Dissimilarity

Hierarchy recovering algorithm SL ML SL ML SL ML SL ML SL ML SL ML SL ML SL ML SL ML SL ML SL ML SL ML

31 Table 5. The result of different executed algorithms for small datasets in the forms of average and variance of FScore . The ensemble contains L=10 hierarchical clusterers, the numbers are average and standard deviation of 100 trials. Dataset

1

2

3

4

5

6

7

8

9

10

11

12

Method 1

0.7209+ 0.6982+ 0.6964+ 0.6919+ 0.7533+ 0.2775+ 0.6760+ 0.5167+ 0.6863+ 0.6020+ 0.6682+ 0.6731+ 0.0015 0.0028 0.0141 0.0008 0.0018 0.0098 0.0016 0.0037 0.0018 0.0020 0.0013 0.0045

2

0.7201+ 0.6964+ 0.6934+ 0.6913+ 0.7524+ 0.2768+ 0.6749+ 0.5136+ 0.6851+ 0.6013+ 0.6669+ 0.6691+ 0.0011 0.0019 0.0114 0.0003 0.0018 0.0106 0.0014 0.0033 0.0018 0.0021 0.0005 0.0033

3

0.7219+ 0.6993+ 0.6965+ 0.6927+ 0.7548+ 0.2956+ 0.6771+ 0.5225+ 0.6876+ 0.6109+ 0.6693+ 0.6749+ 0.0022 0.0040 0.0128 0.0012 0.0030 0.0080 0.0024 0.0074 0.0030 0.0072 0.0018 0.0054

4

0.7216+ 0.6990+ 0.6964+ 0.6925+ 0.7548+ 0.2745+ 0.6768+ 0.5189+ 0.6869+ 0.6034+ 0.6690+ 0.6742+ 0.0018 0.0035 0.0142 0.0012 0.0031 0.0055 0.0021 0.0045 0.0026 0.0021 0.0016 0.0049

5

0.7221+ 0.7372+ 0.6806+ 0.7855+ 0.7585+ 0.5097+ 0.6770+ 0.5692+ 0.7208+ 0.6106+ 0.6844+ 0.7376+ 0.0042 0.0022 0.0079 0.0050 0.0022 0.0100 0.0008 0.0249 0.0037 0.0036 0.0051 0.0108

6

0.7209+ 0.7330+ 0.6776+ 0.8305+ 0.7555+ 0.4821+ 0.6754+ 0.5634+ 0.7149+ 0.6119+ 0.6830+ 0.7100+ 0.0037 0.0042 0.0055 0.0279 0.0028 0.0194 0.0011 0.0274 0.0049 0.0044 0.0103 0.0130

7

0.7234+ 0.7196+ 0.6837+ 0.7735+ 0.7565+ 0.4047+ 0.6770+ 0.5370+ 0.7170+ 0.6186+ 0.6935+ 0.7116+ 0.0031 0.0073 0.0066 0.0062 0.0030 0.0590 0.0013 0.0114 0.0071 0.0035 0.0081 0.0120

8

0.7229+ 0.7358+ 0.6817+ 0.7904+ 0.7573+ 0.6807+ 0.6773+ 0.5971+ 0.7248+ 0.6266+ 0.7078+ 0.7316+ 0.0034 0.0015 0.0052 0.0036 0.0025 0.0134 0.0010 0.0405 0.0048 0.0044 0.0078 0.0118

9

0.7200+ 0.7094+ 0.6851+ 0.7705+ 0.7577+ 0.4879+ 0.6764+ 0.6753+ 0.7239+ 0.6209+ 0.7530+ 0.7477+ 0.0017 0.0087 0.0090 0.0077 0.0022 0.0357 0.0015 0.0394 0.0040 0.0042 0.0063 0.0150

10

0.7203+ 0.6970+ 0.6832+ 0.8002+ 0.7555+ 0.4678+ 0.6743+ 0.6363+ 0.7139+ 0.6364+ 0.7582+ 0.7243+ 0.0031 0.0039 0.0097 0.0457 0.0060 0.0340 0.0003 0.0381 0.0069 0.0096 0.0353 0.0218

11

0.7230+ 0.7104+ 0.6857+ 0.7379+ 0.7554+ 0.3569+ 0.6766+ 0.5606+ 0.7151+ 0.6296+ 0.7248+ 0.7194+ 0.0031 0.0065 0.0105 0.0094 0.0022 0.0365 0.0014 0.0237 0.0081 0.0048 0.0095 0.0153

12

0.7224+ 0.7150+ 0.6827+ 0.7693+ 0.7567+ 0.6575+ 0.6767+ 0.7475+ 0.7223+ 0.6485+ 0.7513+ 0.7482+ 0.0032 0.0076 0.0055 0.0081 0.0025 0.0167 0.0014 0.0258 0.0052 0.0080 0.0066 0.0130

13

0.7192+ 0.7138+ 0.6860+ 0.7347+ 0.7555+ 0.6376+ 0.6764+ 0.8327+ 0.7278+ 0.6305+ 0.7530+ 0.7648+ 0.0016 0.0171 0.0015 0.0167 0.0024 0.0083 0.0012 0.0282 0.0049 0.0049 0.0057 0.0132

14

0.7288+ 0.7340+ 0.6864+ 0.8090+ 0.7552+ 0.6093+ 0.6743+ 0.7977+ 0.7146+ 0.6468+ 0.7546+ 0.7650+ 0.0094 0.0059 0.0023 0.0417 0.0061 0.0177 0.0004 0.0397 0.0070 0.0122 0.0321 0.0219

15

0.7227+ 0.7280+ 0.6823+ 0.7586+ 0.7561+ 0.6605+ 0.6766+ 0.7163+ 0.7164+ 0.6503+ 0.7200+ 0.7320+ 0.0026 0.0112 0.0053 0.0082 0.0028 0.0561 0.0014 0.0649 0.0080 0.0095 0.0093 0.0181

16

0.7232+ 0.7343+ 0.6860+ 0.7767+ 0.7575+ 0.6979+ 0.6768+ 0.8367+ 0.7275+ 0.6584+ 0.7537+ 0.7716+ 0.0029 0.0031 0.0013 0.0085 0.0025 0.0053 0.0015 0.0284 0.0062 0.0066 0.0056 0.0119

17

0.7195+ 0.6976+ 0.6979+ 0.7707+ 0.7580+ 0.3112+ 0.6763+ 0.5288+ 0.7208+ 0.6122+ 0.7425+ 0.7294+ 0.0015 0.0032 0.0170 0.0067 0.0021 0.0182 0.0016 0.0111 0.0040 0.0058 0.0071 0.0128

18

0.7191+ 0.6952+ 0.6921+ 0.7839+ 0.7538+ 0.3202+ 0.6745+ 0.5275+ 0.7061+ 0.6224+ 0.7391+ 0.6948+ 0.0012 0.0015 0.0195 0.0419 0.0039 0.0147 0.0007 0.0161 0.0076 0.0088 0.0302 0.0145

19

0.7223+ 0.7046+ 0.6996+ 0.7289+ 0.7555+ 0.2877+ 0.6766+ 0.5312+ 0.7066+ 0.6124+ 0.7119+ 0.7023+ 0.0025 0.0054 0.0149 0.0094 0.0026 0.0091 0.0015 0.0104 0.0078 0.0039 0.0099 0.0133

20

0.7218+ 0.7026+ 0.6986+ 0.7574+ 0.7564+ 0.2841+ 0.6768+ 0.5335+ 0.7181+ 0.6212+ 0.7417+ 0.7165+ 0.0019 0.0050 0.0168 0.0088 0.0025 0.0054 0.0016 0.0109 0.0071 0.0043 0.0076 0.0117

21

0.7209+ 0.7318+ 0.6865+ 0.7678+ 0.7579+ 0.6035+ 0.6764+ 0.7861+ 0.7291+ 0.6296+ 0.7570+ 0.7574+ 0.0028 0.0074 0.0051 0.0137 0.0018 0.0192 0.0011 0.0234 0.0044 0.0032 0.0047 0.0151

22

0.7246+ 0.7332+ 0.6873+ 0.8467+ 0.7560+ 0.5598+ 0.6742+ 0.7449+ 0.7164+ 0.6410+ 0.7670+ 0.7418+ 0.0062 0.0038 0.0055 0.0390 0.0051 0.0202 0.0001 0.0290 0.0055 0.0093 0.0333 0.0150

23

0.7216+ 0.7173+ 0.6797+ 0.7600+ 0.7567+ 0.5630+ 0.6767+ 0.6764+ 0.7192+ 0.6457+ 0.7306+ 0.7244+ 0.0023 0.0104 0.0058 0.0078 0.0026 0.0769 0.0014 0.0627 0.0071 0.0107 0.0087 0.0168

24

0.7243+ 0.7355+ 0.6858+ 0.7866+ 0.7583+ 0.6954+ 0.6767+ 0.7943+ 0.7285+ 0.6570+ 0.7577+ 0.7613+ 0.0031 0.0015 0.0021 0.0051 0.0023 0.0057 0.0014 0.0164 0.0057 0.0051 0.0047 0.0118

Average 0.72198 0.71576 0.68797 NonEns 0.63799 0.6466

0.7628

0.75606 0.47508 0.67616 0.63601 0.71374 0.62701 0.72326 0.72429

0.59623 0.70636 0.67049 0.61061 0.58747 0.69232 0.64623 0.56234 0.66926 0.68652

32 Table 6. The result of different executed algorithms for large datasets in the forms of average and variance of FScore . The ensemble contains L=10 hierarchical clusterers, the numbers are average and standard deviation of 10 trials. Dataset

1

2

3

4

5

6

7

8

9

10

11

12

Method 1

0.6946+ 0.7150+ 0.5950+ 0.4028+ 0.8992+ 0.6911+ 0.9370+ 0.7027+ 0.8075+ 0.7103+ 0.4714+ 0.5035+ 0.0005 0.0003 0.0011 0.0008 0.0011 0.0015 0.0028 0.0141 0.0008 0.0018 0.0098 0.0016

2

0.6945+ 0.7150+ 0.5943+ 0.4023+ 0.8992+ 0.6904+ 0.9351+ 0.6998+ 0.8069+ 0.7094+ 0.4708+ 0.5023+ 0.0005 0.0003 0.0009 0.0008 0.0011 0.0011 0.0019 0.0114 0.0003 0.0018 0.0106 0.0014

3

0.6950+ 0.7151+ 0.5952+ 0.4042+ 0.8985+ 0.6921+ 0.9381+ 0.7029+ 0.8083+ 0.7118+ 0.4895+ 0.5045+ 0.0007 0.0004 0.0012 0.0014 0.0010 0.0022 0.0040 0.0128 0.0012 0.0030 0.0080 0.0024

4

0.6950+ 0.7151+ 0.5950+ 0.4042+ 0.8987+ 0.6918+ 0.9377+ 0.7028+ 0.8081+ 0.7118+ 0.4684+ 0.5043+ 0.0009 0.0003 0.0011 0.0011 0.0010 0.0018 0.0035 0.0142 0.0012 0.0031 0.0055 0.0021

5

0.6946+ 0.7168+ 0.5943+ 0.4037+ 0.9026+ 0.6924+ 0.9760+ 0.6869+ 0.9010+ 0.7155+ 0.7037+ 0.5044+ 0.0005 0.0003 0.0012 0.0003 0.0019 0.0042 0.0022 0.0079 0.0050 0.0022 0.0100 0.0008

6

0.6940+ 0.7154+ 0.5939+ 0.4030+ 0.9029+ 0.6911+ 0.9718+ 0.6839+ 0.9460+ 0.7125+ 0.6760+ 0.5028+ 0.0001 0.0005 0.0007 0.0008 0.0015 0.0037 0.0042 0.0055 0.0279 0.0028 0.0194 0.0011

7

0.6974+ 0.7157+ 0.5950+ 0.4092+ 0.9035+ 0.6936+ 0.9583+ 0.6900+ 0.8891+ 0.7135+ 0.5986+ 0.5045+ 0.0013 0.0004 0.0013 0.0053 0.0012 0.0031 0.0073 0.0066 0.0062 0.0030 0.0590 0.0013

8

0.6951+ 0.7162+ 0.5947+ 0.4131+ 0.9058+ 0.6931+ 0.9746+ 0.6881+ 0.9060+ 0.7143+ 0.8746+ 0.5047+ 0.0005 0.0005 0.0010 0.0062 0.0011 0.0034 0.0015 0.0052 0.0036 0.0025 0.0134 0.0010

9

0.6948+ 0.7158+ 0.5949+ 0.4039+ 0.9053+ 0.6903+ 0.9481+ 0.6914+ 0.8861+ 0.7147+ 0.6819+ 0.5038+ 0.0007 0.0007 0.0016 0.0016 0.0025 0.0017 0.0087 0.0090 0.0077 0.0022 0.0357 0.0015

10

0.7018+ 0.7150+ 0.5946+ 0.4256+ 0.9072+ 0.6905+ 0.9358+ 0.6895+ 0.9158+ 0.7124+ 0.6618+ 0.5018+ 0.0095 0.0002 0.0009 0.0141 0.0020 0.0031 0.0039 0.0097 0.0457 0.0060 0.0340 0.0003

11

0.6979+ 0.7157+ 0.5953+ 0.4264+ 0.9064+ 0.6932+ 0.9492+ 0.6920+ 0.8534+ 0.7124+ 0.5509+ 0.5041+ 0.0015 0.0004 0.0014 0.0072 0.0016 0.0031 0.0065 0.0105 0.0094 0.0022 0.0365 0.0014

12

0.6966+ 0.7155+ 0.5952+ 0.4606+ 0.9131+ 0.6927+ 0.9537+ 0.6890+ 0.8848+ 0.7137+ 0.8514+ 0.5041+ 0.0019 0.0002 0.0014 0.0066 0.0030 0.0032 0.0076 0.0055 0.0081 0.0025 0.0167 0.0014

13

0.6946+ 0.7176+ 0.5952+ 0.4850+ 0.9060+ 0.6895+ 0.9525+ 0.6924+ 0.8503+ 0.7125+ 0.8315+ 0.5039+ 0.0009 0.0004 0.0013 0.0094 0.0024 0.0016 0.0171 0.0015 0.0167 0.0024 0.0083 0.0012

14

0.6995+ 0.7151+ 0.5943+ 0.4796+ 0.9076+ 0.6990+ 0.9728+ 0.6927+ 0.9246+ 0.7122+ 0.8032+ 0.5018+ 0.0073 0.0004 0.0010 0.0154 0.0024 0.0094 0.0059 0.0023 0.0417 0.0061 0.0177 0.0004

15

0.6959+ 0.7153+ 0.5953+ 0.4202+ 0.9137+ 0.6929+ 0.9668+ 0.6886+ 0.8742+ 0.7131+ 0.8544+ 0.5040+ 0.0010 0.0002 0.0011 0.0211 0.0048 0.0026 0.0112 0.0053 0.0082 0.0028 0.0561 0.0014

16

0.6960+ 0.7155+ 0.5951+ 0.5065+ 0.9173+ 0.6934+ 0.9731+ 0.6924+ 0.8923+ 0.7145+ 0.8918+ 0.5042+ 0.0016 0.0004 0.0012 0.0074 0.0028 0.0029 0.0031 0.0013 0.0085 0.0025 0.0053 0.0015

17

0.6949+ 0.7156+ 0.5949+ 0.4022+ 0.9033+ 0.6897+ 0.9363+ 0.7042+ 0.8863+ 0.7150+ 0.5051+ 0.5037+ 0.0005 0.0006 0.0013 0.0014 0.0021 0.0015 0.0032 0.0170 0.0067 0.0021 0.0182 0.0016

18

0.6942+ 0.7150+ 0.5946+ 0.4007+ 0.9048+ 0.6893+ 0.9340+ 0.6985+ 0.8995+ 0.7108+ 0.5141+ 0.5019+ 0.0005 0.0002 0.0016 0.0007 0.0024 0.0012 0.0015 0.0195 0.0419 0.0039 0.0147 0.0007

19

0.6958+ 0.7158+ 0.5953+ 0.4088+ 0.9022+ 0.6925+ 0.9434+ 0.7059+ 0.8444+ 0.7125+ 0.4816+ 0.5040+ 0.0015 0.0005 0.0017 0.0040 0.0012 0.0025 0.0054 0.0149 0.0094 0.0026 0.0091 0.0015

20

0.6958+ 0.7157+ 0.5950+ 0.4067+ 0.9048+ 0.6921+ 0.9413+ 0.7049+ 0.8729+ 0.7134+ 0.4780+ 0.5042+ 0.0014 0.0002 0.0012 0.0028 0.0018 0.0019 0.0050 0.0168 0.0088 0.0025 0.0054 0.0016

21

0.6944+ 0.7176+ 0.5961+ 0.4716+ 0.9062+ 0.6912+ 0.9705+ 0.6929+ 0.8834+ 0.7149+ 0.7975+ 0.5038+ 0.0003 0.0005 0.0026 0.0115 0.0027 0.0028 0.0074 0.0051 0.0137 0.0018 0.0192 0.0011

22

0.6957+ 0.7149+ 0.5950+ 0.4547+ 0.9078+ 0.6948+ 0.9719+ 0.6937+ 0.9622+ 0.7130+ 0.7537+ 0.5017+ 0.0036 0.0000 0.0012 0.0159 0.0026 0.0062 0.0038 0.0055 0.0390 0.0051 0.0202 0.0001

23

0.6959+ 0.7153+ 0.5951+ 0.4164+ 0.9139+ 0.6918+ 0.9561+ 0.6860+ 0.8755+ 0.7137+ 0.7569+ 0.5042+ 0.0011 0.0002 0.0011 0.0151 0.0043 0.0023 0.0104 0.0058 0.0078 0.0026 0.0769 0.0014

24

0.6957+ 0.7156+ 0.5954+ 0.5044+ 0.9169+ 0.6945+ 0.9742+ 0.6922+ 0.9022+ 0.7153+ 0.8893+ 0.5042+ 0.0013 0.0004 0.0013 0.0088 0.0032 0.0031 0.0015 0.0021 0.0051 0.0023 0.0057 0.0014

Average 0.69581 0.71564 0.59495 0.42982 0.90612 0.69221 0.95451 0.69431 0.87837 0.71305 0.669

0.5036

NonEns 0.65474 0.67671 0.58597 0.46696 0.87034 0.65109 0.89699 0.66273 0.76749 0.67026 0.8314

0.46346

33

Table 7. The relative performance of different combination algorithm on small datasets. Dataset

1

2

3

4

5

6

7

8

9

10

11

12

Method 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

0.98913 0.98814 0.99049 0.99017 0.99088 0.98911 0.9926 0.99191 0.98799 0.98838 0.99206 0.99128 0.9869 1 0.99164 0.99231 0.98725 0.98667 0.99108 0.99048 0.98924 0.9942 0.99017 0.99385

0.94709 0.94459 0.94854 0.94812 1 0.9943 0.97604 0.99809 0.96217 0.94544 0.96362 0.96976 0.96815 0.99565 0.9875 0.996 0.94619 0.94298 0.95576 0.95297 0.9926 0.99447 0.97297 0.99763

0.99547 0.99124 0.99567 0.99555 0.97286 0.9686 0.9773 0.97453 0.9793 0.97658 0.98019 0.97584 0.98066 0.98112 0.97532 0.98068 0.99764 0.9894 1 0.99861 0.98137 0.9825 0.97159 0.98036

0.81724 0.81654 0.81817 0.81791 0.92771 0.98087 0.9136 0.93359 0.9101 0.94517 0.87152 0.90859 0.86776 0.95556 0.89599 0.91739 0.91027 0.92593 0.86086 0.89454 0.90687 1 0.89762 0.92905

0.9931 0.99191 0.99515 0.99511 1 0.99603 0.99733 0.99847 0.99901 0.99598 0.99595 0.99768 0.99605 0.99571 0.99687 0.99875 0.99931 0.99387 0.99601 0.99725 0.99917 0.9967 0.99769 0.99975

0.39757 0.39667 0.42353 0.3933 0.73035 0.6907 0.57988 0.9753 0.69912 0.67032 0.51141 0.9421 0.91351 0.87301 0.94638 1 0.44589 0.45876 0.41221 0.40701 0.86476 0.80204 0.80665 0.99638

0.99814 0.9965 0.99974 0.99935 0.99959 0.99722 0.99962 1 0.99866 0.99562 0.99905 0.99914 0.99873 0.99566 0.999 0.99927 0.99856 0.99585 0.99898 0.99929 0.99868 0.99552 0.99918 0.99921

0.61754 0.61392 0.6245 0.62022 0.68029 0.67338 0.64185 0.71363 0.80716 0.76049 0.67004 0.89346 0.99522 0.9534 0.85614 1 0.63199 0.6305 0.63494 0.63768 0.93955 0.89035 0.8084 0.94932

0.94132 0.93963 0.94306 0.94213 0.98862 0.98047 0.98348 0.99405 0.99288 0.9791 0.98086 0.99071 0.99829 0.98016 0.98265 0.9978 0.98856 0.96846 0.96918 0.98496 1 0.9826 0.98641 0.99916

0.91428 0.91319 0.92786 0.91646 0.92741 0.92931 0.9395 0.95169 0.94296 0.96661 0.95621 0.98494 0.95764 0.98241 0.9877 1 0.92981 0.94535 0.93016 0.9435 0.9562 0.97357 0.98066 0.99782

0.87128 0.86956 0.87271 0.87235 0.89237 0.89054 0.90419 0.92289 0.98179 0.98853 0.94504 0.97961 0.98183 0.98388 0.93884 0.98275 0.96806 0.96363 0.9282 0.96702 0.98701 1 0.95266 0.98796

0.87231 0.86718 0.87459 0.87374 0.95587 0.92021 0.92216 0.94813 0.96906 0.93863 0.9323 0.96961 0.99117 0.9914 0.9487 1 0.94528 0.90043 0.91022 0.92858 0.98151 0.96131 0.93881 0.98669

34

Table 8. The relative performance of different combination algorithm on large datasets. Dataset

1

2

3

4

5

6

7

8

9

10

11

12

Method 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

0.98968 0.98953 0.99034 0.99027 0.98975 0.98889 0.99371 0.99042 0.98995 1 0.99443 0.99263 0.9897 0.99668 0.99157 0.99167 0.99012 0.98923 0.99149 0.99143 0.98939 0.99135 0.99157 0.99125

0.99641 0.99641 0.99655 0.99649 0.99886 0.99688 0.99727 0.99808 0.99752 0.99642 0.99738 0.99711 0.99997 0.99646 0.99676 0.99705 0.99713 0.99641 0.99748 0.99729 1 0.99627 0.99675 0.99723

0.99829 0.99699 0.99858 0.99824 0.99713 0.99646 0.99816 0.99773 0.99812 0.99752 0.9987 0.99849 0.99851 0.9971 0.99879 0.99839 0.99804 0.9976 0.99881 0.99816 1 0.99822 0.99845 0.99888

0.79537 0.79434 0.79803 0.79804 0.797 0.79574 0.80798 0.81565 0.79752 0.84027 0.84194 0.90947 0.95772 0.94703 0.82975 1 0.79408 0.79121 0.80714 0.80296 0.93108 0.89777 0.82219 0.9959

0.9802 0.98021 0.97951 0.97969 0.984 0.98429 0.98499 0.98747 0.98689 0.98899 0.98805 0.99537 0.98769 0.9894 0.99601 1 0.9847 0.98634 0.98354 0.9864 0.98791 0.9896 0.99628 0.99958

0.98867 0.98764 0.99008 0.98975 0.99049 0.98865 0.99228 0.99156 0.98748 0.98789 0.99173 0.99091 0.98634 1 0.99128 0.99199 0.9867 0.98611 0.9907 0.99007 0.98878 0.99396 0.98975 0.99358

0.96003 0.95814 0.96113 0.96081 1 0.99569 0.9819 0.99855 0.97142 0.95879 0.97252 0.97716 0.97594 0.99671 0.99056 0.99698 0.95935 0.95693 0.96658 0.96447 0.99441 0.99582 0.97958 0.99821

0.99551 0.99132 0.99571 0.99559 0.9731 0.96888 0.9775 0.97476 0.97948 0.97679 0.98036 0.97605 0.98083 0.98129 0.97554 0.98085 0.99766 0.9895 1 0.99862 0.98154 0.98265 0.97185 0.98053

0.83919 0.83857 0.84001 0.83978 0.9364 0.98317 0.92398 0.94157 0.9209 0.95176 0.88695 0.91957 0.88365 0.9609 0.90848 0.92731 0.92105 0.93483 0.87757 0.90721 0.91806 1 0.90992 0.93757

0.99269 0.99142 0.99486 0.99482 1 0.99579 0.99717 0.99838 0.99895 0.99574 0.99571 0.99755 0.99581 0.99545 0.99669 0.99867 0.99927 0.9935 0.99577 0.99708 0.99912 0.9965 0.99756 0.99973

0.52856 0.52786 0.54888 0.52522 0.78899 0.75795 0.67123 0.98067 0.76454 0.74201 0.61765 0.95469 0.93232 0.90062 0.95804 1 0.56637 0.57645 0.54002 0.53596 0.89417 0.84509 0.84869 0.99717

0.99751 0.99531 0.99965 0.99913 0.99945 0.99627 0.99949 1 0.99821 0.99412 0.99873 0.99885 0.99829 0.99417 0.99866 0.99902 0.99807 0.99444 0.99863 0.99905 0.99822 0.99399 0.99889 0.99893

35

Table 9. The summarized result of Table 5. For each method b , w , t are times better, times worse, total index respectively. w Method t b 24 16 8 22 21 12 14 13 5 15 9 23 7 20 17 11 10 19 6 3 18 4 1 2

239 236 186 185 185 183 180 167 162 155 148 145 139 135 125 123 111 103 102 86 79 65 45 28

37 40 90 91 91 93 96 109 114 121 128 131 137 141 151 153 165 173 174 190 197 211 231 248

202 196 96 94 94 90 84 58 48 34 20 14 2 -6 -26 -30 -54 -70 -72 -104 -118 -146 -186 -220

36

Table 10. The summarized result of Table 6. For each method b , w , t are times better, times worse, total index respectively. w Method t b 24 225 51 174 16 214 62 152 8 184 92 92 12 178 98 80 21 169 107 62 15 168 108 60 14 165 111 54 11 162 114 48 7 161 115 46 22 159 117 42 23 153 123 30 5 150 126 24 13 144 132 12 20 143 133 10 19 140 136 4 9 120 156 -36 3 113 163 -50 10 112 164 -52 17 109 167 -58 4 91 185 -94 6 88 188 -100 1 64 211 -147 18 61 215 -154 2 38 237 -199

37

Figure caption

Figure 1. Hierarchical clustering combination algorithm Figure 2. A dendrogram with the terminal vertices i-p Figure 3. Different steps of the HCC algorithm. (a,b,c) three dendrograms to be combined. (d,e,f) the CD description matrices of dendrograms a,b,c. (g) consensus matrix derived from summation of the d, e, and f. (h,i) the intermediate matrices of Single Linkage hierarchical clustering algorithm. (j) ultrametric matrix derived from g by SL algorithm. (k) the dendrogram corresponding to the matrix shown in j. Figure 4. Winning frequencies of different HCC algorithms on 12 small datasets. Figure5. Winning frequencies of different HCC algorithms on 12 large datasets. Figure 6. Robustness of different HCC algorithm on small datasets Figure 7. Robustness of different HCC algorithm on large datasets Figure 8. The result of comparing different Combination methods on small datasets (Flat or Hierarchical with different dendrogram descriptors) in terms of Nt(i) value Figure 9. The result of comparing different Combination methods on large datasets (Flat or Hierarchical with different dendrogram descriptors) in terms of Nt(i) value

38

Algorithm HCC Input Dataset Z = {z1 , z 2 ,..., z N } Ensemble size L 1- Generate L hierarchical clustering of Z, H

(i )

, 1 ≤ i ≤ L , in the form of dendrogram

(i )

2- Create a descriptor matrix D for each hierarchy H D (i ) = f ( H (i ) ) 1 ≤ i ≤ L

(i )

(dendrogram)

3- Aggregate the descriptor matrices to a final descriptor matrix

D

D = Aggregate( D , D ,..., D ) 4- Create the final dendrogram from the D (this will be done using a standard hierarchical (1)

( 2)

(L)

clustering algorithm)

H = f −1 ( D)

Figure 1. Hierarchical clustering combination algorithm

39

Figure 2. A dendrogram with the terminal vertices i-p

40 3

3

3

2

2

2

1 l

m

k

1

n

k

l

(a) k l m n

0 2 2 3

1

n

k

(b) 2 0 1 3

2 1 0 3

3 3 3 0

k l m n

(d) k l m n

m

0 4 7 8

0 1 2 3

1 0 2 3

2 2 0 3

3 3 3 0

k l m n

0 1 3 2

(e) 4 0 6 8

7 6 0 9

8 8 9 0

k,l m n

(g)

0 6 8

l n (c)

1 0 3 2

m

3 3 0 3

2 2 3 0

(f) 6 0 9

8 9 0

k,l,m n

0 8

(h)

8 0

(i) 8

k l m n

0 4 6 8

4 0 6 8

6 6 0 8

8 8 8 0

6 4 k

(j)

l

m

n

(k)

Figure 3. Different steps of the HCC algorithm. (a,b,c) three dendrograms to be combined. (d,e,f) the CD description matrices of dendrograms a,b,c. (g) consensus matrix derived from summation of the d, e, and f. (h,i) the intermediate matrices of Single Linkage hierarchical clustering algorithm. (j) ultrametric matrix derived from g by SL algorithm. (k) the dendrogram corresponding to the matrix shown in j.

41 0.45 0.4 0.35

Winning Frequency

0.3 0.25 0.2 0.15 0.1 0.05 0

1 2 3 4 5 6 7 8

9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Method No

Figure 4. Winning frequencies of different HCC algorithms on 12 small datasets.

42 0.25

Winning Frequency

0.2

0.15

0.1

0.05

0

1 2 3 4 5 6 7 8

9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Method No

Figure5. Winning frequencies of different HCC algorithms on 12 large datasets.

43 12

Robustness of error

10

8 breast cancer flare solar titanic ionosphere wpbc image segmentation liver disorders wine laryngeal1 laryngeal3 weaning contractions

6

4

2

0

1

2

3

4

5

6

7

8

9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Method No

Figure 6. Robustness of different HCC algorithm on small datasets.

44 12

Robustness of error

10

8 diabetis german balance scale vehicle laryngeal2 banana ringnorm splice twonorm waveform page blocks waveform noise

6

4

2

0

1

2

3

4

5

6

7

8

9

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Method No

Figure 7. Robustness of different HCC algorithm on large datasets.

45

Figure 8. The result of comparing different Combination methods on small datasets (Flat or Hierarchical with different dendrogram descriptors) in terms of Nt(i) value.

46

Figure 9. The result of comparing different Combination methods on large datasets (Flat or Hierarchical with different dendrogram descriptors) in terms of Nt(i) value.