Hybrid Manifold Clustering with Evolutionary Tuning

0 downloads 0 Views 1MB Size Report
are compared w.r.t. the nearest neighbor classification performance on .... HMC implementation is written in Python, while the algorithms k-means,. PCA, ISOMAP, and LLE built upon ... k-means-PCA with the (1 + 1)-ES on the 3-class Digits data set with k = 50. ..... Hull, J.: A database for handwritten text recognition research.
Hybrid Manifold Clustering with Evolutionary Tuning Oliver Kramer Computational Intelligence Group Department of Computer Science University of Oldenburg, Germany [email protected]

Abstract. Manifold clustering, also known as submanifold learning, is the task to embed patterns in submanifolds with different characteristics. This paper proposes a hybrid approach of clustering the data set, computing a global map of cluster centers, embedding each cluster, and then merging the scaled submanifolds with the global map. We introduce various instantiations of cluster and embedding algorithms based on hybridization of k-means, principal component analysis, isometric mapping, and locally linear embedding. A (1+1)-ES is employed to tune the submanifolds by rotation and scaling. The submanifold learning algorithms are compared w.r.t. the nearest neighbor classification performance on various experimental data sets. Key words: Manifold clustering, dimensionality reduction, evolutionary tuning

1

Introduction

In dimensionality reduction (DR), the task is to embed high-dimensional patterns y1 , . . . , yN ∈ Rd into low-dimensional latent spaces by learning an explicit mapping F : Rd → Rq or by finding low-dimensional counterparts x1 , . . . , xN ∈ Rq with q < d in latent spaces that conserve useful information of their highdimensional pendants. The DR problem has intensively been studied in the past decade, but still turns out to be comparatively difficult to solve. Methods like principal component analysis (PCA) [1], isometric mapping (ISOMAP) [2], and locally linear embedding (LLE) [3] learn a global map of all patterns. For some purposes, it might not be appropriate to embed all patterns within one manifold. This can have various reasons. For example in visualization, not a global map of data that puts all patterns into relation may be required, but a map that separates the groups of patterns and presents a reasonable embedding within each group. Patterns may lie in submanifolds with different characteristics that afford individual parameterizations for their optimal representation. Further, the runtime of many DR methods grows quadratically or cubically and does not scale well for large data sets. Embedding of smaller groups of patterns accelerates the learning process, and can additionally be parallelized on multicore machines.

In this paper, an easy-to-implement approach for manifold clustering is introduced. It first clusters the data to assign them to submanifolds. Then, the cluster centers are embedded with a DR approach to establish a global map structure. The patterns within each submanifold are separately embedded. This allows to employ different dimensionality reduction algorithms and different parameters within each submanifold. Finally, the separate embeddings are combined around each latent cluster center to one global map. To avoid overlaps and consider inter-cluster neighborhood relations, the submanifolds are rotated and scaled with evolution strategies (ES) [4]. This paper is structured as follows. In Section 2, we introduce the manifold clustering problem and present related work. The hybrid manifold clustering (HMC) algorithm is introduced in Section 3 and experimentally analyzed in Section 4 w.r.t. different submanifold measures. Conclusions are drawn in Section 5.

2

Manifold Clustering

In this section, we introduce the manifold clustering problem, present related work and introduce the DR method that will be basis of the HMC approach.

2.1

Problem Definition

d Observed patterns Y = [yi ]N i=1 ∈ R may lie in different submanifolds. Let k be the number of potential submanifolds {Mj }kj=1 . Each submanifold may employ an own intrinsic dimension, i.e., the number of features that is necessary to represent the main characteristics of a subset of patterns. The manifold clustering problem is the task to simultaneously assign the patterns to clusters and to solve the DR problem within each submanifold. The DR problem is to find a q low-dimensional representation X = [xi ]N i=1 ∈ R of high-dimensional patterns with q  d such that their most important characteristics like pattern distances and neighborhoods are maintained. Such characteristics can be measured with pattern similarities or classification accuracies, if labels are available. The problem to simultaneously learn submanifolds and their embeddings is difficult to solve. Clusters have to be identified, and low-dimensional representations of the patterns have to be learned. Further, in each cluster different parameters can be chosen. Vidal [5] summarizes challenges of the manifold clustering problem. Essential characteristics of manifold clustering are the coupling between clustering of patterns and model estimation. A known distribution to clusters simplifies the model estimation process, as the model would allow the determination of the assignment to manifolds. But in general, the distribution to clusters is unknown. The problem in manifold clustering is the closeness of subspaces or their overlapping. Submanifolds may employ different characteristics with different intrinsic dimensionality.

2.2

Related Work

Manifold clustering algorithms based on algebraic methods employ matrix factorization e.g., by Costeira and Kanade [6] and by Gear [7], or employ polynomial algebra [8]. Kushnir et al. [9] introduce a submanifold learning variant based on density, shape, intrinsic dimensionality, and orientation. Iterative methods extend k-means, alternately fitting local PCA models to each submanifold and then assigning each pattern to its closest submanifold, e.g., k-planes [10] or k-subspaces [11]. Another iterative solution construction algorithm has been proposed by Kramer [12] based on unsupervised nearest neighbors [13] and an iterative k-means variant. For handling noise, statistical models like mixtures of probabilistic PCA [14] assume that data within submanifolds are generated with independent Gaussian distributions employing the maximum likelihood principle. Closely related to this work are the evolutionary submanifold learning algorithms that choose the employed attributes with evolutionary algorithms [15, 16]. For example, Vahdat et al. [16] uses evolutionary multi-objective search to balance intra-cluster distance and connectedness of clusters. An further introduction to submanifold learning can be found in Vidal [5] and Luxburg [17].

3

Hybrid Manifold Clustering

In this section, we introduce the HMC approach based on six main steps. The idea of HMC is to cluster the data set, embed the clusters into submanifolds and then to combine the separate embeddings on a global map. The embedding task in submanifolds is easy to parallelize. Algorithm 1 shows the pseudocode of the proposed approach. The steps are explained in the following. 3.1

Clustering

The first step of the hybrid approach is the assignment of all patterns to clusters. Let y1 , . . . , yN be the set of patterns. A clustering algorithm (e.g. k-means) assigns the patterns to clusters M1 , . . . , Mk that will be the basis P of the submanifolds. The center of each submanifold is defined as mj = |M1 j | y∈Mj y, where |Mj | is the cardinality of the set of patterns belonging to cluster Mj . We employ k-means for clustering, which repeatedly assigns patterns to the closest intermediate center mj and computes the new center based on this assignment. Other kinds of clustering algorithms can be applied. The success of the clustering approach is important for the assignment to submanifolds. For this sake, it is important to choose algorithms appropriate for the data set. This might be a problem, in particular in high-dimensional data spaces, where the distances can become less meaningful than in low-dimensional data spaces. 3.2

Global Map Embedding

After the clustering process, the cluster centers m1 , . . . , mk ∈ Rd are embedded in a q-dimensional space Rq with a DR algorithm to learn an appropriate global

ˆ 1, . . . , m ˆ k ∈ Rq . In the experimental structure. This results in embeddings m section, we will use neighborhood size K = k for LLE and ISOMAP, i.e., we use the number k of clusters as the largest possible value for K. A small number can be compensated by sampling more patterns from each cluster for the global ˆ 1, . . . , m ˆ k are the centers of the map embedding process. The latent positions1 m embeddings of the submanifolds that are computed in the next step.

Algorithm 1: HMC Require: data set Y, ξ 1: cluster Y → clusters Mj with centers mj ˆj 2: embed global map → m ˆj 3: embed clusters → M 4: compute scaling factor ξ 5: map fusion (Eq. 3) → X 6: tuning with (1 + 1)-ES 7: return X Fig. 1. Pseudocode of HMC approach

3.3

Submanifold Embedding

For each cluster Mj with j = 1, . . . , k, its patterns yl with l = 1, . . . , |Mj | are  ˆ j = x1 , . . . , x|M | embedded with a DR method resulting in latent positions M j with xi ∈ Rq . As most embedding methods scale at least quadratically or cubically with the number of patterns, the approach is faster than embedding all patterns at the same time, i.e., O((N 2 /k) log(N/k)) < O(N 2 log(N )). 3.4

Map Fusion

In the next step, the submanifold embeddings are merged on the global map. To avoid overlaps, the maximum extension n o ˆ j ∧ i 6= l dmax = max kxi − xl k2 |xi , xl ∈ M (1) j=1,...,k

of all submanifolds is combined with the minimum distance  ˆi−m ˆ j k2 |i, j = 1, . . . , k ∧ i 6= j dmin = km

(2)

between the embeddings on the global map to a submanifold scaling factor ξ = dmin /dmax . With ξ, the final embedding is determined by placing the submanifolds at the positions of embeddings of the cluster centers o [ n ˆj . ˆ j + ξ · xi |xi ∈ M X= m (3) j=1,...,k 1

ˆ j is The notation Mj is used for cluster j in data space with center mj , while M ˆ j. the corresponding submanifold in latent space with center m

ˆ j can also be normalized Befor the map fusion, the low-dimensional patterns in M ˆ i is located at the origin m ˆ i = (0, . . . , 0, )T ∈ Rq . This HMC such that center m framework can be instantiated with various DR methods and separate settings for each submanifold. This will be investigated in more detail in the experimental section, where we concentrate on a comparison of PCA, ISOMAP, and LLE. 3.5

Evolutionary Tuning

The embedding processes in each submanifold are independent from each other. An adaptation of submanifolds on the level of the global map by scaling and rotation allows the improvement of the overall DR result. In the following, we propose to improve the submanifolds with an evolutionary post-optimization and tuning process. The objective is to minimize DR-oriented quality measures. For the labeled data sets, we demonstrate the tuning process with the k-nearest neighbor (kNN) classification error on the embeddings, i.e., the fitness function to be minimized is defined as the cross-validation (CV) error E(X) = kf (xi ) − y i k2

(4)

with i = 1, . . . , N , where N is the number of patterns in the CV learning scheme employing the original labels y i . As this error measure is not differentiable, we employ ES as blackbox optimizer. A solution z = (r1 , . . . , rk , s1 , . . . , sk ) is a vector of rotation angles ri ∈ [0, 360] and scaling factors si ∈ R+ for each ˆ 1, . . . , M ˆ k . Hence, the optimization problem is a continuous 2ksubmanifold M dimensional problem with bound constraints 0 ≤ ri ≤ 360 and si > 0. In the following, we employ a (1+1)-ES with Gaussian mutation z0 = z+σ ·N (0, 1) and Rechenberg’s 1/5th mutation strength control [18]. The fitness E(·) is computed for neighborhood size k = 50 of the kNN classification error. The step sizes start with initial values σ10 for mutation of the rotation angels and σ20 for mutation of the scaling factor. For the Rechenberg mutation rate control, we choose the generation window G = 50 and mutation parameter τ = 1.1. The (1 + 1)-ES is based on death penalty for the bound constraints, i.e., candidate solutions are discarded, if they are infeasible. As termination condition, the ES stops after T = 500 generations.

4

Experimental Analysis

In the following, we evaluate various instantiations of HMC experimentally. Our HMC implementation is written in Python, while the algorithms k-means, PCA, ISOMAP, and LLE built upon the scikit-learn [19] machine learning framework. We compare the embeddings w.r.t. the kNN classification and regression error E(X) using the latent representations x1 , . . . , xN as training patterns and the original labels y i of the labeled training data sets. The kNN error gives information about the neighborhood characteristics of the employed data set. Low errors show that the low-dimensional patterns preserve the most important characteristics of their high-dimensional counterparts and are therefore appropriate representations.

4.1

Exemplary Embeddings

Figure 2 shows exemplary embeddings of the Digits data set with HMC employing evolutionary tuning. Figure 2(a) shows an embedding of the HMC variant k-means-PCA with the (1 + 1)-ES on the 3-class Digits data set with k = 50. The submanifolds have been rotated slightly leading to a situation, where similar instances of different digits are located in a closer neighborhood. Figure 2(b) shows the result of a post-optimization with larger initial step sizes leading to a manifold with varying shape and higher kNN accuracies. Figure 3 shows the embeddings of the Iris and the Photos data set with HMC and evolutionary tuning. Figure 3(a) shows the result of HMC and evolutionary post-optimization on the Iris data set. Figure 3(b) shows an embedding with HMC of Photos employing PCA. The three submanifolds are highlighted.

40 30 20 10 0 10 20 3060

00 0000000000 0000000000 0 0000 40

22 2 222222 2 2 222222 22222 2222 222 1 2222222 1 11 1 11111111111111 1

20 0

20

(a) HMC on Digits, low

40

0 σ1,2

40

2222222 0 00 1 22222222222 2 12 0 00 0 0 00 0000000 0 01 0 0 0 000 000 1 2 22 20 00 00 111111 11222 111 1 40 11 111

20

60

6060

40

20 0

20

(b) HMC on Digits, high

40

60

0 σ1,2

Fig. 2. Embeddings of Digits data with the HMC variant k-means-PCA and (1 + 1)-ES tuning. The colors correspond to the labels of the patterns, which are not used in the primary embedding process, but for the tuning optimization with the (1+1)-ES.

Table 1 compares the kNN error of HMC with and without evolutionary tuning. For both Digits data sets, we choose σ10 = 50.0 and σ20 = 0.001. On the Iris data set, the setting σ10 = 5.0 and σ20 = 0.001 is chosen. The table shows that the (1 + 1)-ES improves the submanifold embeddings in all cases. Table 1. Comparison of evolutionary tuning by rotating and scaling of submanifolds on Digits, Iris, and Gauss. Digits 1 (5) Digits 2 (10) Iris Gauss HMC 0.109 0.140 0.100 0.383 (1 + 1)-EA 0.056±0.05 0.131±0.03 0.067±0.06 0.375±0.02

8 6 4 2 0 2 4 66

11

submanifold 1

11 1 1222 1 1 1 10 2 11 1 4

2

0

2

4

(a) HMC on Iris

submanifold 2

submanifold 3

6

(b) HMC on Photos

Fig. 3. Embeddings of Iris with k-means-PCA HMC and embedding of Photo data set with LLE HMC and (1 + 1)-ES tuning.

4.2

Benchmark Problems

Table 2 compares the classification and regression errors between native DR methods for preprocessing and HMC on the benchmark test set, see Appendix A. The target dimensionality is q = 2. The results show the kNN classification error with the MSE measure. The native embedding methods are compared to the HMC variants, which employ the corresponding DR method both for the global map and for the submanifolds. The evolutionary fine-tuning process runs with a (1+1)-EA for 500 generations. The evolutionary experiments are repeated Table 2. Comparison of basic DR for preprocessing to HMC on benchmark problem w.r.t kNN classification measure. problem native MakeClass 0.296 Digits 0.330 Faces 0.542 Blobs 0.048 Friedman 1 20.09 Friedman 2 20.48 Wind 11.50 Fitness 0.291

PCA HMC 0.211 ± 0.009 0.093 ± 0.012 0.616 ± 0.001 0.063 ± 0.009 23.67 ± 0.623 22.12 ± 0.764 7.646 ± 0.599 0.970 ± 0.507

ISOMAP native HMC 0.296 0.192 ± 0.003 0.330 0.094 ± 0.015 0.542 0.616 ± 0.014 0.048 0.059 ± 0.010 20.09 24.35 ± 0.613 20.48 23.73 ± 0.213 11.50 7.271 ± 0.714 0.291 1.106 ± 0.382

native 0.297 0.316 0.543 0.047 20.40 20.66 12.12 0.253

LLE HMC 0.224 ± 0.004 0.097 ± 0.014 0.630 ± 0.003 0.060 ± 0.013 21.64 ± 1.370 23.05 ± 1.309 8.283 ± 0.606 0.982 ± 0.663

50 times and the mean with standard deviation is shown. The results on the MakeClass data set show that each native variant is significantly outperformed by the HMC approaches. The lowest error in mean has been achieved by the ISOMAP-HMC variant. Also on Digits the HMC variants outperform the native algorithms. Embedding in ten submanifolds results in a clear separation between

the different classes and achieves significantly lower kNN classification errors. While the standard deviation is moderate in all cases, the PCA-HMC variant is the best model in mean. On the Faces data set, all native and all HMC variants can only achieve a comparatively low accuracy. Here, the native variants even outperform the HMC variants. On Blobs, the mean results of HMC is slightly worse than the native variants. Also on the Friedmann regression problems, the native DR methods perform better in terms of kNN classification than the HMC approaches. On the Wind data set, HMC is superior, while it is again outperformed on the Fitness regression data set. 4.3

Runtime

Last, we analyze the runtime of HMC. With growing data set sizes the runtime increases. Figure 4(a) compares the runtime of ISOMAP and HMC on the MakeClass data set w.r.t. an increasing number of patterns, i.e., from N = 500 to N = 5500. HMC optimizes for T = 100 generations and employs k = 2 classes. Each experimental setting is repeated 30 times with new randomly generated data sets. The curves show the mean runtimes as well as the minimum and maximum values. The curves show that ISOMAP is slightly faster at the beginning, but is then the overhauled by HMC. HMC is slower at the beginning because of the evolutionary tuning process. As of N = 1500, ISOMAP is outperformed because its quadratic runtime complexity2 , while HMC works on a data set of size N/k. Interestingly, the runtime variance is comparatively low.

250

200

150

runtime

runtime

200

250

ISOMAP HMC

100 50 00

ISOMAP HMC

150 100 50

1000

2000

3000 4000 data set size

5000

6000

(a) MakeClass

00

1000

2000

3000 4000 data set size

5000

6000

(b) Blobs

Fig. 4. Runtime comparison of native ISOMAP and HMC based on ISOMAP on (a) MakeClass and (b) Blobs data set.

The corresponding results for ISOMAP on the Blobs data set are shown in Figure 4(b), where a similar picture is drawn. ISOMAP is only slightly faster for the small data sets, i.e., until N = 1500. However, the choice of methods is a 2

The runtime of ISOMAP is O(N 2 log N )

decision with tradeoff. The win in runtime might be a convincing argument for choosing HMC, also the accuracy in classification scenarios can be slightly worse. For visualization purposes, HMC might be a reasonable choice if submanifolds are present.

5

Conclusions

The hybrid approach of clustering and subsequent embedding of patterns into submanifolds introduced in this paper can be instantiated with numerous clustering and DR algorithms. The embedding of cluster centers on a global map as well as scaling of the submanifolds before merging with the global centers leads to improved DR results in comparison to standard methods. We analyzed different combinations of DR methods for learning the global map and for embedding into submanifolds. The most important result is that the kNN classification and regression error based on submanifold embeddings is often lower than the accuracy based on standard methods. This indicates that HMC embeddings maintain more information of the data distribution in the submanifold latent spaces than classical approaches. In the future, we will analyze separate dimensionalities qj ˆ j optimized w.r.t. different DR measures. for each manifold M

References 1. Jolliffe, I.: Principal component analysis. Springer series in statistics. Springer, New York u.a. (1986) 2. Tenenbaum, J.B., Silva, V.D., Langford, J.C.: A global geometric framework for nonlinear dimensionality reduction. Science 290 (2000) 2319–2323 3. Roweis, S.T., Saul, L.K.: Nonlinear dimensionality reduction by locally linear embedding. Science 290 (2000) 2323–2326 4. Beyer, H.G., Schwefel, H.P.: Evolution strategies - A comprehensive introduction. Natural Computing 1 (2002) 3–52 5. Vidal, R.: Subspace clustering. IEEE Signal Process. Magazine 28 (2011) 52–68 6. Costeira, J.P., Kanade, T.: A multibody factorization method for independently moving objects. International Journal of Computer Vision 29 (1998) 159–179 7. Gear, C.W.: Multibody grouping from motion images. International Journal of Computer Vision 29 (1998) 133–150 8. Vidal, R., Ma, Y., Sastry, S.: Generalized principal component analysis (gpca). IEEE Trans. Pattern Anal. Mach. Intell. 27 (2005) 1945–1959 9. Kushnir, D., Galun, M., Brandt, A.: Fast multiscale clustering and manifold identification. Pattern Recognition 39 (2006) 1876–1891 10. Bradley, P.S., Mangasarian, O.L.: k-plane clustering. Journal of Global Optimization 16 (2000) 23–32 11. Tseng, P.: Nearest q-flat to m points. Journal of Optimization Theory and Applications 105 (2000) 249–252 12. Kramer, O.: Fast submanifold learning with unsupervised nearest neighbors. In: ICANNGA. (2013) 317–325

13. Kramer, O.: Dimensionalty reduction by unsupervised nearest neighbor regression. In: International Conference on Machine Learning and Applications (ICMLA), IEEE (2011) 275–278 14. Tipping, M.E., Bishop, C.M.: Mixtures of probabilistic principal component analysers. Neural Computation 11 (1999) 443–482 15. Nourashrafeddin, S., Arnold, D., Milios, E.E.: An evolutionary subspace clustering algorithm for high-dimensional data. In: Proceedings of the annual conference on Genetic and evolutionary computation (GECCO). (2012) 1497–1498 16. Vahdat, A., Heywood, M.I., Zincir-Heywood, A.N.: Bottom-up evolutionary subspace clustering. In: IEEE Congress on Evolutionary Computation. (2010) 1–8 17. von Luxburg, U.: A tutorial on spectral clustering. Statistics and Computing 17 (2007) 1–24 18. Rechenberg, I.: Cybernetic solution path of an experimental problem. In: Ministry of Aviation, UK, Royal Aircraft Establishment (1965) 19. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12 (2011) 2825–2830 20. Hull, J.: A database for handwritten text recognition research. IEEE PAMI 5 (1994) 550–554 21. Huang, G.B., Ramesh, M., Berg, T., Learned-Miller, E.: Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical Report 07-49, University of Massachusetts, Amherst (2007) 22. Friedman, J.H.: Multivariate adaptive regression splines. The Annals of Statistics 19 (1991) 1–67

A

Benchmark Problems

The experimental analysis is based on the following benchmark problems: MakeClass is a classification data set generated with the scikit-learn [19] method make classification with d dimensions and two centers. The structure ∆ determine the ratio informative features, the remaining are redundant. The UCI Digits data set [20] comprises handwritten digits with d = 64. It is a frequent reference problem related to the recognition of handwritten characters and digits. The Faces data set is called Labeled Faces in the Wild [21] and has been introduced for studying the face recognition problem. The data set source is http://vis-www.cs.umass.edu/lfw/. It contains JPEG images of famous people collected from the internet. The faces are labeled with the name of the person pictured. The Gaussian blobs data set is generated with the scikit-learn [19] method make blobs and the following settings. Two centers, i.e., two classes are generated, each with a standard deviation of σ = 10.0 and variable d. Friedman 1 is a regression data set generated with the scikit-learn [19] method make friedman1. The regression problem has been introduced in [22], where Friedman introduces multivariate adaptive regression splines. Friedman 2 is also a regression data set of scikit-learn [19] and can be generated with make friedman2. The wind data set is based on spatio-temporal time series data from the National Renewable Energy Laboratory (NREL) western wind data set. The whole data set comprises time series of 32,043 wind turbines, each holding ten 3 MW turbines over a timespan of three years in a 10-minute resolution. The dimensionality is d = 22. Fitness is data set based on an optimization run of a (15+100)-ES [4] on the Sphere function f (z) = zT z with d = 20 dimensions and 21000 fitness function evaluations. The patterns are the objective variable values of the best candidate solution in each generation, the labels are the fitness function values in each generation. The data set Photos contains thirty JPEG photos with resolution 320 × 214 taken with a SONY DSLR-A300. The photos show landscapes taken in the San Francisco bay area and the Dominican Republic. The Iris data sets consists of 150 4-dimensional patterns of three different types of irises.