A Stochastic Optimization Approach for ... - Semantic Scholar

2 downloads 553 Views 877KB Size Report
In this paper we employ evolutionary approaches for the optimization of the UKR model. Based on local linear embedding solutions, the stochastic search.
A Stochastic Optimization Approach for Unsupervised Kernel Regression Oliver Kramer

Fabian Gieseke

Institute of Structural Mechanics Bauhaus-University Weimar [email protected]

Institute of Structural Mechanics Bauhaus-University Weimar [email protected]

Abstract—Unsupervised kernel regression (UKR), the unsupervised counterpart of the Nadaraya-Watson estimator, is a dimension reduction technique for learning of low-dimensional manifolds. It is based on optimizing representative low-dimensional latent variables with regard to the data space reconstruction error. The problem of scaling initial local linear embedding solutions, and optimization in latent space is a continuous multi-modal optimization problem. In this paper we employ evolutionary approaches for the optimization of the UKR model. Based on local linear embedding solutions, the stochastic search techniques are used to optimize the UKR model. An experimental study compares covariance matrix adaptation evolution strategies to an iterated local search evolution strategy.

I. I NTRODUCTION Unsupervised kernel regression (UKR) is the unsupervised counterpart of the Nadaraya-Watson estimator. It is based on optimizing latent variables w.r.t. to the data space reconstruction error. The evolved latent variables define a UKR solution. Projection back into data space yields the UKR manifold. But the optimization problem has multiple local optima. In the following, we employ evolutionary optimization strategies, i.e., the CMA-ES and Powell evolution strategies (Powell-ES) [11], to optimize UKR manifolds. Furthermore, we analyze the influence of local linear embedding (LLE) [18] as initialization procedure. This paper is organized as follows. In Section II we give a brief overview of manifold learning techniques. In Section III we introduce the UKR problem by Meinicke et al. [13], [12]. In this section we also shortly present the standard UKR optimization approach that we are going to replace by an evolutionary framework. Section IV introduces the components of the evolutionary UKR approach. An experimental study in Section V will give insights into the evolutionary optimization process. Important results are summarized in Section VI. II. R ELATED W ORK High-dimensional data is usually difficult to interpret and visualize. But many datasets show correlations between their variables. For high-dimensional data, a low-dimensional simplified manifold can often be found that represents characteristics of the original data. The assumption is that the structurally relevant data lies on an embedded manifold of lower dimension. In the following, we will overview famous

manifold learning techniques pointing out that the overview can only be subjective depiction of a wide field of methods. One of the most famous and widely used dimension reduction methods is principal component analysis (PCA) that assumes linearity of the manifold. Pearson [16] provided early work in this field. He fitted lines and planes to a given set of points. The standard PCA as most frequently known today can be found in the depiction of Jolliffe [8]. The PCA computes the eigenvectors, i.e., the largest eigenvalues of the covariance matrix of the data samples. An approach for learning of nonlinear manifolds is kernel PCA [19] that projects the data into a Hilbert space similarly to the SVM and SVR principle. Hastie and Stuetzle [6] introduced principal curves that are self-consistent smooth curves that pass through the middle of data clouds. Self-consistency means that the principal curve is the average of the data projected on it. Bad initializations can lead to bad local optima in the previous approaches. A solution to this problem is k-segments by Verbeek, Vlassis, and Kröse [21] that alternates fitting of unconnected local principal axes and connecting the segments to form a polygonal line. Self-organizing feature maps [10] proposed by Kohonen learn a topological mapping from data space to a map of neurons, i.e., they perform a mapping to discrete values based on neural (codebook) vectors in data space. During the training phase the neural vectors are pulled into the direction of the training data. Generative topographic mapping (GTM) by Bishop, Svensén, and Williams [3], [2] is similar to selforganizing feature maps, but assumes that the observed data has been generated by a parametric model, e.g., a Gaussian mixture model. It can be seen as constrained mixture of Gaussian, while the SOM can be viewed as constrained vector quantizer. Multi-dimensional scaling is a further class of dimension reduction methods, and is based on the pointwise embedding of the dataset, i.e., for each high-dimensional point yi , a low dimensional point xi is found, and for which similarity or dissimilarity conditions hold. For example, the pairwise (Euclidean) distances of two low-dimensional points shall be consistent with the high-dimensional counterparts. Another famous dimension reduction method based on multi-dimensional scaling is Isomap introduced by Tenenbaum, Silva, and Langford [20]. It is based on three steps: first, a neighborhood graph of Y is computed, second, the shortest distances between its

nodes yi and yj are computed (e.g. with Dijkstra), third, multi-dimensional scaling is applied based on the distances along the neighborhood graph that more represent curvilinear distances than Euclidean distances. The optimal embedding can be computed by the solution of an eigendecomposition. III. U NSUPERVISED K ERNEL R EGRESSION In this section we introduce the UKR approach, regularization techniques and the Klanke-Ritter optimization approach [9]. A. Kernel Functions UKR is based on kernel density functions K : Rd → R. A typical kernel function is the Gaussian (multivariate) kernel:   1 −1 2 1 exp − H z , (1) KG (z) = 2 (2π)q/2 det(H) with bandwidth matrix H = diag(h1 , h2 , . . . , hd ). Figure 1 illustrates that the UKR result significantly depends on the choice of an appropriate bandwidth. For a random data cloud the kernel density estimate is visualized. A too small bandwidth results in an overfitted model (left). Bandwidth 0.01

Bandidth 0.5 0.04 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0

0.04 0.03 0.02 0.01 0

3

2

1

0

-1

-2

-3 3

2

1

0

-1

-2

-3

0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0

0.16 0.12 0.08 0.04 0

3

2

1

0

-1

-2

-3 3

2

1

0

-1

-2

-3

B. UKR Formulation UKR has been introduced by Meinicke et al. [13], [12] within a general regression framework for the reversed problem of learning manifolds. The task of UKR is to find the input parameters of the regression function that are the latent variables of the low-dimensional manifold. UKR reverses the Nadaraya-Watson estimator [14], [22], i.e., the latent variables X = (x1 , . . . xN ) ∈ Rq×N become parameters of the system, in particular X becomes the lower dimensional representation of the observed data Y = (y1 , . . . yN ) ∈ Rd×N . The UKR regression function can be written as follows: N X

K(x − xi ) yi PN , j=1 K(x − xj ) i=1

i=1

The matrix Y of observed data is fixed, and the basis functions are tuned during the learning process. The basis functions bi sum to one as they are normalized by the denominator. The quality of the principal manifold learning is evaluated with the data space reconstruction error, i.e., the Euclidean distance between training data and its reconstruction. 1 (5) R(X) = kY − YB(X)k2F , N using the Frobenius norm, and with the matrix of basis functions. The Frobenius norm of a matrix A is defined as follows: v uX n um X |aij |2 . (6) kAk2F = t i=1 j=1

To summarize, bi ∈ R is a relative kernel density, b ∈ RN is a vector of basis functions, and B ∈ RN ×N is a matrix, whose columns consists of the vector of basis functions. Hence, the product of Y ∈ Rd×N and B ∈ RN ×N (which is the Nadaraya-Watson estimator) results in a d × N -matrix. B(X) = (b(x1 ; X), . . . , b(xN ; X)).

Fig. 1. Comparison of kernel density estimates with two bandwidths of a data cloud.

f(x; X) =

Each component i of the vector b(·) contains the relative kernel density of point x w.r.t. the i-th point of matrix X. Equation (2) can also be written in terms of these basis functions: N X yi bi (x; X) = Yb(x; X). (4)

(2)

which is the revised Nadaraya-Watson estimator. The free parameters X define the manifold, i.e., the low-dimensional representation of the data Y. Parameter x is the location where the function is evaluated, and is based on the entries of X. For convenience, Klanke et al. [9] introduced a vector b(·) ∈ RN of basis functions that define the ratios of the kernels: K(x − xi ) bi (x, X) = PN . (3) j=1 K(x − xj )

(7)

Leave-one-out cross-validation (LOO-CV) can easily be employed by setting the diagonal entries of B to zero, normalizing the columns, and then applying Equation 5. Instead of applying LOO-CV an UKR model can be regularized via penalizing extension in latent space (corresponding to penalizing small bandwidths in kernel regression) [9]: 1 (8) R(X) = kY − YB(X)k2F + λkXk2F . N The regularized approach will be used in the comparison between the CMA-ES and the Powell-ES in Section V-C. C. Klanke-Ritter Optimization Scheme Klanke and Ritter [9] have introduced an optimization scheme consisting of various steps. It uses PCA [8] and multiple LLE [18] solutions for initialization, see Section III-D. In particular, the optimization scheme consists of the following steps: 1) Initialization of n + 1 candidate solutions are, n solution from LLE, one solution from PCA, 2) selection of the best initial solution w.r.t. CV-error, 3) search for optimal scale factors that scale the best LLE solution to an UKR solution w.r.t. CV-error, 4) selection of the most promising solution w.r.t. CV-error, 5) CV-error minimization: • if the best solution stems from PCA: search for optimal regularization parameters η with the homotopy method,

if the best solution steps form LLE: CV-error minimization with the homotopy method / resiliant backpropagation (RPROP) [17], 6) final density threshold selection. For a detailed and formal description of the steps, we refer to the depiction of Klanke and Ritter [9]. They discuss that spectral methods do not always yield an initial set of latent variables that is close to a sufficient deep local minimum. We will employ evolution strategies to solve the global optimization problem, but also use LLE for initial solutions.

an experimental analysis of the evolutionary UKR approach. A side effect of the use of an evolutionary scheme is that arbitrary, also non-differentiable kernel functions can be employed. LOO-CV can easily be implemented by setting the diagonal entries of X to zero, and normalizing the columns, and then applying Equation (5). In the following, we compare two optimization approaches for optimizing the UKR model: (1) Powell’s conjugate gradient ES [11], and (2) the CMAES [15].

D. Local Linear Embedding

In regression typically different loss function are used that weight the residuals. In the best case the loss function is chosen according to the needs of the underlying data mining model. With the design of a loss function, the emphasis of outliers can be controlled. Let L : Rq × Rd → R be the loss function.PIn the univariate case d = 1 a loss function is defined N as L = i=1 L(yi , f (xi )). The L1 loss is defined as



For non-linear manifolds LLE by Roweis and Saul [18] is often employed. LLE assumes local linearity of manifolds. It works as follows for mapping high-dimensional data points y ∈ Y to low-dimensional embedding vectors x ∈ X. LLE computes a neighborhood graph like for the other nonlinear spectral embedding methods, see Section II. Then it computes the weights wij that best reconstruct each data point yi from its neighbors, minimizing the cost function: R(w) =

N X

kyi −

i=1

X

wij yj k2 .

N X

(9)

j

kxi −

i=1

N X

wij xj k2

L1 =

(10)

For a detailed introduction to LLE we refer to [18], and Chang and Yeung [4] for a variant robust against outliers. A free parameter of LLE is the number of local models, which can reach from 1 to N . LLE is employed as initialization routine. The best LLE solution (w.r.t. the CV-error of the UKR manifold) is used as basis for the subsequent stochastic CVerror minimization. IV. E VOLUTIONARY U NSUPERVISED K ERNEL R EGRESSION

(12)

L2 =

N X (yi − f (xi ))2 .

(13)

i=1

Huber’s loss [7] is a differential alternative to the L1 loss, and makes use of a trade-off point δ between the L1 and the L2 characteristic, i.e.: LH =

N X

Lh (yi − f (xi )),

(14)

1 2 2·δ r

(15)

i=1

and:

 Lh (r) =

|r| − 12 δ

|r| < δ |r| ≥ δ

Parameter δ allows a problem specific adjustment to certain problem characteristics. In the experimental part we use the setting δ = 0.01. V. E XPERIMENTAL S TUDY

A. Evolutionary Optimization Scheme We employ the CMA-ES, and the Powell-ES to solve two steps of the UKR optimization framework. Our aim is to replace the rather complicated optimization scheme we briefly summarized in Section III-C. It consists of the following steps: 1) Initialization of n candidate LLE solutions, ˆ ∗ w.r.t. CV-error, 2) selection of the best initial solution X init 3) search for optimal scale factors s

kyi − f (xi )k,

and L2 is defined as

j=1

ˆ∗ ) s∗ = arg min RCV (diag(s)X init

N X i=1

The resulting weights capture the geometric structure of the data as they are invariant under rotation, scaling and translation of the data vectors. Then, LLE computes the vectors yi best reconstructed by the weights wij minimizing R(w) =

B. Huber’s Loss

(11)

with CMA-ES/Powell-ES, and 4) CV-error minimization with CMA-ES/Powell-ES. We employ Huber’s loss function [7], see Section IV-B, for the following experiments. The subsequent sections describe

A. Datasets and Fitness Measure The experimental analysis is based on the following datasets: • 2-D-S: Noisy “S”: 100 data points, d = 2, noise magnitude σ = 0.1, and 1,000 test points without noise, • 3-D-S: Noisy “S”: 100 data points, d = 3, noise magnitude σ = 0.1 and 1,000 test points without noise, • digits 7: 100 samples with d = 256 (16 · 16 greyscale values) of figure 7 from the digits dataset [5], and 250 test samples. The test error is computed by the projection of 1000 test points uniformly generated in latent space, and mapped to data space. The test error is the sum of distances between each test point

xt ∈ T , and closest projection and the sum distances between each projected point xp ∈ P and the closest test point: X X Rt = min kxp − xt k + min kxp − xt k (16) xt ∈T

xp ∈P

xp ∈P

xt ∈T

For training (validation) and test measurement of residuals Huber’s loss function is employed, see Section IV-B.

TABLE I E XPERIMENTAL ANALYSIS OF OPTIMIZATION STEPS INVESTED INTO SCALING OF THE LLE SOLUTIONS , AND CV- ERROR MINIMIZATION ( MINIMAL VALUES ). balance

0/4

1/3

2/2

3/1

4/0

LLE scaling CV test

7.23733 7.23733 1.80625 0.00276

7.23733 3.02818 1.95179 0.00225

7.23733 2.19235 2.04789 0.00195

7.23733 2.10442 2.07737 0.00313

7.23733 1.89369 1.89369 0.00329

B. LLE Initialization and Scaling Balance Initialization with LLE is part of the optimization scheme introduced by Klanke and Ritter [9]. The question arises, how much optimization effort should be invested into the scaling in comparison to the CV-error minimization process. In the following, we analyze the balance of optimization effort for both optimization steps with a budget of 4,000 optimization steps. Table I shows the UKR CV-errors (1) of the best LLE model, (2) after scaling optimization with the CMA-ES, (3) after the final CV-error minimization, and (4) the error on the test set. We test five optimization balances, the first number indicates the number of steps for the LLE scaling optimization, the second number states the number of steps for the latent variable-based optimization. We test the following combinations: (0/4) meaning no scaling, 4,000 generations of final CV minimization, and the balances (1/3) meaning 1,000 scaling and 3,000 CV, (2/2) meaning 2,000 scaling and 2,000 CV, (3/2) meaning 3,000 scaling and 1,000 CV, and finally (4/0) meaning 4000 scaling and no final CV optimization. The values shown in Table I present the best test error of 100 runs. The results show that the (2/2) variant achieves the lowest rest error, while the (0/4) variant achieves the lowest training error. UKR data

2

UKR data

2

1

1

0

0

-1

-1

-2

-2 -2

-1

0

1

2

UKR data

2

-2

-1.5

-1

-0.5

0

0.5

1

2

1

1

0

0

-1

-1

-2

1.5

2

UKR data

C. CMA-ES and Powell-ES In the following, we compare two optimization algorithms, i.e., the CMA-ES and the Powell-ES (also known as PowellILS, see [11]) as optimization approaches for the UKR learning problem. It is based on Powell’s fast local search. To overcome local optima the Powell-ES makes use of a (µ + λ)ES [11]. Each objective variable x ∈ R is mutated using Gaussian mutation [1] with a step-size (variance σ), and a Powell-search is conducted until a local optimum is found. The step-sizes of the ES are mutated as follows: if in successive iterations the same local optimum is found, the step-sizes are increased to overcome local optima. In turn, if different local optima are found, the step-sizes are decreased. Table II compares the two optimizers on three artificial datasets, using the penalized UKR variant, see Equation (8) with λ = 0.1. The values show the test error, i.e., the distance between the original data to the projections of 1000d samples after 4000 fitness function evaluations, i.e., 2000 steps of scale optimization, and 2000 steps of CV error minimization. The experiments show that the Powell-ES achieves a lower training error in each of the experiments. But only on the problems 2D-S this is reflected in a lower test error. This means that on 3D-S and digits overfitting effect occurred that could not been prevented with the penalty regularization approach. A deeper analysis of the balancing parameter λ will be necessary. TABLE II E XPERIMENTAL COMPARISON OF CMA-ES AND P OWELL -ES ON THREE DATASETS 2D-S, 3D-2 AND digits. data train 2D-S 3D-S digits

9.4408 17.7041 51.9710

CMA-ES test 0.0303 0.1351 0.0147

max

train

0.3845 1.0351 0.3991

8.7854 15.8019 48.3865

Powell-ES test 0.0205 0.2296 0.0168

max 0.2241 1.4088 0.5242

-2

-2

-1

0

1

2

-2

-1

0

1

2

Fig. 2. Evolutionary UKR on noisy S dataset. The figures show the results of various optimization efforts spent on scaling and final CV-error minimization, i.e., 0/4, 2/2, 3/1, and a regularized approach with λ = 0.25, see Equation (8).

The projection is used for visualization. Figure 2 gives a visual impression of the evolved UKR manifolds. The original data is shown by (blue) spots, the manifold is drawn as a (red) line. With the help of the test set, we compute a test error, see Equation (16).

VI. C ONCLUSIONS The multi-modal optimization problem of UKR can be solved with the CMA-ES or the Powell-ES leading to an easier optimization framework that is also capable of handling nondifferentiable loss and kernel functions. However, initial LLE solutions still improve the optimization process. Overfitting effects might occur that have to be avoided by improved regularization approaches. For this sake further search has to be invested into parameter λ, e.g., employing grid-search.

In the future we plan to balance regularization with multiobjective optimization techniques. R EFERENCES [1] H.-G. Beyer and H.-P. Schwefel. Evolution strategies - A comprehensive introduction. Natural Computing, 1:3–52, 2002. [2] C. M. Bishop, M. Svensén, and C. K. I. Williams. Developments of the generative topographic mapping. Neurocomputing, 21(1-3):203–224, 1998. [3] C. M. Bishop, M. Svensén, and C. K. I. Williams. Gtm: The generative topographic mapping. Neural Computation, 10(1):215–234, 1998. [4] H. Chang and D. Yeung. Robust locally linear embedding. Pattern Recognition, 39:1053–1065, 2006. [5] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer, Berlin, 2009. [6] Y. Hastie and W. Stuetzle. Principal curves. Journal of the American Statistical Association, 85(406):502–516, 1989. [7] P. J. Huber. Robust Statistics. Wiley, New York, 1981. [8] I. Jolliffe. Principal component analysis. Springer series in statistics. Springer, New York u.a., 1986. [9] S. Klanke and H. Ritter. Variants of unsupervised kernel regression: General cost functions. Neurocomputing, 70(7-9):1289–1303, 2007. [10] T. Kohonen. Self-Organizing Maps. Springer, 2001. [11] O. Kramer. Fast blackbox optimization: Iterated local search and the strategy of powell. In Proceedings of the 2009 International Conference on Genetic and Evolutionary Methods, 2009.

[12] P. Meinicke. Unsupervised Learning in a Generalized Regression Framework. PhD thesis, University of Bielefeld, 2000. [13] P. Meinicke, S. Klanke, R. Memisevic, and H. Ritter. Principal surfaces from unsupervised kernel regression. IEEE Trans. Pattern Anal. Mach. Intell., 27(9):1379–1391, 2005. [14] E. Nadaraya. On estimating regression. Theory of Probability and Its Application, 10:186–190, 1964. [15] A. Ostermeier, A. Gawelczyk, and N. Hansen. A derandomized approach to self adaptation of evolution strategies. Evolutionary Computation, 2(4):369–380, 1994. [16] K. Pearson. On lines and planes of closest fit to systems of points in space. Philosophical Magazine, 2(6):559–572, 1901. [17] M. Riedmiller and H. Braun. A direct adaptive method for faster backpropagation learning: The rprop algorithm. In In Proceedings of the IEEE International Conference on Neural Networks, pages 586–591, 1993. [18] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. SCIENCE, 290:2323–2326, 2000. [19] B. Schölkopf, A. Smola, and K.-R. Müller. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5):1299– 1319, 1998. [20] J. B. Tenenbaum, V. D. Silva, and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290:2319– 2323, 2000. [21] J. Verbeek, N. Vlassis, and B. Kröse. A k-segments algorithm for finding principal curves. Pattern Recognition, 23:1009–1017, 2002. [22] G. Watson. Smooth regression analysis. Sankhya Series A, 26:359–372, 1964.