When Are Nonconvex Problems Not Scary?

5 downloads 170 Views 584KB Size Report
Oct 21, 2015 - Ju Sun, Qing Qu, John Wright. Department of Electrical Engineering, Columbia University. {js4038, qq2105, jw2966}@columbia.edu. Abstract.
When Are Nonconvex Problems Not Scary?

arXiv:1510.06096v1 [cs.IT] 21 Oct 2015

Ju Sun, Qing Qu, John Wright Department of Electrical Engineering, Columbia University {js4038, qq2105, jw2966}@columbia.edu

Abstract In this paper, we focus on nonconvex optimization problems with no “spurious” local minimizers, and with saddle points of at most second-order. Concrete applications such as dictionary learning, phase retrieval, and tensor decomposition are known to induce such structures. We describe a second-order trust-region algorithm that provably converges to a local minimizer in polynomial time. Finally we highlight alternatives, and open problems in this direction.

1

Introduction

General nonconvex optimization problems are NP-hard [1, 2]. In applied disciplines, however, nonconvex problems abound, and heuristic algorithms are often surprisingly effective. The ability of nonconvex heuristics to find high-quality solutions remains largely mysterious. In this paper, we study a family of nonconvex problems which can be solved efficiently. This family includes examples from signal processing and machine learning applications, such as complete dictionary learning [3], generalized phase retrieval [4], and tensor decomposition [5]. These problems exhibit a characteristic structure. In each problem, the goal is to estimate or recover an object from observed data. Under certain technical hypotheses, every local minimizer of the objective function exactly recovers the object of interest. With this structure, the central issue is how to escape the saddle points. Fortunately, for these problems, all saddle points are second-order, i.e., at each such saddle point, the Hessian matrix is indefinite, with both strictly positive and negative eigenvalues.1 Thus, eigenvector directions corresponding to one negative eigenvalue are local descent directions. One can use this to design algorithms that escape the saddle points. Indeed, consider a natural quadratic approximation to the objective f around a saddle point x: 1 fb(δ; x) = f (x) + δ ∗ ∇2 f (x)δ. 2 When δ is chosen to align with one eigenvector associated with a negative eigenvalue λneg [∇2 f (x)], 2 it holds that fb(δ; x) − f (x) ≤ − 21 λneg kδk . Thus, minimizing fb(δ; x) provides a direction δ? that tends to decrease the objective f , provided local approximation of fb to f is reasonably accurate.2 . Based on this intuition, we derive an algorithmic framework that exploits the second-order information to escape saddle points and provably returns a local minimizer. 1

They are also called “strict saddle” points in optimization literature, see, e.g., pp 38 of [6]; see also [5]. For higher-order saddles that seem to demand higher-order approximations, the computation may quickly become intractable. For example, third order saddle points seem to demand studying spectral property of three-way tensors, which entails NP-hard computational problems [7]. 2

1

Figure 1: Not all saddle points are secondorder (ridable!). Shown in the plot are functions f (x, y) = x2 − y 2 (left) and g(x, y) = x3 − y 3 (right). For g, both first- and second-order derivatives vanish at (0, 0), producing a higher-order saddle.

2

Nonconvex Optimization with Ridable Saddles

In this section, we present a more quantitative definition of the problem class we focus on and provide several concrete examples in this class. We are interested in optimization problem of the form: minimize f (x),

subject to

x ∈ M.

(1)

Here we assume f is twice-differentiable, i.e., it has continuous first- and second-order derivatives, and M is a Riemannian manifold. Restricting f to M and (with abuse of notation) writing the restricted function as f also, one can effectively treat (1) as an unconstrained optimization on M. We further use grad f (x) and Hess f (x) to denote the Riemannian gradient and Hessian of f at point x 3 , which one can think of as Riemannian counterparts of Euclidean gradient and Hessian for functions, with the exception that grad f (x)[·] and Hess f (x)[·] only act on vectors in tangent space of M at x. Definition 2.1 ((α, β, γ, δ)-ridable function; see also strict-saddle function defined in [5]) A function f over manifold M is (α, β, γ, δ)-ridable (α, β, γ, δ strictly positive) if any point x ∈ M obeys at least one of the following: (Tx M is the tangent space of M at point x) 1) [Negative curvature] There exists v ∈ Tx M with kvk = 1 such that hHess f (x)[v], vi ≤ −α, 2) [Strong gradient] kgrad f (x)k ≥ β, 3) [Strong convexity around minimizers] There exists a local minimizer x? such that kx − x? k ≤ δ, and for all y ∈ M that is in 2δ neighborhood of x? , hHess f (y)[v], vi ≥ γ for any v ∈ Ty M with kvk = 1, i.e., the function f is γ-strongly convex in 2δ neighborhood of x? . In this paper, we deal exclusively with minimizing ridable functions. These functions indeed appear in important practical problems, with the additional property that physcially all local minimizers are equally “good” solutions. Figure 2: (Left) Function landscape of learning sparsifying complete basis via (2) in R3 . (Right) Function landscape of phase retrieval via (3), assuming the target signal x is real in R2 . In each case, note the equivalent global minimizers and the ridable saddles. • Complete Dictionary Recovery [3]. Arising in signal processing and machine learning, dictionary learning tries to approximate a given data matrix Y ∈ Rn×p as the product of a dictionary A and a sparsest coefficient matrix X. In recovery setting, assuming Y = A0 X0 with A0 square and invertible, Y and X0 have the same row space. Under appropriate model on X0 , it makes sense to recover one row of X0 each time by finding the sparsest direction4 in row(Y ) each time by solving the optimization:

minimize q > Y subject to q 6= 0, 0

3

Detailed introduction to these quantities can be found in [8]. We prefer to keep this at intuitive level to convey only the main ideas. 4 The absolute scale is not recovable.

2

which can be relaxed as p

. 1X minimize f (q) = h(q > y k ) p

subject to

k=1

kqk2 = 1

[i.e., q ∈ Sn−1 ].

(2)

Here h(·) is smooth approximation to the |·| function and y k the k-th column of Y , a proxy of Y . The manifold M is Sn−1 here. [3] (Theorem 2.3 and Corollary 2.4) showed that when h(·) = µ log cosh(·/µ) and p is reasonably large, those q’s that help recover rows of n−1 X0 are the only local minimizers . Moreover, these exists a positive constant √ of f over S c such that f is (cθ, cθ, cθ/µ, 2µ/7)-ridable over Sn−1 , where θ controls the sparsity level of X0 . • (Generalized) Phase Retrieval [4]. For complex signal x ∈ Cn , generalized phase retrieval 2 (PR) tries to recover x from the nonlinear measurements of the form yk = |a∗k x| , for k = 1, . . . , m. This task has occupied the central place in imaging systems for scientific discovery [9]. Assuming i.i.d. Gaussian measurement noise, a natural formulation for PR is m

minimize

z∈Cn

. 1 X 2 f (z) = (yk − |a∗k z| )2 . 4m

(3)

k=1

The manifold M here is Cn . It is obvious that for all z, f (z) hasthe same value as f (zeiθ ) for any θ ∈ [0, 2π). [4] showed when m ≥ Ω(npolylog(n)), xeiθ are the only local minimizers, and also global minimizers (as f ≥ 0). Moreover, modulo the trivial equivalence discussed above, the function f is (c, c/(n log m), c, c/(n log m))-ridable for a certain absolute constant c, assuming kxk = 1. • Independent Component Analysis (ICA) and Tensor Decomposition [5]. Typical setting of ICA asks for a linear transformation A for a given data matrix Y , such that rows of AY achieve maximal statistical independence. Tensor decomposition generalizes (spectral) decomposition of matrices. Here we focus on orthogonal decomposable d-th order tensors T which can be represented as T =

r X

a⊗d i ,

i=1

n a> i aj = δij ∀ i, j, (ai ∈ R ∀ i)

where ⊗ generalizes the usual outer product of vectors. Tensor decomposition refers to finding (up to sign and permutation) the components ai ’s given T . With appropriate processing and up to small perturbation, ICA is showed to be equivalent to decomposition of 4-th order orthogonal decomposable tensors [10, 11]. Specifically, [5] showed (Section C.1.) 5 the minimization problem r

X . 4 minimize f (u) = −T (u, u, u, u) = − (a> i u)

subject to

i=1

kuk2 = 1

has ±ai ’s as its only minimizers and the function f is (7/r, 1/poly(r), 3, 1/poly(r))ridable over Sn−1 . Once one of the component is obtained, one can apply deflation to obtain the others. One alternative that tends to make the process more noise-stable is trying to recover all the components in one shot. To this end, [5] proposed to solve r

XX . X 2 > 2 minimize g(u1 , . . . , ur ) = T (ui , ui , uj , uj ) = (a> k ui ) (ak uj ) , i6=j

i6=j k=1

subject to kui k = 1 ∀i ∈ [r].

The object {U ∈ Rn×r : kui k = 1 ∀i} is called the oblique manifold, which is a product space of multiple spheres. [5] showed all local minimizers of g are equivalent (i.e., signed permuted) copies of [a1 , . . . , ar ]. Moreover, g is (1/poly(r), 1/poly(r), 1, 1/poly(r))ridable. 5 [5] has not used the manifold language as we use here, but resorted to Lagrange multiplier and optimality of the Lagrangian function. For the two decomposition formulations we discussed here, one can verify that the gradient and Hessian they defined are exactly the Riemannian gradient and Hessian of the respective manifolds.

3

3

Second-order Trust-region Method and Proof of Convergence

The intuition that second-order information can help escape ridable saddles from the very start suggests a second-order method. We describe a second-order trust-region algorithm on manifolds [8, 12] for this purpose. For the generic problem (1), we start from any feasible x(0) ∈ M, and form a sequence of iterates x(1) , x(2) , · · · ∈ M as follows. For the current iterate x(k) , we consider the quadratic approximation E D E 1D . Hess f (x(k) )[δ], δ (4) fb(δ; x(k) ) = f (x(k) ) + δ, grad f (x(k) ) + 2 which is defined for all δ ∈ Tx(k) M. The next iterate is determined by minimizing the quadratic approximation within a small radius ∆ (i.e., the trust region) of x(k) , i.e.,   . δ (k+1) = arg min fb δ; x(k) , (5) δ∈Tx(k) M,kδk2 ≤∆

which is called the Riemannian trust-region subproblem. The vector x(k) + δ (k+1) is generally not a point on M. One then performs a retraction step Rx(k) that pulls the vector back to the manifold, resulting in the update formula x(k+1) = Rx(k) (x(k) + δ (k+1) ). Most manifolds of practical interest are embedded submanifolds of Rm×n and the tangent space is a subspace of Rm×n . For an x(k) ∈ M and an orthonormal basis U for Tx(k) M, one can solve (5) by solving the recast Euclidean trust-region subproblem . ξ (k+1) = arg min fb(U ξ; x(k) ), (6) kξk≤∆

for which efficient numerical algorithms exist [13–16]. Design choice of the retraction is often problem-specific, ranging from the classical exponential map to the Euclidean projection that works for many matrix manifolds [17]. To show the trust-region algorithm converges to a local minimizer, we assume ∆ is small enough such that approximation error of (4) to f is “negligible” locally. Each step around a negative-curvature or stronggradient point decreases the objective by a certain amount. Indeed, it is clear there is always one direction of descent in such cases. Thus, the trustregion step will approximately follow one descent direction and decrease the function value. When the iterate sequence moves into a strongly convex region around a local minimizer, a step is either constrained such that it also deceases the objective by an amount, or unconstrained, which is a good indicator that the target minimizer is within a radius ∆. In the latter case, the algorithm behaves exactly as the classical Newton method and quadratic sequence convergence can be shown.

TqSn−1 δ

Sn−1

q

expq(δ) O

Figure 3: Illustrations of the tangent space Tq Sn−1 and exponential map expq (δ) defined on the sphere Sn−1 .

Quantitative convergence proof demands knowledge of the ridability parameters, and also smoothness parameters of the objective, and elements of Riemannian geometry. We refer the reader to [3, 4] for practical examples of convergence analyses.

4

Discussion

Recently, there is a surge of interest in understanding nonconvex heuristics for practical problems [18– 38, 38–46]. Majority of the work start from clever initializations, and then proceed with analysis of local convergence. In comparison, it is clear for ridable functions, second-order trust-region algorithms with any initialization guarantee to retrieve one target minimizer. Identifying ridable functions has involved intensive technical work [3, 4]. It is interesting to see if streamlined toolkits can be developed, say via operational rules or unified potential functions. This would facilitate study of other practical problems, such as the deep networks of which saddle points are believed to be prevalent and constitute significant computational bottleneck [47–49]. To match heuristics computationally, more practical algorithms than solving the second-order trust-region subproblems are needed. In fact, simulations with several practical problems suggest gradient-style algorithms with random initializations succeed. [5] is a step towards this direction. 4

References [1] K. G. Murty and S. N. Kabadi, “Some np-complete problems in quadratic and nonlinear programming,” Mathematical programming, vol. 39, no. 2, pp. 117–129, 1987. [2] D. P. Bertsekas, “Nonlinear programming,” 1999. [3] J. Sun, Q. Qu, and J. Wright, “Complete dictionary recovery over the sphere,” arXiv preprint arXiv:1504.06785, 2015. [4] J. Sun, Q. Qu, and J. Wright, “A geometric analysis of phase retreival,” In preparation, 2015. [5] R. Ge, F. Huang, C. Jin, and Y. Yuan, “Escaping from saddle points—online stochastic gradient for tensor decomposition,” in Proceedings of The 28th Conference on Learning Theory, pp. 797–842, 2015. [6] S. Reich and A. D. Ioffe, Nonlinear Analysis and Optimization: Optimization, vol. 2. American Mathematical Soc., 2010. [7] C. J. Hillar and L.-H. Lim, “Most tensor problems are NP-hard,” Journal of the ACM (JACM), vol. 60, no. 6, p. 45, 2013. [8] P.-A. Absil, R. Mahoney, and R. Sepulchre, Optimization Algorithms on Matrix Manifolds. Princeton University Press, 2009. [9] Y. Shechtman, Y. C. Eldar, O. Cohen, H. N. Chapman, J. Miao, and M. Segev, “Phase retrieval with application to optical imaging: A contemporary overview,” Signal Processing Magazine, IEEE, vol. 32, pp. 87–109, May 2015. [10] A. Frieze, M. Jerrum, and R. Kannan, “Learning linear transformations,” in focs, p. 359, IEEE, 1996. [11] S. Arora, R. Ge, A. Moitra, and S. Sachdeva, “Provable ICA with unknown gaussian noise, with implications for gaussian mixtures and autoencoders,” in Advances in Neural Information Processing Systems, pp. 2375– 2383, 2012. [12] P.-A. Absil, C. G. Baker, and K. A. Gallivan, “Trust-region methods on Riemannian manifolds,” Foundations of Computational Mathematics, vol. 7, no. 3, pp. 303–330, 2007. [13] J. J. Mor´e and D. C. Sorensen, “Computing a trust region step,” SIAM Journal on Scientific and Statistical Computing, vol. 4, no. 3, pp. 553–572, 1983. [14] A. R. Conn, N. I. Gould, and P. L. Toint, Trust region methods, vol. 1. Siam, 2000. [15] C. Fortin and H. Wolkowicz, “The trust region subproblem and semidefinite programming*,” Optimization methods and software, vol. 19, no. 1, pp. 41–67, 2004. [16] E. Hazan and T. Koren, “A linear-time algorithm for trust region problems,” arXiv preprint arXiv:1401.6757, 2014. [17] P.-A. Absil and J. Malick, “Projection-like retractions on matrix manifolds,” SIAM Journal on Optimization, vol. 22, no. 1, pp. 135–158, 2012. [18] R. H. Keshavan, A. Montanari, and S. Oh, “Matrix completion from a few entries,” Information Theory, IEEE Transactions on, vol. 56, no. 6, pp. 2980–2998, 2010. [19] P. Jain, P. Netrapalli, and S. Sanghavi, “Low-rank matrix completion using alternating minimization,” in Proceedings of the forty-fifth annual ACM symposium on Theory of Computing, pp. 665–674, ACM, 2013. [20] M. Hardt, “Understanding alternating minimization for matrix completion,” in Foundations of Computer Science (FOCS), 2014 IEEE 55th Annual Symposium on, pp. 651–660, IEEE, 2014. [21] M. Hardt and M. Wootters, “Fast matrix completion without the condition number,” in Proceedings of The 27th Conference on Learning Theory, pp. 638–678, 2014. [22] P. Netrapalli, U. Niranjan, S. Sanghavi, A. Anandkumar, and P. Jain, “Non-convex robust pca,” in Advances in Neural Information Processing Systems, pp. 1107–1115, 2014. [23] P. Jain and P. Netrapalli, “Fast exact matrix completion with finite samples,” arXiv preprint arXiv:1411.1087, 2014. [24] R. Sun and Z.-Q. Luo, “Guaranteed matrix completion via non-convex factorization,” arXiv preprint arXiv:1411.8003, 2014. [25] Q. Zheng and J. Lafferty, “A convergent gradient descent algorithm for rank minimization and semidefinite programming from random linear measurements,” arXiv preprint arXiv:1506.06081, 2015. [26] S. Tu, R. Boczar, M. Soltanolkotabi, and B. Recht, “Low-rank solutions of linear matrix equations via procrustes flow,” arXiv preprint arXiv:1507.03566, 2015. [27] Y. Chen and M. J. Wainwright, “Fast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees,” arXiv preprint arXiv:1509.03025, 2015.

5

[28] P. Netrapalli, P. Jain, and S. Sanghavi, “Phase retrieval using alternating minimization,” in Advances in Neural Information Processing Systems, pp. 2796–2804, 2013. [29] E. Cand`es, X. Li, and M. Soltanolkotabi, “Phase retrieval via wirtinger flow: Theory and algorithms,” Information Theory, IEEE Transactions on, vol. 61, pp. 1985–2007, April 2015. [30] Y. Chen and E. J. Candes, “Solving random quadratic systems of equations is nearly as easy as solving linear systems,” arXiv preprint arXiv:1505.05114, 2015. [31] C. D. White, R. Ward, and S. Sanghavi, “The local convexity of solving quadratic equations,” arXiv preprint arXiv:1506.07868, 2015. [32] P. Jain and S. Oh, “Provable tensor factorization with missing data,” in Advances in Neural Information Processing Systems, pp. 1431–1439, 2014. [33] A. Anandkumar, R. Ge, and M. Janzamin, “Guaranteed non-orthogonal tensor decomposition via alternating rank-1 updates,” arXiv preprint arXiv:1402.5180, 2014. [34] A. Anandkumar, R. Ge, and M. Janzamin, “Analyzing tensor power method dynamics: Applications to learning overcomplete latent variable models,” arXiv preprint arXiv:1411.1488, 2014. [35] A. Anandkumar, P. Jain, Y. Shi, and U. Niranjan, “Tensor vs matrix methods: Robust tensor decomposition under block sparse perturbations,” arXiv preprint arXiv:1510.04747, 2015. [36] X. Yi, C. Caramanis, and S. Sanghavi, “Alternating minimization for mixed linear regression,” arXiv preprint arXiv:1310.3745, 2013. [37] H. Sedghi and A. Anandkumar, “Provable tensor methods for learning mixtures of classifiers,” arXiv preprint arXiv:1412.3046, 2014. [38] K. Lee, Y. Wu, and Y. Bresler, “Near optimal compressed sensing of sparse rank-one matrices via sparse power factorization,” arXiv preprint arXiv:1312.0525, 2013. [39] Q. Qu, J. Sun, and J. Wright, “Finding a sparse vector in a subspace: Linear sparsity using alternating directions,” in Advances in Neural Information Processing Systems, pp. 3401–3409, 2014. [40] A. Agarwal, A. Anandkumar, P. Jain, P. Netrapalli, and R. Tandon, “Learning sparsely used overcomplete dictionaries via alternating minimization,” arXiv preprint arXiv:1310.7991, 2013. [41] A. Agarwal, A. Anandkumar, and P. Netrapalli, “Exact recovery of sparsely used overcomplete dictionaries,” arXiv preprint arXiv:1309.1952, 2013. [42] S. Arora, R. Ge, and A. Moitra, “New algorithms for learning incoherent and overcomplete dictionaries,” arXiv preprint arXiv:1308.6273, 2013. [43] S. Arora, R. Ge, T. Ma, and A. Moitra, “Simple, efficient, and neural algorithms for sparse coding,” arXiv preprint arXiv:1503.00778, 2015. [44] S. Arora, A. Bhaskara, R. Ge, and T. Ma, “More algorithms for provable dictionary learning,” arXiv preprint arXiv:1401.0579, 2014. [45] P. Jain, C. Jin, S. M. Kakade, and P. Netrapalli, “Computing matrix squareroot via non convex local search,” arXiv preprint arXiv:1507.05854, 2015. [46] S. Bhojanapalli, A. Kyrillidis, and S. Sanghavi, “Dropping convexity for faster semi-definite optimization,” arXiv preprint arXiv:1509.03917, 2015. [47] R. Pascanu, Y. N. Dauphin, S. Ganguli, and Y. Bengio, “On the saddle point problem for non-convex optimization,” arXiv preprint arXiv:1405.4604, 2014. [48] Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio, “Identifying and attacking the saddle point problem in high-dimensional non-convex optimization,” in Advances in Neural Information Processing Systems, pp. 2933–2941, 2014. [49] A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun, “The loss surface of multilayer networks,” arXiv preprint arXiv:1412.0233, 2014.

6