Stochastic Optimization with Importance Sampling

16 downloads 178 Views 301KB Size Report
Jan 13, 2014 - ML] 13 Jan 2014. Stochastic Optimization with Importance Sampling. Peilin Zhao. Department of Statistics. Rutgers University. Piscataway, NJ ...
arXiv:1401.2753v1 [stat.ML] 13 Jan 2014

Stochastic Optimization with Importance Sampling Tong Zhang Department of Statistics Rutgers University Piscataway, NJ, 08854, USA [email protected]

Peilin Zhao Department of Statistics Rutgers University Piscataway, NJ, 08854, USA [email protected]

Abstract Uniform sampling of training data has been commonly used in traditional stochastic optimization algorithms such as Proximal Stochastic Gradient Descent (prox-SGD) and Proximal Stochastic Dual Coordinate Ascent (prox-SDCA). Although uniform sampling can guarantee that the sampled stochastic quantity is an unbiased estimate of the corresponding true quantity, the resulting estimator may have a rather high variance, which negatively affects the convergence of the underlying optimization procedure. In this paper we study stochastic optimization with importance sampling, which improves the convergence rate by reducing the stochastic variance. Specifically, we study prox-SGD with importance sampling and prox-SDCA with importance sampling. For prox-SGD, instead of adopting uniform sampling throughout the training process, the proposed algorithm employs importance sampling to minimize the variance of the stochastic gradient. For prox-SDCA, the proposed importance sampling scheme aims to achieve higher expected dual value at each coordinate ascent step. We provide extensive theoretical analysis to show that the convergence rates with the proposed importance sampling methods can be significantly improved under suitable conditions both for prox-SGD and for prox-SDCA. Experiments are provided to verify the theoretical analysis.

1

Introduction

Stochastic optimization has been extensively studied in the machine learning community [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]. In general, at every step, a traditional stochastic optimization method will sample one training example or one dual coordinate uniformly at random from the training data, and then update the model parameter using the sampled example or dual coordinate. Although uniform sampling simplies the analysis, it is insufficient because it may introduce a very high variance of the sampled quantity, which will negatively affect the convergence rate of the resulting optimization procedure. In this paper we study stochastic optimization with importance sampling, which reduces the stochastic variance to significantly improve the convergence rate. Specifically, this paper focues on importance sampling techniques for Proximal Stochastic Gradient Descent (prox-SGD) [4] and Proximal Stochastic Dual Coordinate Ascent (prox-SDCA) [13]. For prox-SGD, the traditional algorithms such as Stochastic Gradient Descent (SGD) sample training examples uniformly at random during the entire learning process, so that the stochastic gradient is an unbiased estimation of the true gradient [1, 2, 3, 4]. However, the variance of the resulting stochastic gradient estimator may be very high since the stochastic gradient can vary significantly over different examples. In order to improve convergence, this paper proposes a sampling distribution and the corresponding unbiased importance weighted gradient estimator that achieves minimal variance. To this end, we analyze the relation between the variance of stochastic gradient and the sampling distribution. We show that to minimize the variance, the optimal sampling distribution should be roughly proportional to the norm of the stochastic gradient. To simplify computation, we also consider the use of upper bounds for the norms. Our theoretical analysis shows that under certain conditions, the proposed sampling method can significantly improve the

1

convergence rate, and our results include the existing theoretical results for uniformly sampled prox-SGD and SGD as special cases. Similarly for prox-SDCA, the traditional approach such as Stochastic Dual Coordinate Ascent (SDCA) [12] picks a coordinate to update by sampling the training data uniformly at random [5, 6, 7, 8, 9, 10, 11, 12, 13]. It was shown recently that SDCA and prox-SDCA algorithm with uniform random sampling converges much faster than a fixed cyclic ordering [12, 13]. However, this paper shows that if we employ an appropriately defined importance sampling strategy, the convergence could be further improved. To find the optimal sampling distribution, we analyze the connection between the expected increase of dual objective and the sampling distribution, and obtain the optimal solution that depends on the smooth parameters of the loss functions. Our analysis shows that under certain conditions, the proposed sampling method can significantly improve the convergence rate. In addition, our theoretical results include the existing results for uniformly sampled prox-SDCA and SDCA as special cases. The rest of this paper is organized as follows. Section 2 reviews the related work. Section 3 presents some preliminaries. In section 4, we will study stochastic optimization with importance sampling. Section 5 lists several applications for the proposed algorithms. Section 6 gives our empirical evaluations. Section 7 concludes the paper.

2

Related Work

We review some related work on Proximal Stochastic Gradient Descent (including Stochastic Gradient Descent) and Proximal Stochastic Dual Coordinate Ascent (including Stochastic Dual Coordinate Ascent). In recent years Proximal Stochastic Gradient Descent has been extensively studied [4, 14]. As a special case of prox-SGD, Stochastic Gradient Descent has been extensively studied in stochastic approximation theory [15]; however these results are often asymptotic, so there is no explicit bound in terms of T . Later on, finite sample convergence rate of SGD for solving linear prediction problem √ were studied by a number of authors [1, 16]. In general prox-SGD can achieve a convergence rate of O(1/ T ) for convex loss functions, and a convergence rate of O(log T /T ) for strongly convex loss functions, where T is the number of iterations of the algorithm. More recently, researchers have improved the previous bound to O(1/T ) by α-Suffix Averaging [2], which means instead of returning the average of the entire sequence of classifiers, the algorithm will average and return just an α-suffix: the average of the last α fraction of the whole sequence of classifiers. In practice it may be difficult for users to decide when to compute the α-suffix. To solve this issue, a polynomial decay averaging strategy is proposed by [3], which will decay the weights of old individual classifiers polynomially and also guarantee a O(1/T ) convergence bound. For Proximal Stochastic Dual Coordinate Ascent [13], Shalev-Shwartz and Zhang recently proved that the algorithm achieves a convergence rate of O(1/T ) for Lipschitz loss functions, and enjoys a linear convergence rate of exp(−O(T )) for smooth loss functions. For structural SVM, a similar result was also obtained in [9]. Several others researchers [6, 7] have studied the convergence behavior of the related non-randomized DCA (dual coordinate ascent) algorithm for SVM, but could only obtain weaker convergence results. The related randomized coordinate descent method has been investigated by some other authors [8, 10, 17]. However, when applied to SDCA, the analysis can only lead to a convergence rate of the dual objective value while we are mainly interested in the convergence of primal objective in machine learning applications. Recently, Shai Shalev-Shwartz and Tong Zhang has resolved this issue by providing a primal-dual analysis that showed a λ T )) of the duality gap for SDCA with smooth loss function [12]. linear convergence rate O(exp(− 1+λn Although both prox-SGD and prox-SDCA have been extensively studied, most of the existing work only considered the uniform sampling scheme during the entire learning process. We shall mention that for coordinate descent, some researchers have recently considered non-uniform sampling strategies [18, 19], but their results cannot be directly applied to proximal SDCA which we are interested in here. In contrast, the primal-dual analysis of prox-SDCA in this paper is analogous to that of [12], which directly implies a convergence rate for the duality gap.

2

3

Preliminaries

Here, we briefly introduce some key definitions and propositions that are useful throughout the paper (for details, please refer to [20] ). We consider vector functions: φ : Rd → R. Definition 1. A function φ : Rd → R is H-strongly convex, if for all u, v ∈ Rd , we have H ku − vk2 , 2

φ(u) ≥ φ(v) + ∇φ(v)⊤ (u − v) + or equivalently, ∀s ∈ [0, 1]

Hs(1 − s) ku − vk2 . 2

φ(su + (1 − s)v) ≤ sφ(u) + (1 − s)φ(v) −

Definition 2. A function φ : Rd → R is L-Lipschitz, if for all u, v ∈ Rd , we have |φ(u) − φ(v)| ≤ Lku − vk, where k · k is a norm. Definition 3. A function φ : Rd → R is (1/γ)-smooth if it is differentiable and its gradient is (1/γ)-Lipschitz, or, equivalently for all u, v ∈ Rd , we have φ(u) ≤ φ(v) + ∇φ(v)⊤ (u − v) +

1 ku − vk2 . 2γ

Proposition 1. If φ is (1/γ)-smooth with respect to a norm k · kP , then its dual function φ∗ is γ-strongly convex with respect to its dual norm k · kD , where φ∗ (v) = sup(v⊤ w − φ(w)), w

and the dual norm is defined as kvkD =

sup v⊤ w. kwkP =1

Throughout this paper, we will denote k · k as k · k2 for simplicity.

4

Stochastic Optimization with Importance Sampling

We consider the following generic optimization problem associated with regularized loss minimization of linear predictors. Let φ1 , φ2 , . . . , φn be n vector functions from Rd to R. Our goal is to find an approximate solution of the following optimization problem n

min P (w) :=

w∈Rd

1X φi (w) +λr(w), n i=1 {z } |

(1)

f (w)

where λ > 0 is a regularization parameter, and r is a regularizer. For example, given examples (xi , yi ) where xi ∈ Rd and yi ∈ {−1, +1}, the Support Vector Machine 1 2 problem is obtained by setting φi (w) = [1 − yi x⊤ i w]+ , [z]+ = max(0, z), and r(w) = 2 kwk . Regression 2 problems also fall into the above. For example, ridge regression is obtained by setting φi (w) = (yi − x⊤ i w) 1 2 ⊤ 2 and r(w) = 2 kwk , lasso is obtained by setting φi (w) = (yi − xi w) and r(w) = kwk1 . Let w∗ be the optimum of (1). We say that a solution w is ǫP -sub-optimal if P (w) − P (w∗ ) ≤ ǫP . We analyze the convergence rates of the proposed algorithms with respect to the number of iterations. 3

4.1

Proximal Stochastic Gradient Descent with Importance Sampling

In this subsection, we would consider the proximal stochastic gradient descent (prox-SGD) with importance sampling. If we directly apply full or stochastic gradient descent to the optimization problem (1), the solution may not satisfy some useful desirable property. For example, when r(w) = kwk1 , the optimal solution of the problem (1) should be sparse, and we would like the approximate solution to be sparse as well. However, if we directly use stochastic (sub)-gradient descent, then the resulting solution will not achieve sparsity [4]. To effectively solve the optimization problem (1), a well known method is the proximal gradient descent, which can be described by the following update rule for t = 1, 2, . . .   1 wt+1 = arg min h∇f (wt ), wi + λr(w) + (2) kw − wt k2 , w 2ηt which will produce the t + 1-th iterate as:

where,

 wt+1 = proxηt λr wt − ηt ∇f (wt ) ,   1 proxh (x) = arg min h(w) + kw − xk2 . w 2

We assume that the proximal mapping of ηt λr(w), i.e., proxηt λr (x), is easy to compute. For example, when r(w) = kwk1 , the proximal mapping or λr(w) is the following shrinkage operation proxλr (x) = sign(x) ⊙ [|x| − λ]+ , where ⊙ is the element wise product. In particular, when r(w) = 0, proxλr (x) = x and wt+1 = wt − ηt ∇f (wt ), which is the well know gradient descent method. In addition, setting the derivative of optimization function in equation (2) as zero, we can obtain the following implicit update rule for the iterative solution: wt+1 = wt − ηt ∇f (wt ) − ηt λ∂r(wt+1 ), where ∂r(wt+1 ) is a subgradient. However, at each step, proximal gradient descent requires the calculation of n derivatives ∇φi , which results in a time complexity of O(n). To reduce the time complexity, Proximal Stochastic Gradient Descent can be used, where at each iteration t = 1, 2, . . ., we draw it uniformly at random from {1, 2, . . . , n}, and update according to the formula   1 (3) kw − wt k2 . wt+1 = arg min h∇φit (wt ), wi + λr(w) + w 2ηt Since n

E[∇φit (wt )|wt ] =

1X ∇φi (wt ) = ∇f (wt ), n i=1

the optimization problem (3) is an unbiased estimation of that for the proximal gradient descent, i.e., the optimization problem (2). Similarly, the proximal stochastic gradient descent will produce the t+1-th iterate as:  wt+1 = proxηt λr wt − ηt ∇φit (wt ) , 4

or the following implicit solution: wt+1 = wt − ηt ∇φit (wt ) − ηt λ∂r(wt+1 ).

(4)

It is easy to see, when r(w) = 0, proximal stochastic gradient descent will recover the well known stochastic gradient descent method as wt+1 = wt − ηt ∇φit (wt ).

The advantage of proximal stochastic gradient descent is that each step only relies on a single derivative ∇φit (·), and thus the computational cost is 1/n of that of the standard proximal gradient descent. However, a disadvantage of the method is that the randomness introduces variance - this is caused by the fact that ∇φit (wt ) equals the gradient ∇f (wt ) in expectation, but ∇φi (wt ) varies with i. In particular, if the stochastic gradient has a large variance, then the convergence will become slow. For example, consider the case that each φi (w) is (1/γ)-smooth, and ηt ≤ γ, then we have f (wt+1 )

ηt2 k∇φit (wt ) + λ∂r(wt+1 )k2 2γ ηt ≤ f (wt ) − ηt h∇f (wt ), ∇φit (wt ) + λ∂r(wt+1 )i + k∇φit (wt ) + λ∂r(wt+1 )k2 . 2

≤ f (wt ) − ηt h∇f (wt ), ∇φit (wt ) + λ∂r(wt+1 )i +

In addition, because r is convex, we have r(wt+1 ) ≤ r(wt ) + h∂r(wt+1 ), wt+1 − wt i = r(wt ) − ηt h∂r(wt+1 ), ∇φit (wt ) + λ∂r(wt+1 )i, where the equality follows from (4). Combining the above two inequalities, we can get EP (wt+1 ) = E[f (wt+1 ) + λr(wt+1 )] ≤ E[f (wt ) + λr(wt )] − ηt Eh∇f (wt ), ∇φit (wt ) + λ∂r(wt+1 )i + E −Eηt hλ∂r(wt+1 ), ∇φit (wt ) + λ∂r(wt+1 )i

= EP (wt ) − ηt Ek∇f (wt )k2 − ηt λEh∇f (wt ), ∂r(wt+1 )i + = EP (wt ) −

ηt k∇φit (wt ) + λ∂r(wt+1 )k2 2

ηt ηt Ek∇φit (wt )k2 − Ekλ∂r(wt+1 )k2 2 2

ηt ηt Ek∇f (wt ) + λ∂r(wt+1 )k2 + V(∇φit (wt )), 2 2

where the variance V(∇φit (wt )) is defined as V(∇φit (wt )) = Ek∇φit (wt ) − ∇f (wt )k2 . From the above analysis, we can observe that the smaller the variance, the more reduction on objective function we have. In the next subsection, we will study how to adopt importance sampling to reduce the variance. This observation will be made more rigorous below. 4.1.1

Algorithm

In this subsection, we will study proximal SGD with importance sampling. ThePidea of importance sampling n is, at the t-th step, to assign each i ∈ {1, . . . , n} a probability pti ≥ 0 such that i=1 pti = 1. We then sample it from {1, . . . , n} based on probability pt = (pt1 , . . . , ptn )⊤ . If we adopt this distribution, then proximal SGD with importance sampling will work as follows: i h 1 kw − wt k2 , wt+1 = arg min h(nptit )−1 ∇φit (wt ), wi + λr(w) + w 2ηt which is another unbiased estimation of the previous optimization problem (2) for proximal gradient descent, because E[(nptit )−1 ∇φit (wt )|wt ] =

n X i=1

pti (npti )−1 ∇φi (wt ) = ∇f (wt ). 5

Similarly, the proximal SGD with importance sampling will produce the t + 1-th iterate as:  wt+1 = proxηt λr wt − ηt (nptit )−1 ∇φit (wt ) ,

or the following implicit solution:

wt+1 = wt − ηt (nptit )−1 ∇φit (wt ) − ηt λ∂r(wt+1 ). Under the previous assumptions, we will have the following two inequalities: ηt f (wt+1 ) ≤ f (wt ) − ηt h∇f (wt ), (nptit )−1 ∇φit (wt ) + λ∂r(wt+1 )i + k(nptit )−1 ∇φit (wt ) + λ∂r(wt+1 )k2 , 2 and r(wt+1 ) ≤ r(wt ) − ηt h∂r(wt+1 ), (nptit )−1 ∇φit (wt ) + λ∂r(wt+1 )i. Combining the above two inequalities, we can get the reduction on objective function bounded as follows: EP (wt+1 ) = E[f (wt+1 ) + λr(wt+1 )] ≤ E[f (wt ) + λr(wt )] − ηt Eh∇f (wt ), (nptit )−1 ∇φit (wt ) + λ∂r(wt+1 )i ηt +E k(nptit )−1 ∇φit (wt ) + λ∂r(wt+1 )k2 − Eηt hλ∂r(wt+1 ), (nptit )−1 ∇φit (wt ) + λ∂r(wt+1 )i 2 ηt ηt = EP (wt ) − ηt Ek∇f (wt )k2 − ηt λEh∇f (wt ), ∂r(wt+1 )i + Ek(nptit )−1 ∇φit (wt )k2 − Ekλ∂r(wt+1 )k2 2 2 ηt ηt t t t+1 2 t −1 t = EP (w ) − Ek∇f (w ) + λ∂r(w )k + V((npit ) ∇φit (w )), 2 2 where n n X 1 X t −1 (pi ) k∇φi (wt )k2 − k∇f (wt )k2 . pti k(npti )−1 ∇φi (wt ) − ∇f (wt )k2 = 2 V((nptit )−1 ∇φit (wt )) = n i=1 i=1

According to the above analysis, to maximize the reduction on the objective value, we should choose pt as the solution of the following optimization min Pn

pt ,pti ∈[0,1],

i=1

pti =1

V((nptit )−1 ∇φit (wt )) ⇔

n 1 X t −1 (pi ) k∇φi (wt )k2 . t =1 n2 p i=1 i i=1

min Pn

pt ,pti ∈[0,1],

(5)

It is easy to verify, that the solution of the above optimization is k∇φi (wt )k , pti = Pn t j=1 k∇φj (w )k

∀i ∈ {1, 2, . . . , n}.

(6)

Although, this distribution can minimize the variance of the t-th stochastic gradient, it requires calculation of n derivatives at each step, which is clearly inefficient. To solve this issue, we relax the previous optimization (5) as follows n n 1 X t −1 1 X t −1 2 t 2 (p ) k∇φ (w )k ≤ min (p ) Gi i P t t n2 i=1 i n2 i=1 i pt ,pti ∈[0,1], n i=1 pi =1 i=1 pi =1

min Pn

pt ,pti ∈[0,1],

by introducing

Gi ≥ k∇φi (w)k,

(7)

∀w ∈ Rd .

Then, we can approximate the distribution in equation (6) by solving the the right hand side of the inequality (7) as Gi pti = Pn j=1

Gj

,

∀i ∈ {1, 2, . . . , n}.

Finally, we can summarize the proposed Proximal SGD with importance sampling in Algorithm 1. 6

Algorithm 1 Proximal Stochastic Gradient Descent with Importance Sampling (Iprox-SGD) Input: λ ≥ 0. Initialize: w1 = 0, p1 = (1/n, . . . , 1/n)⊤ . for t = 1, . . . , T do Update pt ; Sample it from {1, . . . , n} based on pt ;  Update wt+1 = proxηt λr wt − ηt (nptit )−1 ∇φit (wt ) ; end for 4.1.2

Analysis

This section provides a convergence analysis of the proposed algorithm. Before presenting the results, we make some general assumptions: r(0) = 0,

and r(w) ≥ 0, for all w.

It is easy to see that these two assumptions are generally satisfied by all the well-known regularizers. Under the above assumptions, we begin our analysis with a technical lemma. Lemma 1. Suppose f (w) is H-strongly convex over Rd . Then for any t ≥ 1 we have h i E f (wt ) − f (w∗ ) + λr(wt+1 ) − λr(w∗ ) ≤

1 H ηt E[kwt − w∗ k2 − kwt+1 − w∗ k2 ] − Ekwt − w∗ k2 + Ek(nptit )−1 ∇φit (wt )k2 . 2ηt 2 2

Proof. First it is easy to check wt+1 = wt − ηt (nptit )−1 ∇φit (wt ) − ηt λ∂r(wt+1 ). We also have kwt − w∗ k2 − kwt+1 − w∗ k2

= kwt − w∗ k2 − kwt − ηt (nptit )−1 ∇φit (wt ) − ηt λ∂r(wt+1 ) − w∗ k2

= 2ηt h(nptit )−1 ∇φit (wt ), wt − w∗ i + 2ηt hλ∂r(wt+1 ), wt − w∗ i − kηt (nptit )−1 ∇φit (wt ) + ηt λ∂r(wt+1 )k2 .

Next, we will provide a lower bound for the above equation. For the first term, we define   H t t ∗ t ∗ t ∗ 2 δi = h∇φi (w ), w − w i − φi (w ) − φi (w ) + kw − w k . 2 We have

  H 1 t t ∗ t ∗ t ∗ 2 E t δit = h∇f (w ), w − w i − f (w ) − f (w ) + kw − w k ≥ 0. npit 2

(8)

For the second term, we use the convexity of r to get h∂r(wt+1 ), wt − w∗ i =



h∂r(wt+1 ), wt+1 − w∗ i + h∂r(wt+1 ), wt − wt+1 i

r(wt+1 ) − r(w∗ ) + h∂r(wt+1 ), ηt (nptit )−1 ∇φit (wt ) + ηt λ∂r(wt+1 )i.

Combining the above three equations, we have   h i H ∗ t t −1 t+1 ∗ t ∗ 2 φit (w ) − φit (w ) + kw − w k + δit + λr(w ) − λr(w ) 2ηt (npit ) 2

≤ kwt − w∗ k2 − kwt+1 − w∗ k2 + ηt2 k(nptit )−1 ∇φit (wt ) + λ∂r(wt+1 )k2 − 2ηt2 λ(nptit )−1 h∂r(wt+1 ), ∇φit (wt )i −2ηt2 λ2 k∂r(wt+1 )k2

= kwt − w∗ k2 − kwt+1 − w∗ k2 + ηt2 k(nptit )−1 ∇φit (wt )k2 − ηt2 kλ∂r(wt+1 )k2 . 7

Taking expectation of the above inequality and using (8), we obtain h  i H t ∗ t+1 ∗ t ∗ 2 2ηt f (w ) − f (w ) + kw − w k + λr(w ) − λr(w ) 2

E [kwt − w∗ k2 − kwt+1 − w∗ k2 ] + ηt2 E k(nptit )−2 ∇φit (wt )k2 − ηt2 kλ∂r(wt+1 )k2 .



By rearranging the terms we conclude the proof of this lemma. Given the above lemma, we first prove a convergence result for Proximal SGD with importance sampling when all φi (·) are strongly convex. Theorem 1. Suppose f (w) is H-strongly convex over Rd . If we set ηt = 1/(Ht), it holds that for all T > 0, inf

t∈{1,...,T }

EP (wt ) − P (w∗ ) ≤

T T 1X 1 X Vt , EP (wt ) − P (w∗ ) ≤ E T t=1 T t=1 2Ht

where Vt =

n 1 X 1 k∇φit (wt )k2 . n2 i=1 pti

If we further assume k∇φi (w)k ≤ Gi , and set pti = inf

t∈{1,...,T }

EP (wt ) − P (w∗ ) ≤

PnGi j=1

Gj ,

then

Pn T ( i=1 Gi )2 log T 1 X . EP (wt ) − P (w∗ ) ≤ T t=1 2Hn2 T

Proof. According to Lemma (1), we have h i E f (wt ) − f (w∗ ) + λr(wt+1 ) − λr(w∗ ) ≤

H ηt 1 E[kwt − w∗ k2 − kwt+1 − wk2 ] − Ekwt − w∗ k2 + Ek(nptit )−1 ∇φit (wt )k2 . 2ηt 2 2

Summing the above inequality over t = 1, . . . , T , and using ηt = 1/(Ht) we get T X t=1



E[f (wt ) + λr(wt+1 )] −

T X Ht t=1

=− ≤

2

T X

[f (w∗ ) + λr(w∗ )]

t=1

[Ekwt − w∗ k2 − Ekwt+1 − w∗ k2 ] −

T T X ηt HX Ek(nptit )−1 ∇φit (wt )k2 Ekwt − w∗ k2 + 2 t=1 2 t=1

T X 1 HT EkwT +1 − w∗ k2 + Ek(nptit )−1 ∇φit (wt )k2 2 2Ht t=1

T T n X X 1 1 1 X 1 t 2 Ek(npit )−1 ∇φit (wt )k2 = E t k∇φi (w )k . 2 2Ht 2Ht n p i t=1 t=1 i=1

Finally, combining r(w1 )−r(wT +1 ) ≤ 0 with the above inequality will conclude the Pnfirst part of the theorem. To prove the second part, we can use the fact k∇φi (w)k ≤ Gi , and pti = Gi / j=1 Gj to get Pn Gi Vt ≤ ( i=1 )2 . n Then plugging the above inequality into the first part will conclude the proof of this theorem. 8

Remark. If we replace all Gi with Gmax = max{G1 , . . . , Gn }, then the theorem is valid with the uniform PT G2 log T sampling distribution. It recovers the result in corollary 10 of [4], i.e., T1 t=1 EP (wt ) − P (w∗ ) ≤ max . 2HT However, G2 Pn max Gi 2 ( i=1 ) n

n2 G2 = Pn max 2 ≥ 1, ( i=1 Gi )

Pn i )2 ≪ n2 . implies importance sampling does improve the convergence rate, especially when ( i=1 GG max We can also prove a convergence bound for the last predictor when r(w) = 0 (in this case Proximal SGD becomes SGD). Theorem 2. Suppose r(w)=0, and f (w) is H-strongly convex over Rd , and that k∇φi (w)k ≤ Gi . Then if Pn t we set pi = Gi / j=1 Gj , and ηt = 1/(Ht), it holds for any T that EkwT − w∗ k2 ≤

Pn 4( i=1 Gi )2 . H 2 n2 T

Proof. Because r(w)=0, and f (w) is H-strongly convex, P (w) is also H-strongly convex and h∇P (w), w − w∗ i ≥ P (w) − P (w∗ ) +

H kw − w∗ k2 ≥ Hkw − w∗ k2 . 2

(9)

Case t = 1: According to Cauchy-Schwartz inequality and (9) we have k∇P (w1 )kkw1 − w∗ k ≥

h∇P (w1 ), w1 − w∗ i ≥

H kw1 − w∗ k2 , 2

and E[k(np1i1 )−1 ∇φi1 (w1 )k2 ] =

=



E[k∇P (w1 ) + ((np1i1 )−1 ∇φi1 (w1 ) − ∇P (w1 ))k2 ]

E[k∇P (w1 )k2 ] + E[k(np1i1 )−1 ∇φi1 (w1 ) − ∇P (w1 )k2 ]

+2E[h(np1i1 )−1 ∇φi1 (w1 ) − ∇P (w1 ), ∇P (w1 )i] E[k∇P (w1 )k2 ].

By combining the above two inequalities, we get 1

∗ 2

E[kw − w k ] ≤

Pn 4( i=1 Gi )2 4 4 1 2 1 2 1 −1 E[k∇P (w )k ] ≤ 2 E[k(npi1 ) ∇φi1 (w )k ] ≤ . H2 H H 2 n2

Case t > 1: First, from the update rule E[kwt+1 − w∗ k2 ]

= E[kwt − ηt (nptit )−1 ∇φit (wt ) − w∗ k2 ]

= E[kwt − w∗ k2 ] − 2ηt E[h(nptit )−1 ∇φit (wt ), wt − w∗ i] + ηt2 E[k(nptit )−1 ∇φit (wt )k2 ] Pn Gi = E[kwt − w∗ k2 ] − 2ηt E[h∇P (wt ), wt − w∗ i] + ηt2 ( i=1 )2 . n 1 By combining the above inequality with the inequality (9) and plugging in ηt = Ht , we get Pn ( i=1 Gi )2 2 E[kwt+1 − w∗ k2 ] ≤ (1 − )E[kwt − w∗ k2 ] + . t H 2 n 2 t2

The above inequality implies that Ekwt − w∗ k2 ≤ t

∗ 2

Ekw − w k ≤

4(

Pn

i=1 Gi ) H 2 n2 t

2

4(

Pn

2 i=1 Gi ) H 2 n2 t

for t = 2. For t ≥ 3, we can also easily prove

by induction. This proves the desired bound. 9

Remark: When we set Gi = Gmax = max{G1 , . . . , Gn }, the distribution in the theorem becomes the G2 n2 G2max P uniform distribution, and the theorem implies Theorem 1 of [2]. Since (Pn max 2 ≥ 1, Gi )2 /n2 = ( n i=1P i=1 Gi ) n Gi 2 2 importance sampling again can improve the convergence rate, especially when ( i=1 Gmax ) ≪ n . Now, we will analyze the theoretical performance of the proposed algorithm for the case that f (·) may not be strongly convex. Theorem 3. Suppose all φi (·) are convex over Rd (i = 1, . . . , n). If we set ηt = η, then for all T we have inf

t∈{1,...,T }

where Vt =

1 n2

EP (wt ) − P (w∗ ) ≤

Pn

T T 1 X η X 1 1 Vt ], EP (wt ) − P (w∗ ) ≤ [ kw∗ k2 + E T t=1 T 2η 2 t=1

1 t 2 i=1 pti k∇φit (w )k .

If we further assume k∇φi (w)k ≤ Gi , and set pti = T , we have

PnGi

, j=1 Gj

√ P ηt = nkw∗ k/( T ni=1 Gi ), then for any

Pn T Gi 1 X 1 ∗ t ∗ inf EP (w ) − P (w ) ≤ EP (w ) − P (w ) ≤ √ (kw k i=1 ). t∈{1,...,T } T t=1 n T t



Proof. According to Lemma 1, we have h i ηt 1 E[kwt − w∗ k2 − kwt+1 − wk2 ] + Ek(nptit )−1 ∇φit (wt )k2 . E f (wt ) − f (w∗ ) + λr(wt+1 ) − λr(w∗ ) ≤ 2ηt 2 Summing the above inequality over t = 1, . . . , T , and using ηt = η we get T X t=1

≤ =

E[f (wt ) + λr(wt+1 )] −

1 2η

T X t=1

T X

[f (w∗ ) + λr(w∗ )]

t=1

[Ekwt − w∗ k2 − Ekwt+1 − w∗ k2 ] +

T ηX Ek(nptit )−1 ∇φit (wt )k2 2 t=1

T ηX 1 [Ekw1 − w∗ k2 − EkwT +1 − w∗ k2 ] + Ek(nptit )−1 ∇φit (wt )k2 2η 2 t=1

T n η X 1 X 1 1 ∗ 2 k∇φit (wt )k2 kw k + E ≤ 2η 2 t=1 n2 i=1 pti

Combining r(w1 ) − r(wT +1 ) ≤ 0 with the above inequality will conclude the first part of the theorem. Remark: If we replace Gi with Gmax = max{G1 , . . . , Gn }, then it is obvious that the theorem is still correct with the sampling distribution. Moreover this theorem will recover the conclusion in corollary Puniform T 7 of [4], i.e., T1 t=1 EP (wt ) − P (w∗ ) ≤ √1T (kw∗ kGmax ).

4.2

Proximal Stochastic Dual Coordinate Ascent with Importance Sampling

In this section, we study the Proximal Stochastic Dual Coordinate Ascent method (prox-SDCA) with importance sampling. Prox-SDCA works with the following dual problem of (1): n

n

1 X 1X −φ∗i (−θi ) − λr∗ ( θi ). max D(θ) := θ n i=1 λn i=1

10

(10)

We assume that r∗ (·) is continuous differentiable; the relationship between primal variable w and dual variable is n

w = ∇r∗ (v(θ)) ,

v(θ) =

1 X θi . λn i=1

We also assume that r(w) is 1-strongly convex with respect to a norm k · kP ′ , i.e., 1 r(w + ∆w) ≥ r(w) + ∇r(w)⊤ ∆w + k∆wk2P ′ , 2 which means that r∗ (w) is 1-smooth with respect to its dual norm k · kD′ . Namely, r∗ (v + ∆v) ≤ h(v; ∆v), where 1 h(v; ∆v) := r∗ (v) + ∇r∗ (v)⊤ ∆v + k∆vk2D′ . 2 At the t-th step, traditional Proximal Stochastic Dual Coordinate Ascent (prox-SDCA) will uniformly randomly pick i ∈ {1, . . . , n}, and update the dual variable θit−1 as follows: θit = θit−1 + ∆θit−1 , where

   1 1 1 1 = arg max − φ∗i (−(θit−1 + ∆θi )) − λ ∇r∗ (vt−1 )⊤ ∆θi + k ∆θi k2D′ ∆θi n λn 2 λn   1 k∆θi k2D′ ) , (11) = arg max −φ∗i (−(θit−1 + ∆θi )) − (wt−1 )⊤ ∆θi − ∆θi 2λn Pn t−1 1 , which is equivalent to maximizing a lower bound of the following problem: = λn i=1 θi   1 ∗ 1 t−1 t−1 ∗ t−1 ∆θi = arg max − φi (−(θi + ∆θi )) − λr (v + ∆θi ) . ∆θi n λn

∆θit−1

where vt−1

However, the optimization (11) may not have a closed form solution, and in prox-SDCA we may adopt other update rules ∆θi = s(u − θit−1 ) for an appropriately chosen step size parameter s > 0 and any vector u ∈ Rd such that −u ∈ ∂φi (wt−1 ). When r(w) = 21 kwk2 , the proximal SDCA method is known as SDCA. Now we will study prox-SDCA with importance sampling, which is to allow P the algorithm to randomly pick i according to probability pi , which is the i-th element of p ∈ Rn+ , pi = 1. Once we pick the coordinate i, θi is updated as traditional prox-SDCA. The main question we are interested in here is which p = (p1 , . . . , pn )⊤ can optimally accelerate the convergence rate of prox-SDCA. To answer this question, we will introduce a lemma which will state the relationship between p and the convergence rate of prox-SDCA with importance sampling. Lemma 2. Given a distribution p, if assume φi is (1/γi )-smooth with norm k · kP , then for any iteration t and any s such that si = s/(pi n) ∈ [0, 1], ∀i, we have E[D(θt ) − D(θt−1 )] ≥

s s Gt , E[P (wt−1 ) − D(θt−1 )] − n 2λn2

where n

Gt =

1X (si R2 − γi (1 − si )λn)Ekut−1 − θit−1 k2D , i n i=1

∈ ∂φi (wt−1 ). R = supu6=0 kukD′ /kukD , and −ut−1 i 11

(12)

Proof. Since only the i-th element of θ is updated, the improvement in the dual objective can written as n[D(θt ) − D(θt−1 )]     1 t−1 ∗ t ∗ t−1 ∆θi ) − −φ∗i (−θit−1 ) − λnr∗ (vt−1 ) = −φi (−θi ) − λnr (v + λn     1 t−1 ∗ t t−1 ≥ −φi (−θi ) − λnh(v ; ∆θi ) − −φ∗i (−θit−1 ) − λnr∗ (vt−1 ) . λn | {z } {z } | Bi

(13)

Ai

By the definition of the update, for all si ∈ [0, 1] we have Ai

= ≥

1 ∆θi ) λn si t−1 −φ∗i (−(θit−1 + si (ut−1 − θit−1 ))) − λnh(vt−1 ; (u − θit−1 )). i λn i max −φ∗i (−(θit−1 + ∆θi )) − λnh(vt−1 ; ∆θi

(14)

From now on, we drop the superscript (t − 1) to simplify our discussion. Because φi is (1/γi )-smooth with k · kP , φ∗i is γi -strongly convex with k · kD , and we have that φ∗i (−(θi + si (ui − θi ))) = φ∗i (si (−ui ) + (1 − si )(−θi )) ≤ si φ∗i (−ui ) + (1 − si )φ∗i (−θi ) −

γi si (1 − si )kui − θi k2D . 2

Combining the above two inequalities, we obtain Ai

si γi si (1 − si )kui − θi k2D − λnh(v; (ui − θi )) 2 λn γi = −si φ∗i (−ui ) − (1 − si )φ∗i (−θi ) + si (1 − si )kui − θi k2D − λnr∗ (v) − si w⊤ (ui − θi ) − 2 γ i ≥ −si φ∗i (−ui ) − (1 − si )φ∗i (−θi ) + si (1 − si )kui − θi k2D − λnr∗ (v) − si w⊤ (ui − θi ) − 2 ≥ −si φ∗i (−ui ) − (1 − si )φ∗i (−θi ) +

s2i kui − θi k2D′ 2λn s2i 2 R kui − θi k2D 2λn

si R 2 si ∗ ∗ )kui − θi k2D + si (φ∗i (−θi ) + θi⊤ w), = −si (φ∗i (−ui ) + u⊤ i w) + (−φi (−θi ) − λnr (v)) + (γi (1 − si ) − | {z } | {z } 2 λn Bi

si φi (w)

where we used −ut−1 ∈ ∂φi (wt−1 ) which yields φ∗i (−ui ) = −u⊤ i w − φi (wi ). Therefore i   h i γi (1 − si ) si R2 ∗ ⊤ Ai − Bi ≥ si φi (w) + φi (−θi ) + θi w + kui − θi k2D . − 2 2λn

(15)

Therefore, if we take expectation of inequality (15) with respect to the choice of i and use the fact si = s/(pi n), we obtain that   n n i 1 1 Xh 1X 1 γi (1 − si ) si R2 kui − θi k2D . [Ai − Bi ] = Et [Ai − Bi ] = − φi (w) + φ∗i (−θi ) + θi⊤ w + s n i=1 si n i=1 2 2λn

Next note that with w = ∇r∗ (v), we have r(w) + r∗ (v) = w⊤ v. Therefore n

P (w) − D(θ)

=

1X φi (w) + λr(w) − n i=1 n

=

n

1X 1X ∗ φi (w) + φ (−θi ) + λw⊤ v n i=1 n i=1 i n

=

! n 1X ∗ ∗ − φ (−θi ) − λr (v) n i=1 i

1X (φi (w) + φ∗i (−θi ) + θi⊤ w). n i=1 12

Taking expectation of inequality (13) and plugging the above two equations give n 1 Gt E[D(θt ) − D(θt−1 )] ≥ E[Ai − Bi ] ≥ E[P (wt−1 ) − D(θt−1 )] − . s s 2λn Multiplying both sides by s/n concludes the proof. 4.2.1

Algorithm

According to Lemma 2, to maximize the dual ascent for the t-th update, we should choose s and p as the solution of the following optimization maxn

s/(pi n)∈[0,1],p∈R+ ,

s s Gt E[P (wt−1 ) − D(θt−1 )] − 2 . n 2λ i=1 pi =1 n

Pn

However, because this optimization problem is difficult to solve, we choose to relax it as follows: maxn

s/(pi n)∈[0,1],p∈R+ ,



s Gt s E[P (wt−1 ) − D(θt−1 )] − 2 n 2λ i=1 pi =1 n

Pn

max

λnγ

i ],p∈Rn s/(pi n)∈[0, R2 +λnγ +, i



max

λnγi s/(pi n)∈[0, R2 +λnγ i

],p∈Rn +,

s Gt s E[P (wt−1 ) − D(θt−1 )] − 2 n n 2λ i=1 pi =1

Pn

s E[P (wt−1 ) − D(θt−1 )]. n p =1 i i=1

Pn

where the last inequality used Gt ≤ 0. To optimize the final relaxation, we have the following proposition Proposition 2. The solution for the optimization max s s,p

s.t. s/(pi n) ∈ [0,

n X λnγi pi = 1, ], p ≥ 0, i R2 + λnγi i=1

is 2

s=

n+

n Pn

i=1

R 1 + λnγ i , p = . P i n R2 R2 n + j=1 λnγj λnγi

(16)

We omit the proof of this proposition since it is simple. Given that φi is (1/γi )-smooth, ∀i ∈ {1, . . . , n}, the sampling distribution should be set as in (16). Although s can be set as in (16), it can also be optimized by maximizing some terms in the analysis, such as Ai , or the right hand side of inequality (14), or inequality (15), which can all guarantee the dual ascent E[D(θt ) − D(θt−1 )] is no worse the the one obtained by setting s as in (16). When γi = 0, the above distribution in the equation (16) is not valid. To solve this problem, we combine the facts P (wt−1 ) − D(θt−1 ) ≥ D(θ∗ ) − D(θt−1 ) := ǫt−1 D , where θ∗ the optimal solution of the dual problem maxθ D(θ), t D(θt ) − D(θt−1 ) = ǫt−1 D − ǫD ,

and the inequality (12), to obtain E[ǫtD ] ≤ (1 −

s s Gt . )E[ǫt−1 D ]+ n 2λn2 13

(17)

According to this inequality, although every γi = 0, if we further assume every φi is Li -Lipschitz, then n

Gt =

n

1X 4R2 s X 1 2 (si R2 − γi (1 − si )λn)Ekut−1 − θit−1 k2D ≤ L , i n i=1 n2 i=1 pi i

(18)

where we used si = s/(npi ), kut−1 k ≤ Li and kθit−1 k ≤ Li , since −ut−1 , −θit−1 ∈ ∂φi (wt−1 ). i i Combining the above two inequalities results in n

E[ǫtD ] ≤ (1 −

s s 4R2 s X 1 2 L . )E[ǫt−1 D ]+ n 2λn2 n2 i=1 pi i

(19)

According to the above inequality, to minimize the t-th duality gap, we should choose a proper distribution to optimize the following problem n X 1 2 Li , p p =1 i i=1 i=1 i

min Pn n

p∈R+ ,

for which the optimal distribution is obviously

Li pi = Pn

j=1

Lj

.

Because si = s/(npi ) ∈ [0, 1], the above distribution furthermore indicates # " n \ nLmin := [0, ρ] [0, npi ] = 0, Pn s∈ j=1 Lj i=1

where Lmin = min{L1 , L2 , . . . , Ln } and ρ ≤ 1. In summary, the prox-SDCA with importance sampling can be summarized as in the algorithm 2. Remark: It is easy to check the first three options can do no worse than the Option IV for Lipschitz losses and Option V for smooth losses. Specifically, option I is to optimize Ai directly. Option II only requires choosing ∆θi = s(ut−1 − θit−1 ) and then chooses s to optimize a lower bound of Ai , i.e., the right i hand side of inequality (14). Option III is similar with option II, and chooses s to optimize the bound on the right hand side of (15). As a result, all the first three options can do no worse than choosing the optimal s for the Lemma 2. Option IV replace kzk2D with its upper bound 4L2it , so that the inequality (19) is still n valid. Option V is similar with options II and III, and chooses s = as in the Proposition 2, so Pn R2 n+

t

i=1 λnγi

that G ≤ 0. 4.2.2

Analysis

In this subsection we analyze the convergence behavior of the proposed algorithm. Before presenting the theoretical results, we will make several assumptions without loss of generality: • for the loss functions: φi (0) ≤ 1, and φi (w) ≥ 0, ∀w, and • for the regularizer: r(0) = 0 and r(w) ≥ 0, ∀w.

Under the above assumptions, we have the following theorem for the expected duality gap of E[P (wt ) − D(θT )]. P R2 R2 Theorem 4. Assume φi is (1/γi )-smooth ∀i ∈ {1, . . . , n} and set pi = (1 + λnγ )/(n + nj=1 λnγ ), for i j t T all i ∈ {1, . . . , n}. To obtain an expected duality gap of E[P (w ) − D(θ )] ≤ ǫP for the proposed Proximal SDCA with importance sampling, it suffices to have a total number of iterations of ! n n X X R2 1 R2 . ) log (n + ) T ≥ (n + λnγi λnγi ǫP i=1 i=1 14

Algorithm 2 Proximal Stochastic Dual Coordinate Ascent with Importance Sampling (Iprox-SDCA) Input: λ > 0, R = supu6=0 kukD′ /kukD , norms k · kD , k · kD′ , γ1 , . . . , γn > 0, or L1 , . . . , Ln ≥ 0. 2

Initialize: θi0 = 0, w0 = ∇r∗ (0), pi =

n+

R 1+ λnγ i Pn R2

, or pi =

j=1 λnγj

PnLi

j=1

Lj ,

∀i ∈ {1, . . . , n}.

for t = 1, . . . , T do Sample it from {1, . . . , n} based on p; Caculate ∆θit−1 using any of the following options (or achieving larger dual objective than one of the t options); Option I:   1 k∆θit k2D′ ; + ∆θit )) − (wt−1 )⊤ ∆θit − 2λn ∆θit−1 = arg max∆θit −φ∗it (−(θit−1 t t Option II: Let u be s.t. −u ∈ ∂φit (wt−1 ); Let z = u − θit−1 ; t i h s2 2 t−1 ⊤ kzk + sz)) − s(w ) z − Let sit = arg maxs∈[0,1] −φ∗it (−(θit−1 ′ D ; 2λn t Set ∆θit−1 = sit z; t Option III: Same as Option II, but replace the definition of sit as follows: φi (wt−1 )+φ∗ (−θ t−1 )+(wt−1 )⊤ θ t−1 +

γi t

kzk2

it D it it 2 Let sit = t ; kzk2D (γit +R2 /(λn)) Option IV (only for Lipschitz losses): Same as Option III, but replace kzk2D with 4L2it ; Option V (only for smooth losses):  ; ∆θit−1 = Pnn R2 −∇φit (wt−1 ) − θit−1 t t

n+

i=1 λnγi

θitt = θit−1 + ∆θit−1 ; t t 1 t t−1 v =v + λn ∆θit−1 ; t wt = ∇r∗ (vt ); end for

Proof. Given the distribution p and step size s in equation (16), since φi is (1/γi )-smooth, according to Lemma 2, we have (si R2 − γi (1 − si )λn) ≤ 0, and hence Gt ≤ 0 for all t. By Lemma 2, this yields E[D(θt ) − D(θt−1 )] ≥

s E[P (wt−1 ) − D(θt−1 )]. n

(20)

Furthermore since P (wt−1 ) − D(θt−1 ) ≥ D(θ∗ ) − D(θt−1 ) := ǫt−1 D , t where θ∗ the optimal solution of the dual problem and D(θt ) − D(θt−1 ) = ǫt−1 D − ǫD , we obtain that

E[ǫtD ] ≤ (1 − In addition, since P (0) = n

D(0) =

1 n

Pn

i=1

s t 0 s )E[ǫt−1 ) E[ǫD ]. D ] ≤ (1 − n n

(21)

φi (0) + λr(0) ≤ 1 and n

n

1X 1X 1X −φ∗i (0) − λr∗ (0) = − max(0 − φi (ui )) − max(0 − r(u)) = min φi (ui ) + λr(u) ≥ 0, u ui n i=1 n i=1 n i=1 ui 15

we have ǫ0D ≤ P (0) − D(0) ≤ 1. Combining this with inequality (21), we obtain E[ǫtD ] ≤ (1 −

s t st t ) ≤ exp(− ) = exp(− Pn n n n + i=1

R2 λnγi

).

According to the above inequality, by setting t≥

n X R2 n+ λnγi i=1

!

log(

1 ), ǫD

the proposed algorithm will achieve E[ǫtD ] ≤ ǫD . Furthermore, according to inequality (20) E[P (wt ) − D(θt )] ≤

n t n t E[ǫD − ǫt+1 E[ǫD ], D ]≤ s s

by setting n n X X R2 R2 1 t ≥ (n + ) log (n + ) λnγi λnγi ǫP i=1 i=1

we will obtain E[ǫtD ] ≤

s n ǫP

!

,

and E[P (wt ) − D(θt )] ≤ ǫP .

Remark: If we adopt uniform sampling, i.e., pi = 1/n ∀i, then we have to use the same γ for all φi , which should be γmin = min{γ1 , . . . , γn } according to the analysis. Once replacing  γi with γmin, this theorem will recover the conclusion in the theroem 1 of [13], i.e., T ≥ (n +

R2 λγmin ) log

(n +

R2 1 λγmin ) ǫP

. However,

2

n + λγRmin nλγmin + R2 P n R2 = 2 Pn n + i=1 λγi n nλγmin + Rn i=1

γmin γi

≥ 1,

Pn implies importance sampling does improve convergence, especially when i=1 γmin γi ≪ n. For non-smooth loss functions, the convergence rate for Proximal SDCA with importance sampling is given below. Theorem 5. Consider the P proposed proximal SDCA with importance sampling. Assume that φi is Li n ¯ − Lipschitz and set pi = Li / j=1 Lj , ∀i ∈ {1, . . . , n}. To obtain an expected duality gap of E[P (w) P PT T 1 1 t−1 t−1 ¯ ¯ ¯ = w and θ = θ , it suffices to have a total number D(θ)] ≤ ǫP where w T −T0

of iterations of

where ρ =

t=T0 +1

T −T0

t=T0 +1

P 20R2 ( ni=1 Li )2 n λn P T ≥ max(0, 2⌈ log( )⌉) − n/ρ + , ρ ρ2R2 ( ni=1 Li )2 /n2 n2 λǫP

nLmin P . n i=1 Li

Moreover, when

t ≥ max(0, 2⌈

Pn 16R2 ( i=1 Li )2 n λn P )⌉) − 2n/ρ + , log( n ρ n2 λǫP ρ2R2 ( i=1 Li )2 /n2

we have dual sub-optimality bound of E[D(θ∗ ) − D(θt )] ≤ ǫP /2. Pn Pn Proof. Since pi = Li / j=1 Lj , the inequality (18) indicates Gt ≤ G where G = 4R2 ( i=1 Li )2 /n2 . Lemma 2, with γi = 0, tells that E[D(θt ) − D(θt−1 )] ≥

s G s E[P (wt−1 ) − D(θt−1 )] − ( )2 , n n 2λ 16

(22)

for all s ∈ [0, ρ], which further indicates E[ǫtD ] ≤ (1 −

s s 2G )E[ǫt−1 , D ]+( ) n n 2λ

for all s ∈ [0, ρ]. We next show that the above yields E[ǫtD ] ≤

2G , λ(2n/ρ + t − t0 )

(23)

for all t ≥ t0 = max(0, ⌈ nρ log( 2λn ρG )⌉). Indeed, let us choose s = ρ, then at t = t0 , we have 1 ρ t 0 ρ2 G ρG G ) ǫD + 2 ≤ exp(−ρt/n) + ≤ . n n 2λ 1 − (1 − ρ/n) 2λn λn/ρ

E[ǫtD ] ≤ (1 −

This implies that the inequality (23) holds at t = t0 . For t > t0 we can use inductive argument. Suppose the claim holds for t − 1, therefore E[ǫtD ] ≤ (1 − Choosing s =

2n (2n/ρ+t−1−t0 )

E[ǫtD ] ≤ = = ≤ =



s s G s 2G s G t−1 )E[ǫD ] + ( )2 ≤ (1 − ) + ( )2 . n n 2λ n λ(2n/ρ + t − 1 − t0 ) n 2λ

∈ [0, ρ] yields

 2  2 2G G 2 + 2n/ρ + t − 1 − t0 λ(2n/ρ + t − 1 − t0 ) 2n/ρ + t − 1 − t0 2λ 1 2G ) (1 − λ(2n/ρ + t − 1 − t0 ) 2n/ρ + t − 1 − t0 2G 2n/ρ + t − 1 − t0 − 1 ) ( λ(2n/ρ + t − 1 − t0 ) 2n/ρ + t − 1 − t0 2n/ρ + t − 1 − t0 2G ) ( λ(2n/ρ + t − 1 − t0 ) 2n/ρ + t − t0 2G . λ(2n/ρ + t − t0 ) 1−

This provides a bound on the dual sub-optimality. We next turn to bound the duality gap. Summing the inequality (22) over t = T0 + 1, . . . , T and rearranging terms we obtain that " # T X 1 n sG t−1 t−1 E (P (w ) − D(θ )) ≤ E[D(θT ) − D(θT0 )] + . T − T0 s(T − T0 ) 2λn t=T0 +1

¯ θ¯ to be either the average vector or a randomly chosen vector over t ∈ {T0 + 1, . . . , T }, Now if we choose w, the the above implies ¯ ≤ ¯ − D(θ))] E[(P (w)

sG n E[D(θT ) − D(θT0 )] + . s(T − T0 ) 2λn

If T ≥ n/ρ + T0 and T0 ≥ t0 , we can set s = n/(T − T0 ) ≤ ρ and combining with (23) we obtain ¯ ¯ − D(θ))] E[(P (w) ≤ ≤ ≤

sG n E[D(θT ) − D(θT0 )] + s(T − T0 ) 2λn G E[D(θ∗ ) − D(θT0 )] + 2λ(T − T0 ) G 2G + . λ(2n/ρ + T0 − t0 ) 2λ(T − T0 ) 17

4G A sufficient condition for the above to be smaller than ǫP is that T0 ≥ λǫ − 2n/ρ + t0 and T ≥ T0 + λǫGP . P ∗ It also implies that E[D(θ ) − D(θT0 )] ≤ ǫP /2. Since we also need T0 ≥ t0 and T − T0 ≥ n/ρ, the overall number of required iterations can be

T0 ≥ max{t0 , 4G/(λǫP ) − 2n/ρ + t0 },

T − T0 ≥ max{n/ρ, G/(λǫP )}.

Using the fact a + b ≥ max(a, b) concludes the proof of this theorem. Remark: When we replace all the Li with Lmax = max{L1 , . . . , Ln }, the above theorem will still be valid, and the sampling distribution becomes the uniform distribution. In this case we will recover Theorem 20R2 (Lmax )2 2 of [13], i.e., T ≥ max(0, 2⌈n log( 2R2λn . However, the ratio between the leading L2max )⌉) − n + λǫP terms is (L )2 n Pn max 2 2 = ( Pn )2 ≥ 1, ( i=1 Li ) /n L /L max i=1 i

which again implies that the importance sampling strategy will improve convergence, especially when P Li )2 ≪ n2 . ( ni=1 Lmax

5

Applications

There are numerous possible applications of our proposed algorithms. Here we will list several popular applications.

5.1

Hinge Loss Based SVM with ℓ2 Regularization

Suppose our task is to solve the typical Support Vector Machine (SVM): n

min w

1X λ [1 − yi w⊤ xi ]+ + kwk2 . n i=1 2

Assume that X = maxi kxi k is not too large, such as for text categorization problems where each xi is a bag-of-words representation of some short document. To solve this problem we can use two different proximal SGD or proximal SDCA with importance sampling as follows. 5.1.1

Proximal SGD with Importance Sampling

Case 1: We can set regularizer as r(w) = 12 kwk2 , and loss function as φi (w) = [1 − yi w⊤ xi ]+ so that   x λ 1 2 2 proxλr (x) = arg min kwk + kw − xk = , w 2 2 λ+1 and ∇φi (w) = −sign([1 − yi w⊤ xi ]+ )yi xi . Because k∇φi (w)k ≤ kxi k, according to our analysis, we can set kxi k pi = P n . j=1 kxj k

Case 2: We can set regularizer as r(w) = 0, and the loss function as φi (w) = [1 − yi w⊤ xi ]+ + λ2 kwk2 , which is λ-strongly convex, so that   1 2 proxλr (x) = arg min 0 + kw − xk = x, w 2 18

and ∇φi (w) = −sign([1 − yi w⊤ xi ]+ )yi xi + λw. √ Using P (w∗ ) = D(θ∗ ), we find that the optimal solution of SVM satisfies kw∗ k ≤ 1/ λ. So we can project √ the iterative solutions into {w ∈ Rd |kwk ≤ 1/ λ} using √ Euclidean distance, while the theoretical analysis is still valid. In this way, we have k∇φi (w)k ≤ kxi k + λ. According to our analysis, we should set √ kxi k + λ √ . pi = P n λ) j (kxj k +

5.1.2

Proximal SDCA with Importance Sampling

We can set r(w) = 12 kwk2 , and loss function as φi (w) = [1 − yi w⊤ xi ]+ which is kxi k-Lipschitz. According to our analysis, the distribution should be kxi k pi = P n . i=1 kxi k

Furthermore, it is easy to get the dual function of φi as  −α θ = αyi xi , α ∈ [0, 1] ∗ φi (−θ) = ∞ otherwise

Given the dual function, using θi = αi yi xi , the options I in the algorithm produces a closed form solution as    t−1 1 − yi x⊤ i w ∆θi = max −αi , min 1 − αi , yi xi . kxi k2 /(λn)

5.2

Squared Hinge Loss Based SVM with ℓ2 Regularization

Suppose our interest is to solve the task of optimizing squared hinge loss based Support Vector Machine (SVM) with ℓ2 regularization: n

min w

5.2.1

2 λ 1X [1 − yi w⊤ xi ]+ + kwk2 . n i=1 2

Proximal SGD with Importance Sampling

√ Firstly, using the inequality P (w∗√ ) = D(θ∗ ), we can get kw∗ k ≤ 1/ λ. So we can project the iterative solutions into {w ∈ Rd |kwk ≤ 1/ λ} using Euclidean distance, while the previous theoretical analysis is still valid. 2 Case 1: If we set r(w) = 21 kwk2 and φi (w) = [1 − yi w⊤ xi ]+ , so that proxλr (x) = x/(λ + 1), and

∇φi (w) = −2[1 − yi w⊤ xi ]+ yi xi . √ Because k∇φi (w)k ≤ 2(1 + kxi k/ λ)kxi k, according the previous analysis, the optimal distribution for this case should be √ (1 + kxi k/ λ)kxi k √ pi = P n . j=1 (1 + kxj k/ λ)kxj k 2 Case 2: If we set r(w) = 0 and φi (w) = [1 − yi w⊤ xi ]+ + λ2 kwk2 so that proxλr (x) = x and

∇φi (w) = −2[1 − yi w⊤ xi ]+ yi xi + λw. √ √ Because k∇φi (w)k ≤ 2(1 + kxi k/ λ)kxi k + λ, according the previous analysis, the optimal distribution for this case should be √ √ 2(1 + kxi k/ λ)kxi k + λ √ . √ pi = P n λ] j=1 [2(1 + kxj k/ λ)kxj k + 19

5.2.2

Proximal SDCA with Importance Sampling

2 If we set r(w) = 12 kwk2 , which is 1-strongly convex with k · kP = k · k, we have φi (w) = [1 − yi w⊤ xi ]+ , which is (2kxi k2 )-smooth with respect to k · kP ′ = k · k. As a result, the optimal distribution for proximal SDCA with importance sampling should be pi = (1 +

n X 2kxj k2 2kxi k2 )/(n + ), λn λn j=1

where we used the fact R = supu6=0 kukD′ /kukD = 1. It can be derived that the dual function of φ(·) is  −α + α2 /4 θ = αyi xi , α ≥ 0 φ∗i (−θ) = ∞ otherwise Plugging the above equation into the update, we can observe that option I in the algorithm has the following closed-form solution:   1 − yi w⊤ xi − αi /2 ∆θi = max , −α i yi xi . 1/2 + kxi k2 /(λn)

5.3

Squared Hinge Loss Based SVM with ℓ1 Regularization

Suppose our interest is to solve the task of optimizing squared hinge loss based Support Vector Machine (SVM) with ℓ1 regularization: n

min w

2 1X [1 − yi w⊤ xi ]+ + λkwk1 . n i=1

where ℓ1 regularization is introduced to make the optimal model sparse, which can alleviate the effect of the curse of dimensionality, and improve the model interpretability. 5.3.1

Proximal SGD with Importance Sampling

Using the inequality P (w∗ ) ≤ P (0), we obtain the fact that kw∗ k ≤ kw∗ k1 ≤ 1/λ. So we can project the iterative solutions into {w ∈ Rd |kwk ≤ 1/λ} using Euclidean distance, while the previous analysis is still valid. 2 If we set regularizer as r(w) = kwk1 , then loss function is φi (w) = [1 − yi w⊤ xi ]+ , so that the proximal mapping is proxλr (x) = sign(x) ⊙ [|x| − λ]+ , and

 ∇φi (w) = −2 [1 − yi w⊤ xi ]+ yi xi .

Because ∇φi (w) ≤ 2(1 + kxi k/λ)kxi k, according to the previous analysis, we should set

5.3.2

(1 + kxi k/λ)kxi k . pi = P n j=1 (1 + kxj k/λ)kxj k

Proximal SDCA with Importance Sampling

Let w∗ be the optimal solution, which satisfies kw∗ k ≤ 1/λ. Choosing δ = λ2 ǫ and r(w) =

λ 1 kwk2 + kwk1 , 2 δ 20

which is 1-strongly convex with respect to k · kP ′ = k · k. Consider the problem " n # 1X b min P (w) = φi (w) + δr(w) , w n i=1

then if w is an ǫ/2 approximated solution of the above problem , it holds that n

n

ǫ 1X 1X φi (w) + λkwk1 ≤ Pb (w) ≤ Pb(w∗ ) + ≤ φi (w∗ ) + λkw∗ k1 + ǫ, n i=1 2 n i=1

which implies w is an ǫ approximated solution of the original optimization problem. 2 When we adopt Proximal SDCA with importance sampling to minimize Pˆ (w), we have φi (w) = [1 − yi w⊤ xi ]+ , which is (2kxi k2 )-smooth with respect to k · kP = k · k. As a result, the optimal distribution for Proximal SDCA with importance sampling should be pi = (1 +

n X 2kxi k2 2kxj k2 )/(n + ). λn λn j=1

Plugging φ∗i (−θ) = −α + α2 /4, θ = αyi xi , α ≥ 0 into the algorithm, we can observe that the option I produces the following closed-form solution:   1 − yi w⊤ xi − αi /2 , −αi yi xi . ∆θi = max 1/2 + kxi k2 /(λn) Finally it is easy to verify that ∇r∗ (v) = sign(v) ⊙ [|v| −

6

λ ]+ . δ

Experimental Results

In this section, we evaluate the empirical performance of the proposed algorithms.

6.1

Experimental Testbed and Setup

To compare our algorithms with their traditional versions without importance sampling, we focus on the task of optimizing squared hinge loss based SVM with ℓ2 regularizaton: n

min w

2 λ 1X [1 − yi w⊤ xi ]+ + kwk2 . n i=1 2

We compared our Iprox-SGD with traditional prox-SGD (actually Pegasos [16]), and Iprox-SDCA with prox-SDCA (actually SDCA [12]). For Iprox-SGD, we adopt the method in the case 2 of the subsection 5.2.1, while for Iprox-SDCA, we adopt the method in the subsection 5.2.2. Table 1: Datasets used in the experiments. Dataset Dataset Size Features ijcnn1 49990 22 kdd2010(algebra) 8407752 20216830 w8a 49749 300 To evaluate the performance of our algorithms, the experiments were performed on several real world datasets, which are chosen fairly randomly in order to cover various aspects of datasets. All the datasets 21

can be downloaded from LIBSVM website1 . The details of the dataset characteristics are provided in the Table 1. To make a fair comparison, all algorithms adopt the same setup in our experiments. In particular, the regularization parameter λ of SVM is set as 10−4 , 10−6 , 10−4 for ijcnn1, kdd2010(algebra), and w8a, respectively. For prox-SGD and Iprox-SGD, the step size is set as ηt = 1/(λt) for all the datasets. All the experiments were conducted by fixing 5 different random seeds for each dataset. All the results were reported by averaging over these 5 runs. We evaluated the learning performance by the convergence behaviors.

6.2

Evaluation on Iprox-SGD

The figure 1 summarized the details of primal objective values varying over the learning process for prox-SGD and Iprox-SGD. 5

4

10 Primal Objective Value

Primal Objective Value

Pegasos Iprox−SGD

Pegasos Iprox−SGD

0

10

10 Primal Objective Value

−0.4

10

−0.7

10

0

Pegasos Iprox−SGD

2

10

0

10

−2

5

10 Epoch

15

20

0

5

10 Epoch

15

20

(a) primal objective value on ijcnn1 (b) primal objective value on kdd2010(algebra)

10

0

5

10 Epoch

15

20

(c) primal objective value on w8a

Figure 1: Comparison between Pegasos with Iprox-SGD on several datasets. Epoch for the horizontal axis is the number of iterations divided by dataset size. Several observations can be drawn from the experimental results. First of all, on the last two datasets, the proposed Iprox-SGD algorithm achieved the fastest convergence rates, which implies that the proposed importance sampling does sampled more informative stochastic gradient during the learning process for these two datasets. Secondly, on the first dataset, the proposed Iprox-SGD algorithm achieved comparable convergence rate compared with traditional prox-SGD, which indicates that Iprox-SGD may degenerate into the traditional prox-SGD when the variance of training dataset is significantly small.

6.3

Evaluation on Iprox-SDCA

The figure 2 summarized the details of duality gap values varying over the learning process for prox-SDCA and Iprox-SDCA. We have several observations from these empirical results. Firstly, on all the datasets, the proposed Iprox-SDCA algorithm converged much faster than the standard SDCA, which indicates that the proposed importance sampling strategy make the duality gap minimization more efficient during the learning process. Secondly, for the dataset “w8a”, SDCA achieved smaller duality gap values for the first several epochs, however which should be caused by some statistical variance.

7

Conclusion

This paper studied stochastic optimization with importance sampling that reduces the variance. Specifically we considerd importance sampling strategies for Proximal Stochastic Gradient Descent and for Proximal 1 http://www.csie.ntu.edu.tw/

~ cjlin/libsvmtools/

22

0

−5

10

−10

10

−15

10

0

0

10 Duality Gap Value

Duality Gap Value

SDCA Iprox−SDCA

SDCA Iprox−SDCA

−5

10

−10

50 Epoch

100

(a) duality gap value on ijcnn1

10

0

10 Duality Gap Value

0

10

SDCA Iprox−SDCA

−5

10

−10

10

−15

50 Epoch

100

(b) duality gap value on kdd2010(algebra)

10

0

50 Epoch

100

(c) duality gap value on w8a

Figure 2: Comparison between SDCA with Iprox-SDCA on several datasets. Epoch for the horizontal axis is the number of iterations divided by dataset size. Stochastic Dual Coordinate Ascent. For prox-SGD with importance sampling, our analysis showed that in order to reduce variance, the sample distribution should depend on the norms of the gradients of the loss functions; for prox-SDCA with importance sampling, our analysis showed that the sampling distribution should rely on the smooth parameters or Lipschitz parameters of all the loss functions. Compared to the traditional prox-SGD and prox-SDCA methods with uniform sampling, we showed that the proposed importance sampling methods can significantly improve the convergence rate of optimization under suitable situations. Experiments confirm our theoretical analysis.

References [1] Tong Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In ICML, 2004. [2] Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. arXiv preprint arXiv:1109.5647, 2011. [3] Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 71–79, 2013. [4] John Duchi and Yoram Singer. Efficient online and batch learning using forward backward splitting. The Journal of Machine Learning Research, 10:2899–2934, 2009. [5] Zhi-Quan Luo and Paul Tseng. On the convergence of the coordinate descent method for convex differentiable minimization. Journal of Optimization Theory and Applications, 72(1):7–35, 1992. [6] Olvi L Mangasarian and David R Musicant. Successive overrelaxation for support vector machines. Neural Networks, IEEE Transactions on, 10(5):1032–1037, 1999. [7] Cho-Jui Hsieh, Kai-Wei Chang, Chih-Jen Lin, S Sathiya Keerthi, and Sellamanickam Sundararajan. A dual coordinate descent method for large-scale linear svm. In Proceedings of the 25th international conference on Machine learning, pages 408–415. ACM, 2008. [8] Shai Shalev-Shwartz and Ambuj Tewari. Stochastic methods for l1 -regularized loss minimization. Journal of Machine Learning Research, 12:1865–1892, 2011. [9] Simon Lacoste-Julien, Martin Jaggi, Mark W. Schmidt, and Patrick Pletscher. Stochastic blockcoordinate frank-wolfe optimization for structural svms. CoRR, abs/1207.4747, 2012. 23

[10] Yurii Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341–362, 2012. [11] Shai Shalev-Shwartz and Tong Zhang. abs/1211.2717, 2012.

Proximal stochastic dual coordinate ascent.

CoRR,

[12] Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. JMLR, pages 567–599, 2013. [13] Shai Shalev-Shwartz and Tong Zhang. Proximal stochastic dual coordinate ascent. arXiv preprint arXiv:1211.2717, 2012. [14] John C. Duchi, Shai Shalev-Shwartz, Yoram Singer, and Ambuj Tewari. Composite objective mirror descent. In COLT, pages 14–26, 2010. [15] Harold J Kushner and George Yin. Stochastic approximation and recursive algorithms and applications, volume 35. Springer, 2003. [16] Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal estimated sub-gradient solver for svm. In ICML, pages 807–814, 2007. [17] Peter Richt´ arik and Martin Tak´ ac. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Programming, pages 1–38, 2012. [18] Yu Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341–362, 2012. [19] Yin Tat Lee and Aaron Sidford. Efficient accelerated coordinate descent methods and faster algorithms for solving linear systems. arXiv preprint arXiv:1305.1922, 2013. [20] Jonathan Borwein and Adrian Lewis. Convex analysis and nonlinear optimization: theory and examples, volume 3. Springer, 2006.

24