Stochastic Subgradient Algorithms for Strongly Convex ... - arXiv

8 downloads 32863 Views 652KB Size Report
THE demand for large-scale networks consisting of multi- ple agents (i.e., nodes) with ... distributed data mining scenario, privacy considerations may ... can@mit.edu). M. O. Sayin ... a deterministic analysis of the SGD iterates and our results builds on ...... the proposed algorithms for various training data lengths. In particular ...
1

Stochastic Subgradient Algorithms for Strongly Convex Optimization over Distributed Networks

arXiv:1409.8277v2 [cs.NA] 31 Aug 2015

N. Denizcan Vanli, Muhammed O. Sayin, and Suleyman S. Kozat

Abstract—We study diffusion and consensus based optimization of a sum of unknown convex objective functions over distributed networks. The only access to these functions is through stochastic gradient oracles, each of which is only available at a different node, and a limited number of gradient oracle calls is allowed at each node. In this framework, we introduce a convex optimization algorithm based on the stochastic gradient descent (SGD) updates. Particularly, we use a carefully designed timedependent weighted averaging  √ of the SGD iterates, which yields a convergence rate of O N T N after T gradient updates for each node on a network of N nodes. We then show that after T gradient oracle calls, the average  √ SGD  iterate achieves a mean square deviation (MSD) of O TN . This rate of convergence is optimal as it matches the performance lower bound up to constant terms. Similar to the SGD algorithm, the computational complexity of the proposed algorithm also scales linearly with the dimensionality of the data. Furthermore, the communication load of the proposed method is the same as the communication load of the SGD algorithm. Thus, the proposed algorithm is highly efficient in terms of complexity and communication load. We illustrate the merits of the algorithm with respect to the stateof-art methods over benchmark real life data sets and widely studied network topologies. Index Terms—Distributed processing, convex optimization, online learning, diffusion strategies, consensus strategies.

I. I NTRODUCTION

T

HE demand for large-scale networks consisting of multiple agents (i.e., nodes) with different objectives is steadily growing due to their increased efficiency and scalability compared to centralized distributed structures [1]–[4]. A wide range of problems in the context of distributed and parallel processing can be considered as a minimization of a sum of objective functions, where each function (or information on each function) is available only to a single agent or node [5]– [7]. In such practical applications, it is essential to process the information in a decentralized manner since transferring the objective functions as well as the entire resources (e.g., data) may not be feasible or possible [8]–[11]. For example, in a distributed data mining scenario, privacy considerations may prohibit sharing of the objective functions [5]–[7]. Similarly, This work is supported in part by Outstanding Researcher Program of Turkish Academy of Sciences and in part by TUBITAK under Contract 113E517. N. D. Vanli is with the Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, Cambridge, MA 02139 (e-mail: [email protected]). M. O. Sayin is with the Coordinated Science Laboratory, University of Illinois at Urbana-Champaign, Urbana, IL 61801 (e-mail: [email protected]). S. S. Kozat is with the Department of Electrical and Electronics Engineering, Bilkent University, Ankara 06800, Turkey (e-mail: [email protected]).

in a distributed wireless network, energy considerations may limit the communication rate between agents [12]–[15]. In such settings, parallel or distributed processing algorithms, where each node performs its own processing and share information subsequently, are preferable over the centralized methods [16]–[19]. Here, we consider minimization of a sum of unknown convex objective functions, where each agent (or node) observes only its particular objective function via the stochastic gradient oracles. Particularly, we seek to minimize this sum of functions with a limited number of gradient oracle calls at each agent. In this framework, we introduce a distributed online convex optimization algorithm based on the SGD iterates that efficiently minimizes this cost function. Specifically, each agent uses a time-dependent weighted combination of the SGD iterates and achieves the presented performance guarantees, which matches with the lower bounds presented in [20], only with a relatively small excess term caused by the unknown network model. The proposed method is comprehensive, in that any communication strategy, such as the diffusion [1] and the consensus [4] strategies, are incorporated into our algorithm in a straightforward manner as shown in the paper. We compare the performance of our algorithm respect to the state-of-the-art methods [4], [9], [21] in the literature and present the outstanding performance improvements for various well-known network topologies and benchmark data sets. Distributed networks are successfully used in wireless sensor networks [22]–[27], and recently used for convex optimization via projected subgradient techniques [4]–[9]. In [9], the authors illustrate the performance of the least mean squares (LMS) algorithm over distributed networks using different diffusion strategies. We emphasize that this problem can also be casted as a distributed convex optimization problem, hence our results can be applied to these problems in a straightforward manner. In [8], the authors consider the cooperative optimization of the cost function under convex inequality constraints. However, the problem formulation as well as the convergence results in this paper are significantly different than the ones in [8]. In [4], the authors present a deterministic analysis of the SGD iterates and our results builds on them by illustrating a stronger convergence bound in expectation while also providing MSD analyses of the SGD iterates. In [5]–[7], the authors consider the distributed convex optimization problem and present the probability-1 and mean square convergence results of the SGD iterates. In this paper, on the other hand, we provide the expected convergence rate of our algorithm and the MSD of the SGD iterates at any time instant.

2

Similar convergence analyses are recently illustrated in the computational learning theory [20], [21], [28]–[30]. In [28], the authors provide deterministic bounds on the learning performance (i.e., regret) of the SGD algorithm. In [29], these analyses are extended and a regret-optimal learning algorithm is proposed. In the same lines, in [21], the authors describe a method to make the SGD algorithm optimal for the strongly convex optimization. However, these approaches rely on the smoothness of the optimization problem. In [30], a different method to achieve the optimal convergence rate is proposed and its performance is analyzed. On the other hand, in this paper, the convex optimization is performed over a network of localized learners, unlike [21], [28]–[30]. Our results illustrate the convergence rates over any unknown communication graph, and in this sense build upon the analyses of the centralized learners. Furthermore, unlike [21], [29], our algorithm does not require the optimization problem to be sufficiently smooth. Distibuted convex optimization appears in a wide range of practical applications in wireless sensor networks and realtime control systems [1]–[3]. We introduce a comprehensive approach to this setup by proposing an online algorithm, whose expected performance is asymptotically the same as the performance of the optimal centralized processor. Our results are generic for any probability distribution on the data, not necessarily Gaussian unlike the conventional works in the literature [9], [10]. Furthermore, our performance bounds are optimal in a strong deterministic sense up to constant terms. Our experiments over different network topologies, various data sets and cost functions illustrate the superiority and robustness of our approach with respect to the state-of-theart methods in the literature. Our main contributions are as follows. 1) We introduce a distributed online convex optimization algorithm based on the SGD iterates that achieves an √  N N after T gradient optimal convergence rate of O T updates, for each and every node on the network. We emphasize that this convergence rate is optimal since it achieves the lower bounds presented in [20] up to constant terms. 2) We show  √ that  the average SGD iterate achieves a MSD of O TN after T gradient updates. 3) Our analyses can be extended to analyze the performances of the diffusion and consensus strategies in a straightforward manner as illustrated in our paper. 4) We illustrate the highly significant performance gains of the introduced algorithm with respect to the state-ofthe-art methods in the literature under various network topologies and benchmark data sets. The organization of the paper is as follows. In Section II, we introduce the distributed convex optimization framework and provide the notations. We then introduce the main result of the paper, i.e., a SGD based convex optimization algorithm, in Section III and analyze the convergence rate of the introduced algorithm. In Section V, we demonstrate the performance of our algorithm with respect to the state-of-the-art methods through simulations and then conclude the paper with several

remarks in Section VI. II. P ROBLEM D ESCRIPTION A. Notation Throughout the paper, all vectors are column vectors and represented by boldface lowercase letters. Matrices are represented by boldface uppercase letters. For a matrix H, √ ||H||F is the Frobenius norm. For a vector x, ||x|| = xT x is the ℓ2 -norm. Here, 0 (and 1) denotes a vector with all zero (and one) elements and the dimensions can be understood from the context. For a matrix H, H ij represents its entry at the ith row and jth column. For a convex set W, ΠW denotes the Euclidean projection onto W, i.e., ΠW (w′ ) = arg minw∈W ||w − w′ ||. B. System Overview We consider the problem of distributed strongly convex optimization over a network of N nodes. At each time t, each node i observes a pair of regressor vectors and data, i.e., ut,i and dt,i , where the pairs (ut,i , dt,i ) are independent and identically distributed (i.i.d.) for all t ≥ 1, which is a common framework in the conventional studies in the literature [9], [10], [25]. Here, the distributions of the data pairs can differ from one node to another and we do not have any information on the underlying distributions. In such a framework, the aim of each node is to minimize a strongly convex cost function fi (·) over a convex set W. However, fi (·)’s are now known and each node accesses to its fi (·) only via a stochastic gradient oracle, which given some w ∈ W, produces a random ˆ i = g i is a subgradient of vector gˆ i , whose expectation Eg fi at w. Using these stochastic gradient oracles, each node estimates a parameter of interest wt,i on a common convex set W and calculates an estimate of the output as dˆt,i = wTt,i ut,i ,

i.e., by a first order linear method. After observing the true data at time t, node i suffers a loss of ft,i (w t,i ), where ft,i (·) is a strongly convex loss function. As an example, our cost function can be defined as follows λ 2 ft,i (w t,i ) = ℓ(w t,i ; ut,i , dt,i ) + ||w t,i || , (1) 2 where ℓ(wt,i ; ui , di ) is a Lipschitz-continuous convex loss function with respect to the first variable, where (1) is extensively studied in the literature [21], [28]–[30] as a strongly convex loss function involving regularity terms. Here, the aim of each node is to minimize its expected loss over the convex set W. To continue with our example in (1), the aim of each node is to minimize Efi (wt,i ) = E ℓ(wt,i; ui, di ) + λ2 ||wt,i||2 , (2) where we drop the notational dependency of f to the time index t since the data pairs are assumed to be independent over time. We emphasize that the formulation in (2) covers a wide range of practical loss functions. As an example, when ℓ(wt,i ; ui , di ) = (di − wTt,i ui )2 , we consider the regularized squared error loss and when ℓ(w t,i ; ui , di ) = max{0, 1 −

3

Algorithm 1 Time Variable Weighting (TVW) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10:

Fig. 1: An example distributed network of 16 nodes, where each agent communicates with a set of nodes in its neighborhood.

di wTt,i ui },

we consider the hinge loss. Since we make no assumptions on the loss function fi (wt,i ) other than strong convexity, one can also use different loss functions with their corresponding gradients and our results still hold. In this framework, at each time t, after observing the true data dt,i , each node exchanges (i.e., diffuses) information with its neighbors in order to produce the next estimate of the parameter vector wt+1,i . For example, in Fig. 1, we have a distributed network consisting of N = 16 nodes, where each node communicates with its neighbors. Although the aim of each node is to minimize its expected loss in (2), the ultimate goal of the distributed network is to minimize the total expected loss, i.e.,

Ef (wt,i) ,

N X

Efj (wt,i),

(3)

j=1

for each node i, with at most t stochastic gradient oracle calls at each agent, where N is the number of nodes in the distributed network. Here, we emphasize that the expected total loss in (3) is defined with respect to the data statistics at every node, whereas the parameter vector is trained using the gradient oracle calls at ith node and the information diffusion over the distributed network. Thus, our aim is to minimize a global, strongly convex cost function at each node over a distributed network, where each learner is allowed to use at most t calls to the gradient oracle until time t. III. M AIN R ESULTS In this section, we present the main result of the paper, where we introduce an algorithm based on the SGDupdates √  that achieves an expected regret upper bound of O N T N after T iterates. The proposed method uses time dependent weighted averages of the SGD updates at each node together with the adapt-then-combine diffusion strategy [9] to achieve this performance. However, as later explained in the paper (i.e., in Section IV), our algorithm can be extended the consensus strategy in a straightforward manner. The complete description of the algorithm can be found in Algorithm 1.

for t = 1 to T do for i = 1 to N do dˆt,i = w ¯ Tt,i ut,i % Estimation g t,i = ∇w t,i fi (w t,i ) % Gradient oracle call ψ t+1,i = w t,i − µt g t,i % SGD update φt+1,i = ΠW (ψ t+1,i ) % Projection PN wt+1,i = j=1 H ji φt+1,j % Diffusion 2 t w ¯ t,i + t+2 wt,i % Weighting w ¯ t+1,i = t+2 end for end for

To achieve the aforementioned result, we first introduce the following lemma, which provides an upper bound on the performance of the average parameter vector. Lemma 1 Assume that (i) each fi is λ-strongly convex over W, where W ⊆ Rp is a closed convex set; (ii) the norms of any two subgradients has a finite covariance, i.e.,   E ||gi(w1 )|| gj (w2 ) ≤ G2 , for any w1 , w2 ∈ W. Then defining N 1 X wt,i , wt , N i=1 Algorithm 1 yields

E ||wt+1 − w∗ ||2 − (1 − λµt ) E ||wt − w∗ ||2 ≤

2µt N

E f (w∗ ) − f (wt ) + G h

N X i=1

2 ||wt − wt,i ||

i + w t − ψ t+1,i + 4G2 µ2t .

(4)

This lemma provides an upper bound on the regret and the squared deviation of the average parameter vector. It provides an intermediate step to relate the performance of the parameter vector at each agent to the best parameter vector. The assumptions in Lemma 1 are widely used to analyze the convergence of online algorithms such as in [21], [28]–[30]. Particularly, the first assumption in Lemma 1 is satisfied for many loss functions, such as the hinge loss, that are widely used in the literature. Even though this assumption may not hold for some loss functions such as the square loss, one can incorporate a small regularization term in order to make it strongly convex [21], [28]. The second assumption in Lemma 2 is practically a boundedness condition that is widely used to analyze the performance of SGD based algorithms [29], [30]. We emphasize that our algorithm does not need to know this upper bound and it is only used in our theoretical derivations. Proof In order to efficiently manage the recursions, we first consider the projection operation and let xt,i , ΠW (ψ t+1,i ) − ψ t+1,i ,

(5)

and g t,i , ∇w t,i fi (w t,i ). Then, we can compactly represent PN the averaged estimation parameter wt = N1 i=1 w t,i in a

4

recursive manner as follows [4]

wt+1

and upper bound this term as follows − g Tt,i (w t − w ∗ ) = −g Tt,i (w t − w t,i + wt,i − w∗ )

# "N N  1 X X H ij wt,i − µt g t,i + xt,i = N j=1 i=1 N  1 X = wt + xt,i − µt g t,i , N i=1

(6)

where the last line follows since H is right stochastic, i.e., H1 = 1. Hence, the squared deviation of these average iterates with respect to w∗ can be obtained as follows 2 N X  1 ||wt+1 − w || = wt − w∗ + xt,i − µt g t,i N i=1 2 N 1 X = ||w t − w ∗ || + 2 (xt,i − µt g t,i ) N ∗ 2

≤ −g Tt,i (w t − wt,i ) + fi (w ∗ ) − fi (w t,i ) λ 2 − ||w∗ − wt,i || (10) 2 T T ∗ ≤ −g t,i (w t − wt,i ) + g ¯t,i (wt − wt,i ) + fi (w ) λ λ − fi (wt ) − ||w∗ − wt,i ||2 − ||wt,i − wt ||2 2 2 (11)  ∗ ≤ fi (w ) − fi (w t ) + g ¯t,i + g t,i ||wt − wt,i || −

λ λ 2 2 ||w∗ − wt,i || − ||wt,i − wt || , 2 2

(12)

where g ¯t,i , ∇wt fi (w t ), (10) follows from the λ-strong convexity, i.e., fi (w ∗ ) ≥ fi (w t,i ) + g Tt,i (w∗ − wt,i ) +

λ 2 ||w∗ − wt,i || , 2

i=1

N 2 X (xt,i − µt g t,i )T (w t − w∗ ). + N i=1

fi (wt,i ) ≥ fi (wt ) + g ¯Tt,i (w t,i − wt ) +

We first upper bound the second term in the right hand side (RHS) of (7) as follows 1 N2

2 N X (xt,i − µt g t,i )

N X i=1

||xt,i || + µt g t,i

!2

.

(8)

||xt,i || = ΠW (ψ t+1,i ) − ψ t+1,i ≤ wt,i − ψ t+1,i

N X

||w t − w t,i || + f (w∗ ) − f (wt )

i=1 N X i=1

#  1  ∗ 2 2 ||w − wt,i || + ||w t,i − wt || N

i λN 2 ||wt − w∗ || , 2

N X i=1

||wt − wt,i || (13)

where the first inequality follows due to the assumption and the last inequality follows from the Jensen’s inequality due to the convexity of the norm operator. We finally turn our attention to xTt,i (wt − w ∗ ) term in (7) and write this term as follows

Thus, we can rewrite (8) as follows



≤ E 2G



ΠW (ψ t+1,i ) = arg min φ − ψ t+1,i . φ∈W

i=1 2 2 4G µt ,

"

≤ E f (w∗ ) − f (wt ) + 2G

where the second line follows since

N X g t,i

i=1

g Tt,i (wt − w ∗ )

h

= µt ||g t || ,

2 N 4µ2 1 X (xt,i − µt g t,i ) ≤ 2t 2 N i=1 N

N X

λN − 2

We then note that

λ ||w t,i − wt ||2 , 2

and (12) follows from the Cauchy-Schwarz inequality. Summing (12) from i = 1 to N and taking expectation of both sides, we obtain −E

i=1

1 ≤ 2 N

(11) also follows from the λ-strong convexity, i.e.,

(7)

xTt,i (wt − w∗ ) = xTt,i (w t − ψ t+1,i ) + xTt,i (ψ t+1,i − w∗ )

!2

≤ xTt,i (w t − ψ t+1,i ),

(9)

  where the last line follows since E g t,i g t,j ≤ G2 for any i, j ∈ {1, . . . , N } according to the assumption.   We next turn our attention to −g Tt,i (w t − w∗ ) term in (7)

where the inequality follows from the definition of the Euclidean projection. Taking the expectation of both sides, we can upper bound this term as follows   E xTt,i(wt − w∗ ) = E ||xt,i || wt − ψt+1,i ≤ Gµt E wt − ψ t+1,i . (14)

5

Putting (9), (13), and (14) back in (7), we obtain

E ||wt+1 − w∗ ||2 − (1 − λµt ) E ||wt − w∗ ||2 ≤

2µt N

E

N h X 2 ||wt − wt,i || f (w∗ ) − f (wt ) + G

i=1 i + 4G2 µ2 . + wt − ψ t+1,i t

z=1

(15)

=2



This concludes the proof of Lemma 1.

Having obtained an upper bound on the performance of the average parameter vector, we then consider the MSD of the parameter vectors at each node from the average parameter vector. This lemma will then be used to relate the performance of each individual node to the performance of the fully connected distributed system. Lemma 2 In addition to the assumption in Lemma 1, assume that (i) the communication graph H forms a doubly stochastic matrix such that H is irreducible and aperiodic; (ii) the initial weights at each node are identically initialized to avoid any bias, i.e., w1,i = w1,j , ∀i, j ∈ {1, . . . , N }. Then Algorithm 1 yields t−1 √ X µt−z σ z , N

E ||wt − wt,i || ≤ 2G

(16)

z=1

and t−1 √ X N µt−z σ z ,

E wt − ψt+1,i ≤ Gµt + 2G



(17)

z=1

We emphasize that the assumptions in Lemma 2 are not restrictive and previous analyses in the literature also use similar assumptions [4]–[7]. The first assumption in Lemma 2 indicates that the convergence of the communication graph is geometric. This assumption holds when the communication graph is strongly connected, doubly stochastic, and each node gives a nonzero weight to the iterates of its neighbors. This holds for many communication strategies such as the Metropolis rule [1]. The second assumption in Lemma 2 is basically an unbiasedness condition, which is reasonable since the objective weight w∗ is completely unknown to us. Even though the initial weights are not identical, our analyses still hold, however with small additional excess terms. Proof We first let W t , [wt,1 , . . . , wt,N ], Gt , [g t,1 , . . . , g t,N ], and X t , [xt,1 , . . . , xt,N ]. Then, we obtain the recursion on W t as follows W t = W 1 H t−1 +

z=1

(X t−z − µt−z Gt−z ) H z .

t−1 X z=1

≤2

t−1 X z=1

  1 z 1 − H ei µt−z Gt−z N

1 µt−z ||Gt−z ||F 1 − H z ei , N

(19)

where the third line follows due to the unbiased initialization assumption, i.e., w1,i = w1,j , ∀i, j ∈ {1, . . . , N }. We first consider the term N1 1 − H z ei of (19) and 1 define the matrix B , K 11T . Then, we can write 1 1 − H z ei = ||Bei − H z ei || N = ||(B − H)z ei || , (20)

where the last line follows since B z = B, ∀z ≥ 1. Here, we let σ1 (A) ≥ σ2 (A) ≥ · · · ≥ σN (A) denote the singular values of a matrix A and λ1 (A) ≥ λ2 (A) ≥ · · · ≥ λN (A) denote the eigenvalues of a symmetric matrix A. Then, we can upper bound (20) as follows ||(B − H)z ei || ≤ σ1 (B − H) (B − H)z−1 ei , ∀z ≥ 1. Therefore, using the above recursion z times to (20), we obtain

where σ is the second largest singular value of matrix H.

t−1 X

only the ith entry of ei is 1 whereas the rest is 0, we have   1 1 − ei ||wt − wt,i || = W t N   t−1 X 1 1 − H z ei ≤ ||w1 − w1,i || + 2 µt−z Gt−z N

(18)

Letting ei denote the basis function for the ith dimension, i.e.,

||(B − H)z ei || ≤ σ1z (B − H) ||ei || = σ1z (B − H).

(21)

Here, we note that H is assumed to be a doubly stochastic matrix, thus λ1 (H) = 1. Therefore, the eigenspectrums of B −H and (B −H)T (B-H) are equal to the eigenspectrums of H and H T H, respectively, except the largest eigenvalues, i.e., λ1 (H) = λ1 (H T H) = 1. Thus, we have σ1z (B − H) = σ2z (H), and combining (20), (21), and (22), we obtain 1 1 − H z ei ≤ σ z (H). 2 N

(22)

(23)

From here on, we denote σ , σ2 (H) for notational simplicity. Putting (23) back in (19), we obtain ||w t − wt,i || ≤ 2

t−1 X z=1

µt−z σ z ||Gt−z ||F .

Taking the expectation of both sides and noting that (E ||Gt−z ||F ) ≤ E ||Gt−z ||F 2

2

=E

N X g t−z,i 2 i=1

≤ G2 N,

(24)

6

we can rewrite (24) as follows

Proof According to Lemma 1 and Lemma 2, we have √ N

E ||wt − wt,i || ≤ 2G

t−1 X

µt−z σ z .

(25)

z=1

An upper bound for the term wt − ψ t+1,i can be obtained as wt − ψ t+1,i = wt − wt,i + µt g t,i ≤ ||wt − wt,i || + µt g t,i ,

where the last line follows from the triangle inequality. Taking expectation of both sides, we obtain the following upper bound √ N

E wt − ψt+1,i ≤ Gµt + 2G



This concludes the proof of Lemma 2.

t−1 X

µt−z σ z .

E [f (wt ) − f (w∗ )] i h N 2 2 E ≤ (1 − λµt ) ||wt − w∗ || − ||w t+1 − w ∗ || 2µ t

t−1 X √ µt−z σ z . + 3N G2 µt + 6N N G2

From the convexity of the cost functions, we also have

E [fi(wt ) − fi(wt,j )] ≥ E gTt,i,j (wt − wt,j ),

E [fi(wt,j ) − fi (wt )] ≤ E g Tt,i,j (wt,j − wt )  ≤ E g t,i,j ||wt,j − w t || (30) ≤ G E ||wt,j − w t || ,

(26) 

Theorem 1 Under the assumptions in Lemma 1 and Lemma 2 and weighted 2, Algorithm 1 with learning rate µt = λ(t+1) parameters w ¯ t,i , when applied to any independent and identically distributed regressor vectors and data, i.e., (ut,i , dt,i ) for all t ≥ 1 and i = 1, . . . , N , achieves the following convergence guarantee √ ! 8σ N 4N G2 ∗ E [f (w¯ T,i ) − f (w )] ≤ λ(T + 1) 3 + 1 − σ , (27) for all T ≥ 1, where σ is the second largest singular value of matrix H.

(29)

∀i, j ∈ {1, . . . , N }, where g t,i,j , ∇wt,j fi (w t,j ). Here, we can rewrite (29) as follows

z=1

The results in Lemma 1 and Lemma 2 are combined in the following theorem to obtain a regret bound on the performance of the proposed algorithm. This theorem illustrates the convergence rate of our algorithm (i.e., Algorithm 1)over distributed networks. The upper bound on √  the regret O N T N results since the algorithm suffers from a “diffusion regret” to sufficiently exchange (i.e., diffuse) the information among the nodes. This convergence rate matches with the lower bounds presented in [20] up to constant terms, hence is optimal in a minimax sense. The computational complexity of the introduced algorithm is on the order of the computational complexity of the SGD iterates up to constant terms. Furthermore, the communication load of the proposed method is the same as the communication load of the SGD algorithm. On the other hand, by using a time-dependent averaging of the SGD iterates, our algorithm achieves a significantly improved performance as shown in Theorem 1 and illustrated through our simulations in Section V.

(28)

z=1

where the second line follows from the triangle inequality and the last line follows from the boundedness assumption. Summing (30) from i = 1 to N , we obtain

E [f (wt,j ) − f (wt )] ≤ N G E ||wt,j − wt || .

(31)

Using Lemma 2 in (31), we get

E [f (wt,j ) − f (wt )] ≤ 2N

t−1 X √ N G2 µt−z σ z .

(32)

z=1

We then add (28) and (32) to obtain

E [f (wt,j ) − f (w∗ )] i N h ≤ (1 − λµt ) ||wt − w∗ ||2 − ||w t+1 − w∗ ||2 E 2µ t

t−1 X √ 2 + 3N G µt + 8N N G µt−z σ z . 2

(33)

z=1

Multiplying both sides of (28) by t and summing from t = 1 to T yields [30]

E

T X t=1



t [f (wt,j ) − f (w ∗ )]

TN N (1 − λµ1 ) 2 E ||w 1 − w ∗ || − E ||wT +1 − w∗ ||2 2µ1 2µT   T X N t(1 − λµt ) t − 1 + E ||wt − w∗ ||2 − 2 µ µ t t−1 t=2

+ 3N G2

T X t=1

T t−1 X X √ tµt + 8N N G2 t µt−z σ z . t=1

(34)

z=1

Here, we observe that This theorem illustrates that although every node uses local gradient oracle calls to train its parameter vector, each node asymptotically achieves the performance of the centralized processor through the information diffusion over the network. This result shows that each node acquires the information contained in the gradient oracles at every other node and suffers asymptotically no regret as the number of gradient oracle calls at each node increse.

T X t−1 X t=1 z=1

tµt−z σ z ≤ ≤ ≤

T X T X

tµt−z σ z

z=1 t=1

T X

σ

z

z=1

σ 1−σ

T X

t=1 T X t=1

tµt tµt .

(35)

7

Putting (35) back in (34) and inserting µt =

E

T X t=1

2 λ(t+1) ,

we obtain

t [f (wt,j ) − f (w∗ )]

≤−

≤−

λN T (T + 1) E ||wT +1 − w∗ ||2 4 X  T √ σ 2t + 3N G2 + 8N N G2 1 − σ t=1 λ(t + 1)

λN T (T + 1) E ||wT +1 − w∗ ||2 4   √ σ 2N G2 T 3+8 N , + λ 1−σ

(36)

t where the last line follows since t+1 ≤ 1. Dividing both sides PT T (T +1) , we obtain of (36) by t=1 t = 2 " # T X 2 ∗ E T (T + 1) t (f (wt,j ) − f (w )) t=1

≤−

λN 2

E ||wT +1 − w∗ ||2

4N G2 + λ(T + 1)

 √ 3+8 N

σ 1−σ



.

(37)

Since fi ’s are convex for all i ∈ {1, . . . , N }, f is also convex. Thus, from Jensen’s inequality, we can write ! # " T X 2 ∗ E f T (T + 1) t wt,j − f (w ) t=1 " # T X 2 t (f (wt,j ) − f (w∗ )) . ≤E T (T + 1) t=1 (38) Combining (37) and (38), we obtain ! # " T X 2 E f T (T + 1) t wt,j − f (w∗ ) t=1 ≤−

λN 2

+

E ||wT +1 − w∗ ||2 4N G2 λ(T + 1)

 √ 3+8 N

This concludes the proof of Theorem 1.

σ 1−σ



.

(39) 

Hence, using the weighted average T

w ¯ T,j

X 2 , t wt,j , T (T + 1) t=1

In the following theorem, we then consider the performance of the average SGD iterate instead of the time-variable weighted iterate in (40). In particular, we show  the  √ that average SGD iterate achieves a MSD of O TN . This MSD follows due to the number of gradient oracle calls and diffusion regret over the distributed network. Theorem 2 Under the assumptions in Lemma 1 and Lemma 2 2, Algorithm 1 with learning rate µt = λ(t+1) and weighted parameters w ¯ t,i , when applied to any independent and identically distributed regressor vectors and data, i.e., (ut,i , dt,i ) for all t ≥ 1 and i = 1, . . . , N , yields the following MSD guarantee √ ! 2 2σ 24G 2 E ||wT +1 − w∗ || ≤ λ2 (T + 1) 1 + 1 − Nσ . (41) PN for all T ≥ 1, where wt = N1 i=1 wt,i and σ is the second largest singular value of matrix H. Proof Following similar lines to the proof of Theorem 1, in particular following the steps between (33)-(39), we get " ! # T X 2 ∗ E f T (T + 1) t wt − f (w ) t=1 ≤−

λN 2

E ||wT +1 − w∗ ||2

12N G2 + λ(T + 1)

which yields

 √ 1+2 N

2 E ||wT +1 − w || ≤ λ224G (T + 1) ∗ 2

σ 1−σ

 √ 1+2 N



σ 1−σ

,

(42)



, (43)

which is the desired result. This concludes the proof of Theorem 2.  IV. E XTENSIONS TO THE C ONSENSUS S TRATEGY Algorithm 1 can be generalized into the consensus strategy in a straightforward manner, while the performance guarantee in Theorem 1 still holds up terms, i.e., we still have   to√constant N N . For the consensus strategy, a convergence rate of O T the lines 5–7 of Algorithm 1 is replaced by the following update   N X H ji w t,j − µt g t,i  . (44) w t+1,i = ΠW  j=1

(40)

instead of the original  √ SGD  iterates, we can achieve a converN N gence rate of O . The denominator T of this regret T bound follows since we use a time variable weighting of the SGD iterates. The linear dependency to the network size follows since we add N different cost functions, i.e., one corresponding to each node. Finally, the sublinear dependency to the network size results from the diffusion of the parameter vector over the distributed network.

Hence, we have the following recursion on the parameter vectors t−1 X W t = W 1 H t−1 − (X t−z − µt−z Gt−z ) H z−1 , z=1

instead of the one in (18). According to this modification, Lemma 2 can be updated as follows t−1 √ X N µt−z σ z−1 .

E ||wt − wt,i|| ≤ 2G

z=1

(45)

8

SNRs of the Nodes

noted in [4], [21], whereas the learning rate of the TVW algorithm is set to 2/(λ(t + 1)) as noted in Theorem 1, where λ = 0.01. These learning rates chosen to guarantee a fair performance comparison between these algorithms according to the corresponding algorithm descriptions stated in this paper and in [4], [21]. In the left column of Fig. 3, we compare the normalized time accumulated error performances of these algorithms under different network topologies in terms of the global normalized cumulative error (NCE) measure, i.e.,

10

SNR (dB)

5

0

−5

N

−10

NCE(t) = −15

0

5

10 Node index (i)

15

20

Fig. 2: SNRs of the nodes in the distributed network.

This loosens the upper bounds in (19) and (26) by a factor of 1/σ. Therefore, diffusion strategies achieves a better convergence performance compared to the consensus strategy. V. S IMULATIONS In this section, we first examine the performance of the proposed algorithms for various distributed network topologies, namely the star, the circle, and a random network topologies (which are shown in Fig. 3). In all cases, we have a network of N = 20 nodes where at each node i = 1, . . . , N at time t, we observe the data dt,i = wT0 ut,i + vt,i , where the regression vector ut,i and the observation noise vt,i are generated from i.i.d. zero mean Gaussian processes for all t ≥ 1. The variance 2 of the observation noise is σv,i = 0.1 for all i = 1, . . . , N , whereas the auto-covariance matrix of the regression vector ut,i ∈ R5 is randomly chosen for each node i = 1, . . . , N such that the signal-to-noise ratio (SNR) over the network varies between −15dB to 10dB (see Fig. 2). The parameter of interest w 0 ∈ R5 is randomly chosen from a zero mean Gaussian process and normalized to have a unit norm, i.e., ||w0 || = 1. We use the well-known Metropolis combination rule [1] to set the combination weights as follows  1 , if j ∈ Ni \ i   max{ni ,nj } (46) H ij = 0 , if j ∈ / Ni  P  1 − j∈Ni \i H ij , if i = j where ni is the number of neighboring nodes for node i. For this set of experiments, we consider the squared error loss, i.e., ℓ(w t,i ; ut,i , dt,i ) = (dt,i − wTt,i ut,i )2 as our loss function. In the figures, CSS represents the distributed constant step-size SGD algorithm of [9], VSS represents the distributed variable step-size SGD algorithm of [4], UW represents the distributed version of the uniform weighted SGD algorithm of [21], and TVW represents the distributed time variant weighted SGD algorithm proposed in this paper. The stepsizes of the CSS-1, CSS-2, and CSS-3 algorithms are set to 0.05, 0.1, and 0.2, respectively, at each node and the learning rates of the VSS and UW algorithms are set to 1/(λt) as

t

1 XX (dτ,i − wTτ,i uτ,i )2 . N t i=1 τ =1

Additionally, in the right column of Fig. 3, we compare the performance of the algorithms in terms of the global MSD measure, i.e., N 1 X 2 ||w0 − wt,i || . MSD(t) = N i=1

In the figures, we have plotted the NCE and MSE performances of the proposed algorithms over 200 independent trials to avoid any bias. As can be seen in the Fig. 3, the proposed TVW algorithm significantly outperforms its competitors and achieves a smaller error performance. This superior performance of our algorithm is obtained thanks to the time-dependent weighting of the regression parameters, which is used to obtain a faster convergence rate with respect to the rest of the algorithms. Hence, by using a certain time varying weighting of the SGD iterates, we obtain a significantly improved convergence performance compared to the state-of-the-art approaches in the literature. Furthermore, the performance of our algorithm is robust against the network topology, whereas the competitor algorithms may not provide satisfactory performances under different network topologies. We next consider the classification tasks over the benchmark data sets: covertype1 and quantum2. For this set of experiments, we consider the hinge loss, i.e., ℓ(wt,i ; ut,i , dt,i ) = max{0, 1 − dt,i w Tt,i ut,i }2 as our loss function. The regularization constant is set to λ = 1/T , where the stepsizes of the TVW, UW, and VSS algorithms are set as in the previous experiment. The step-sizes of the CSS-1, CSS-2, and CSS-3 algorithms are set to 0.02, 0.05, and 0.1 for the covertype data set, whereas the step-sizes of the CSS-1, CSS-2, and CSS-3 algorithms are set to 0.01, 0.02, and 0.05 for the quantum data set. These learning rates are chosen to illustrate the tradeoff between the convergence speed and the steady state performance of the constant stepsize SGD methods. The network sizes are set to N = 20 and N = 50 for the covertype and quantum data sets, respectively. In Fig. 4 and Fig. 5, we illustrate the performances of the proposed algorithms for various training data lengths. In particular, we train the parameter vectors at each node using a certain length of training data and test the performance 1 https://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/datasets/ 2 http://osmot.cs.cornell.edu/kddcup/

9

Normalized Accumulated Error Performances of the Proposed Algorithms

Time evolution of the MSD 0 −5

The Star Network

−17.5 −10 −15

CSS−3

−18

Global MSD (dB)

Global Normalized Accumulated Error (dB)

−17

CSS−2

−18.5

−19

−20

CSS−1 CSS−2

−30 −35

CSS−1

VSS

−40

−19.5

UW

UW TVW

−45

−20

VSS TVW 0

1000

2000 3000 Data Length

4000

−50

5000

0

1000

(a) Global NCE for the star network

2000 3000 Data Length

4000

5000

(b) Global MSD for the star network

Normalized Accumulated Error Performances of the Proposed Algorithms

Time evolution of the MSD

−17

0 −5

The Circle Network

−17.5

−10 −15

CSS−3

−18

Global MSD (dB)

Global Normalized Accumulated Error (dB)

CSS−3

−25

−18.5 CSS−1 −19

UW

−20

CSS−1

−25 −30 −35

CSS−2

UW

−40

−19.5 VSS

CSS−2

CSS−3

VSS

TVW

−45

TVW −20

0

1000

2000 3000 Data Length

4000

−50

5000

0

1000

(c) Global NCE for the circle network

4000

5000

(d) Global MSD for the circle network

Normalized Accumulated Error Performances of the Proposed Algorithms

Time evolution of the MSD

−17

0 The Random Network

−5

−17.5 −10 −15

CSS−3

−18

Global MSD (dB)

Global Normalized Accumulated Error (dB)

2000 3000 Data Length

−18.5 CSS−1 CSS−2

−19

UW

−20

CSS−1

−25

−35

UW

−19.5

−40 VSS

TVW

−45

TVW

CSS−3

VSS −20

CSS−2

−30

0

1000

2000 3000 Data Length

4000

5000

(e) Global NCE for the random network

−50

0

1000

2000 3000 Data Length

4000

5000

(f) Global MSD for the random network

Fig. 3: NCE (left column) and MSD (right column) performances of the proposed algorithms under the star (first row), the circle (second row), and a random (third row) network topologies, under the squared error loss function averaged over 200 trials.

of the final parameter vector over the entire data set. We

provide averaged results over 250 and 100 independent trials

10

Global Normalized Accumulated Error (log−scale)

Normalized Accumulated Error Performances of the Proposed Algorithms CSS−1 CSS−2 CSS−3 UW VSS TVW

−0.69

−0.7

−0.71

−0.72

−0.73

−0.74

−0.75 0.5

1

1.5 Data Length

2

R EFERENCES

2.5 4

x 10

Fig. 4: Normalized accumulated errors of the proposed algorithms versus training data length for cover type data averaged over 250 trials for a network size of 20.

Global Normalized Accumulated Error (log−scale)

Normalized Accumulated Error Performances of the Proposed Algorithms CSS−1 CSS−2 CSS−3 UW VSS TVW

−0.4 −0.42 −0.44 −0.46 −0.48 −0.5 −0.52 −0.54 −0.56 100

200

300

400

500 600 Data Length

these objective and achieves the optimal convergence   √ functions rate of O N T N after T gradient updates at each node. This performance is obtained by using a certain time-dependent weighting of the SGD iterates at each node. The computational complexity and the communication load of the proposed approach is the same with the state-of-the-art methods in the literature up to constant terms. We also prove that the average  √SGD  iterate achieves a mean square deviation (MSD) N of O T after T gradient oracle calls. We illustrate the superior convergence rate of our algorithm with respect to the state-of-the-art methods in the literature.

700

800

900

1000

Fig. 5: Normalized accumulated errors of the proposed algorithms versus training data length for quantum data averaged over 100 trials for a network size of 50.

for covertype and quantum data sets, respectively, and present the mean and variance of the normalized accumulated hinge errors. These figures illustrate that the proposed TVW algorithm significantly outperforms its competitors. Although the performances of the UW and VSS algorithms are comparably robust over different iterations, the TVW algorithm provides a smaller accumulated loss. On the other hand, the variance of the constant stepsize methods highly deteriorate as the stepsize increases. Although decreasing the stepsize yields more robust performance for these constant stepsize algorithms, the TVW algorithm provides a significantly smaller steady-state cumulative error with respect to these methods. VI. C ONCLUSION We study distributed strongly convex optimization over distributed networks, where the aim is to minimize a sum of unknown convex objective functions. We introduce an algorithm that uses a limited number of gradient oracle calls to

[1] A. H. Sayed, “Adaptive networks,” Proceedings of the IEEE, vol. 102, no. 4, pp. 460–497, April 2014. [2] A. H. Sayed, S.-Y. Tu, J. Chen, X. Zhao, and Z. J. Towfic, “Diffusion strategies for adaptation and learning over networks: an examination of distributed strategies and network behavior,” IEEE Signal Processing Magazine, vol. 30, no. 3, pp. 155–171, May 2013. [3] J.-J. Xiao, A. Ribeiro, Z.-Q. Luo, and G. B. Giannakis, “Distributed compression-estimation using wireless sensor networks,” IEEE Signal Processing Magazine, vol. 23, no. 4, pp. 27–41, July 2006. [4] F. Yan, S. Sundaram, S. V. N. Vishwanathan, and Y. Qi, “Distributed autonomous online learning: Regrets and intrinsic privacy-preserving properties,” IEEE Transactions on Knowledge and Data Engineering, vol. 25, no. 11, pp. 2483–2493, Nov 2013. [5] S. SundharRam, A. Nedic, and V. V. Veeravalli, “Distributed stochastic subgradient projection algorithms forconvex optimization,” Journal of Optimization Theory and Applications, vol. 147, no. 3, pp. 516–545, Dec 2010. [6] A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multiagent optimization,” IEEE Transactions on Automatic Control, vol. 54, no. 1, pp. 48–61, Jan 2009. [7] I. Lobel and A. Ozdaglar, “Distributed subgradient methods for convex optimization over random networks,” IEEE Transactions on Automatic Control, vol. 56, no. 6, pp. 1291–1306, June 2011. [8] Z. J. Towfic and A. H. Sayed, “Adaptive penalty-based distributed stochastic convex optimization,” IEEE Transactions on Signal Processing, vol. 62, no. 15, pp. 3924–3938, Aug 2014. [9] C. G. Lopes and A. H. Sayed, “Diffusion least-mean squares over adaptive networks: Formulation and performance analysis,” IEEE Transactions on Signal Processing, vol. 56, no. 7, pp. 3122–3136, July 2008. [10] F. S. Cattivelli and A. H. Sayed, “Diffusion LMS strategies for distributed estimation,” IEEE Transactions on Signal Processing, vol. 58, no. 3, pp. 1035–1048, March 2010. [11] J. Chen, C. Richard, and A. H. Sayed, “Multitask diffusion adaptation over networks,” IEEE Transactions on Signal Processing, vol. 62, no. 16, pp. 4129–4144, Aug 2014. [12] S.-Y. Tu and A. H. Sayed, “Distributed decision-making over adaptive networks,” IEEE Transactions on Signal Processing, vol. 62, no. 5, pp. 1054–1069, March 2014. [13] J. Chen, C. Richard, and A. H. Sayed, “Diffusion LMS over multitask networks,” IEEE Transactions on Signal Processing, vol. 63, no. 11, pp. 2733–2748, June 2015. [14] X. Zhao and A. H. Sayed, “Performance limits for distributed estimation over LMS adaptive networks,” IEEE Transactions on Signal Processing, vol. 60, no. 10, pp. 5107–5124, Oct 2012. [15] J. Chen and A. H. Sayed, “Diffusion adaptation strategies for distributed optimization and learning over networks,” IEEE Transactions on Signal Processing, vol. 60, no. 8, pp. 4289–4305, Aug 2012. [16] S.-Y. Tu and A. H. Sayed, “Diffusion strategies outperform consensus strategies for distributed estimation over adaptive networks,” IEEE Transactions on Signal Processing, vol. 60, no. 12, pp. 6217–6234, Dec 2012. [17] J. Chen and A. H. Sayed, “Distributed pareto optimization via diffusion strategies,” IEEE Journal of Selected Topics in Signal Processing, vol. 7, no. 2, pp. 205–220, April 2013. [18] X. Zhao and A. H. Sayed, “Distributed clustering and learning over networks,” IEEE Transactions on Signal Processing, vol. 63, no. 13, pp. 3285–3300, July 2015.

11

[19] Z. J. Towfic and A. H. Sayed, “Stability and performance limits of adaptive primal-dual networks,” IEEE Transactions on Signal Processing, vol. 63, no. 11, pp. 2888–2903, June 2015. [20] A. Agarwal, P. L. Bartlett, P. Ravikumar, and M. J. Wainwright, “Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization,” IEEE Transactions on Information Theory, vol. 58, no. 5, pp. 3235–3249, May 2012. [21] A. Rakhlin, O. Shamir, and K. Sridharan, “Making gradient descent optimal for strongly convex stochastic optimization,” in Proceedings of the 29th International Conference on Machine Learning (ICML-12), 2012, pp. 449–456. [22] K. Slavakis, G. B. Giannakis, and G. Mateos, “Modeling and optimization for big data analytics: (statistical) learning tools for our era of data deluge,” IEEE Signal Processing Magazine, vol. 31, no. 5, pp. 18–31, Sept 2014. [23] I. D. Schizas, G. Mateos, and G. B. Giannakis, “Distributed LMS for consensus-based in-network adaptive processing,” IEEE Transactions on Signal Processing, vol. 57, no. 6, pp. 2365–2382, June 2009. [24] G. Mateos, I. Schizas, and G. Giannakis, “Distributed recursive leastsquares for consensus-based in-network adaptive estimation,” Signal Processing, IEEE Transactions on, vol. 57, no. 11, pp. 4583–4588, Nov 2009. [25] G. Mateos and G. B. Giannakis, “Distributed recursive least-squares: Stability and performance analysis,” IEEE Transactions on Signal Processing, vol. 60, no. 7, pp. 3740–3754, July 2012. [26] G. Mateos, J. A. Bazerque, and G. B. Giannakis, “Distributed sparse linear regression,” IEEE Transactions on Signal Processing,, vol. 58, no. 10, pp. 5262–5276, Oct 2010. [27] P. A. Forero, A. Cano, and G. B. Giannakis, “Distributed clustering using wireless sensor networks,” IEEE Journal of Selected Topics in Signal Processing, vol. 5, no. 4, pp. 707–724, Aug 2011. [28] E. Hazan, A. Agarwal, and S. Kale, “Logarithmic regret algorithms for online convex optimization,” Machine Learning, vol. 69, no. 2-3, pp. 169–192, Dec 2007. [29] E. Hazan and S. Kale, “Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-convex optimization,” Journal of Machine Learning Research, vol. 15, pp. 2489–2512, Jul 2014. [30] S. Lacoste-Julien, M. W. Schmidt, and F. Bach, “A simpler approach to obtaining an O(1/t) convergence rate for the projected stochastic subgradient method,” http://arxiv.org/pdf/1212.2002v2.pdf, Dec 2012.