Online Convex Optimization with Stochastic Constraints - arXiv

12 downloads 0 Views 803KB Size Report
Aug 12, 2017 - Hao Yu, Michael J. Neely, Xiaohan Wei. Department of Electrical Engineering. University of Southern California. Abstract. This paper considers ...
1

Online Convex Optimization with Stochastic Constraints

arXiv:1708.03741v1 [math.OC] 12 Aug 2017

Hao Yu, Michael J. Neely, Xiaohan Wei Department of Electrical Engineering University of Southern California

Abstract This paper considers online convex optimization (OCO) with stochastic constraints, which generalizes Zinkevich’s OCO over a known simple fixed set by introducing multiple stochastic functional constraints that are i.i.d. generated at each round and are disclosed to the decision maker only after the decision is made. This formulation arises naturally when decisions are restricted by stochastic environments or deterministic environments with noisy observations. It also includes many important problems as special cases, such as OCO with long term constraints, stochastic constrained convex optimization, and deterministic constrained convex optimization. To solve √ this√problem, this paper proposes a new algorithm that achieves O( T ) expected regret and constraint violations and O( T log(T )) high probability regret and constraint violations. Experiments on a real-world data center scheduling problem further verify the performance of the new algorithm.

I. I NTRODUCTION Online convex optimization (OCO) is a multi-round learning process with arbitrarily-varying convex loss functions where the decision maker has to choose decision x(t) ∈ X before observing the corresponding loss function f t (·). For a fixed time horizon T , define the regret of a learning algorithm with respect to the best fixed decision in hindsight (with full knowledge of all loss functions) as regret(T ) =

T X t=1

f t (x(t)) − min x∈X

T X

f t (x).

t=1

The goal of OCO is to develop dynamic learning algorithms such that regret grows sub-linearly with respect to T . The setting of OCO is introduced in a series of work [1], [2], [3], [4] and is formalized in [4]. OCO has gained considerable amount of research interest recently with various applications such as online regression, prediction with expert advice, online ranking, online shortest paths and portfolio selection. See [5], [6] for more applications and backgrounds. In [4], Zinkevich shows that using an online gradient descent (OGD) update given by   x(t + 1) = PX x(t) − γ∇f t (x(t)) (1) √ where ∇f t (·) is a subgradient of f t (·) and PX [·] is the projection onto set X can achieve O( T ) regret. Hazan et al. √ in [7] show that better regret is possible under the assumption that each loss function is strongly convex but O( T ) is the best possible if no additional assumption is imposed. It is obvious that Zinkevich’s OGD in (1) requires the full knowledge of set X and low complexity of the projection PX [·]. However, in practice, the constraint set X , which is often described by many functional inequality constraints, can be time varying and may not be fully disclosed to the decision maker. In [8], Mannor et al. extend OCO by considering time-varying constraint functions g t (x) which can arbitrarily vary and are only disclosed to us after each x(t) is chosen. In this setting, Mannor et al. in P[8] explore the possibility of designing learning algorithms such that regret grows sub-linearly and lim supT →∞ T1 Tt=1 g t (x(t)) ≤ 0, i.e., the (cumulative) constraint violation PT t t=1 g (x(t)) also grows sub-linearly. Unfortunately, Mannor et al. in [8] prove that this is impossible even when both f t (·) and g t (·) are simple linear functions. Given the impossibility results shown by Mannor et al. in [8], this paper considers OCO where constraint functions g t (x) are not arbitrarily varying but independently and identically distributed (i.i.d.) generated from an unknown probability model. More specifically, this paper considers online convex optimization (OCO) with stochastic

2

constraint X = {x ∈ X0 : Eω [gk (x; ω)] ≤ 0, k ∈ {1, 2, . . . , m}} where X0 is a known fixed set; the expressions of stochastic constraints Eω [gk (x; ω)] (involving expectations with respect to ω from an unknown distribution) are unknown; and subscripts k ∈ {1, 2, . . . , m} indicate the possibility of multiple functional constraints. In OCO with stochastic constraints, the decision maker receives loss function f t (x) and i.i.d. constraint function realizations ∆ gkt (x) = gk (x; ω(t)) at each round t. However, the expressions of gkt (·) and f t (·) are disclosed to the decision maker only after decision x(t) ∈ X0 is chosen. This setting arises naturally when decisions are restricted by stochastic environments or deterministic environments with noisy observations. For example, if we consider online routing (with link capacity constraints) in wireless networks [8], each link capacity is not a fixed constant (as in wireline networks) but an i.i.d. random variable since wireless channels are stochastically time-varying by nature [9]. OCO with stochastic constraints also covers important special cases such as OCO with long term constraints [10], [11], [12], stochastic constrained convex optimization [13] and deterministic constrained convex optimization [14]. P Let x∗ = argmin{x∈X0 :E[gk (x;ω)]≤0,∀k∈{1,2,...,m}} Tt=1 f t (x) be the best fixed decision in hindsight (knowing all loss functions f t (x) and the distribution of stochastic constraint functions gk (x; ω)). Thus, x∗ minimizes the T -round P cumulative loss and satisfies all stochastic constraints in expectation, which also implies lim supT →∞ T1 Tt=1 gkt (x∗ ) ≤ 0 almost surely by the strong law of large numbers. Our goal is to develop dynamic P P P learning algorithms that guarantee both regret Tt=1 f t (x(t))− Tt=1 f t (x∗ ) and constraint violations Tt=1 gkt (x(t)) grow sub-linearly. Note that Zinkevich’s algorithm in (1) is not applicable to OCO with stochastic constraints since X is unknown and it can happen that X (t) = {x ∈ X0 : gk (x; ω(t)) ≤ 0, ∀k ∈ {1, 2, . . . , m}} = ∅ for certain realizations ω(t), such that projections PX [·] or PX (t) [·] required in (1) are not even well-defined. Our Contributions: This paper solves online convex optimization with stochastic constraints. In particular, we √ T ) expected regret and constraint violations and propose a new learning algorithm that is proven to achieve O( √ O( T log(T )) high probability regret and constraint violations. Along the way, we developed new techniques for stochastic analysis, e.g., Lemma 5, and improve upon state-of-the-art results in the following special cases. •



OCO with long term constraints: This is a special case where each gkt (x) ≡ gk (x) is known and does not depend on time. Note that X = {x ∈ X0 : gk (x) ≤ 0, ∀k ∈ {1, 2, . . . , m}} can be complicated while X0 might be a simple hypercube. To avoid high complexity involved in the projection onto X as in Zinkevich’s algorithm, work in [10], [11], [12] develops low complexity algorithms that use projections onto a simpler P set X0 by allowing gk (x(t)) > 0 for certain rounds but ensuring lim supT →∞ T1 Tt=1 gk (x(t)) ≤ 0. The best existing performance is O(T max{β,1−β}√ ) regret and O(T 1−β/2 ) constraint violations where β ∈ (0, √ 1) is an algorithm parameter [12]. This gives O( T ) regret with worse O(T 3/4 ) constraint violations or O( T ) constraint violations with worse O(T ) √ regret. In contrast, our √ algorithm, which only uses projections onto X0 as shown in Lemma 1, can achieve O( T ) regret and O( T ) constraint violations simultaneously.1 Stochastic constrained convex optimization: This is a special case where each f t (x) is i.i.d. generated from an unknown distribution. This problem has many applications in operation research and machine learning such as Neyman-Pearson classification and risk-mean portfolio. The work [13] develops a (batch) offline algorithm that produces a solution with high probability performance guarantees only after sampling the problems for sufficiently many times. That is, during the process of sampling, there are no performance guarantees. The work [17] proposes a stochastic approximation based (batch) offline algorithm for stochastic convex optimization

1 By adapting the methodology presented in this paper, our other report √ [15] developed a different algorithm that can only solve the special case problem “OCO with long term constraints” but can achieve O( T ) regret and O(1) constraint violations. The current paper also relaxes a deterministic Slater condition assumption required in our other technical report [16] for OCO with time-varying constraints, ˆ ∈ X0 such that gk (ˆ which requires the existence of constant  > 0 and fixed point x x; ω(t)) ≤ −, ∀k ∈ {1, 2, . . . , m} for all ω(t) ∈ Ω. By relaxing the deterministic Slater condition assumption to the stochastic Slater condition in Assumption 2, the current paper even allows the possibility that gk (x; ω(t)) ≤ 0 is infeasible for certain ω(t) ∈ Ω. However, under the deterministic Slater condition assumption, our technical report [16] shows that if the regret is defined as the cumulative loss difference between our algorithm and the best fixed point from set A√= {x ∈ X : gk (x;√ω) ≤ 0, ∀k ∈ {1, 2, . . . , m}, ∀ω ∈ Ω}, which is called a common subset in [16], then our algorithm can achieve O( T ) regret and O( T ) constraint violations simultaneously even if the constraint functions gkt (x) are arbitrarily time-varying (not necessarily i.i.d.). That is, by imposing the additional deterministic Slater condition and restricting the regret to be defined over the common subset A, our algorithm can escape the impossibility shown by Mannor et al. in [8]. To the best of our knowledge, this is the first time that specific conditions are proposed to enable sublinear regret and constraints violations simultaneously for OCO with arbitrarily time-varying constraint functions. Since the current paper focuses on OCO with stochastic constraints, we refer interested readers to Section IV in [16] for results on OCO with arbitrarily time-varying constraints.

3



with one single stochastic functional inequality constraint. In contrast, our algorithm is an online algorithm with online performance guarantees. 2 Deterministic constrained convex optimization: This is a special case where each f t (x) ≡ f (x) and gkt (x) ≡ gk (x) are known and do not depend on time. In this case, the goal is to develop a fast algorithm that√converges to a good solution (with a small error) with a few number of iterations; and our algorithm √ with O( T ) regret and constraint violations is equivalent to an iterative numerical algorithm with O(1/ T ) convergence rate. Our algorithm is subgradient based and does not require√the smoothness or differentiability of the convex program. Recall that Nesterov in [14] shows that O(1/ T ) is the best possible convergence rate of any subgradient/gradient based algorithm for non-smooth convex programs.√Thus, our algorithm is optimal. The primal-dual subgradient method considered in [18] has the same O(1/ T ) convergence rate but requires an upper bound of optimal Lagrange multipliers, which is typically unknown in practice. Our algorithm does not require such bounds to be known. II. F ORMULATION AND N EW A LGORITHM

Let X0 be a known fixed compact convex set. Let gk (x; ω(t)), k ∈ {1, 2, . . . , m} be sequences of functions that ∆ are i.i.d. realizations of stochastic constraint functions g˜k (x) = Eω [gk (x; ω)] with random variable ω ∈ Ω from an unknown distribution. That is, ω(t) are i.i.d. samples of ω . Let f t (x) be a sequence of convex functions that can arbitrarily vary as long as each f t (·) is independent of all ω(τ ) with τ ≥ t + 1 so that we are unable to predict future constraint functions based on the knowledge of the current loss function. For example, each f t (·) can even be chosen adversarially based on ω(τ ), τ ∈ {1, 2, . . . , t} and actions x(τ ), τ ∈ {1, 2, . . . , t}. For each ω ∈ Ω, we assume gk (x; ω) are convex with respect to x ∈ X0 . At the beginning of each round t, neither the loss function f t (x) nor the constraint function realizations gk (x; ω(t)) are known to the decision maker. However, the decision maker still needs to make a decision x(t) ∈ X0 for round t; and after that f t (x) and gk (x, ω(t)) are disclosed to the decision maker at the end of round t. For convenience, we often suppress the dependence of each gk (x; ω(t)) on ω(t) and write gkt (x) = gk (x; ω(t)). Recall g˜k (x) = Eω [gk (x; ω)] where the expectation is with respect to ω . Define X = {x ∈ X0 : g˜k (x) = t (x) E[gk (x; ω)] ≤ 0, ∀k ∈ {1, 2, . . . , m}}. We further define the stacked vector of multiple functions g1t (x), . . . , gm t (x)]T and define g ˜ (x) = [Eω [g1 (x; ω)], . . . , Eω [gm (x; ω)]]T . We use k · k to denote the as gt (x) = [g1t (x), . . . , gm Euclidean norm for a vector. Throughout this paper, we have the following assumptions: Assumption 1 (Basic Assumptions). • Loss functions f t (x) and constraint functions gk (x; ω) have bounded subgradients on X0 . That is, there exists D1 > 0 and D2 > 0 such that k∇f t (x)k ≤ D1 for all x ∈ X0 and all t ∈ {0, 1, . . .} and k∇gk (x; ω)k ≤ D2 for all x ∈ X0 , all ω ∈ Ω and all k ∈ {1, 2, . . . , m}.3 • There exists constant G > 0 such that kg(x; ω)k ≤ G for all x ∈ X0 and all ω ∈ Ω. • There exists constant R > 0 such that kx − yk ≤ R for all x, y ∈ X0 . ˆ ∈ X0 such that g˜k (ˆ Assumption 2 (The Slater Condition). There exists  > 0 and x x) = Eω [gk (ˆ x; ω)] ≤ − for all k ∈ {1, 2, . . . , m}.

A. New Algorithm Now consider the following algorithm described in Algorithm 1. This algorithm chooses x(t + 1) as the decision for round t + 1 based on f t (·) and gt (·) without requiring f t+1 (·) or gt+1 (·). For each stochastic constraint function gk (x; ω), we introduce Qk (t) and call it a virtual queue since its dynamic is similar to a queue dynamic. The next lemma summarizes that x(t + 1) update in (2) can be implemented via a simple projection onto X0 . 2 While the analysis of this paper assumes a Slater-type condition, note that our other work [16] shows that the Slater condition is not needed in the special case when both the objective and constraint functions vary i.i.d. over time. (This also includes the case of deterministic constrained convex optimization, since processes that do not vary with time are indeed i.i.d. processes.) In such scenarios, Section VI in our work [16] shows that our algorithm works more generally whenever a Lagrange multiplier vector attaining the strong duality exists. 3 We use ∇h(x) to denote a subgradient of a convex function h at the point x. If the gradient exists, then ∇h(x) is the gradient. Nothing in this paper requires gradients to exist: We only need the basic subgradient inequality h(y) ≥ h(x) + [∇h(x)]T [y − x] for all x, y ∈ X0 .

4

Algorithm 1 Let V, α be constant algorithm parameters. Choose x(1) ∈ X0 arbitrarily and let Qk (1) = 0, ∀k ∈ {1, 2, . . . , m}. At the end of each round t ∈ {1, 2, . . .}, observe f t (·) and gt (·) and do the following: • Choose x(t + 1) that solves m X  t T min V [∇f (x(t))] [x − x(t)] + Qk (t)[∇gkt (x(t))]T [x − x(t)] + αkx − x(t)k2 (2) x∈X0



k=1

as the decision for the next round t + 1, where ∇f t (x(t)) is a subgradient of f t (x) at point x = x(t) and ∇gkt (x(t)) is a subgradient of gkt (x) at point x = x(t). Update each virtual queue Qk (t + 1), n ∀k ∈ {1, 2, . . . , m} via o Qk (t + 1) = max Qk (t) + gkt (x(t)) + [∇gkt (x(t))]T [x(t + 1) − x(t)], 0 , (3) where max{·, ·} takes the larger one between two elements.

 Lemma 1. The x(t + 1) update in (2) is given by x(t + 1) = PX0 x(t) − Pm t k=1 Qk (t)∇gk (x(t)) and PX0 [·] is the projection onto convex set X0 . Proof. The projection by definition is minx∈X0 kx − [x(t) −

1 2 2α d(t)]k

1 2α d(t)

 , where d(t) = V ∇f t (x(t)) +

and is equivalent to (2).

B. Intuitions of Algorithm 1 Note that if there are no stochastic constraints gkt (x), i.e., X = X0 , then Algorithm 1 has Qk (t) ≡ 0, ∀t and V becomes Zinkevich’s algorithm with γ = 2α in (1) since (b)    V (a) ∇f t (x(t)) x(t + 1) = argmin V [∇f t (x(t))]T [x − x(t)] + αkx − x(t)k2 = PX0 x(t) − | {z } 2α x∈X0

(4)

penalty

where (a) follows from (2); and (b) follows from Lemma 1 by noting that d(t) = V ∇f t (x(t)). Call the term marked by an underbrace in (4) the penalty. Thus, Zinkevich’s algorithm is to minimize the penalty term and is a special case of Algorithm 1 used to solve OCO over X0 .  T Let Q(t) = Q1 (t), . . . , Qm (t) be the vector of virtual queue backlogs. Let L(t) = 21 kQ(t)k2 be a Lyapunov function and define Lyapunov drift 1 ∆(t) = L(t + 1) − L(t) = [kQ(t + 1)k2 − kQ(t)k2 ]. 2

(5)

The intuition behind Algorithm 1 is to choose x(t + 1) to minimize an upper bound of the expression ∆(t) + V [∇f t (x(t))]T [x − x(t)] + αkx − x(t)k2 |{z} | {z } drift

(6)

penalty

The intention to minimize penalty is natural since Zinkevich’s algorithm (for OCO without stochastic constraints) minimizes penalty, while the intention to minimize drift is motivated by observing that gkt (x(t)) is accumulated into queue Qk (t + 1) introduced in (3) such that we intend to have small queue backlogs. The drift ∆(t) can be complicated and is in general non-convex. The next lemma provides a simple upper bound of ∆(t) and follows directly from (3). Lemma 2. At each round t ∈ {1, 2, . . .}, Algorithm 1 guarantees

∆(t) ≤

m X k=1

  1 √ Qk (t) gkt (x(t)) + [∇gkt (x(t))]T [x(t + 1) − x(t)] + (G + mD2 R)2 , 2

where m is the number of constraint functions; and D2 , G and R are defined in Assumption 1.

(7)

5

Proof. Recall that for any b∈ R, if a = max{b, 0} then a2 ≤ b2 . Fix k ∈ {1, 2, . . . , m}. The virtual queue update equation Qk (t + 1) = max Qk (t) + gkt (x(t)) + [∇gkt (x(t))]T [x(t + 1) − x(t)], 0 implies that 2 1 1 [Qk (t + 1)]2 ≤ Qk (t) + gkt (x(t)) + [∇gkt (x(t))]T [x(t + 1) − x(t)] 2 2   1 = [Qk (t)]2 + Qk (t) gkt (x(t)) + [∇gkt (x(t))]T [x(t + 1) − x(t)] 2 2 1 + gkt (x(t)) + [∇gkt (x(t))]T [x(t + 1) − x(t)] 2   1 (a) 1 = [Qk (t)]2 + Qk (t) gkt (x(t)) + [∇gkt (x(t))]T [x(t + 1) − x(t)] + [hk ]2 , 2 2

(8)

where (a) follows by defining hk = gkt (x(t)) + [∇gkt (x(t))]T [x(t + 1) − x(t)]. Define s = [s1 , . . . , sm ]T , where sk = [∇gkt (x(t))]T [x(t+1)−x(t)], ∀k ∈ {1, 2, . . . , m}; and h = [h1 , . . . , hm ]T = gt (x(t)) + s. Then, v um (a) (b) uX √ t D22 R2 = G + mD2 R, (9) khk ≤ kg (x(t))k + ksk ≤ G + t k=1

where (a) follows from the triangle inequality; and (b) follows from the definition of Euclidean norm, the CauchySchwartz inequality and Assumption 1. Summing (8) over k ∈ {1, 2, . . . , m} yields m X   1 1 2 2 kQ(t + 1)k ≤ kQ(t)k + Qk (t) gkt (x(t)) + [∇gkt (x(t))]T [x(t + 1) − x(t)] + khk2 2 2 k=1

(a) 1

≤ kQ(t)k2 + 2

m X k=1

  1 √ Qk (t) gkt (x(t)) + [∇gkt (x(t))]T [x(t + 1) − x(t)] + (G + mD2 R)2 , 2

where (b) follows from (9). Rearranging the terms yields the desired result. P √ 1 t 2 At the end of round t, m k=1 Qk (t)gk (x(t))+ 2 [G+ mD2 R] is a given constant that is not affected by decision x(t + 1). The algorithm decision in (2) is now transparent: x(t + 1) is chosen to minimize the drift-plus-penalty expression (6), where ∆(t) is approximated by the bound in (7). C. Preliminary Analysis and More Intuitions of Algorithm 1 The next lemma relates constraint violations and virtual queue values and follows directly from (3). P P Lemma 3. For any T ≥ 1, Algorithm 1 guarantees Tt=1 gkt (x(t)) ≤ kQ(T +1)k+D2 Tt=1 kx(t+1)−x(t)k, ∀k ∈ {1, 2, . . . , m}, where D2 is defined in Assumption 1. Proof. Fix k ∈ {1, 2, . . . , m} and T ≥ 1. For any t ∈ {0, 1, . . .}, (3) in Algorithm 1 gives: Qk (t + 1) = max{Qk (t) + gkt (x(t)) + [∇gkt (x(t))]T [x(t + 1) − x(t)], 0} ≥ Qk (t) + gkt (x(t)) + [∇gkt (x(t))]T [x(t + 1) − x(t)] (a)

≥ Qk (t) + gkt (x(t)) − k∇gkt (x(t))kkx(t + 1) − x(t)k

(b)

≥ Qk (t) + gkt (x(t)) − D2 kx(t + 1) − x(t)k,

where (a) follows from the Cauchy-Schwartz inequality and (b) follows from Assumption 1. Rearranging terms yields gkt (x(t)) ≤ Qk (t + 1) − Qk (t) + D2 kx(t + 1) − x(t)k.

6

Summing over t ∈ {1, . . . , T } yields T X

gkt (x(t)) ≤ Qk (T + 1) − Qk (1) + D2

T X

kx(t + 1) − x(t)k

t=1

t=1 (a)

= Qk (T + 1) + D2

T X

kx(t + 1) − x(t)k

t=1 T X

≤ kQ(T + 1)k + D2

kx(t + 1) − x(t)k.

t=1

where (a) follows from the fact Qk (1) = 0. Recall that function h : X0 → R is said to be c-strongly convex if h(x) − 2c kxk2 is convex over x ∈ X0 . By the definition of strongly convex functions, it is easy to see that if φ : X0 → R is a convex function, then for any constant c > 0 and any constant vector x0 , the function φ(x) + 2c kx − x0 k2 is c-strongly convex. Further, it is known that if h : X → R is a c-strongly convex function and is minimized at point xmin ∈ X0 , then (see, for example, Corollary 1 in [19]): c (10) h(xmin ) ≤ h(x) − kx − xmin k2 ∀x ∈ X0 2 Note that the expression involved in minimization (2) in Algorithm 1 is strongly convex with modulus 2α and x(t + 1) is chosen to minimize it. Thus, the next lemma follows. Lemma 4. Let z ∈ X0 be arbitrary. For all t ≥ 1, Algorithm 1 guarantees t

T

V [∇f (x(t))] [x(t + 1) − x(t)] +

m X

Qk (t)[∇gkt (x(t))]T [x(t + 1) − x(t)] + αkx(t + 1) − x(t)k2

k=1

≤V [∇f t (x(t))]T [z − x(t)] +

m X

Qk (t)[∇gkt (x(t))]T [z − x(t)] + αkz − x(t)k2 − αkz − x(t + 1)k2 .

k=1

The next corollary follows by taking z = x(t) in Lemma 4. Corollary 1. For all t ≥ 1, Algorithm 1 guarantees kx(t + 1) − x(t)k ≤

V D1 2α



+

mD2 2α kQ(t)k.

Proof. Fix t ≥ 1. Note that x(t) ∈ X0 . Taking z = x(t) in Lemma 4 yields V [∇f t (x(t))]T [x(t + 1) − x(t)] +

m X

Qk (t)[∇gkt (x(t))]T [x(t + 1) − x(t)] + αkx(t + 1) − x(t)k2

k=1

≤ − αkx(t) − x(t + 1)k2 .

Rearranging terms and cancelling common terms yields 2αkx(t + 1) − x(t)k2 ≤ − V [∇f t (x(t))]T [x(t + 1) − x(t)] −

m X

Qk (t)[∇gkt (x(t))]T [x(t + 1) − x(t)]

k=1

v um uX t ≤ V k∇f (x(t))kkx(t + 1) − x(t)k + kQ(t)kt k∇g t (x(t))k2 kx(t + 1) − x(t)k2

(a)

k

k=1 (b)

≤V D1 kx(t + 1) − x(t)k +



mD2 kQ(t)kkx(t + 1) − x(t)k

where (a) follows by the Cauchy-Schwarz inequality (note that the second term on the right side applies the Cauchy-Schwarz inequality twice); and (b) follows from Assumption 1. Thus, we have √ mD2 V D1 + kQ(t)k. kx(t + 1) − x(t)k ≤ 2α 2α

7

The next corollary follows directly from Lemma 3 and Corollary 1 and shows that constraint violations are ultimately bounded by sequence kQ(t)k, t ∈ {1, 2, . . . , T + 1}. Corollary 2. For any T ≥ 1, Algorithm 1 guarantees T X

gkt (x(t))

t=1

V T D 1 D2 ≤ kQ(T + 1)k + + 2α



T mD22 X kQ(t)k 2α

, ∀k ∈ {1, 2, . . . , m}

t=1

where D1 and D2 are defined in Assumption 1. This corollary further justifies why Algorithm 1 intends to minimize drift ∆(t). Recall that controlled drift can often lead to boundedness of a stochastic process as illustrated in the next section. Thus, the intuition of minimizing drift ∆(t) is to yield small kQ(t)k bounds. III. E XPECTED P ERFORMANCE A NALYSIS OF A LGORITHM 1 √ This section shows that if √ we choose V = T and α = T in Algorithm 1, then both expected regret and expected constraint violations are O( T ).

A. A Drift Lemma for Stochastic Processes Let {Z(t), t ≥ 0} be a discrete time stochastic process adapted4 to a filtration {F(t), t ≥ 0}. For example, Z(t) can be a random walk, a Markov chain or a martingale. Drift analysis is the method of deducing properties, e.g., recurrence, ergodicity, or boundedness, about Z(t) from its drift E[Z(t + 1) − Z(t)|F(t)]. See [21], [22] for more discussions or applications on drift analysis. This paper proposes a new drift analysis lemma for stochastic processes as follows: Lemma 5. Let {Z(t), t ≥ 1} be a discrete time stochastic process adapted to a filtration {F(t), t ≥ 1}. Suppose there exists an integer t0 > 0, real constants θ ∈ R, δmax > 0, and 0 < ζ ≤ δmax such that |Z(t + 1) − Z(t)| ≤δmax ,  t0 δmax , if Z(t) < θ . E[Z(t + t0 ) − Z(t)|F(t)] ≤ −t0 ζ, if Z(t) ≥ θ hold for all t ∈ {1, 2, . . .}. Then, the following holds   2 2 8δmax ζ/(4δmax ) , ∀t ∈ {1, 2, . . .}. 1) E[Z(t)] ≤ θ + t0 4δmax ζ log 1 + ζ 2 e

(11) (12)

 2 2) For any constant 0 < µ < 1, we have Pr(Z(t) ≥ z) ≤ µ, ∀t ∈ {1, 2, . . .} where z = θ + t0 4δmax ζ log 1 +  2 2 8δmax ζ/(4δmax ) + t 4δmax log( 1 ). e 2 0 ζ ζ µ Proof. See Appendix A. The above lemma provides both expected and high probability bounds for stochastic processes based on a drift condition. It will be used to establish upper bounds of virtual queues kQ(t)k, which further leads to expected and high probability constraint performance bounds of our algorithm. For a given stochastic process Z(t), it is possible to show the drift condition (12) holds for multiple t0 with different ζ and θ. In fact, we will show in Lemma 7 that kQ(t)k yielded by Algorithm 1 satisfies (12) for any integer t0 > 0 by selecting ζ and θ according to t0 . One-step drift conditions, corresponding to the special case t0 = 1 of Lemma 5, have been previously considered in [22], [23]. However, Lemma 5 (with general t0 > 0) allows us to choose the best t0 in performance analysis such that sublinear regret and constraint violation bounds are possible. 4

Random variable Y is said to be adapted to σ-algebra F if Y is F-measurable. In this case, we often write Y ∈ F. Similarly, random process {Z(t)} is adapted to filtration {F(t)} if Z(t) ∈ F(t), ∀t. See e.g. [20].

8

B. Expected Constraint Violation Analysis Define filtration {W(t), t ≥ 0} with W(0) = {∅, Ω} and W(t) = σ(ω(1), . . . , ω(t)) being the σ -algebra generated by random samples {ω(1), . . . , ω(t)} up to round t. From the update rule in Algorithm 1, we observe that x(t + 1) is a deterministic function of f t (·), g(·; ω(t)) and Q(t) where Q(t) is further a deterministic function of Q(t − 1), g(·; ω(t−1)), x(t) and x(t−1). By inductions, it is easy to show that σ(x(t)) ⊆ W(t−1) and σ(Q(t)) ⊆ W(t−1) for all t ≥ 1 where σ(Y ) denotes the σ -algebra generated by random variable Y . For fixed t ≥ 1, since Q(t) is fully determined by ω(τ ), τ ∈ {1, 2, . . . , t − 1} and ω(t) are i.i.d., we know gt (x) is independent of Q(t). This is formally summarized in the next lemma. ˜ (x∗ ) = Eω [g(x∗ ; ω)] ≤ 0, then Algorithm 1 guarantees: Lemma 6. If x∗ ∈ X0 satisfies g

E[Qk (t)gkt (x∗ )] ≤ 0, ∀k ∈ {1, 2, . . . , m}, ∀t ≥ 1. Proof.

(13)

Fix k ∈ {1, 2, . . . , m} and t ≥ 1. Since gkt (x∗ ) = gk (x∗ ; ω(t)) is independent of Qk (t), which is determined (a)

by {ω(1), . . . , ω(t − 1)}, it follows that E[Qk (t)gkt (x∗ )] = E[Qk (t)]E[gkt (x∗ )] ≤ 0, where (a) follows from the fact that E[gkt (x∗ )] ≤ 0 and Qk (t) ≥ 0. To establish a bound on constraint violations, by Corollary 2, it suffices to derive upper bounds for kQ(t)k. In this subsection, we derive upper bounds for kQ(t)k by applying the drift lemma (Lemma 5) developed at the beginning of this section. The next lemma shows that random process Z(t) = kQ(t)k satisfies the conditions in Lemma 5. Lemma 7. Let t0 > 0 be an arbitrary integer. At each round t ∈ {1, 2, . . . , } in Algorithm 1, the following holds √ kQ(t + 1)k − kQ(t)k ≤G + mD2 R,

E[kQ(t + t0 )k − kQ(t)k W(t − 1)] ≤



and

√ t0 (G + mD2 R), if kQ(t)k < θ , −t0 2 , if kQ(t)k ≥ θ

√ √ 2 2V D1 R+(G+ mD2 R)2 where θ = 2 t0 + (G + mD2 R)t0 + 2αR + , m is the number of constraint functions; t0   D1 , D2 , G and R are defined in Assumption 1; and  is defined in Assumption 2. (Note that  < G by the definition of G.)

Proof. See Appendix B.

√ Lemma 7 allows us to apply Lemma 5 to random process Z(t) = kQ(t)k and obtain E [kQ(t)k] = O( T ), ∀t by √ √ √ √ taking t0 = d T e, V = T and α = T , where d T e represents the smallest integer no less than T . By Corollary √ P 2, this further implies the expected constraint violation bound E[ Tt=1 gk (x(t))] ≤ O( T ) as summarized in the next theorem. √ Theorem 1 (Expected Constraint Violation Bound). If V = T and α = T in Algorithm 1, then for all T ≥ 1, we have T X √ E[ gkt (x(t))] ≤ O( T ), ∀k ∈ {1, 2, . . . , m}.

(14)

t=1

where the expectation is taken with respect to all ω(t). Proof. Define random process Z(t) = kQ(t)k and filtration F(t) = W(t − 1). Note that Z(t) is adapted to √  F(t). By Lemma 7, Z(t) satisfies the conditions in Lemma 5 with δ = G + mD max 2 R, ζ = 2 and θ = √ 2 √ 2V D1 R+(G+ mD2 R)  2αR2 . Thus, by part (1) of Lemma 5, for all t ∈ {1, 2, . . .}, we 2 t0 + (G + mD2 R)t0 + t0  +  have √ √  2αR2 2V D1 R + (G + mD2 R)2 E[kQ(t)k] ≤ t0 + (G + mD2 R)t0 + + + t0 B 2 t0  

9 √



2

2



32[G+ mD2 R] /[8(G+ mD2 R)] 2 R) e where B = 8(G+ mD ] is an absolute constant irrelevant to algorithm  √log[1 + √ 2 parameters. Taking t0 = d T e, V = T and α = T , for all t ∈ {1, 2, . . .}, we have √ √ √ √ √  √ 2T R2 2 T D1 R + (G + mD2 R)2 E[kQ(t)k] ≤ d T e + (G + mD2 R)d T e + √ + + d T eB 2  d T e √ =O( T ). √ Fix T ≥ 1. By Corollary 2 (with V = T and α = T ) , we have √ √ T T X T D1 D2 mD22 X t gk (x(t)) ≤ kQ(T + 1)k + + kQ(t)k, ∀k ∈ {1, 2, . . . , m}. 2 2T t=1

t=1

√ Taking expectations on both sides and substituting E[kQ(t)k] = O( T ), ∀t into it yields T X √ E[ gkt (x(t))] ≤ O( T ). t=1

C. Expected Regret Analysis The next lemma refines Lemma 4 and is useful to analyze the regret. Lemma 8. Let z ∈ X0 be arbitrary. For all T ≥ 1, Algorithm 1 guarantees T X

T X

T m  √ 1 1 XX α 2 V D12 2T f (x(t)) ≤ T + [G + mD2 R] + Qk (t)gkt (z) f (z) + R + 4α 2 V} V |V {z t=1 t=1 t=1 k=1 | {z } (I) t

t

(15)

(II)

where m is the number of constraint functions; and D1 , D2 , G and R are defined in Assumption 1. Proof. Fix t ≥ 1. By Lemma 4, we have V [∇f t (x(t))]T [x(t + 1) − x(t)] +

m X

Qk (t)[∇gkt (x(t))]T [x(t + 1) − x(t)] + αkx(t + 1) − x(t)k2

k=1

≤V [∇f t (x(t))]T [z − x(t)] +

m X

Qk (t)[∇gkt (x(t))]T [z − x(t)] + α[kz − x(t)k2 − kz − x(t + 1)k2 ].

k=1

Pm

Adding constant V f t (x(t))+ k=1 Qk (t)gkt (x(t)) on both sides; and noting that f t (x(t))+[∇f t (x(t))]T [z−x(t)] ≤ f t (z) and gkt (x(t)) + [∇gkt (x(t))]T [z − x(t)] ≤ gkt (z) by convexity yields t

t

T

V f (x(t)) + V [∇f (x(t))] [x(t + 1) − x(t)] +

m X

  Qk (t) gkt (x(t)) + [∇gkt (x(t))]T [x(t + 1) − x(t)]

k=1

+ αkx(t + 1) − x(t)k2 m X t ≤V [∇f (z)] + Qk (t)gkt (z) + α[kz − x(t)k2 − kz − x(t + 1)k2 ].

(16)

k=1

By Lemma 2, we have ∆(t) ≤

m X k=1

 1  √ Qk (t) gkt (x(t)) + [∇gkt (x(t))]T [x(t + 1) − x(t)] + [G + mD2 R]2 . 2

Summing (16) and (17), cancelling common terms and rearranging terms yields

(17)

10

V f t (x(t)) ≤V f t (z) − ∆(t) +

m X

Qk (t)gkt (z) + α[kz − x(t)k2 − kz − x(t + 1)k2 ]

k=1

√ 1 − V [∇f (x(t))] [x(t + 1) − x(t)] − αkx(t + 1) − x(t)k2 + [G + mD2 R]2 2 t

T

(18)

Note that − V [∇f t (x(t))]T [x(t + 1) − x(t)] − αkx(t + 1) − x(t)k2 (a)

≤ V k∇f t (x(t))kkx(t + 1) − x(t)k − αkx(t + 1) − x(t)k2

(b)

≤V D1 kx(t + 1) − x(t)k − αkx(t + 1) − x(t)k2  V D1 2 V 2 D12 = − α kx(t + 1) − x(t)k − + 2α 4α V 2 D12 ≤ (19) 4α where (a) follows from the Cauchy-Schwartz inequality; and (b) follows from Assumption 1. Substituting (19) into (18) yields t

t

V f (x(t)) ≤ V f (z) − ∆(t) +

m X

Qk (t)gkt (z) + α[kz − x(t)k2 − kz − x(t + 1)k2 ] +

k=1

√ V 2 D12 1 + [G + mD2 R]2 . 4α 2

Summing over t ∈ {1, 2, . . . , T } yields V

T X

t

f (x(t)) ≤V

t=1

T X

t

f (z) −

t=1

T X

∆(t) + α

t=1

√ 1 + [G + mD2 R]2 T + 2 (a)

=V

T X

T X

[kz − x(t)k2 − kz − x(t + 1)k2 ] +

t=1 m T X X



t=1

V 2 D12 T 4α

 Qk (t)gkt (z)

k=1

f t (z) + L(1) − L(T + 1) + αkz − x(1)k2 − αkz − x(T + 1)k2 +

t=1

V 2 D12 T 4α

T m X X  √ 1 + [G + mD2 R]2 T + Qk (t)gkt (z) 2 t=1

(b)

≤V

T X

t

2

f (z) + αR +

t=1

V

k=1

2 D2 1

T

m

t=1

k=1

XX  √ 1 T + [G + mD2 R]2 T + Qk (t)gkt (z) . 4α 2

where (a) follows by recalling that ∆(t) = L(t + 1) − L(t); and (b) follows because kz − x(1)k ≤ R by Assumption 1, L(1) = 21 kQ(1)k2 = 0 and L(T + 1) = 12 kQ(T + 1)k2 ≥ 0. Dividing both sides by V yields the desired result. √ √ Note that if we take V = T and α = T , then term (I) in (15) is O( T ). Recall that the expectation of term (II) in (15) with z = x∗ is non-positive by Lemma 6. The expected regret bound of Algorithm 1 follows by taking expectations on both sides of (15) and is summarized in the next theorem. ˜ (x∗ ) ≤ 0, e.g., x∗ = Theorem 2 P (Expected Regret Bound). Let x∗ ∈ X0 be any fixed solution that satisfies g √ T t argminx∈X t=1 f (x). If V = T and α = T in Algorithm 1, then for all T ≥ 1,

E[

T X t=1

T X √ f t (x(t))] ≤ E[ f t (x∗ )] + O( T ). t=1

11

where the expectation is taken with respect to all ω(t). Proof. Fix T ≥ 1. Taking z = x∗ in Lemma 8 yields T X

t

f (x(t)) ≤

T X t=1

t=1

T m  √ α 2 V D12 1 1 XX 2T f (x ) + R + Qk (t)gkt (x∗ ) . T + [G + mD2 R] + V 4α 2 V V t



t=1

k=1

Taking expectations on both sides and using (13) yields T X

t

E[f (x(t))] ≤

t=1

T X

E[f t (x∗ )] +

t=1

√ α 2 D12 V T 1 R + T + [G + mD2 R]2 . V 4 α 2 V

√ Taking V = T and α = T yields T X

E[f t (x(t))] ≤

t=1

T X

√ E[f t (x∗ )] + O( T ).

t=1

D. Special Case Performance Guarantees Theorems 1 and 2 provide expected performance guarantees of Algorithm 1 for OCO with stochastic constraints. The results further imply the performance guarantees in the following special cases: • OCO with long term constraints: In this case, gk (x; ω(t)) ≡ gk (x) and there is no randomness. Thus, the √ expectations in√Theorems 1 and 2 disappear. For this problem, Algorithm 1 can achieve O( T ) (deterministic) regret and O( T ) (deterministic) constraint violations. • Stochastic constrained convex optimization: Note that i.i.d. time-varying f (x; ω(t)) is a special case of arbitrarily-varying f t (x) as considered in our OCO setting. Thus, Theorems 2 still hold Algorithm PT 1 and Pwhen T t t ∗ 1 is√ applied to stochastic constrained t=1 E[f (x(t))] ≤ t=1 E[f (x )] + √ convex optimization. That is, PT t O( T ) and t=1 E[gk (x(t))] ≤ O( T ), ∀k ∈ {1, 2, . . . , n}. This √ online performance guarantee also implies Algorithm 1 can be used a (batch) offline algorithm with O(1/ T ) convergence for stochastic constrained 1 PT convex optimization. That is, after running Algorithm 1 for T slots, if we use x(T ) = T t=1 x(t) as a fixed solution, then E[f (x(T ); ω)] = E[f t (x(T ))] ≤ E[f t (x∗ )] + O( √1T ) and E[gk (x(T ); ω)] = E[gkt (x(T ))] ≤ O( √1T ), ∀k ∈ {1, 2, . . . , n} with t ≥ T + 1 by the i.i.d. property of each f t and g t and Jensen’s inequality. If we use Algorithm 1 as a (batch) offline algorithm, its performance ties with the algorithm developed in [17], which is by design a (batch) offline algorithm and can only solve stochastic optimization with a single constraint function. • Deterministic constrained convex optimization: Similarly to OCO with long term constraints, the expectations in P Theorems 1 and 2 disappear in this case since f t (x) ≡ f (x) and gk (x; ω(t)) ≡ gk (x). If we use x(T ) = T 1 ∗ √1 √1 t=1 x(t) as the solution, then f (x(T )) ≤ f (x ) + O( T ) and gk (x(T )) ≤ O( T ), which follows by T dividing inequalities in Theorems 1 and 2 by T on both sides and applying Jensen’s inequality. Thus, Algorithm 1 solves deterministic constrained convex optimization with O( √1T ) convergence. IV. H IGH P ROBABILITY P ERFORMANCE A NALYSIS √ This section shows that if we choose V = T and α = T in Algorithm 1, then for any √ √ 0 < λ < 1, with  probability at least 1 − λ, regret is O( T log(T ) log1.5 ( λ1 )) and constraint violations are O T log(T ) log( λ1 ) . A. High Probability Constraint Violation Analysis Similarly to the expected constraint violation analysis, we can use part (2) of the new drift lemma (Lemma 5) to obtain a high probability bound of kQ(t)k, which together with Corollary 2 leads to a high probability constraint violation bound summarized in Theorem 3.

12

Theorem 3 (High Probability Constraint Violation Bound). Let 0 < λ < 1 be arbitrary. If V = Algorithm 1, then for all T ≥ 1 and all k ∈ {1, 2, . . . , m}, we have Pr

T X

gk (x(t)) ≤ O

t=1





T and α = T in

1  T log(T ) log( ) ≥ 1 − λ. λ

Proof. Define random process Z(t) = kQ(t)k, ∀t ∈ {1, 2, . . .}. By Lemma 7, Z(t) satisfies the conditions in √ Lemma 5 with δmax = G + mD2 R, ζ = 2 and √ √ 2αR2 4V D1 R + [G + mD2 R]2  + . θ = t0 + (G + mD2 R)t0 + 2 t0   Fix T ≥ 1 and 0 < λ < 1. Taking µ = λ/(T + 1) in part (2) of Lemma 5 yields Pr(kQ(t)k ≥ γ) ≤

λ , ∀t ∈ {1, 2, . . . , T + 1}, T +1

√ √ 8[G+ mD2 R]2 2V D1 R+[G+ mD2 R]2 2αR2 + + t B + t log( T λ+1 ) with B = 0 0 t    0 √ √ 32[G+ mD2 R]2 /[8(G+ mD2 R)] e ] being an absolute constant irrelevant to algorithm parameters. 2

where γ = 2 t0 + (G + √ 8[G+ mD2 R]2 



mD2 R)t0 +

log[1 + By union bounds, we have

Pr(kQ(t)k ≥ γ for some t ∈ {1, 2, . . . , T + 1}) ≤ λ. This implies Pr(kQ(t)k ≤ γ for all t ∈ {1, 2, . . . , T + 1}) ≥ 1 − λ. √

Taking t0 = d T e, V =



(20)

T and α = T yields

√ √ √ 1 1 γ = O( T log(T )) + O( T log( )) = O( T log(T ) log( )) λ λ √ Recall that by Corollary 2 (with V = T and α = T ), for all k ∈ {1, 2, . . . , m}, we have √ √ T T X T D1 D2 mD22 X gk (x(t)) ≤ kQ(T + 1)k + + kQ(t)k. 2 2T t=1

(21)

(22)

t=1

It follows from (20)-(22) that Pr

T X t=1

√ 1 gk (x(t)) ≤ O( T log(T ) log( )) ≥ 1 − λ. λ

B. High Probability Regret Analysis To obtain a high probability regret bound from Lemma 8, it remains to derive a high probability bound of term (II) in (15) with z = x∗ . The main challenge is that term (II) is a supermartingale with unbounded differences (due to the possibly unbounded virtual queues Qk (t)). Most concentration inequalities, e.g., the Hoeffding-Azuma inequality, used in high probability performance analysis of online algorithms are restricted to martingales/supermartingales with bounded differences. See for example [24], [25], [10]. The following lemma considers supermartingales with unbounded differences. Its proof uses the truncation method to construct an auxiliary well-behaved supermargingale. Similar proof techniques are previously used in [26], [27] to prove different concentration inequalities for supermartingales/martingales with unbounded differences. Lemma 9. Let {Z(t), t ≥ 0} be a supermartingale adapted to a filtration {F(t), t ≥ 0} with Z(0) = 0 and F(0) = {∅, Ω}, i.e., E[Z(t + 1)|F(t)] ≤ Z(t), ∀t ≥ 0. If there exits constant c > 0 such that {|Z(t + 1) − Z(t)| >

13

c} ⊆ {Y (t) > 0}, ∀t ≥ 0, where each Y (t) is adapted to F(t) and Pr(Y (t) > 0) ≤ p(t), ∀t ≥ 0. Then, for all z > 0, we have

Pr(Z(t) ≥ z) ≤ e−z

2

/(2tc2 )

+

t−1 X

p(τ ), ∀t ≥ 1.

τ =0

Proof. See Appendix C. Note that if p(t) = 0, ∀t ≥ 0, then Z(t) is a supermartingale with differences bounded by c and Pr({|Z(t + 1) − Z(t)| > c}) = 0, ∀t ≥ 0. In this case, Lemma 9 reduces to the conventional Hoeffding-Azuma inequality. The next theorem summarizes the high probability regret performance of Algorithm 1 and follows from Lemmas 5-9 . P ˜ (x∗ ) ≤ 0, e.g., x∗ = argminx∈X Tt=1 f t (x). Theorem 4 (High Probability Regret√Bound). Let x∗ ∈ X0 satisfy g Let 0 < λ < 1 be arbitrary. If V = T and α = T in Algorithm 1, then for all T ≥ 1, we have Pr Proof. See Appendix D.

T X t=1

t

f (x(t))) ≤

T X t=1

√ 1  f t (x∗ ) + O( T log(T ) log1.5 ( )) ≥ 1 − λ. λ

V. E XPERIMENT: O NLINE J OB S CHEDULING IN D ISTRIBUTED DATA C ENTERS Consider a geo-distributed data center infrastructure consisting of one front-end job router and 100 geographically distributed servers, which are located at 10 different zones to form 10 clusters (10 servers in each cluster). See Fig. 1(a) for an illustration. The front-end job router receives job tasks and schedules them to different servers to fulfill the service. To serve the assigned jobs, each server purchases power (within its capacity) from its zone market. Electricity market prices can vary significantly across time and zones. For example, see Fig. 1(b) for a 5-minute average electricity price trace (between 05/01/2017 and 05/10/2017) at New York zone CENTRL [28]. This problem is to schedule jobs and control power levels at each server in real time such that all incoming jobs are served and electricity cost is minimized. In our experiment, each server power is adjusted every 5 minutes, which is called a slot. (In practice, server power can not be adjusted too frequently due to hardware restrictions and configuration delay.) Let x(t) = [x1 (t), . . . , x100 (t)] be the power vector at slot t, where each xi (t) must be chosen max ] restricted by the hardware, and service rate at each server i satisfy µ (t) = h (x (t)), from an interval [xmin i i i i , xi where hi (·) is an increasing concave function.P At each slot t, the job router schedules µi (t) amount of jobs to server i. The electricity cost at slot t is f t (x(t)) = 100 i=1 ci (t)xi (t) where ci (t) is the electricity price at server i’s zone. Use ci (t) from real-world 5-minute average electricity price data at 10 different zones in New York city between 05/01/2017 and 05/10/2017 obtained from NYISO [28]. At each slot t, the incoming job is given by ω(t) and satisfies a Poisson distribution. Note that the amount of incoming jobs and electricity price ci (t) are unknown to us at the beginning of each slot t but can be observed at the end of each slot. This is an example of OCO with stochastic constraints, where we aim to minimize the electricity cost subject to the constraint that incoming jobs must be served in P time. In particular, at each round t, we receive loss function f t (x(t)) and constraint function t g (x(t)) = ω(t) − 100 i=1 hi (xi (t)). We compare our proposed algorithm with 3 baselines: (1) best fixed decision in hindsight; (2) react [29] and (3) low-power [30]. Both “react” and “low-power” are popular power control strategies used in distributed data centers. See Supplement E for more details of these 2 baselines and our experiment. Fig. 1(c)(d) plot the performance of 4 algorithms, where the running average is the time average up to the current slot. Fig. 1(c) compares electricity cost while Fig. 1(d) compares unserved jobs. (Unserved jobs accumulated if the service rate provided by an algorithm is less than the job arrival rate, i.e., the stochastic constraint is violated.) Fig. 1(c)(d) show that our proposed algorithm performs closely to the best fixed decision in hindsight over time, both in electricity cost and constraint violations. ‘React” performs well in serving job arrivals but yields larger electricity cost, while “low-power” has low electricity cost but fails to serve job arrivals. VI. C ONCLUSION This √ paper studies OCO with stochastic constraints√and proposes a novel learning algorithm that guarantees O( T ) expected regret and constraint violations and O( T log(T )) high probability regret and constraint violations.

14

Electricity market price 450

400

Price (dollar/MWh)

350

300

250

200

150

100

50

0 0

500

1000

1500

2000

2500

Number of slots (each 5 min)

(a)

(b) Running average unserved jobs

Running average electricity cost 1200

15000

Cost (dollar)

10000

5000

Our algorithm Best fixed strategy in hindsight React (Gandhi et al. 2012) Low-power (Qureshi et al. 2009) 0

Unserved jobs (per slot)

1000

800

Our algorithm Best fixed decision in hindsight React (Gandhi et al. 2012) Low-power (Qureshi et al. 2009)

600

400

200

0

-200 0

500

1000

1500

2000

2500

0

500

Number of slots (each 5 min)

1000

1500

2000

2500

Number of slots (each 5 min)

(c)

(d)

Fig. 1: (a) Geo-distributed data center infrastructure; (b) Electricity market prices at zone CENTRAL New York; (c) Running average electricity cost; (d) Running average unserved jobs.

A PPENDIX A P ROOF OF L EMMA 5 In this proof, we first establish an upper bound of E[erZ(t) ] for some constant r > 0. Part (1) of this lemma follows by applying Jensen’s inequality since erx is convex with respect to x when r > 0. Part (2) of this lemma follows directly from Markov’s inequality. The following fact is useful in the proof. Fact 1. ex ≤ 1 + x + 2x2 for any |x| ≤ 1. Proof. By Taylor’s expansion, we known for any x ∈ R, there exists a point x ˆ in between 0 and x such that x2 x x ˆ e = 1 + x + e 2 . (Note that the value of x ˆ depends on x and if x > 0, then x ˆ ∈ (0, x); if x < 0, then x ˆ ∈ (x, 0); and if x = 0, then x ˆ = x. ) Since |x| ≤ 1, we have exˆ ≤ e ≤ 4. Thus, ex ≤ 1 + x + 2x2 for any |x| ≤ 1. The next lemma provides an upper bound of E[erZ(t) ] with constant r =

ζ 2 4δmax

< 1.

15

Lemma 10. E[erZ(t) ] ≤ where r =

ζ , 2 4t0 δmax

ρ=1−

ζ2 2 8δmax

=1−

rt0 ζ 2

ert0 δmax rθ dte e + ρ t0 , ∀t ∈ {0, 1, . . .}, 1−ρ

and where dxe denotes the smallest integer that is no less than x.

Proof. Since 0 < ζ < δmax , we have 0 < ρ < 1 < erδmax . Define η(t) = Z(t + t0 ) − Z(t). Note that |η(t)| ≤ ζ ≤ 1. Then, t0 δmax , ∀t ≥ 0 and |rη(t)| ≤ 4t0 δζ2 t0 δmax = 4δmax max

erZ(t+t0 ) =erZ(t) erη(t)

(23)

(a)

2 ≤ erZ(t) [1 + rη(t) + 2r2 t20 δmax ] 1 (b) rZ(t) =e [1 + rη(t) + rt0 ζ], (24) 2 where (a) follows from Fact 1 by noting that |rη(t)| ≤ 1 and |η(t)| ≤ t0 δmax ; and (b) follows by substituting 2 . r = 4t0 δζ2 into a single r of the term 2r2 t20 δmax max Next, consider the cases Z(t) ≥ θ and Z(t) < θ, separately. • Case Z(t) ≥ θ : Taking conditional expectations on both sides of (24) yields:

1 E[erZ(t+t0 ) |Z(t)] ≤E[erZ(t) (1 + rη(t) + rt0 ζ)|Z(t)] 2 (a)   1 ≤ erZ(t) 1 − rt0 ζ + rt0 ζ 2  rt0 ζ  rZ(t) 1− =e 2 (b)

=ρerZ(t) .



where (a) follows from the fact that E[Z(t + t0 ) − Z(t)|F(t)] ≤ −t0 ζ when Z(t) ≥ θ; and (b) follows from the fact that ρ = 1 − rt20 ζ . Case Z(t) < θ: Taking conditional expectations on both sides of (23) yields: E[erZ(t+t0 ) |Z(t)] =E[erZ(t) erη(t) |Z(t)] =erZ(t) E[erη(t) |Z(t)] (a)

≤ ert0 δmax erZ(t) ,

where (a) follows from the fact that η(t) ≤ t0 δmax . Putting two cases together yields: (a)

E[erZ(t+t0 ) ] = Pr(Z(t) ≥ θ)E[erZ(t+t0 ) |Z(t) ≥ θ] + Pr(Z(t) < θ)E[erZ(t+t0 ) |Z(t) < θ] (b)

≤ρE[erZ(t) |Z(t) ≥ θ]Pr(Z(t) ≥ θ) + ert0 δmax E[erZ(t) |Z(t) < θ]Pr(Z(t) < θ)

(c)

=ρE[erZ(t) ] + [ert0 δmax − ρ]E[erZ(t) |Z(t) < θ]Pr(Z(t) < θ)

(d)

≤ ρE[erZ(t) ] + [ert0 δmax − ρ]erθ ≤ρE[erZ(t) ] + ert0 δmax erθ ,

(25)

where (a) follows by the definition of expectations; (b) follows from the results in the above two cases; (c) follows from the fact that E[erZ(t) ] = Pr(Z(t) ≥ θ)E[erZ(t) |Z(t) ≥ θ] + Pr(Z(t) < θ)E[erZ(t) |Z(t) < θ]; and (d) follow from the fact that ert0 δmax > ρ. rt0 δmax dte Now, we prove E[erZ(t) ] ≤ e 1−ρ erθ + ρ t0 , ∀t ≥ 0 , where dxe denotes the smallest integer that is no less than x, by τ inductions. Since Z(τ ) ≤ τ δmax , ∀τ ≥ 0, it follows that E[erZ(τ ) ] ≤ erτ δτmax ≤ ert0 δmax ≤ dt e dt e ert0 δmax rθ erθ 0 , ∀τ ∈ {1, . . . , t }, where the last inequality follows because 0 ≥ 1. 0 1−ρ e + ρ 1−ρ ≥ 1 and ρ

16

rt0 δmax

Fix τ ∈ {1, . . . , t0 }. Assume E[erZ(nt0 +τ ) ] ≤ e 1−ρ erθ + ρn+1 holds. Note that the base case n = 0 is just proven above. Consider E[erZ((n+1)t0 +τ ) ]. By (25), we have E[erZ((n+1)t0 +τ ) ] ≤ρE[erZ(nt0 +τ ) ] + ert0 δmax erθ ert0 δmax rθ e + ρn+1 ] + ert0 δmax erθ 1−ρ ρ ≤ert0 δmax erθ [ + 1] + ρn+2 1−ρ ert0 δmax rθ e + ρn+2 . = 1−ρ ≤ρ[

Thus, this lemma follows by inductions. By this lemma, for all t ∈ {0, 1, . . .}, we have ert0 δmax rθ dte e + ρ t0 1−ρ (a) ert0 δmax ≤ erθ + 1 1−ρ (b) ert0 δmax erθ + erθ ≤ 1−ρ  ert0 δmax  =erθ +1 , 1−ρ where (a) follows from the fact that 0 < ρ < 1; and (b) follows from the facts that r > 0 and θ > 0. Proof of Part (1): Note that erx is convex with respect to x when r > 0. By Jensen’s inequality,

E[erZ(t) ] ≤

erE[Z(t)] ≤E[erZ(t) ] (a)  ert0 δmax  ≤ erθ +1 , 1−ρ

(26)

(27)

where (a) follows from (26). Taking logarithm on both sides and dividing by r yields:  1 ert0 δmax  E[Z(t)] ≤θ + log 1 + r 1−ρ 2  4δ 8δ 2 eζ/(4δmax )  (a) , = θ + t0 max log 1 + max 2 ζ ζ where (a) follows by recalling that r = Proof of Part (2): Fix z . Note that

ζ 2 4t0 δmax

and ρ = 1 −

ζ2 . 2 8δmax

Pr(Z(t) > z) =Pr(erZ(t) > erz ) (a) E[erZ(t) ]



erz

  ert0 δmax +1 1−ρ ζ 8δ 2 eζ/(4δmax )  (c) (θ−z)  2 =e 4t0 δmax 1 + max 2 ζ (b)

≤erθ e−rz

where (a) follows from Markov’s inequality; (b) follows from (26); and (c) follows by recalling that r = and ρ = 1 −

ζ2 . 2 8δmax

ζ

2 Define µ = e 4t0 δmax

(θ−z) 

1+

2 8δmax eζ/(4δmax )  . ζ2

It follows that if

2  4δmax 8δ 2 eζ/(4δmax )  4δ 2 1 log 1 + max 2 + t0 max log( ), ζ ζ ζ µ then we have Pr(Z(t) ≥ z) ≤ µ by (28).

z = θ + t0

(28) ζ 2 4t0 δmax

17

A PPENDIX B P ROOF OF L EMMA 7 The next lemma will be useful in our proof. ˆ ∈ X0 be a Slater point defined in Assumption 2, i.e, g˜k (ˆ Lemma 11. Let x x) = Eω [gk (ˆ x; ω)] ≤ −, ∀k ∈ {1, 2, . . . , m}. Then m X E[ Qk (t1 )gkt1 (ˆ x)|W(t2 )] ≤ −E[kQ(t1 )k|W(t2 )],

∀t2 ≤ t1 − 1

k=1

where  > 0 is defined in Assumption 2. Proof. To prove this lemma, we first show that E[Qk (t1 )gkt1 (ˆ x)|W(t2 )] ≤ −E[Qk (t1 )|W(t2 )], ∀k ∈ {1, 2, . . . , m}, ∀t2 ≤ t1 − 1. Fix k ∈ {1, 2, . . . , m}. Note that Q(t1 ) ∈ W(t1 − 1) and gkt1 (ˆ x) is independent of W(t1 − 1). Further, if t2 ≤ t1 − 1, then W(t2 ) ⊆ W(t1 − 1). Thus, we have  (a)  E[Qk (t1 )gkt1 (ˆ x)|W(t2 )] = E E[Qk (t1 )gkt1 (ˆ x)|W(t1 − 1)]|W(t2 )  (b)  = E Qk (t1 )E[gkt1 (ˆ x)]|W(t2 ) (c)

= E[gkt1 (ˆ x)]E[Qk (t1 )|W(t2 )]

(d)

≤ − E[Qk (t1 )|W(t2 )]

where (a) follows from iterated expectations; (b) follows because gkt1 (ˆ x) is independent of W(t1 − 1) and Qk (t1 ) ∈ t1 ˆ is a Slater W(t1 − 1); (c) follows by extracting the constant E[gk (ˆ x)] and (d) follows from the assumption that x point, g t (·) are i.i.d. across t and the fact that Qk (t) ≥ 0. Now, summing over m ∈ {1, 2, . . . , m} yields E[

m X

x)|W(t2 )] Qk (t1 )gkt1 (ˆ

≤ − E[

m X

Qk (t1 )|W(t2 )]

k=1

k=1 (a)

≤ − E[kQ(t1 )k|W(t2 )] q P Pm 2 where (a) follows from the basic fact that m k=1 ak ≥ k=1 ak when ak ≥ 0, ∀k ∈ {1, 2, . . . , m}.

The bounded difference of |Q(t + 1) − Q(t)| follows directly from the virtual queue update equation (3) and is summarized in the next Lemma. Lemma 12. Let Q(t), t ∈ {0, 1, . . .} be the sequence generated by Algorithm 1. Then, √ kQ(t)k − G − mD2 R ≤ kQ(t + 1)k ≤ kQ(t)k + G, ∀t ≥ 0. Proof. • Proof of kQ(t + 1)k ≤ kQ(t)k + G: Fix t ≥ 0 and k ∈ {1, 2, . . . , m}. The virtual queue update equation implies that Qk (t + 1) = max{Qk (t) + gkt (x(t)) + [∇gkt (x(t))]T [x(t + 1) − x(t)], 0} (a)

≤ max{Qk (t) + gkt (x(t + 1)), 0},

where (a) follows from the convexity of gkt (·). Note that Qk (t + 1) ≥ 0 and recall the fact that if 0 ≤ a ≤ max{b, 0}, then a2 ≤ b2 for all a, b ∈ R. Then, we have [Qk (t + 1)]2 ≤ [Qk (t) + gkt (x(t + 1))]2 . Summing over k ∈ {1, 2, . . . , m} yields kQ(t + 1)k2 ≤ kQ(t) + gt (x(t + 1))k2 .

18



Thus, kQ(t + 1)k ≤ kQ(t) + gt (x(t + 1))k ≤ kQ(t)k + kgt (x(t + 1))k ≤ kQ(t)k + G where the last inequality follows from Assumption 1. √ Proof of kQ(t + 1)k ≥ kQ(t)k − G − mD2 R: Since Qk (t) ≥ 0, it follows that |Qk (t + 1) − Qk (t)| ≤ |gkt (x(t)) + [∇gkt (x(t))]T [x(t + 1) − x(t)]|. (This can be shown by considering gkt (x(t))+[∇gkt (x(t))]T [x(t+1)−x(t)] ≥ 0 and gkt (x(t))+[∇gkt (x(t))]T [x(t+1)−x(t)] < 0 √ separately.) Thus, we have kQ(t + 1) − Q(t)k ≤ G + mD2 R, which further implies kQ(t + 1)k ≥ kQ(t)k − √ G − mD2 R by the triangle inequality of norms.

kQ(t + 1)k − kQ(t)k ≤ Now, we are ready to present the main proof of Lemma 7. Note that Lemma 12 gives √ √ G + mD2 R, which further implies that E [kQ(t + t0 )k − kQ(t)k|Q(t)] ≤ t0 (G + mD2 R) when kQ(t)k < θ. It remains to prove E[kQ(t + 1)k − kQ(t)k Q(t)] ≤ − 2 t0 when kQ(t)k ≥ θ. Note that kQ(0)k = 0 < θ. ˆ ∈ X0 and  > 0 be defined in Assumption 2. Note that Fix t ≥ 1 and consider that kQ(t)k ≥ θ. Let x ˆ ∈ X0 , E[gkt (ˆ x)] ≤ −, ∀k ∈ {1, 2, . . . , m}, ∀t ∈ {1, 2, . . . } since ω(t) are i.i.d. from the distribution of ω . Since x by Lemma 4, for all τ ∈ {t, t + 1, . . . , t + t0 − 1}, we have m X V [∇f τ (x(τ ))]T [x(τ + 1) − x(τ )] + Qk (τ )[∇gkτ (x(τ ))]T [x(τ + 1) − x(τ )] + αkx(τ + 1) − x(τ )k2 k=1

≤V [∇f τ (x(τ ))]T [ˆ x − x(τ )] +

m X

Qk (τ )[∇gkτ (x(τ ))]T [z − x(τ )] + α[kˆ x − x(τ )k2 − kˆ x − x(τ + 1)k2 ].

k=1

Adding yields

Pm

τ k=1 Qk (τ )gk (x(τ ))

τ

on both sides and noting that gkτ (x(τ ))+[∇gkτ (x(τ ))]T [z−x(τ )] ≤ gkτ (z) by convexity

T

V [∇f (x(τ ))] [x(τ + 1) − x(τ )] +

m X

  Qk (τ ) gkτ (x(τ )) + [∇gkτ (x(τ ))]T [x(τ + 1) − x(τ )] + αkx(τ + 1) − x(τ )k2

k=1

≤V [∇f τ (x(τ ))]T [ˆ x − x(τ )] +

m X

Qk (τ )gkτ (z) + α[kˆ x − x(τ )k2 − kˆ x − x(τ + 1)k2 ].

k=1

Rearranging terms yields m X   Qk (t) gkτ (x(t)) + [∇gkτ (x(τ ))]T [x(τ + 1) − x(τ )] k=1

≤V [∇f τ (x(τ ))]T [ˆ x − x(τ )] − V [∇f τ (x(τ ))]T [x(τ + 1) − x(τ )] + α[kˆ x − x(τ )k2 − kˆ x − x(τ + 1)k2 ] m X − αkx(τ + 1) − x(τ )k2 + Qk (t)gkτ (ˆ x) k=1

≤V [∇f τ (x(τ ))]T [ˆ x − x(τ + 1)] + α[kˆ x − x(τ )k2 − kˆ x − x(τ + 1)k2 ] +

m X

Qk (τ )gkτ (ˆ x)

k=1 m X

(a)

≤ V k∇f τ (x(τ ))kkˆ x − x(τ + 1)k + α[kˆ x − x(τ )k2 − kˆ x − x(τ + 1)k2 ] +

Qk (τ )gkτ (ˆ x)

k=1 (b)

≤V D1 R + α[kˆ x − x(τ )k2 − kˆ x − x(τ + 1)k2 ] +

m X

Qk (τ )gkτ (ˆ x),

(29)

k=1

where (a) follows from the Cauchy-Schwarz inequality and (b) follows from Assumption 1. By Lemma 2, all τ ∈ {t, t + 1, . . . , t + t0 − 1}, we have m X   1 √ ∆(τ ) ≤ Qk (τ ) gkτ (x(τ )) + [∇gkτ (x(τ ))]T [x(τ + 1) − x(τ )] + (G + mD2 R)2 2 k=1

m

X √ 1 x), ≤ V D1 R + (G + mD2 R)2 + α[kˆ x − x(τ )k2 − kˆ x − x(τ + 1)k2 ] + Qk (τ )gkτ (ˆ 2

(a)

k=1

19

where (a) follows from (29). Summing the above inequality over τ ∈ {t, t + 1, . . . , t + t0 − 1}, taking expectations conditional on W(t − 1) on both sides and recalling that ∆(τ ) = 12 kQ(τ + 1)k2 − 12 kQ(τ )k2 yields E[kQ(t + t0 )k2 − kQ(t)k2 W(t − 1)] √ x − x(t)k2 − kˆ x − x(t + t0 )k2 |W(t − 1)] ≤2V D1 Rt0 + t0 (G + mD2 R)2 + 2αE[kˆ +2

t+t 0 −1 X τ =t

m X Qk (τ )gkτ (ˆ x)|W(t − 1)] E[ k=1

(a)

≤ 2V D1 Rt0 + t0 (G + (b)

≤2V D1 Rt0 + t0 (G +

√ √

2

2

mD2 R) + 2αR − 2 mD2 R)2 + 2αR2 − 2

E[kQ(τ )k|W(t − 1)]

τ =t tX 0 −1

E[kQ(t)k − τ (G +



mD2 R)|W(t − 1)]

τ =0

√ mD2 R)2 + 2αR2 − 2t0 kQ(t)k + t0 (t0 − 1)(G + mD2 R) √ √ ≤2V D1 Rt0 + t0 (G + mD2 R)2 + 2αR2 − 2t0 kQ(t)k + t20 (G + mD2 R) P τ x)|W(t−1)] ≤ where (a) follows from kˆ x −x(t)k2 −kˆ x −x(t+t0 )k2 ≤ R2 by Assumption 1 and E[ m k=1 Qk (τ )gk (ˆ −E[kQ(τ )k|W(t − 1)], ∀τ ∈ {t, t + 1, . . . , t + t0 − 1} by Lemma 11; and (b) follows from kQ(t + 1)k ≥ √ kQ(t)k − (G + mD2 R), ∀t by Lemma 12. This inequality can be rewritten as E[kQ(t + t0 )k2 W(t − 1)] √ √ ≤kQ(t)k2 − 2t0 kQ(t)k + 2V D1 Rt0 + 2αR2 + t0 (G + mD2 R)2 + t20 (G + mD2 R) √ (a) √ 2αR2 2V D1 R + (G + mD2 R)2  + ] ≤ kQ(t)k2 − t0 kQ(t)k − t0 [ t0 + (G + mD2 R)t0 + 2 t0   √ √ + 2V D1 Rt0 + 2αR2 + t0 (G + mD2 R)2 + t20 (G + mD2 R) =2V D1 Rt0 + t0 (G +



t+t 0 −1 X

2 t20 =kQ(t)k2 − t0 kQ(t)k − 2  2 ≤[kQ(t)k − t0 ] , 2 √ √ 2 2V D1 R+(G+ mD2 R)2 + . where (a) follows from the hypothesis that kQ(t)k ≥ θ = 2 t0 + (G + mD2 R)t0 + 2αR t0   Taking square root on both sides yields q  E[kQ(t + t0 )k2 W(t − 1)] ≤ kQ(t)k − t0 . 2 √ By the concavity of function x and Jensen’s inequality, we have p  E[kQ(t + t0 )k W(t − 1)] ≤ E[kQ(t + t0 )k2 |W(t − 1)] ≤ kQ(t)k − t0 . 2

A PPENDIX C P ROOF OF L EMMA 9 Intuitively, the second term on the right side in the lemma bounds the probability that |Z(τ +1)−Z(τ )| > c for any τ ∈ {0, 1, . . . , t}, while the first term on the right side comes from the conventional Hoeffding-Azuma inequality. However, it is unclear that whether Z(t) is still a supermartigale conditional on the event that |Z(τ +1)−Z(τ )| ≤ c for any τ ∈ {0, 1, . . . , t − 1}.That’s why it is important to have {|Z(t + 1) − Z(t)| > c} ⊆ {Y (t) > 0} and Y (t) ∈ F(t), which means the boundedness of |Z(t+1)−Z(t)| can be inferred from another random variable Y (t) that belongs to F(t). The proof of Lemma 9 uses the truncation method to construct an auxiliary supermargingale. Recall the definition of stoping time given as follows:

20

Definition 1 ([20]). Let {∅, Ω} = F(0) ⊆ F(1) ⊆ F(2) · · · be a filtration. A discrete random variable T is a stoping time (also known as an option time) if for any integer t < ∞, {T = t} ∈ F(t),

i.e. the event that the stopping time occurs at time t is contained in the information up to time t. The next theorem summarizes that a supermartingale truncated at a stoping time is still a supermartingale. Theorem 5. (Theorem 5.2.6 in [20]) If random variable T is a stopping time and Z(t) is a supermartingale, then Z(t ∧ T ) is also a supermartingale, where a ∧ b , min{a, b}. To prove this lemma, we first construct a new supermartingale by truncating the original supermartingale at a carefully chosen stopping time such that the new supermartingale has bounded differences. Define integer random variable T = inf{t ≥ 0 : Y (t) > 0}. That is, T is the first time t when Y (t)S> 0 happens. e = Z(t ∧ T ), then {Z(t) e 6= Z(t)} ⊆ t−1 {Y (τ ) > Now, we show that T is a stoping time and if we define Z(t) τ =0 e is a supermartingale with differences bounded by c . 0}, ∀t ≥ 1 and Z(t) 1) To show T is a stoping time: Note that {T = 0} = {Y (0) > 0} ∈ F(0). Fix integer t0 > 0, we have  {T = t0 } = inf{t ≥ 0 : Y (t) > 0} = t0  0 −1 = ∩tτ =0 {|Y (τ ) ≤ 0} ∩ {Y (t0 ) > 0} (a)

∈ F(t0 )

where (a) follows because {Y (τ ) ≤ 0} ∈ F(τ ) ⊆ F(t0 ) for all τ ∈ {0, 1, . . . , t0 − 1} and {Y (t0 ) > 0} ∈ F(t0 ). It follows that T is a stoping time. e 6= Z(t)} ⊆ St−1 {Y (τ ) > 0}, ∀t ≥ 1: Fix t = t0 > 1. Note that 2) To show {Z(t) τ =0 (a)

e 0 ) 6= Z(t0 )} ⊆ {T < t0 } = {Z(t ⊆

0 t[ −1



inf{t > 0 : Y (t) > 0} < t0



{Y (τ ) > 0}

τ =0

t0

e 0 ) = Z(t0 ∧ T ) = Z(t0 ). where (a) follows by noting that if T ≥ then Z(t e 3) To show Z(t) is a supermartingale with differences bounded by c: Since random variable T is proven to e = Z(t ∧ T ) is a supermartingale by Theorem 5. It remains to show |Z(t e + 1) − Z(t)| e be a stoping time, Z(t) ≤ 0 c, ∀t ≥ 0. Fix integer t = t ≥ 0. Note that e 0 + 1) − Z(t e 0 )| =|Z(T ∧ (t0 + 1)) − Z(T ∧ t0 )| |Z(t =|1{T ≥t0 +1} [Z(T ∧ (t0 + 1)) − Z(T ∧ t0 )] + 1{T ≤t0 } [Z(T ∧ (t0 + 1)) − Z(T ∧ t0 )]| =|1{T ≥t0 +1} [Z(t0 + 1) − Z(t0 )] + 1{T ≤t0 } [Z(T ) − Z(T )]| =1{T ≥t0 +1} |Z(t0 + 1) − Z(t0 )|

Now consider T ≤ t0 and T ≥ t0 + 1 separately. e 0 + 1) − Z(t e 0 )| = 1{T ≥t0 +1} |Z(t0 + 1) − Z(t0 )| = 0 ≤ c. • In the case when T ≤ t0 , it is straightforward that |Z(t  • Consider the case when T ≥ t0 + 1. By the definition of T , we know that {T ≥ t0 + 1} = inf{t ≥ 0 : T0 T0 Y (t) > 0} ≥ t0 + 1 ⊆ tτ =0 {Y (τ ) ≤ 0} ⊆ tτ =0 {|Z(τ + 1) − Z(τ )| ≤ c}, where the last inclusion follows from the fact that {|Z(τ + 1) − Z(τ )| > c} ⊆ {Y (τ ) > 0}. That is, when T ≥ t0 + 1, we must have |Z(τ + 1) − Z(τ )| ≤ c for all τ ∈ {1, . . . , t0 }, which further implies that |Z(t0 + 1) − Z(t0 )| ≤ c. Thus, e 0 + 1) − Z(t e 0 )| = 1{T ≥t0 +1} |Z(t0 + 1) − Z(t0 )| ≤ c. when T ≥ t0 + 1, |Z(t e 0 + 1) − Z(t e 0 )| ≤ c. Combining two cases together proves |Z(t e is a supermartingale with bounded differences c and Z(0) e Since Z(t) = Z(0) = 0, by the conventional HoeffdingAzuma inequality, for any z > 0, we have e ≥ z) ≤ e−z 2 /(2tc2 ) Pr(Z(t)

(30)

21

Finally, we have e = Z(t), Z(t) ≥ z) + Pr(Z(t) e 6= Z(t), Z(t) ≥ z) Pr(Z(t) ≥ z) =Pr(Z(t) e ≥ z) + Pr(Z(t) e 6= Z(t)) ≤Pr(Z(t) (a)

≤e (b)

−z 2 /(2tc2 )

≤e−z

2

/(2tc2 )

+ Pr(

+

t−1 [

Y (τ ) > 0)

τ =0 t−1 X

p(τ )

τ =0

where (a) follows from equation (30) and the second bullet in the above; and (b) follows from the union bound and the hypothesis that Pr(Y (τ ) > 0) ≤ p(τ ), ∀τ . A PPENDIX D P ROOF OF T HEOREM 4 P P τ ∗ Define Z(t) = tτ =1 m k=1 Qk (τ )gk (x ). Recall W(t) = σ(ω(1), . . . , ω(t)). The next lemma shows that Z(t) satisfies Lemma 9 with F(t) = W(t) and Y (t) = kQ(t + 1)k − Gc . P ˜ (x∗ ) ≤ 0, e.g., x∗ = argminx∈X Tt=1 f t (x). Lemma 13. Let x∗ ∈ X0 be any fixed solution that satisfies g Pt P τ ∗ Under Algorithm 1, if we define Z(0) = 0 and Z(t) = τ =1 m k=1 Qk (τ )gk (x ), ∀t ≥ 1, then {Z(t), t ≥ 0} is a supermartingale adapted to filtration {W(t), t ≥ 0} such that {|Z(t + 1) − Z(t)| > c} ⊆ {Y (t) > 0}, ∀t ≥ 0 where Y (t) = kQ(t + 1)k − Gc is a random variable adapted to W(t). Proof. It is easy to say {Z(t), t ≥ 0} is adapted {W(t), t ≥ 0}. It remains to show {Z(t), t ≥ 0} is a P t+1 ∗ supermartingale. Note that Z(t + 1) = Z(t) + m Q k=1 k (t + 1)gk (x ) and E[Z(t + 1)|W(t)] =E[Z(t) +

m X

Qk (t + 1)gkt+1 (x∗ )|W(t)]

k=1 (a)

= Z(t) +

m X

Qk (t + 1)E[gkt+1 (x∗ )]

k=1 (b)

≤Z(t)

where (a) follows from the fact that Z(t) ∈ W(t), Q(t + 1) ∈ W(t) and gt+1 (x∗ ) is independent of W(t); and (b) follows from E[gkt+1 (x∗ )] = g˜k (x∗ ) ≤ 0 which further follows from ω(t) are i.i.d. samples. Thus, {Z(t), t ≥ 0} is a supermartingale. We further note that |Z(t + 1) − Z(t)| = |

m X

(a)

Qk (t + 1)gkt+1 (x∗ )| ≤ kQ(t + 1)kG

k=1

where (a) follows from the Cauchy-Schwarz inequality and the assumption that kgt (x∗ )k ≤ G. This implies that if |Z(t+1)−Z(t)| > c, then kQ(t)k > Gc . Thus, {|Z(t+1)−Z(t)| > c} ⊆ {kQ(t+1)k > Since Q(t+1) is adapted to W(t), it follows that Y (t) = kQ(t+1)k− Gc is a random variable adapted to W(t).

c G }.

By Lemma 13, Z(t) satisfies Lemma 9. Fix T ≥ 1, Lemma 9 implies that T X m T −1 X X c t ∗ −γ 2 /(2T c2 ) Pr( Qk (t)gk (x ) ≥ γ) ≤ e| {z } + Pr(kQ(t + 1)k > ) G t=1 k=1 (I) |t=0 {z }

(31)

(II)

Fix 0 < λ < 1. In the following, we shall choose γ and c such that both term (I) and term (II) in (31) are no larger than λ2 .

22

e = kQ(t)k satisfies the conditions in Lemma 5. To guarantee term Recall that by Lemma 7, random process Z(t) λ (II) is no lareger than 2 , it suffices to choose c such that c λ Pr(kQ(t)k > ) < , ∀t ∈ {1, 2, . . . , T } G 2T √ λ By part (2) of Lemma√5 (with µ = 2T ), the above inequality holds if we choose c = t0 2 G + t0 (G + mD2 R)G + √ 2 2 2V D1 R+(G+ mD2 R) 2 R) 2αR2 G + t0 BG + t0 8(G+ mD log( 2T t0  G +   λ )G where √ √ 32(G + mD2 R)2 /[8(G+√mD2 R)] 8(G + mD2 R)2 e log[1 + ] B=  2 is an absolute constant irrelevant the algorithm parameters and t0 > 0 is an arbitrary integer. √ Once c is chosen, we further need to choose γ such that term (I) in (31) is λ2 . It follows that if γ√= T log0.5 ( λ1 )c = √ √ 2 √ 2 2V D1 R+(G+ mD2 R)2 2 R) T log0.5 ( λ1 )[ 2 t0 G + t0 (G + mD2 R)G + 2αR G + t0 BG + t0 8(G+ mD log( 2T t0  G +   λ )G], then T X m X Pr( Qk (t)gkt (x∗ ) ≥ γ) ≤ λ, t=1 k=1

or equivalently, m T X X Qk (t)gkt (x∗ ) < γ) ≥ 1 − λ. Pr(

(32)

t=1 k=1



√   0.5 1 1.5 1 Note that if we take t = d T e , V = T and α = T , then γ = O T log(T ) log ( ) + O T log ( ) = 0 λ λ  O T log(T ) log1.5 ( λ1 ) . √ By Lemma 8 (with z = x∗ , V = T and α = T ), we have T X t=1

f t (x(t)) ≤

T X

f t (x∗ ) +

t=1



T R2 +

m T √  √ D12 √ 1 1 XX Qk (t)gkt (x∗ ) T + (G + mD2 R)2 T + √ 4 2 T t=1 k=1

(33)

Substituting (32) into (33) yields T T X X 1  t Pr f (x(t)) ≤ f t (x∗ ) + O T log(T ) log1.5 ( ) ≥ 1 − λ. λ t=1

t=1

A PPENDIX E M ORE E XPERIMENT D ETAILS In the experiment, we assume the job arrivals ω(t) are Poisson distributed with mean 1000 jobs/slot. For simplicity, assume each server is restricted to choose power xi (t) ∈ [0, 30] at each round and the service rate satisfies hi (xi (t)) = 4 log(1 + 4xi (t)). (Note that our algorithm can easily deal with general concave functions hi (·) and each server in general can have different hi (·) functions.) The simulation duration is 2160 slots (corresponding to 10 days). The three baselines are further elaborated as below: • Best fixed decision in hindsight: Assume all the electricity price traces and the job arrival distribution are known beforehand. The decision maker chooses a fixed power decision vector p∗ that is optimal based on data in 2160 slots. • React algorithm: This algorithm is developed in [29]. The algorithm reacts to the current traffic and splits the load evenly among each server to support the arrivals. Since instantaneous job arrivals is unknown at the current slot, we use the average of job arrivals over the most recent 5 slots as an estimate. Since this algorithm is designed to meet the time varying job arrivals but is unaware of electricity variations, its electricity cost is high as observed in our simulation results. • Low-power algorithm: This algorithm is adapted from [30] and always schedule jobs to servers in the zones with the lowest electricity price. Since instantaneous electricity prices are unknown at the current slot, we use the average of electricity prices over the most recent 5 slots at each server as an estimate. Recall that each server has a finite service capacity (xi (t) ∈ [0, 30]), this algorithm is not guaranteed serve all job arrivals. Thus, the number of unserved jobs can eventually pile up.

23

R EFERENCES [1] N. Cesa-Bianchi, P. M. Long, and M. K. Warmuth, “Worst-case quadratic loss bounds for prediction using linear functions and gradient descent,” IEEE Transactions on Neural Networks, vol. 7, no. 3, pp. 604–619, 1996. [2] J. Kivinen and M. K. Warmuth, “Exponentiated gradient versus gradient descent for linear predictors,” Information and Computation, vol. 132, no. 1, pp. 1–63, 1997. [3] G. J. Gordon, “Regret bounds for prediction problems,” in Proceeding of Conference on Learning Theory (COLT), 1999. [4] M. Zinkevich, “Online convex programming and generalized infinitesimal gradient ascent,” in Proceedings of International Conference on Machine Learning (ICML), 2003. [5] S. Shalev-Shwartz, “Online learning and online convex optimization,” Foundations and Trends in Machine Learning, vol. 4, no. 2, pp. 107–194, 2011. [6] E. Hazan, “Introduction to online convex optimization,” Foundations and Trends in Optimization, vol. 2, no. 3–4, pp. 157–325, 2016. [7] E. Hazan, A. Agarwal, and S. Kale, “Logarithmic regret algorithms for online convex optimization,” Machine Learning, vol. 69, pp. 169–192, 2007. [8] S. Mannor, J. N. Tsitsiklis, and J. Y. Yu, “Online learning with sample path constraints,” Journal of Machine Learning Research, vol. 10, pp. 569–590, March 2009. [9] D. Tse and P. Viswanath, Fundamentals of Wireless Communication. Cambridge University Press, 2005. [10] M. Mahdavi, R. Jin, and T. Yang, “Trading regret for efficiency: online convex optimization with long term constraints,” Journal of Machine Learning Research, vol. 13, no. 1, pp. 2503–2528, 2012. [11] A. Cotter, M. Gupta, and J. Pfeifer, “A light touch for heavily constrained sgd,” in Proceedings of Conference on Learning Theory (COLT), 2015. [12] R. Jenatton, J. Huang, and C. Archambeau, “Adaptive algorithms for online convex optimization with long-term constraints,” in Proceedings of International Conference on Machine Learning (ICML), 2016. [13] M. Mahdavi, T. Yang, and R. Jin, “Stochastic convex optimization with multiple objectives,” in Advances in Neural Information Processing Systems (NIPS), 2013. [14] Y. Nesterov, Introductory Lectures on Convex Optimization: A√Basic Course. Springer Science & Business Media, 2004. [15] H. Yu and M. J. Neely, “A low complexity algorithm with O( T ) regret and finite constraint violations for online convex optimization with long term constraints,” arXiv:1604.02218, 2016. [16] M. J. Neely and H. Yu, “Online convex optimization with time-varying constraints,” arXiv::1702.04783, 2017. [17] G. Lan and Z. Zhou, “Algorithms for stochastic optimization with expectation constraints,” arXiv:1604.03887, 2016. [18] A. Nedi´c and A. Ozdaglar, “Subgradient methods for saddle-point problems,” Journal of Optimization Theory and Applications, vol. 142, no. 1, pp. 205–228, 2009. [19] H. Yu and M. J. Neely, “A simple parallel algorithm with an O(1/t) convergence rate for general convex programs,” SIAM Journal on Optimization, vol. 27, no. 2, pp. 759–783, 2017. [20] R. Durrett, Probability: Theory and Examples. Cambridge University Press, 2010. [21] J. L. Doob, Stochastic processes. Wiley New York, 1953. [22] B. Hajek, “Hitting-time and occupation-time bounds implied by drift analysis with applications,” Advances in Applied Probability, vol. 14, no. 3, pp. 502–525, 1982. [23] M. J. Neely, “Energy-aware wireless scheduling with near optimal backlog and convergence time tradeoffs,” in Proceedings of IEEE International Conference on Computer Communications (INFOCOM), 2015. [24] N. Cesa-Bianchi and G. Lugosi, Prediction, Learning, and Games. Cambridge University Press, 2006. [25] P. L. Bartlett, V. Dani, T. Hayes, S. Kakade, A. Rakhlin, and A. Tewari, “High-probability regret bounds for bandit online linear optimization,” in Proceedings of Conference on Learning Theory (COLT), 2008. [26] V. Vu, “Concentration of non-lipschitz functions and applications,” Random Structures & Algorithms, vol. 20, no. 3, pp. 262–316, 2002. [27] T. Tao and V. Vu, “Random matrices: universality of local spectral statistics of non-hermitian matrices,” The Annals of Probability, vol. 43, no. 2, pp. 782–874, 2015. [28] “New York ISO open access pricing data. http://www.nyiso.com/.” [29] A. Gandhi, M. Harchol-Balter, and M. A. Kozuch, “Are sleep states effective in data centers?” in International Green Computing Conference (IGCC), 2012. [30] A. Qureshi, R. Weber, H. Balakrishnan, J. Guttag, and B. Maggs, “Cutting the electric bill for internet-scale systems,” in ACM SIGCOMM, 2009.