A Stochastic Newton Method for Stochastic ... - Semantic Scholar

0 downloads 0 Views 187KB Size Report
School of Mathematics. University of New South Wales, Sydney NSW 2052, Australia. (September 22, 1995). Abstract ... realistic since the calculations of (x), its derivative and its Hessian involve high dimensional multiple integrals, which ...
A Stochastic Newton Method for Stochastic Quadratic Programs with Recourse

1

John R. Birge Department of Industrial and Operations Engineering University of Michigan, Ann Arbor, MI 48109, USA Xiaojun Chen, Liqun Qi and Zengxin Wei School of Mathematics University of New South Wales, Sydney NSW 2052, Australia (September 22, 1995)

Abstract

In this paper, we combine the inexact Newton method with the stochastic decomposition method and present a stochastic Newton method for solving two-stage stochastic programs. We prove that the new method is globally convergent and, if in addition, an integral approximation error bound condition holds, the convergence is superlinear (with probability one for random integration rules). The error bound only depends on the integration method at the current point.

Keywords: Stochastic Newton method, stochastic quadratic programming. 1. Introduction Let P 2 Rnn be symmetric positive semi-de nite and H 2 Rmm be symmetric positive de nite. We consider the two-stage stochastic quadratic program with xed recourse : minimize

x 2 Rn

(x) = 12 xT Px + cT x + (x)

subject to Ax  b; where

(x) =

Z

Rm

(1:1) (! ? Tx)(!)d!;

(! ? Tx) = maximize ? 21 zT Hz + zT (! ? Tx)

z 2 Rm

subject to Wz  q; This work is supported by the National Science Foundation under Grant DDM-9215921 and the Australian Research Council. 1

1

c 2 Rn ; A 2 Rrn ; b 2 Rr ; T 2 Rmn ; q 2 Rm1 and W 2 Rm1 m are xed matrices or vectors, ! 2 Rm is a random vector and  : Rm ! R+ is a continuously di erentiable probability density function. Here, x is the rst-stage decision vector, 12 xT Px + cT x is the quadratic cost of the decision x, z is the second-stage shadow price vector, zT (! ? Tx) represents the penalty cost of the discrepancy between the random demand ! and the actual supply Tx, ? 12 zT Hz is the quadratic term caused by z and (x) is the expectation of the penalty cost. The task of the decision maker is to

minimize the overall sum of the direct cost and the expectation of the penalty cost. The quadratic terms make the problem smoother and more amenable to solution than problems with linear terms alone. For the formulation of such a problem, see papers [19], [20], [24] and [18], etc. Let X = fx j Ax  b; x 2 Rn g and Z = fz j Wz  q; z 2 Rm g be nonempty polyhedra. With this assumption, in Section 2 we discuss the di erentiability properties of the stochastic quadratic program (1.1). We show that , hence the objective function , is twice di erentiable if the probability density function is smooth. Hence, theoretically we may apply the classical Newton method (the sequential quadratic programming method) to solve (1.1) with a quadratic convergence rate, i.e.

jjyk+1 ? yjj  jjyk ? yjj2;

where y = (x; ) 2 Rn+r ,  is the Lagrange multiplier of (1.1), x is the optimal solution of (1.1), and yk is the kth iteration point. However, this approach is unrealistic since the calculations of (x), its derivative and its Hessian involve high dimensional multiple integrals, which cannot be calculated exactly. In Section 3, we use approximate values of (x), its derivative and its Hessian. This results in our inexact Newton method. We show that it converges superlinearly (in the case of random rules, with probability one (wp1)) under some mild conditions. Furthermore, we have the following relation: kxk+1 ? xk  o(kxk ? xk) + O(h(Nk )); (wp1) where Nk is the number of points in an integration rule and h(Nk ) is the error bound of the integration rule. This approach may have some advantages over previous rules by allowing relatively easy error checks, but it requires solving an increasing number of quadratic programs at each step. Besides, a new line search procedure based on [6] is needed for global convergence, because of the cost of calculating function values by multiple integrations. In Section 4, we extend the stochastic decomposition method of Higle and Sen [7] for the stochastic linear program to the stochastic quadratic program. This method 2

has global convergence with probability one. To obtain a faster method, we combine the inexact Newton method with the stochastic decomposition method in Section 5 and establish its superlinear convergence with probability one under the condition h(Nk ) = o(kdk k).

2. Classical Newton method Since H is symmetric positive de nite, the objective function (x) = 21 xT Px + cT x + (x) in (1.1) is convex and has a continuous rst-order derivative in Rn :

F (x) := r(x) = Px + c ? T T with

Z

Rm

(! ? Tx)(!)d!;

(! ? Tx) = argmaxf? 21 zT Hz + zT (! ? Tx); z 2 Z g = H ? 12 S (H ? 12 (! ? Tx));

where S (u) denotes the projection of a point u 2 Rm onto the polyhedron

S = fs j s = H 12 z; z 2 Z g = fs j WH ? 21 s  q; s 2 Rn g: By Theorem 2.1 in [3], (! ? Tx) is globally Lipschitz in Rm .

Theorem 2.1. Suppose that the set Z is bounded. Then F (x) = Px + c ? T T

Z

Rm

(! ? Tx)(!)d!

is continuously di erentiable in Rm and

Z

rF (x) = P ? T T ( Rm ()r( + Tx)d)T;

where  = ! ? Tx: Proof. Since Z is bounded, : Rm ! Rm is bounded and uniformly continuous. The probability distribution function  is a kernel function [21], because  : Rm ! R+ satis es Z (!)d! = 1: m R

3

Therefore we have that

F (x) =

Z (! ? Tx)(!)d! Rm Z T

Px + c ? T T

= Px + c ? T is continuously di erentiable and

Rm

()( + Tx)d

Z

rF (x) = P ? T T ( Rm ()r( + Tx)d)T: The proof is complete. Newton's method for (1.1) can be described as follows:

Algorithm 2.1 (Newton's method) Find a solution dk of the quadratic program: minimize

d

F (xk )T d + 12 dT rF (xk)d

subject to A(xk + d)  b: Let xk+1 = xk + dk : Theoretically the Newton method converges to a solution x of (1.1) quadratically if rF (x) is nonsingular at the solution x. See [12], [14], [16] and [17]. However, generally F (xk) and rF (xk) may not be calculated exactly in (1.1) due to the multidimensional integral.

3. Inexact Newton method In this section we construct a sequence of approximate solutions xk ; k = 0; 1; ::: using estimates Nk (x); FNk (x) and VNk (x) of (x), F (x) and rF (x), respectively. The functions Nk ; FNk and VNk are with probability one close to ; F and rF at xk . Using FNk and VNk , we construct an inexact Newton method. This algorithm realizes superlinear local convergence, which in the case of random integration rules, is with probability one. We use a transformation function ! = p(~! ) to go from an integral on Rm to the

4

integral of the unit cube I m and rewrite (x) by

(x) = = = where

Z

Rm

Z

Z

Im Im

(! ? Tx)(!)d! (p(~! ) ? Tx)(p(~!))p0(~!)d!~ (p(~! ) ? Tx)(~!)d!~ ;

(~!) = (p(~!))p0 (~!):

Such transformation functions can be found in [3] and [4]. Similarly, we can let ^ : I m ! Rm be a vector function ( r^ bounded) such that

Z

R

()r( + Tx)d = m

Z

Im

(p(~! ) ? Tx)^(~!)d!~ :

To simplify notation, without confusion, we use ! to indicate !~ in the following discussion. Let fNk g1 k=1 be an integer sequence satisfying 1  N1  :::  Nk  Nk+1  ::: and Nk ! 1 as k ! 1. Generate observations f!i; i = 1; :::; Nkg on the unit hypercube according to an integration rule. This choice may be random (Monte Carlo) or deterministic, as in quasi-random sequences (see [13]). Let Nk X (p(!i) ? Tx)(!i); k

Nk (x) = 21 xT Px + cT x + N1 and where

i=1

Nk X 1 T FNk (x) = Px + c ? N T z(p(!i ) ? Tx)(!i) k

VNk (x) = P ? N1 T T k

i=1

Nk X z(p(!i ) ? Tx)^(!i)T; i=1

z(p(!i ) ? Tx) = argmaxf? 21 zT Hz + zT (p(!i ) ? Tx); z 2 Z g: Theorem 3.1 Suppose that the condition of Theorem 2.1 holds. Let Nk ! 1 as k ! 1. Then, for any x 2 Rn , (x) = klim  (x); (wp1); (3:1) !1 Nk F (x) = klim F (x); (wp1); !1 Nk 5

(3:2)

and

rF (x) = klim V (x); (wp1); !1 Nk

(3:3)

k(x) ? N (x)k  ;

(3:4)

where the condition (wp1) is not necessary for deterministic integration rules. Proof. The equalities (3.1)- (3.3) follow from the fact that (! ? Tx); (! ? Tx) and ()r( + Tx), are continuous in Rm for any xed x 2 Rn : See Theorem 2.13 in [13]. We may assume that for each x 2 X and each  > 0, there exists a positive integer N (x; ) such that, for all N  N (x : ), with probability one in the Monte Carlo case. Some quasi-Monte Carlo methods, such as lattice methods, provide an error bound k(x) ? N (x;)(x)k = O(N (x; )?2) (see [13] and [23]), so we can in theory nd an exact value of N (x; ). In the following, we assume a deterministic integration rule. All results in this section hold, however, for Monte Carlo integration with the additional wp1 restriction.

Algorithm 3.1 (Inexact Newton method)

Step Let  2 (0; 1);  2 (0; 1); fk > 0 : k = 1; 2:::g satisfy P1 0:k j FN (xk;k)(xk )T dk + k : Using the de nition of N (x;)(x), we have N (xk+j dk ;k+1)(xk + j dk )  (xk + j dk ) + k+1 and N (xk;k )(xk )  (xk ) ? k : Thus, (xk + j dk ) ? (xk ) > j FN (xk;k )(xk )T dk + ( ? k ? k+1): Passing to the limit j ! +1, and using the continuity of , we obtain that 0  k ? (k + k+1) which contradicts the choice of k .

Lemma 3.1 indicates that Algorithm 3.1 is well de ned. For the convergence of algorithms for stochastic programs, see [9] and [10]. In the following analysis, without loss of generality, we assume that for all k, N (xk ; k )  N (xk+1; k+1) and N (xk ; k ) ! 1 as k ! 1.

Theorem 3.2 Suppose that the smallest eigenvalue of VN (xk ;k )(xk ) + k I is greater than c1 > 0 for all k. Then, every accumulation point x of the sequence fxk g pro-

duced by Algorithm 3.1 is an optimal solution of (1.1).

Proof. Since for all k, Hence,

N (xk+1;k+1)(xk+1) ? N (xk;k )(xk )  tkFN (xk;k)(xk )T dk + k :

(xk+1) ? (xk )  k + k + k+1 by the de nition of N (x;)(x) and the fact that FN (xk;k )(xk )T dk  0. From 1 X k < +1

k=1

7

and k < 12 k , we can deduce that the sequence f(xk )g converges. Since fxk g has a cluster point x, f(xk )g is bounded from below and lim (xk ) = (x): k!1 The above limit, combined with the following

(xk+1) ? (xk )  tkFN (xk;k)(xk )T dk + k + k + k+1; yields that

lim t F k (x ) k!1 k N (x ;k ) k

= 0: (3:6) Suppose that K is a subsequence of f0; 1; 2;   g such that limk2K xk = x. Then, by kFN (xk;k)(xk ) ? F (x)k  kFN (xk;k)(xk ) ? FN (xk;k)(x)k + kFN (xk;k )(x) ? F (x)k and Theorem 3.1, lim F k (xk ) = F (x): (3:7) k2K N (x ;k ) Equation (3.7) and  2 C 1 yield that there exists a constant L > 0 such that for all k 2 K, kFN (xk;k)(xk )k  L: So for all k 2 K , kd k  2L k

by (3.7),

Td

k

c1

FN (xk;k )(xk )T dk  ? 12 (dk )T (VN (xk ;k)(xk ) + k I )dk ;

(3:8)

and the assumption of this theorem. Hence, we may assume that lim dk = d: k2K We rst prove

d = 0:

(3:9) We consider two cases, depending on whether the liminf of the sequence of stepsizes, ftk : k 2 K g, is positive or zero. Suppose lim inf t > 0: k2K k Then, (3.6) and (3.8) yield (3.9). Now consider the other case, lim inf t = 0: k2K k 8

Without loss of generality, we may assume that lim t k2K k

= 0:

By the de nition of tk , it follows that for each k 2 K ,

N (xk+?1 tk dk ;k+1)(xk + ?1tk dk ) ? N (xk;k )(xk ) > ?1tk FN (xk;k)(xk )T dk + k : Hence, we have

(xk + ?1 tk dk ) ? (xk ) > ?1tk FN (xk;k)(xk )T dk + k ? k ? k+1 > ?1tk FN (xk;k)(xk )T dk : Dividing both sides in the above inequality by ?1 tk, passing to the limit k ! +1; k 2 K , we deduce, using (3.7), that F (x)T d  F (x)T d: Hence, (3.9) holds by using (3.8). Since for all k and for all x 2 X , we have (FN (xk;k )(xk ) + (VN (xk ;k)(xk ) + k I )dk )T (x ? xk )  0; Thus, for all x 2 X ,

F (x)T (x ? x)  0; (3:10) by taking the limit k ! +1; k 2 K . Hence, from (3.10) and convexity of (1.1), we deduce that x is an optimal solution of (1.1).

Lemma 3.2 Assume that x is a solution of (1.1), A has full row rank, rF (x) is

nonsingular and strict complementarity condition holds at x. Then x is the unique solution of (1.1). See Theorem 2.3 in [12] for the proof of this lemma.

Theorem 3.3 Assume that fxk g is generated by Algorithm 3.1. If fxk g is bounded

and the assumptions of Lemma 3.2 hold, then xk tends to the unique solution x of (1.1). Furthermore, if lim k!1

kF (xk) ? FN (xk;k)(xk )k = 0; kdk k 9

(3:11)

then

kxk+1 ? xk = o(kxk ? xk):

(3:12) Proof. From Theorem 3.2 and Lemma 3.2, we can deduce that xk tends to the unique solution x of (1.1). We rst prove kxk + dk ? xk = o(kxk ? xk): (3:13) For any given xk , since dk is a solution of (QPk ), we have 0  (xk + dk ? x)T F (x) = (xk + dk ? x)T (F (xk) ? rF (xk)(xk + dk ? x) + rF (xk)dk +(F (x) ? F (xk) ? rF (xk)(x ? xk ))); i.e., (xk + dk ? x)T rF (xk)(xk + dk ? x)  (xk + dk ? x)T (F (xk) + rF (xk)dk ) +F (x) ? F (xk ) ? rF (xk)(x ? xk ); (3:14) since, for all k, x ? xk is a feasible point of (QPk ). By the feasibility of x ? xk for (QPk ), dk must satisfy (FN (xk;k)(xk ) + (VN (xk;k )(xk )dk + k I ))T (x ? (xk + dk ))  0: (3:15) From the nonsingularity of rF (x) and xk tending to x, there exists M 2 (0; +1) and k1 such that for k  k1 and d 2 Rn , dT rF (xk)d  MdT d: (3:16) Inequalities (3.14), (3.15) and (3.16) imply that M kxk + dk ? xk  kF (xk) ? FN (xk;k )(xk )k + k(rF (xk) ? VN (xk ;k)(xk ))dk k +kF (x) ? F (xk) ? rF (xk)(x ? xk )k + kk dk k:

(3:17) Since F 2 C 1, we have kF (xk) ? F (x) ? rF (xk)(xk ? x)k = o(kxk ? xk): (3:18) If (3.11) holds, from (3.17), (3.18) and krF (xk) ? VN (xk;k )(xk )k ! 0, we have M kxk + dk ? xk  o(kdk k) + o(kdk k) + o(kxk ? xk)

 o(kxk + dk ? xk + kxk ? xk) + o(kxk ? xk): 10

Hence

kxk + dk ? xk = o(kxk ? xk);

completing the proof of (3.13).

We prove now our main conclusion (3.12). Equation (3.13) implies that

kdk k = O(kxk ? xk): Hence, dk ! 0. Therefore, if we let

qk = N (xk+dk ;k+1)(xk + dk ) ? N (xk;k )(xk ) ? FN (xk;k)(xk )T dk ? k ; then

qk  (xk + dk ) ? (xk ) + k + k+1 ? FN (xk;k)(xk )T dk ? k

 21 (F (xk) ? FN (xk;k )(xk ))T dk + ( 21 ? )FN (xk;k )(xk )T dk + o(kdk k2)  ? 21 (1 ? 2)dk (VN (xk;k )(xk ) + k I )dk + o(kdk k2): The nonsingularity of rF (x) and Theorem 3.1 imply that, there exists k2  k1, such that for k  k2, qk < 0, i.e., we can let tk = 1. Since for k  k2, xk+1 = xk + dk . So we have (3.12) by (3.13).

De nition 3.1. Let h(N (xk ; k )) be an error bound such that h(N (xk ; k )) ! 0 as k ! 1. We say that fxk g is almost superlinearly convergent to x with, in the random case, a probabilistic error bound, O(h(N (xk ; k )) if

jjxk+1 ? xjj = o(jjxk ? xjj) + O(h(N (xk ; k ))); (wp1); where wp1 refers to the Monte Carlo integration. From (3.17), (3.18) and krF (xk) ? VN (xk;k )(xk )k ! 0 (wp1 in the random case), we have that for k  k1,

kxk + dk ? xk = o(kxk ? xk) + O(kF (xk) ? FN (xk;k )(xk )k); again, wp1 in the random case. Let h(N (xk ; k )) = kF (xk) ? FN (xk;k)(xk )k. Then

for the original inexact Newton method (let tk = 1 for all k in Algorithm 3.1), if the initial point x1 is very close to the unique solution x, then fxkg is almost superlinearly convergent to x with a (probabilistic) error bound h(N (xk ; k )). The error bound O(h(N (xk ; k )) is dependent on the method for generating f!ig. For instance, Monte Carlo methods provide an error bound h(N (xk ; k ) = O(N (xk ; k )?1=2) with 11

probability one. Some quasi-Monte Carlo methods, such as lattice methods, provide an error bound h(N (xk ; k ) = O(N (xk ; k)?2 ): See [13], [11], [22].

Remark 3.2. We can compare this method with other approaches, such as stochastic quasigradient methods (see, e.g., [5]) and variations on stochastic approximation (see, e.g., [15]). These methods use a xed sample for the relevant gradient or Hessian approximation and then achieve convergence through the use of decreasing step sizes and, in [15], averaging. The optimal convergence result for t iterations [15] is that D p1 (xt ? x) ! N (0; V ); t

where the convergence is in distribution, N is a standard multivariate normal distribution, and V depends on the objective Hessian at x and the random error in the gradient estimates. By choosing an integration rule with h(Nk ) = O(Nk?2 ) and letting Nk  O(k), our Inexact Newton Method would make t = O(k2) random objective evaluations up to iteration k and achieve an asymptotic error O( 1t ). In particular, we can also better estimate the error in our approach since it depends on the integration at the current point and not the unknown point x. The following example shows that condition (3.11) is necessary.

Example: Let B 2 Rnn be a symmetric positive de nite matrix, (x) = 21 xT Bx, and X = Rn .

We choose Nk (x) = (x), FNk (xk ) = cBxk and VNk (x)  B , where c is a scalar in (0; 1). Then dk = cxk . It is clear that (3.11) fails. In fact, k k k lim kFNk (x ) ? F (x )k = lim (1 ? c)kBx k  1 ? c  (B ) > 0; k!1

kdk k

kcxk k

k!1

c

min

where min(B ) is the smallest eigenvalue of matrix B . Since x = 0 is the unique solution for this problem, we deduce

kxk+1 ? xk = kxk+1k = 1 ? t c: k kxk ? xk kxk k

By using tk 2 (0; 1], we have

kx lim k!1

k+1 ? xk kxk ? xk

12

 (1 ? c) > 0:

The main tradeo is that we always increase the number of observations at each iteration, while this is xed in the stochastic approximation methods. The advantage of our approach, however, is that we can take larger step sizes and use alternative integration criteria to regain an overall advantage in function evaluations. This conclusion has potential bene t because the approximations N (xk;k ) and FN (xk;k ) are very close to  and F at xk and take their exact values in the limit. The next sections show that we also do not need to resolve completely all of the samples for each iteration. We do, however, lose our ability to rely on deterministic integration rules.

4. Stochastic decomposition algorithm Our approach follows [7], [8] and [1] where stochastic decomposition algorithms were applied to two-stage stochastic linear programs with recourse. We note that these methods use an argument that depends on randomization. Hence, our deterministic procedures no longer apply. We next describe the stochastic decomposition method to solve two-stage stochastic quadratic programs with recourse (1.1).

Algorithm 4.1. (Stochastic Decomposition Algorithm) 1. k = 1; D0 = ;; x1 2 X: 2. Randomly generate an observation !k over I m. (Note that f!igki=1 are generated independently.) Let zk;k = argmaxf? 21 zT Hz + zT (p(!k ) ? Txk); z 2 Z g Dk = Dk?1 [ fzk;k g and zk;i 2 argmaxf? 21 zT Hz + zT (p(!i) ? Txk ); z 2 Dk g i = 1; 2; :::; k ? 1: 3. Construct ^k (x). Let Xk kk + ( kk )T x = k1 (? 21 (zk;i)T Hzk;i + (zk;i)T (p(!i ) ? Tx))(!i); i=1 k = k ? 1 k?1 ; k = k ? 1 k?1; i = 1; :::; k ? 1; and

i

k

i

i

k

i

^k (x) = maxf 21 xT Px + ki + ( ik + c)T x j i = 1; 2; :::; kg: 13

4. Solve the kth problem

xk+1 2 argminx2X ^k (x): Return to step 2 with k replaced by k + 1.

This algorithm is an extension of the \basic method" presented in Section 2 of [7]. At Step 1 we solve a strongly convex quadratic program for zk;k , which is unique. It is easy to nd zk;i since Dk is nite though the solutions may not be unique. At Step 4 we minimize a function which is the maximum of a nite number of quadratic functions. A special subroutine is needed to solve this subproblem and the solutions may not be unique. We now check Algorithm 4.1's convergence behaviour following the arguments of [7] and make the following assumptions similar to those made in [7]: A1. X; Z and are nonempty and compact. A2. For all x 2 X , (p(!) ? Tx)  0 with probability one. The arguments are essentially similar to those of [7] but some special care is needed where distinctions arise between this algorithm and the algorithm of Higle and Sen. For example, the limit set of fDk g is not nite here. Special attention is needed for this. We now outline our convergence analysis of this algorithm. Since zi;j 2 Di  Z for all positive integers i and j = 1; ::; ; i, we have ?( 21 (zi;j )T Hzi;j + (zi;j )T (p(!j ) ? Tx))  (p(!j ) ? Tx): Hence, for all positive integers k and i = 1; ::; ; k, we have 1 xT Px + k + ( k + c)T x i i 2 Xi = 1 xT Px + cT x + 1 (? 1 (zi;j )T Hzi;j + (zi;j )T (p(!j ) ? Tx))(!j ) 2 k 2 j =1 1 i

X (p(!j ) ? Tx)(!j ) j =1 Xk (p(!j ) ? Tx)(!j );  1 xT Px + cT x + 1

 12 xT Px + cT x + k 2

k j=1

(wp1)

where the last inequality is due to Assumption A2. Therefore, k (x) provides a statistically valid lower bound on the objective function (x). De ne gk (p(!) ? Tx) = maximize ? 21 zT Hz + zT (p(!) ? Tx)

z 2 Dk

14

Let D be the limit set of fDk g, i.e., the set of all limiting points of fDk g. Then D is a compact set by Assumption A1. As the proof of Lemma 1 of [7], we may show that the sequence of continuous functions fgk g1k=1 increases monotonically and converges uniformly to a continuous function g on X  . As noted, the set D may not be nite in contrast to [7]. By compactness of D, the following derivation is, however, still true:

g(p(!) ? Tx)

= klim g (p(!) ? Tx) !1 k 1 zT Hz + zT (p(!) ? Tx)jz 2 Dk g = klim max f? !1 2 1 = maxf? 2 zT Hz + zT (p(!) ? Tx)jz 2 Dg:

This guarantees that the arguments of Lemma 1 of [7] still hold here. Following exactly the proof of Theorem 2 of [7], we may show that with probability one, g(p(!i ) ? T x^) = (p(!i ) ? T x^) for any limiting point x^ of fxk g and any positive integer i except a nite number of observations. Following further the proofs of Theorems 3 and 4 of [7], we may show that there exists an in nite subsequence of fxk g, fxk : k 2 K g such that every limiting point of fxk : k 2 K g is an optimal solution of (1.1), with probability one. Algorithm 4.1 may be improved using the concept of an \incumbent solution" in Section 3 of [7]. Since all parts of these arguments have no substantial di erences from the corresponding parts of [7], we do not go into details.

5. Stochastic Newton method The stochastic decomposition method iteratively generates a sequence of piecewise linear approximations of the objective function . The method converges globally but slowly in terms of convergence rate. It requires solving a minimax subproblem every iteration. We expect a faster algorithm having superlinear convergence rate with probability one when the subproblem is a quadratic program. This leads us to consider a stochastic Newton method. Motivated by the stochastic decomposition method and the inexact Newton method, we present a stochastic Newton method. At the kth step of our algorithm we use the Newton method to nd an approximate solution xk+1 of the problem minimize

x 2 Rn

Nk (x)

subject to Ax  b:

(5:1)

Without confusion, we use Nk to indicate N (xk ; k ) in the following discussion. 15

Algorithm 5.1 (Stochastic Newton method) Step 0: Let k = 1; D0 = ;; x1 2 X , N0 = 0 and N1 = N (x1; 1). Step 1: Randomly generate observations f!j ; j = Nk?1 + 1; :::; Nkg on I m. Let zk;j = argmaxf? 21 zT Hz + zT (p(!k ) ? Txk); z 2 Z g; j = Nk?1 + 1; :::; Nk Dk = Dk?1 [Nj=kNk?1 +1 fzk;j g and zk;i = argmaxf? 21 zT Hz + zT (p(!i ) ? Txk); z 2 Dk g i = 1; 2; :::; Nk?1: Construct Nk X FNk (xk ) = Pxk + c ? N1 T T zk;i(!i) k

and

Let

dk

i=1

Nk VNk (xk ) = P ? 1 T T X zk;i^(!i)T: N k

i=1

be the solution of the quadratic program minimize FNk (xk )T d + 21 dT (VNk (xk ) + k I )d

d

subject to A(xk + d)  b:

(5:2)

Choose k+1 2 (0; 21 k+1) such that k+1 < k ? k . Step 2. Same as Step 2 and Step 3 of Algorithm 3.1. We assume (A1) and (A2) in the following discussion. Lemma 5.1. Let fxkj g1j=0 be an in nite subsequence of fxk g1k=0. If xkj ! x, then with probability 1, FNkj (xkj ) ! F (x): (5:3) Proof. By Theorem 3.1, we have that, with probability one, lim jjFNk (xk ) ? F (x)jj  klim jjF (xk ) ? FNk (x)jj + klim jjF (x) ? F (x)jj = 0: !1 Nk !1 Nk

k!1

16

By the de nitions of FNk and FNk , we have

Nk FNk (xk ) ? FNk (xk ) = 1 T T X(z(p(!i ) ? Txk) ? zk;i)(!i):

Nk

i=1

Thus, it is sucient to show that, with probability one, lim zkj ;i = z(p(!i ) ? T x) j !1 by using

R m (!)d! = 1: Let !i be given and suppose that for every  > 0

(5:4)

I

P (k! ? !ik < ) > 0: (5:5) Then for every  > 0, P (k! ? !ik < ) > 0 and kp(!kj ) ? p(!i)k <  in nitely often,

(wp1). Since z(p(!kj ) ? Txkj ) = zkj ;kj 2 Dkj , we have ?(z(p(!kj ) ? Txkj ))T Hz(p(!kj ) ? Txkj ) + (z(p(!kj ) ? Txkj ))T (p(!kj ) ? Txkj ) +(zkj ;kj )T (p(!i ) ? p(!kj )) = ?(zkj ;kj )T Hzkj ;kj + (zkj ;kj )T (p(!i ) ? Txkj )  ?(zkj ;i)T Hzkj ;i + (zkj ;i)T (p(!i ) ? Txkj )  ?(z(p(!i) ? Txkj ))T Hz(p(!i ) ? Txkj ) + (z(p(!i) ? Txkj ))T (p(!i ) ? Txkj ): Because Z is bounded, there is a large positive number M such that j(zkj ;kj )T (p(!i)? p(!kj ))j  M jjp(!i ) ? p(!kp )jj: Let Ei;kj (z) = ? 21 zT Hz + zT (p(!i) ? Txkj ): Since projection is nonexpansive, we have jjz(p(!i) ? Txkj ) ? z(p(!kj ) ? Txkj )jj  jjH ? 21 jjjjp(!i) ? p(!kj )jj; so 0  Ei;kj (zkj ;i) ? Ei;kj (zkj ;kj )  jjH ? 12 jjjjp(!i) ? p(!kj )jj: Since H is positive de nite, ?Ei;kj is a strongly convex function, i.e., there is a positive constant > 0 such that ?tEi;kj (z) ? (1 ? t)Ei;kj (^z) + Ei;kj (tz + (1 ? t)^z)  t(1 ? t)jjz ? z^jj; for t 2 (0; 1); z; z^ 2 Rm : Hence, letting t = 21 ; z = zkj ;i and z^ = zkj ;kj , we have jjzkj ;i ? zkj ;kj jj  41 jjH ? 21 jjjjp(!i) ? p(!kj )jj: 17

Therefore, by the uniform continuity of z(p(!) ? Tx) = H ? 12 S (H ? 12 (p(!) ? Tx)) and the convergence of xkj to x, for every  > 0 there exist  > and N > 1, such that if j > N and j!kj ? !ij < , then

jjz(p(!i) ? T x) ? z(p(!kj ) ? T x)jj  3

and

jjz(p(!kj ) ? Txkj ) ? z(p(!kj ) ? T x)jj  3 jjzkj ;kj ? zkj ;ijj  3 ; (wp1):

Hence

jjz(p(!i) ? T x) ? zkj ;ijj  ; (wp1): By the uniqueness of z(p(!i) ? T x), zkj ;i ! z(p(!i) ? T x), (wp1), as xkj ! x.

Finally, since I m is a compact set, (5.5) holds in I m with probability one. Hence we have (5.3). Since X is compact, fxk g has at least one accumulation point. From Lemma 5.1, using a similar argument to the proof of Theorem 3.2, we have the following. Theorem 5.1 Suppose that the smallest eigenvalue of Vk (xk ) + k I is greater than c1 > 0 for all k. Then, with probability one, every accumulation point x of the sequence fxk g produced by Algorithm 5.1 is an optimal solution of (1.1). Theorem 5.2. Suppose that (1.1) has an optimal solution x. If A has full row rank, rF is nonsingular at x, then x is unique and, with probability one, the sequence fxk g produced by Algorithm 5.1 converges to the solution x. Furthermore, if k ) ? F (xk )k k F ( x Nk lim = 0; (wp1); (5:6) k!1

then, with probability one,

kdk k

kxk+1 ? xk = o(kxk ? xk):

(5:7) Proof. By Theorem 5.1, Lemma 3.2 and Theorem 3.3, we can deduce that, with probability one, xk tends to the unique solution x. Similarly to the proof of Lemma 5.1, we have that k;i = z (p(! i ) ? Tx); (wp1): lim z k!1 This implies that lim V (xk ) = klim V (xk ); (wp1): k!1 Nk !1 Nk This and Theorem 3.1 yield that (5:8) lim V (xk ) = rF (x); (wp1): k!1 Nk From (5.6) and (5.8), similarly to the proof of Theorem 3.3, we have (5.7). 18

References [1] K.T. Au, J.L. Higle and S. Sen, Inexact subgradient methods with applications in stochastic programming, Math. Programming, 63 (1994), 65-82. [2] J. R. Birge and L. Qi, Subdi erential convergence in stochastic programs, SIAM J. Optimiz., 5 (1995), 436-453. [3] X. Chen, L. Qi and R. S. Womersley, Newton's method for quadratic stochastic programs with recourse, to appear in: J. Comp. Applied Math., 60(1995). [4] P. J. Davis and P. Rabinowitz, Methods of Numerical Integration, Academic Press, Inc, 1984. [5] Yu. M. Ermoliev, Stochastic Programming Methods, Nauka, Moscow, 1976. [6] M. Fukushima and L. Qi,A global and superlinearly convergent algorithm for nonsmooth convex minimization, to appear in: SIAM Journal on Optimization. [7] J. L. Higle and S. Sen, Stochastic decomposition: An algorithm for two-stage linear programs with recourse, Math. Oper. Res., 16(1991), 650-669. [8] J. L. Higle and S. Sen, Statistical veri cation of optimality conditions for stochastic programs with recourse, Annals of Oper. Res. 30(1991), 215-240. [9] J. L. Higle and S. Sen, On the convergence of algorithms with implications for stochastic and nondi erentiable optimization, Math. Oper. Res., 17(1992), 112-311. [10] J. L. Higle and S. Sen, Epigraphical nesting: a unifying theory for the convergence of algorithms, to appear in: J. Optimization Theory and Applications. [11] S. Joe and I. H. Sloan, Imbedded lattice rules for multidimensional integration, SIAM J. Numerical Analysis 29(1992), 1119-1135. [12] M. Minoux, Mathematical Programming - Theory and Algorithms, Translated by S. Vajda, John Wiley & Sons Ltd, 1986. [13] H. Niederreiter, Random Number Generation and Quasi-Monte Carlo Methods, Society for Industrial and Applied Mathematics, Philadelphia, Pennsylvania, 1992. [14] J. S. Pang, Newton's method for B-di erentiable equations, Math. Oper. Res. 15(1990), 311-341. 19

[15] B. T. Polyak and A. B. Juditsky, Acceleration of stochastic approximation by averaging, SIAM J. Control and Optimization 30 (1992), 838-855. [16] L. Qi, Convergence analysis of some algorithms for solving nonsmooth equations, Math. Oper. Res. 18 (1993), 227-244. [17] L. Qi, Superlinear convergent approximate Newton methods for LC 1 optimization problems, Math. Programming, 46 (1994), 277-294. [18] L. Qi, and R. S. Womersley, An SQP algorithm for extended linear quadratic problems in stochastic programming, to appear in: Anna. Oper. Res. (1995). [19] R. T. Rockafellar and R. J. -B. Wets, A Lagrangian nite-generation technique for solving linear-quadratic problems in stochastic programming, Math. Prog. Study 28(1986), 63-93. [20] R. T. Rockafellar and R. J. -B. Wets, Linear-quadratic problems with stochastic penalties: the nite generation algorithm, in: V.I. Arkin, A. Shiraev and R.J-B. Wets, eds., Stochastic Optimization (Lecture Notes in Control and Information Sciences 81, Springer-Verlag, Berlin, 1987) pp. 545-560. [21] H. S. Shapiro, Smoothing and Approximation of Functions, Van Nostrand Reinhold Company, New York, 1969. [22] I. H. Sloan and S. Joe, Lattice Methods for Multiple Integration, to be published by Oxford University Press. [23] J. Spanier and E. H. Maize, Quasi-random methods for estimating integrals using relatively small samples, SIAM Review, 36 (1994), 18-44. [24] C. Zhu and R. T. Rockafellar, Primal-dual projected gradient algorithms for extended linear-quadratic programming, SIAM J. Optimiz., 3 (1993), 751-783.

20