general dynamic programming algorithms applied to ... - CiteSeerX

17 downloads 264 Views 284KB Size Report
ABSTRACT. We formulate the problem of scheduling a single server in a multi-class queue- ... gorithms for both the discounted cost and the average cost optimal polling .... large state space makes the PI algorithm computationally unattractive.
GENERAL DYNAMIC PROGRAMMING ALGORITHMS APPLIED TO POLLING SYSTEMS Eungab Kim, Mark P. Van Oyen, and Maria Rieders

To Appear in Communications in Statistics: Stochastic Models Technical Report No. TR96-06 Department of Industrial Engineering and Management Sciences Northwestern University Evanston, IL 60208-3119 http://www.iems.nwu.edu/vanoyen/

June 26, 1996 Revised September 30, 1997 Revised August 12, 1998

GENERAL DYNAMIC PROGRAMMING ALGORITHMS APPLIED TO POLLING SYSTEMS Eungab Kim Manufacturing Development, Samsung SDS World Tower, 7-25 Shinchun-Dong, Songpa-Gu, Seoul, Korea, 138-731

Mark P. Van Oyen and Maria Rieders1

Department of Industrial Engineering and Management Sciences Northwestern University, Evanston, IL 60208-3119 [email protected], [email protected], [email protected] http://www.iems.nwu.edu/vanoyen/

ABSTRACT We formulate the problem of scheduling a single server in a multi-class queueing system as a Markov decision process under the discounted cost and the average cost criteria. We develop a new implementation of the modi ed policy iteration (MPI) dynamic programming algorithm to eciently solve problems with large state spaces and small action spaces. This implementation has an enhanced policy evaluation (PE) step and an adaptive termination test. To numerically evaluate various solution approaches, we implemented value iteration and forms of modi ed policy iteration, and we further developed and implemented aggregation-disaggregation based (replacement process decomposition and group-scaling) algorithms appropriate to controlled queueing system models. Tests provide evidence that MPI outperforms the other algorithms for both the discounted cost and the average cost optimal polling problems. In light of the complexity of implementation for the aggregationdisaggregation based algorithms and their performance, we cannot recommend them for problems with a sparse transition probability matrix structure. Our new implementation of MPI is e ective in discounted problems and better than previous implementations under the average cost per stage criterion. Keywords: Markov decision process algorithms, dynamic programming, value iteration, modi ed policy iteration, polling systems, control of queues 1 The majority of Maria Rieders' work was performed while with Northwestern University.

Her present address is 2506 Rushland Road, Jamison, PA 18929.

1

1. INTRODUCTION In this paper, we focus on the development and comparison of general algorithms for solving stochastic dynamic programming (DP) problems that fall within the class of Markov decision process (MDP) formulations. We pursue our investigations within the context of scheduling a single server in a multi-class queue with set-ups. The numerical solution of MDP's is an important class of problems. In addition to the application of numerical methods to particular problems of interest, numerical analyses provide insight into the structure of an optimal policy. This becomes dicult as the problem size and dimensionality grows (as with multi-dimensional queueing problems) and provides an incentive to pursue methods such as we develop here to cope with large state spaces. Another motive is the use of numerical methods to benchmark the performance of heuristic control policies or system design candidates. The algorithms developed here have been used in these ways in our other work such as Kim and Van Oyen [6], [7], and [8]. The principal DP algorithms for solving nite state, in nite horizon, MDP's are value iteration (VI, also known as successive approximation) and policy iteration (PI), which are treated in standard texts such as Bertsekas [1] and Puterman [14]. In addition to the standard formulations for these methods, variations on these algorithms have been suggested for computational e ectiveness. Thomas et al. [24] provides a computational comparison of various VI schemes including Gauss-Seidel and successive over-relaxation using randomly generated problems. Schweitzer [21] suggests a discounted VI algorithm that is intermittently supplemented by a group-scaling step, where states are aggregated into groups and the value functions of all states in the same group are scaled by a common scale factor that is the solution of the aggregated system. Schweitzer et al. [20] gives a replacement process decomposition algorithm in which the value function is re ned by the solution of an aggregated system that is constructed by partitioning states into groups. Ohno and Ichiki [13], Puterman [14], and Puterman and Shin [16] present modi ed policy iteration (MPI) algorithms for the discounted MDP, where the solution of the policy evaluation equation is approximated by either a xed or an adaptive number of successive substitution iterations. In this paper, our emphasis is on the development and evaluation of a new type of adaptive MPI, which applies a policy evaluation step at stages for which VI does not produce a suciently large change in the policy estimate. We call it the value iteration with policy evaluation (VIPE) implementation of MPI. We also place emphasis on developing ecient versions and implementations 2

of two aggregation-disaggregation based algorithms suggested in Schweitzer et al. [20] (replacement process decomposition) and Schweitzer [21] (group scaling) for discounted problems only. In the group-scaling algorithm, we suggest branching conditions dictating when the algorithm branches from a VI step to a group-scaling step. For both the replacement process decomposition and group-scaling algorithms, we implement the dynamic decomposition of the state space scheme proposed in Bertsekas and Casta~non [2]. We also implemented VI, MPI, and PI as eciently as possible (as general purpose algorithms) to o er an unbiased evaluation of VIPE and the aggregation based methods. As a context and motivation for this work, we consider the optimal scheduling of a polling system, an interesting and fundamental problem in the control of queues. A single server attends a multi-class M=M=1 queue with heterogeneous holding costs and service rates. Holding costs are incurred at rate cn for each unit of time that jobs of type n wait in the system, and an exponentially distributed switching (set-up) time is required for the server to switch from serving one type of job to another. Job service times for type n are exponential with rate n . We study computational issues for dynamic scheduling policies under the criteria of both expected discounted cost and average cost. Our model represents a MDP formulation (with the addition of holding costs) of a classical polling system, which is representative of many queueing control problems, because it exhibits a large state space that grows exponentially in the number of job classes and thereby possesses a sparse controlled transition probability matrix (t.p.m.). Large state spaces are found in many applications and pose a major diculty for numerical solution approaches. Polling systems are resource allocation models that have been applied to a wide variety of settings in which a common resource or medium is shared by many users or classes of jobs (see Levi and Sidi [10] and Takagi [23]). In these models, a common resource (server) sequentially provides service (polling) to a number of distinct classes of queued jobs, according to some policy (i.e., control law). The literature typically provides the performance analysis of an ad hoc policy. Instead, we use an MDP formulation to include the scheduling decision faced by the server (network, node, or workstation) serving distinct classes of competing demands and to simultaneously determine an optimal policy as well as its performance. The focus of this work is to suggest and compare alternative algorithms for numerical computation, whereas other work focuses on the investigation of the structure of optimal policies and on the development of heuristic control policies that are easily implemented, yet e ective (see for example Duenyas and Van Oyen [3], Kim and Van Oyen [8], and Markowitz 3

and Wein [12]). Liu et al. [11] characterizes optimal policies for homogeneous systems under a variety of information patterns. Strong characterizations are lacking, however, for systems that are heterogeneous with respect to service, set-up, and interarrival distributions. The rest of this paper is organized as follows. In Section 2, we present the implemented algorithms in a general setting. Having presented the algorithms to be tested, we focus on a discrete-time (uniformized) polling system optimization problem in Section 3. Section 4 discusses implementation issues and provides numerical results. Finally, in Section 5 we state our conclusions for e ective algorithms for controlled queueing systems such as this.

2. ALGORITHMS

We begin by presenting algorithms for solving discrete time MDP's in a general context. We assume a controlled Markov chain with a nite scalar state space S , an admissible action set U (i) for each state i 2 S , and a onestage cost function g(i; a) 2 IR+ for state i and action a. Let the discount factor be 2 (0; 1) and let Xk denote the scalar state random variable under policy  at time k. Denote by (i) the action chosen in state i 2 S under a stationary policy . The expected -discounted cost over an in nite time horizon and the average cost per stage, respectively, under a stationary policy  conditioned on initial state i0 are given by M X X (1) k g(i; (i))P (Xk = ijX0 = i0) ; J (i0) = Mlim !1 k=0 i2S (MX?1 X ) 1 0   0 J (i ) = lim sup M g(i; (i))P (Xk = ijX0 = i ) : (2) M !1

k=0 i2S

A policy  is optimal for the discounted MDP (average cost MDP) if   J (i0)= inf  J (i0) J (i0) = inf  J (i0) . For the discounted MDP, we de ne a VI step and a successive substitution (SS) step, respectively, using the operators T and T : 8 9 < X a = g(i; a) + Pij J (j ); ; TJ (i) = amin (3) 2U (i) : j 2S X T J (i) = g(i; (i)) + Pij(i)J (j ); (4) j 2S  ( i ) where Pija (Pij ) is the transition probability from state i to j

when an action a ((i)) is selected in state i. Note that (3) updates both the value function 4

and the policy, whereas (4) updates only the value function. In performing (3), we obtain an updated policy, 0 such that 0(i) is the value of a achieving the minimum. The VI and SS steps for the average cost MDP can be de ned in a similar way, as we indicate in the next section. We rst present a new DP algorithm that solves our scheduling problem eciently and then introduce the replacement process decomposition and group scaling algorithms.

2.1. Value iteration with policy evaluation recursion algorithm We are interested in problems with two features: (1) large state spaces, and (2) sparse t.p.m.'s. This focus is motivated by problems in the control of queues, an example of which will be precisely formulated in Section 3. The large state space makes the PI algorithm computationally unattractive. If most communication among states occurs between adjacent states, we expect VI to work poorly because the residuals spread slowly among states. We proceed to develop a particular variant of the MPI algorithm. Before detailing this algorithm, we rst provide two observations obtained in testing our scheduling problem with VI. First, in using the span criterion later presented as CT-DC and CT-AC for the discounted and average cost formulations, respectively, we have observed that a policy  converges to the optimal policy  faster than the approximation of the value function J (J under the average cost criterion) does to the optimal value function J  (J). That is, even though the VI nds an optimal policy in relatively few iterations, many iterations may be required for convergence to the optimal value function J (J). Second, we have noticed that it is typical in the progress of VI for the same policy  (or nearly so) to be applied to a number of consecutive VI steps as the algorithm re nes its approximation of the value function. In light of this, we revise VI as follows. Let k (k+1) be the policy obtained at iteration k (k + 1) of VI. Let 11fk+1(i) 6= k (i)g denote the indicator function with value one if and only if the action generated for state i by the VI step at iteration k + 1 di ers from that for iteration k. Then P  k +1 k j ?  j = i2S 11fk+1(i) 6= k (i)g equals the total number of states in which the action has changed from iteration k to k +1 of VI. If policy k+1 has changed in p or fewer states during consecutive VIs, rather than take another consecutive VI step, we insert a policy evaluation (PE) step that performs a number of SS steps using the current estimate of the optimal policy. After this PE step, we resume a series of one or more VI steps. The result is that VI is periodically interrupted by a PE step. The PE branching condition re ects 5

the fact that the policy estimate may be much more accurate than the value function estimate. Note that if p = jSj, a PE step follows every VI step, which is equivalent to a basic MPI implementation with an approximate PE step at every stage . More formally, suppose we stop VI at iteration k + 1 because jk+1 ? k+1j  p. The new algorithm requires the implementation of m SS iterations with TJ k as an initial value, where m is some positive integer and is adaptively selected based on the progress of algorithm. Then, we restart VI with J k+1 = Tmk+1 TJ k instead of J k+1 = TJ k. For this reason, we call this algorithm the Value Iteration with Policy Evaluation (VIPE) implementation of MPI. The following algorithm provides the VIPE algorithm for both the discounted cost case and, as designated by steps in parentheses, the average cost case. In the average cost case, the operators T; T are de ned in (3), (4) with = 1. Let e be a column vector of ones with dimension jSj. Also, upon termination at some stage k0, we approximate J by Thk0 (s), where s is preselected as an arbitrary reference state. 1. Initialization: Set k = 0, specify p, m, , and other parameters as needed. Pick an initial guess J 0 such that componentwise J 0  TJ 0 (h0). 2. Value iteration step: Implement a VI on the current estimate J k (hk ) and nd k+1 satisfying

Tk+1 J k = TJ k (Tk+1 hk = Thk ) 3. Termination test: Perform a valid convergence test (for example, see CT-DC and CT-AC below). If convergence has not been reached, go to step 4; otherwise stop. 4. Branch test: If jk+1 ? k j  p, go to the step 5; otherwise, set

J k+1 = TJ k (hk+1 = Thk ? Thk (s)e), increase k by 1, and go to the step 2. 5. Policy evaluation step: Set the (k + 1)th estimate via the following m SS iterations:

J k+1 = Tmk+1 TJ k (hk+1 = Hmk+1 +1 , j j where Hj+1 k +1 = T k +1 H k +1 ? T k +1 H k +1 (s)e, for j = 1; : : : ; m, and H1k+1 = Thk ? Thk(s)e). Increase k by 1 and go to step 2.

6

Although any valid termination test can be used, we implemented the following error bounds and convergence test (see Proposition 4 in Chapter 5 and Proposition 7 in Chapter 7 of [1]). For the discounted MDP, we label the convergence test in the discounted cost case as CT-DC: n k o k?1 (i) ; bk = 1 ? min TJ ( i ) ? TJ (5) i2S n o bk = max TJ k (i) ? TJ k?1(i) : (6) 1 ? i2S For the average cost MDP, we label the convergence test in the average cost case as CT-AC: n k o k?1 (i) ; bk = min Th ( i ) ? Th (7) i2S bk = max nThk (i) ? Thk?1(i)o : (8) i2S

In either CT-DC or CT-AC, the convergence test checks the span norm as follows. If (bk ? bk ) < , the algorithm is stopped. Puterman [14] presents on page 187 a general family of MPI algorithms broadly de ned to allow m to depend on the iteration k as mk . It is suggested that the sequence fmk g can be (a) mk = m for all k, (b) prespeci ed, or (c) determined adaptively. The VIPE implementation of MPI di ers from implementation (a) of MPI (with a PE step at every iteration), because the number of SS steps at any stage may be one (VI step only) or m + 1 (VI with PE), according to our particular branching condition. Thinking in terms of implementation (c), we can construe VIPE as a special case of MPI: an adaptive MPI algorithm with mk = 1 for VI steps and mk = m + 1 for PE steps. VIPE is novel in that branching to a PE step is based on changes in the policy at k + 1 rather than the residual vector (J k+1 ? J k ) at stage k + 1. Concerning how to choose a value of m, one approach is to empirically nd an e ective xed value. It has been our quest to get the best performance from every algorithm tested, and we found that the adaptive rules for m (implementation (c) of MPI) worked better than xed m. For the discounted MDP, we use a basic method suggested for the adaptive implementation (c) in Puterman [14] (which we refer to as PU ) and the third method in Ohno and Ichiki [13], which we denote OI . Ohno and Ichiki [13] treats only discounted problems, so we develop for the case of average cost a new method (KV R) that is similar in spirit to method OI . We also test PU in the average cost case. 7

PU has a pre-determined stopping tolerance, , and solves the PE step to within . As an additional restriction, we select the parameter M to prevent the number of SS iterations on any given PE step from exceeding M . This limit M will also be enforced for the remaining methods: OI and KV R. In the OI method, the stopping tolerance of the PE step is reduced as the algorithm progresses. The stopping tolerance at the (k + 1)th PE step, k+1 is set to maxf mk+1 k; g where 0 =  and mk is the number of SS iterations at the kth PE step. Thus, we have added the term in  to never allow the PE solution to be more precise than the VI step. As in PU , the number of SS iterations cannot exceed M . For the average cost MDP, we propose a method, KV R (Kim, Van Oyen, and Rieders), in which k+1 is limited to k divided by the square root of the p th k +1 k number of SS iterations at the k pPE step (i.e.,  = maxf = mk ; g), and mk  M is enforced. By using mk , as opposed to simply mk , the rate at which the tolerance level is decreased is slowed and is desensitized to variations in mk as a function of k. We turn our attention to the convergence of VIPE in the context of the class of MDP's with nite state and action spaces. Because we have identi ed VIPE as a particular variant of adaptive MPI, the convergence of VIPE for the discounted MDP has already been established in Theorem 6.5.5 of Puterman [14]. Moreover, we note that at iteration k, VIPE is not further from convergence to the optimal value function J  than is VI. Letting the sequence of estimates generated by VI be denoted fvk g, the proof of Theorem 6.5.5 of Puterman [14] justi es the following result. Theorem 1: With J 0 = v0  TJ 0, the sequence fJ k g generated by VIPE converges monotonically and uniformly to J  so that 0  J k  v k . For the average cost MDP, we prove that the maximum error bounds of VIPE are monotonically nonincreasing. Note that the minimum error bounds do not possess this property as a result of the policy evaluation step. The convergence of VIPE, therefore, remains an open question in the average cost setting. Let fhk g be the sequence of the di erential cost vectors generated by VIPE. That is, hk = Thk?1 ? Thk?1(s)e; if jk ? k?1j > p; (9) m +1 k k ? 1 = Hk (i); if j ?  j  p; (10) j j 1 where Hj+1 k (i) = T k H k (i) ? T k H k (s), for j = 1; : : :; m, and H k (i) = Thk?1(i) ? Thk?1(s). 8

Theorem 2: For the average cost MDP, VIPE generates a monotonically  max fThk (i) ? hk (i)g. nonincreasing sequence of fbk g where bk = i2S Proof: When hk is given by (9), bk  bk?1 (see Proposition 7, Chapter 7 of Bertsekas [1]). Suppose hk is given by (10). Since Thk (i)  T hk (i), it follows k

that k Thk (i) ? hk (i)  Tk hk (i) ? hk (i) = Pj2S Pij (i)fTk Hmk (j ) ? Hmk (j )g  maxi2S fTk Hmk (i) ? Hmk (i)g:

Hence, it is sucient to show maxi2S fTk Hmk (i)?Hmk (i)g  maxi2S (Thk?1(i)? hk?1(i)). First, we have maxi2S fTk H1k (i) ? H1k (i)g  maxi2S fThk?1(i) ? hk?1(i)g because k Tk H1k (i) ? H1k (i) = Pj2S Pij (i)fThk?1(j ) ? hk?1(j )g  maxi2S fThk?1(i) ? hk?1 (i)g: An inductive application of the above argument shows that maxi2S fTk Hj+1 k (i) j +1 j j ?Hk (i)g  maxi2S fTk Hk (i) ? Hk (i)g, for j  1. It follows that maxi2S fTk Hmk (i) ? Hmk (i)g  maxi2S fThk?1(i) ? hk?1 (i)g:

2.2. Replacement process decomposition algorithm

2

Schweitzer et al. [20] proposes several types of replacement process decomposition (RPD) algorithms for the discounted MDP. Because this approach is not developed for the average cost criterion, we consider only the discounted case. We selected one of the variants that is attractive for large-scale problems, and we further develop here an e ective implementation scheme. The idea behind this algorithm is as follows. Suppose states are aggregated into M groups, that is, S = [Mi=1 S (i). Each group has a representative state whose value function (say I ) is the maximum of the value functions over that group. A scale vector, y, relates all the states of a group. For state i, yi equals the ratio of the value function for i over its group representative value function. After constructing a (lower dimensional) aggregated MDP, the RPD algorithm approximately solves this aggregated system, updates the group representative values, and disaggregates them into the original system to revise the value functions of all states. This procedure is repeated until convergence is achieved. The overall algorithm is as follows, presented here in the form appropriate for maximization problems. 9

1. Initialization: Set k = 0 and specify . Pick J 0(i) > 0 for all i 2 S . 2. Termination test: Let J k be the current estimate of J . Perform one VI step with J k as a starting value function and compute error bounds. If convergence is achieved, stop. Otherwise, go to step 3. For example, Schweitzer et al. [20] suggests computing upper and lower bounds as X a k k (i)]=[1 ? X P a ] (11) Lk = min max [ g ( i; a ) + P J ( j ) ? J ij ij i2S a2U (i) j 2S j 2S X a k k (i)]=[1 ? X P a ] ; (12) Uk = max max [ g ( i; a ) + P J ( j ) ? J ij ij i2S a2U (i) j 2S

and if Uk ? Lk  , then stop.

j 2S

3. Let I (n) = maxj2S(n) fJ k(j )g: Estimate y via

yi = J k (i)=I (n); i 2 S (n); 1  n  M: 4. Aggregation step: Use the VI algorithm (initialized with the zero vector) to calculate I k+1, the estimate of the exact solution of the aggregated system, I , which satis es 8 9 0 1 M < = X X a y A I (t) ; @ g ( i; a ) + P (13) I (n) = (i;amax ij j ; )2C (n) : t=1 j 2S (t) where C (n) = f(i; a)ji 2 S (n); a 2 U (i)g. Use CT-DC to ensure that I k+1 is within  of I . 5. Disaggregation step: For i 2 S , re-estimate J  via 9 8 0 1 M = < X X a y A I k+1 (t) : @ P g ( i; a ) + J k+1(i) = amax ij j ; 2U (i) : t=1 j 2S (t)

(14)

Increase k by 1 and go to Step 2. (Note: In Schweitzer et al. [20] and Schweitzer [21], RPD and the group scaling algorithm are formulated for maximization problems with positive one stage cost, g, and value function, J . For this reason, in implementing these algorithms, we multiplied g by -1, then we added a suciently positive constant to g to yield a maximization problem with positive one stage costs and value functions.) The performance of RPD is sensitive to the method for decomposing states into groups. We have implemented a dynamic decomposition of the state space 10

suggested in Bertsekas and Casta~non [2], which groups states with similar one-step error magnitude, (TJ k ? J k )(i). With eki = (TJ k ? J k )(i), let 1k = mini2S feki g, 2k = maxi2S fekig, and L = ( 2k ? 1k )=M where M is an integer and M  2. The state space is partitioned at the beginning of an iteration as follows: S (m) = fi : (m ? 1)  L + 1  eki < m  L + 1g; 1  m  M ? 1;

S (M ) = fi : (M ? 1)  L + 1  eki  2g: That is, we divide the interval [ 1; 2] into M sub-intervals of length L and place states into the aggregated groups S (m), m = 1; : : : ; M , according to the value of their one-step residual errors. We note that RPD does not guarantee convergence. RPD has a computational advantage over VI only when the number of iterations is signi cantly lower than VI for the following reasons. First, the computation of the bounds in the termination test requires a VI step. Second, even though the aggregated system has a reduced dimensionality, it is still computationally burdensome to solve this system because the number of actions is increased (see (13)). For these reasons, we developed two variants of RPD that work well for problems with a large state space. The rst variant, which we call RPDF, performs the dynamic decomposition of the state space only when the progress of the algorithm satis es the following conditions A1 and A2 simultaneously.

A1 Wf  W0 where Wf denotes the number of iterations performed since

the last dynamic regrouping of states, where W0 is a positive integer threshold.

A2 ( 2k ? 1k )=( 2k?1 ? 1k?1)  !0, where !0 is a positive real threshold. Conditions A1 and A2 re ect the following empirical observations: A dynamic decomposition performed before every iteration results in poor performance. Even though the iteration immediately after a dynamic decomposition often resulted in an error in ation, the error over the next several iterations decreased rapidly, as is noted in Bertsekas and Casta~non [2] and Rieders and Hoyer [18]. The second variant, called RPDSA, is a VI-based algorithm that is intermittently supplemented by an RPD step, which occurs only when A2 is satis ed and Wsa  W0, where Wsa is the number of VI steps performed since the last RPD step. In RPDSA, the termination test is a byproduct of the VI step, which is a computational advantage over RPD and RPDF. 11

2.3. Group-scaling algorithm Schweitzer [21] presents the group-scaling (GrS) method for discounted problems only. The GrS method uses VI augmented with a group-scaling step in which states are aggregated into M groups. The value functions of all states in the same group are scaled by a common scale factor, B (n) for group n. The scale factors solve an aggregated system equation that we later de ne in (15). Unlike RPD, the GrS algorithm guarantees convergence provided branching to the VI step occurs suciently often to ensure geometric convergence. Our approach is to employ VI enhanced by occasional GrS steps that use a dynamic decomposition of the state space. We suggest the following three branching criteria under which the algorithm branches to a GrS step:

C1 Wgs  W0 where Wgs is the number of VI steps performed since last GrS step.

C2 ( 2k ? 1k )=( 2k?1 ? 1k?1)  !0, (the same as A2). C3 For some small value ofn o0, mini2S LJ k (i)=J k (i) < 1 ? 0 where LJ k (i) = TJ k(i) + 1? minj2S ekj is a lower bound for J  at the kth iteration.

Condition C3 re ects the observation that group-scaling has a minor e ect at the end of algorithm, as is suggested in Schweitzer [21]. Whenever the algorithm branches to a GrS step, we implement the dynamic decomposition scheme based on the one-step residual errors, eki , i 2 S . Recalling that C (n) was de ned in (13), the overall group-scaling algorithm is: 1. Initialization: Set k = 0, W = 0, and specify . Pick J 0(i) > 0, for all i 2 S , satisfying J 0  TJ 0 (componentwise). 2. Termination test: Perform the convergence test CT-DC. If the convergence is achieved, stop. Otherwise, if the branching criteria C1, C2, and C3 are satis ed with Wgs = W , go to step 4 and set W = 0; otherwise, set W W + 1 and go to step 3. 3. Value iteration step: Estimate J  via J k+1 = TJ k. Increase k by 1 and go to Step 2. 4. Group-scaling step: (1) Aggregation Step: Calculate B k+1. That is, with B k+1 initialized as the zero vector, use VI to approximately solve to within  the following 12

system for 1  n  M :

) M X g~(i; a) + P~nt (i; a)B (t) ; B (n) = (i;amax )2C (n) t=1 (

(15)

where

g~(i; a) = g(i; a)=J k(i); (i; a) 2 C (n); h i P~nt (i; a) = Pj2S(t) Pija J k(j ) =J k (i); (i; a) 2 C (n): (2) Disaggregation Step: Estimate J  via

J k+1(i) = J k (i)  B k+1(n); i 2 S (n); 1  n  M: Increase k by 1 and go to step 2.

3. PROBLEM FORMULATION

A single server is to be allocated to jobs in a system of nite parallel queues labeled according to N = f1; 2 : : : ; N g and fed by Poisson arrivals. The maximum number of jobs in queue n is Qn < 1, for n 2 N . By parallel queues, we mean that a job served in any queue directly exits the system. Each queue (equivalently, node) n possesses an exponential service distribution with mean ?n 1 (0 < ?n 1 < 1). Successive services in queue n are independent and identically distributed (i.i.d.). Jobs arrive to queue n according to a Poisson process with strictly positive rate n and join queue n if there are fewer than Qn jobs of type n already waiting; otherwise, they are lost. A switching or set-up time, is incurred at each instant (including time 0) the server switches from queue i to queue n for i 6= n. The switching time, Dn , represents a period of time which is required to prepare the server for processing jobs in a queue di erent from the current one. We assume that the set-up times at queue n form a sequence of i.i.d. exponential random variables with rate dn , where 0 < dn < 1. We assume that all arrival processes, service times, and set-up times are mutually independent. We include a linear holding cost rate of cn per job per unit time in queue n. The state of the system includes the vector of all queue lengths, the location of the server, and whether the server is performing a set-up or job service (to enforce the non-preemptive constraint). Based on perfect observations of the state of the system (a controlled Markov chain) and past control actions, a policy speci es, at each decision epoch, that the server either remains working 13

in the present queue, idles in the present queue, or sets up another queue for service. We take the class of admissible strategies, U , to be non-anticipative policies (with respect to future unrealized events) that are based on perfect observations of the system state. Because we have a controlled Markov chain with a nite state space and bounded costs, without loss of optimality we may restrict our attention to the class of stationary, non-randomized policies (see Bertsekas [1] or Kumar and Varaiya [9]). In addition, for the sake of practical application, U is restricted to the class of greedy and non-preemptive policies, which never idle in a nonempty queue and never interrupt job service or set-up of queue n for another queue n0 6= n. Because of the non-preemptive assumption, the set of decision epochs reduces to the set of all arrival epochs to an empty system, service completion epochs, and set-up completion epochs. The optimal scheduling problem considered here can be formulated as a discrete time MDP by applying the standard technique of uniformization P  (see Section 6.7 of Bertsekas [1]). With  = n n , this uniformized version has a constant system transition rate = maxfmaxn ( + n); maxn ( + dn )g for all states, which is an upper bound on the transition rates that apply for any decision epoch and any control action. Denote by xn(k) the queue length of the nth queue at stage k, 1  n  N , k = 0; 1; : : :. We describe the state of the system under policy  by the vector x(k) = (x1(k); : : :; xN (k); (k); n(k)), where n(k) denotes the location of the server at stage k and (k) is as follows. We set (k) = 0 if the set-up of queue n(k) is not completed, (k) = 1 if the set-up of queue n(k) is completed but the service of the current job at n(k) is not completed, and (k) = 2 if both set-up and service of queue n(k) are completed. Let the action space be A = f1; : : : ; N g. Suppose at a decision epoch, k, the state is x(k). Action n 2 A, where n 6= n(k), causes the server to set up queue n. Action n(k) results in the service of a job in n(k) if xn(k)(k) > 0; otherwise, idle in queue n(k) until the next arrival to the system. No other actions are possible. Thus, the state space is S = fNn=1f0; 1; : : :; Qngg  f0; 1; 2g  f1; : : :; N g: The average cost VIPE algorithm (or VI, MPI, PI) of Section 2 applies P N with g(i; a) set equal to the cost n=1 cnxn(k); where the scalar state i maps to a vector state that has queue lengths (x1; : : :; xN ). The discounted version is slightly more complex. For completeness, let us assume a discounting of e? t is applied at time t 2 IR+, where the discount rate is > 0. Then = =(P + ) is the discrete-time discount factor and the resulting one-stage cost is Nn=1 cn xn(k) ?1: We now justify the existence of optimal policies for our scheduling model. Since the expected one stage costs are bounded for our scheduling problem, 14

by Proposition 2 in Chapter 5 of Bertsekas [1], there exists an optimal policy  for the discounted cost problem. For the average cost problem, we see that for any two states i0 and j , we can construct a stationary policy  such that for some k, P (x(k) = j jx(0) = i0) > 0 (e.g., an exhaustive cyclic policy). Therefore, from Proposition 4, Chapter 7 of Bertsekas [1], there exists an average cost optimal policy  and the optimal average cost J is independent of the initial state i0. Although the formulations (1) and (2) are natural when Qn = 1 for all n, some readers may wish to include job rejection penalties for the nite bu er problems. This point is not crucial to our intentions here, since the inclusion of job rejection penalties does not signi cantly a ect computational speed or the rate of convergence. Even with the simplifying assumption of in nite bu ers, the optimality of a threshold policy has not yet been proven, despite numerous attempts (see Duenyas and Van Oyen [3]). Problems with nite bu ers have a more complex structure than those with in nite bu ers. Optimal policies for nite systems can be roughly approximated by multidimensional threshold policies. Suppose queue i is such that cii  cj j for j 6= i; j 2 f1; 2; : : : ; N g. We call such a queue a top-priority queue. Once queue i has been set up, an optimal policy tends to keep the server working at i as long as jobs are present (i.e., exhaustive service). If a queue, say j , is not a top-priority queue, then it is typically possible to switch from j to another queue, k, while j still has available jobs. Speci cally, if queue k is suciently large (i.e., xk exceeds a threshold function in the other state components), then the server should switch to queue k. Moreover, such switches typically occur only when ck k  cj j , which captures a type of \priority" ordering across job types. Optimal policies for systems with nite bu ers (whether or not job loss/rejection penalties are charged) frequently do not possess a threshold-type structure, because nite bu ers perturb the optimal actions in regions of the state space in which at least one of the queues is relatively full. For a detailed treatment of the surprisingly intricate structural properties of optimal policies of systems with or without rejection penalties, we refer the reader to Kim and Van Oyen [7] for models without set-ups and to Kim and Van Oyen [8] for a detailed investigation and heuristics for models with set-ups and with or without rejection penalties.

4. COMPUTATIONAL RESULTS

We tested the VI, RPD, GrS, VIPE, and MPI algorithms under the discounted cost criterion and VI, VIPE, and MPI under the average cost criterion for our scheduling problem. The algorithms were coded in Matlab (using 15

sparse matrix computations) and run on a SPARC 20 Workstation with 32 Mbytes of RAM. These codes are general versions of VI, RPD, GrS, VIPE, and MPI algorithms. That is, these codes did not use any structural properties of our scheduling problem. For the sake of speed, `CMEX' codes (i.e., C code subroutines) were written to avoid slow Matlab operations. Sparse matrix computations are performed in Matlab as a standard feature, thus none of our DP routines require problem-speci c coding once the t.p.m.'s and 1-stage cost vectors (one for each admissible action) have been stored in memory. For discussion of computational results, we introduce the following notation:  NI = number of iterations taken until convergence.  NIA = total number of iterations taken to iteratively solve an aggregation step until convergence.  NR = number of dynamic regroupings of the state space. The quantity NI is used to report numerical test results for all of the algorithms, while NIA and NR are used speci cally for RPD, RPDF, RPDSA, and GrS. Of the many problems tested, 8 examples are summarized in Tables 1 and 2. Examples 1{4 have two queues and Examples 5{8 have three queues. For Examples 1{2 and 3{4, we set the queue/bu er sizes to 15 and 30, respectively (jSj = 1504, 5704, respectively). For Examples 5{8, the bu er size is set to 9 for each queue (jSj = 8700). Problems with up to 100,000 states with 4 queues and 4 actions per state were run with VI, MPI, and VIPE. Because we did not nd signi cant sensitivity with respect to problem size in the relative performance of the algorithms, the examples presented are for problems of modest size that can be implemented for all the algorithms. Moreover, general implementations are essential to our methodology; however, this does dramatically reduce the size of the problems that we can solve. Aggregation-disaggregation methods are hindered by the fact that they require much more memory even in problemspeci c implementations, because dynamic state regrouping forces one to store in memory the t.p.m. for each aggregated system. In tests of the discounted cost examples, the discount factor is set as = 0:9 and the convergence tolerance, , to 0.1 except in Table 6, in which  = 10?5 . The aggregated systems of RPD, RPDF, RPDSA, and GrS are approximately solved by VI and  = 0:1 is also used as the stopping rule. Under the average cost criterion,  = 10?2 . For RPDF, RPDSA, and GrS, !0 is set to :75 and W0 to 4. For the branching condition C3 of GrS, 0 is set to 10?3 . 16

In implementing the PU , OI , and KV R variants of VIPE and MPI, the maximum number of SS iterations, M , is set to 50 for the discounted MDP and to 300 for the average cost MDP, respectively. Numerical investigation shows that the choice of M a ects the performance of VIPE and MPI. We observe that having M too much lower or higher than the above values degrades CPU time performance of both VIPE and MPI on the whole. When M is set too low, the computational burden of an increased number of PE steps outweighs the CPU time savings resulting from a reduced number of SS iterations per PE step. The reverse is true with M too large. In particular, in the case of the average cost MDP, with M much smaller or larger than 300, the CPU time performance of VIPE and MPI is worse than what is presented for most of the examples in Tables 6 and 7. Although the stopping test for PU and OI is implemented at each SS iteration in [13], we obtained a small run-time savings by implementing them once every W0 SS iterations. For the discounted (average cost) MDP, W0 is set to 5 (20, respectively). For the branching condition from a VI step to the PE step of VIPE, we set p = 0:005  jSj, which is proportional to the number of states. Our experience suggests this is a good rule of thumb. It is not easy to identify an optimal choice of p. In some examples, a near-optimal policy is not identi ed until the nal iterations of VIPE, in which case p should be made large to permit PE steps to occur in the early iterations (and not only in the nal ones). On the other hand, when a near-optimal policy is identi ed in the early iterations of VIPE, p = 0 usually works very well. Table 3 provides computational results for the RPD algorithm and its variants, RPDF and RPDSA. The numerical results given in Tables 3{7 record CPU time in seconds. The CPU time performance of RPD is the worst among the three. Except in one case (see Example 1), the NI performance of RPD is the worst. The overall CPU time and NI performance of RPDSA is the best among the three algorithms. These results indicate that a dynamic decomposition performed before every iteration resulted in poor CPU time performance and did not reduce the number of iterations. In Table 3, note that NI of the RPD algorithm equals the number of dynamic regroupings of the state space. In RPDSA, the number of dynamic regroupings of the state space equals the number of branches to the RPD step from VI. The numerical results suggest that RPDSA works better than RPD and RPDF for our scheduling problem. Table 4 shows how the branching criteria, C1, C2, and C3 in Section 2.3, a ect the performance of GrS. The four cases all employ C3, but C1 and C2 are varied. First, we compared the case without C1 and C2 and the case with C1 and C2. We observed that GrS without C1 and C2 has poor performance 17

with respect to NI and CPU time, compared to GrS with C1 and C2. In other words, frequent branches to the group-scaling step did not signi cantly reduce the value of NI for GrS, except in Example 1. In Table 4, NR equals the number of branches to the GrS step from VI. Second, for GrS with C1 and C2, we varied the value of W0. Note that when W0 = 0, the branching condition C1 is inactive. We can easily see that, as W0 increases (less groupscaling steps), the CPU time decreases but the number of iterations increases. This phenomenon is clear because the group-scaling step is expensive compared to the VI step. For other classes of problems in which the GrS approach is e ective, we would expect the selection of W0 to involve an interesting tradeo between CPU time and the number of iterations. Table 5 presents the numerical comparison of VI, GrS, and RPDSA. In terms of CPU time performance, VI is the best and RPDSA is the worst. However, in terms of NI performance, RPDSA is the best and VI is the worst. This result re ects the fact that even though GrS and RPDSA reduce the number of iterations compared to VI, the computational e ort per iteration is equally important. In terms of the complexity, solving one iteration of the aggregated system is about as expensive as solving one iteration of the original system by VI. Computational results show that the total number of iterations required for solving the aggregated system is much larger than that required for solving the original system by VI (compare Tables 3 and 5). Tables 6 and 7 provide numerical results of VI, VIPE, and MPI under the discounted and average cost criteria, respectively, for Examples 1{8 of Tables 1 and 2. To conserve space, we do not label these examples, but they are presented in order. The third row of the tables presents the various values of  that were tested. After that, the examples follow with the rst row giving CPU time performance and the second row presenting NI performance. Note that in the case of VIPE, the NI row gives two entries per column: the number of VI and PE steps respectively. Table 6 suggests that the best implementations of VIPE and MPI are roughly twice as fast as VI. It also suggests that VIPE works slightly better than MPI (see Example 5 for the best of the cases), but does not dominate MPI for all examples (Example 4 is an exception). For both VIPE and MPI, OI performs better than PU . That is, having a less precise tolerance, , in the initial stages of the algorithm and a more precise tolerance in nal stages takes less CPU time than having a constant tolerance. An interesting and promising feature of VIPE is that the OI method is nearly insensitive to , whereas OI with MPI works signi cantly better for large tolerances than for small tolerances. For both VIPE and MPI, PU works better as tolerance 18

becomes smaller. This phenomenon results from the fact that at each PE step, the number of SS iterations cannot exceed M . Because of this, VIPE and MPI can avoid spending CPU time to solve the PE equation to within  in the initial stages. In the nal stages, it is more ecient to have a precise tolerance because the value functions have been well re ned during the progress of algorithm. Thus, it is not surprising to see that the performance of PU with  = 10?5 is almost the same as OI with  = 10?2 . Table 7 presents the numerical results of VI, VIPE, and MPI under the average cost criterion. The results show that VIPE and MPI dominate VI (an 86% savings in Example 8). VIPE works worse (-7.3%) than MPI for Example 4, shows almost the same performance for Example 7, and performs better than MPI for the other examples (up to 67% for Example 6). Similar to the role of OI in the discounted MDP cases, KV R works much better than PU for both VIPE and MPI. Unlike OI in the discounted MDP cases, however, KV R with VIPE performs better for some examples but worse for other examples as the tolerance, , increases. KV R with MPI works better when  is large than when  is small. Except for Example 6, the best performance of MPI occurs when  = 100. Because the aggregation methods performed poorly for this particular scheduling problem, we are interested in identifying classes of problems for which aggregation based algorithms are ecient. In Figure 1, we present a problem with 7 states, which we refer to as the toy problem. In this example, the t.p.m. depends on  2 [0; 1=2] ( = 0 yields a reducible chain). Setting  = 0:1 yielded the best performance for the aggregation methods. We tested variations on this problem structure, but none yielded results more interesting than those we present here. A control action u 2 f0; 1g is allowed only for state 1. Suppose the one stage cost is given by g(i; u) = 1 + 1=; i = 4; 5 (16) = 1; i = 1; 2; 3; 6; 7: (17) Table 8 present a numerical comparison of VI, GrS, RPDSA, PI, and VIPE for this example. For these tests,  and  are set to 0.1 and 10?3 , respectively. For GrS and RPDSA, the initial number of groups is set to 2. For =0.9 and .95, VI and GrS show the equal CPU time performance and RPDSA shows about 20% worse performance while for = 0:99, RPDSA works about 80% (10%) faster than VI (GrS). Compared with cases of =0.9 and .95, RPDSA reduces the number of iterations considerably when =0.99. In spite of their good performance, we see that GrS and RPDSA do not dominate VIPE for this particular problem. The excellent performance of VIPE occurs despite 19

1 u(φ+1/2)

2

2φ 3

4

1 - 2φ 1 1 1 1 1 - u(φ+1/2)

5

6

φ

7

1-φ

Figure 1: Toy problem the fact that we implemented a very simple version of it. We set p = 0 and used a xed number of SS iterations in each PE step (m = 30).

5. CONCLUSIONS In this paper we developed, implemented, and compared general scalar state-space implementations of dynamic programming algorithms. This generality dramatically reduces the size of problems that we can solve, but allows for a fair comparison of the algorithms. We developed a new implementation of modi ed policy iteration (MPI) called value iteration with policy evaluation (VIPE). We compared it with several dynamic programming algorithms in the context of scheduling a single server in a multi-class queueing system under the criteria of both discounted cost and average cost. We took care to identify ecient implementations of two aggregationdisaggregation type algorithms, RPD and GrS. We developed two variants of RPD that implement a dynamic state decomposition intermittently based upon the progress of the algorithm. Similarly, computational results for the GrS algorithm indicate that our branching conditions signi cantly improve the performance. Our tests revealed that the RPD and GrS methods sometimes work as well as value iteration (VI), but sometimes require twice as much time as VI. For this reason, we constructed a very simple problem in which the aggregation based algorithms RPDSA and GrS worked considerably better than either VI or policy iteration (PI), particularly at high discount factors. This provides a basis for hope that these aggregation based algorithms will be found appropriate in problems with a special structure of dense communication among states. For problems in the control of queues or other problems with a sparse controlled t.p.m. structure, we cannot recommend RPD or GrS due to their relatively poor performance and high complexity of implementation. Unlike VI, VIPE, and MPI, aggregation-disaggregation methods require storing 20

the t.p.m.'s in memory (as opposed to directly hard-coding them). Although aggregation-disaggregation methods can be highly ecient at solving certain Markov chains by exploiting special structure in the t.p.m., control problems make this dicult, if not impossible, because the optimal policy (and hence the resulting t.p.m. structure) generally cannot be known in advance. We also developed and tested algorithms based on VI and PI. The latter does not work well for our scheduling problem, because the large state space yields an expensive policy evaluation step. On the other hand, we implemented algorithms from the MPI family and found that they often require less than half the CPU time of VI, particularly with carefully chosen methods for adaptively terminating the policy evaluation step. VIPE was developed and implemented as a value iteration algorithm that is periodically interrupted by a PE step based on the amount of change in the policy estimate. The scheduling problem considered here has a very large number of states and a transition probability matrix (t.p.m.) that is extremely sparse with state transitions resembling a multi-dimensional birth-death process. These features are exploited by the approximate policy evaluation steps found in the MPI family of algorithms. Our computational tests suggest that for controlled polling systems, VIPE is as e ective an algorithm as any of which we are aware. Both VIPE and another implementation of MPI dominated the other algorithms tested here. Our results suggest that the VIPE algorithm is promising as a general-purpose DP algorithm, especially in problems with a large, sparse t.p.m. In general, we recommend that an algorithm from the family of MPI algorithms be used, because MPI algorithms are highly ecient and only somewhat more complex than VI to implement. ACKNOWLEDGMENTS The work of the Eungab Kim and Mark Van Oyen was supported by the National Science Foundation under Grant No. DMI-9522795. The work of Maria Rieders was supported by the National Science Foundation under Grant No. DDM-9211043. The authors thank Sigrun Andradottir, Dimitri Bertsekas, Marcel Neuts, Martin Puterman, and anonymous referees for many helpful comments which have greatly improved this work.

21

Ex. c1 c2 1 2 1 2 d1 d2 1 1 1 0.6 0.6 0.2 0.2 1 1 2 1 1 2 2 0.3 0.7 1 0.25 3 1.5 1 2 1.5 0.3 0.7 1 0.25 4 5 1 2 0.5 0.3 0.4 5 5 Table 1: Input Data for Examples 1-4.

Ex. 5 6 7 8

c1 1 10 5 1

c2 1 1 1 5

c3 1 1 1 9 1 0.5 1 2 4 Table 2:

2 3 1 2 3 1 1 1/4 1/4 1/4 1 0.1 1/3 1/3 1/60 1 1 0.2 0.1 0.6 0.2 0.25 0.8 0.04 0.1 Input Data for Examples 5-8.

d1 1 1 10 1/3

d2 1 1 1 10

d3 1 1 0.5 10

RPD RPDF RPDSA Ex. CPU NI NIA CPU NI NR NIA CPU NI NR NIA 1 12.10 24 245 8.72 25 5 203 4.02 23 4 81 2 24.45 49 429 15.43 40 8 345 6.37 43 8 94 3 127.92 58 586 93.28 53 11 557 35.30 48 9 165 4 144.38 52 794 101.75 45 9 671 34.75 37 7 193 Table 3: Comparison of RPD, RPDF, and RPDSA.

without C1 and C2 Ex. 1 2 3 4

CPU 4.03 6.40 37.57 64.83

NI

37 44 63 54

NR

8 16 23 53

with branching criteria C1,C2, and C3 W0 = 4 W0 = 7 CPU N I N R CPU N I 3.07 40 2 3.22 46 2 3.10 50 5.93 44 14 3.45 46 2 3.27 46 36.05 61 22 19.98 61 4 17.68 62 38.40 61 18 19.82 56 3 18.92 59 W0 = 0 CPU N I N R

Table 4: Numerical performance of GrS with varying W0 .

22

NR

1 2 2 2

Ex. 1 2 3 4

VI CPU 3.13 3.12 14.45 16.13

NI

66 62 74 83

GrS (W0 = 7) CPU NI NR NIA 3.10 50 1 20 3.27 46 2 21 17.68 62 2 33 18.92 59 2 47

RPDSA CPU NI NR 4.02 23 4 6.37 43 8 35.30 48 9 34.75 37 7

Table 5: Comparison of VI, GrS, and RPDSA.

VI 4.93 141 4.70 127 22.1 155 22.0 159 38.6 129 49.8 170 47.8 156 50.92 169

PU

10?5 2.10 8,3 2.23 8,3 11.9 13,9 10.6 9,4 15.6 9,3 17.3 9,4 16.9 8,3 19.0 10,5

10?2 2.40 22,17 2.82 27,22 14.3 35,31 12.3 31,26 17.6 19,13 24.0 36,31 22.5 29,24 25.5 37,32

VIPE 10?2 2.08 8,3 2.22 8,3 11.9 13,9 10.5 9,4 15.6 9,3 17.3 9,4 16.9 8,3 19.0 10,5

OI

10 2.32 9,4 2.47 9,4 10.5 8,4 10.0 9,4 13.5 9,3 17.1 9,4 18.4 9,4 17.7 9,4

100 2.32 12,7 2.10 11,6 9.7 11,7 9.7 12,7 15.3 12,6 18.5 13,8 18.3 12,7 19.9 13,8

PU

10?5 3.43 11 3.40 9 13.9 13 10.1 4 31.5 10 17.4 4 20.7 7 20.4 8

MPI

OI

10?2 10?2 10 100 4.18 3.45 3.35 2.53 38 11 11 8 4.10 3.43 3.40 2.50 34 9 9 6 16.7 13.9 13.8 10.4 39 13 13 7 11.6 10.1 10.0 9.5+ 25 4 4 6 35.7 31.4 30.8 24.1 32 10 10 10 23.3 17.3+ 17.3+ 18.1 31 4 4 7 28.9 20.4 19.7 17.2+ 39 7 5 6 27.4 20.2 19.3 19.0 35 8 6 7

200 2.45+ 10 2.35+ 9 10.2+ 10 10.2 10 16.7+ 8 17.8 10 19.2 10 18.7+ 10

Table 6: Comparison of VI, VIPE, and MPI for Examples 1{8 under the discounted cost criterion.  The best CPU time performance of VIPE. + The best CPU time performance of MPI.

23

VI 11.07 403 5.85 209 61.4 580 200 1858 62.78 288 135 626 245 1101 434 1985

PU

1 4.82 71,66 4.30 60,28 52.0 163,153 83.5 342,324 38.2 74,68 63.9 124,103 146 366,360 174 423,407

VIPE

10 8.22 154,149 4.12 57,44 55.0 233,223 121 548,528 60.7 127,121 94.2 218,196 168 461,455 223 565,550

KV R

1 10 2.52 2.63 8,3 9,4 5.33 3.38 38,6 15,4 39.3 34.1 16,7 16,6 35.7 35.7 25,7 25,7 12.4 13.4 9,3 12,5 46.7 44.5 28,7 28,7 51.5 46.6 11,5 11,5 56.4 55.6 23,7 23,7

100 2.85 15,9 2.03 15,10 21.1 15,9 34.5 24,7 16.1 18,10 46.6 32,8 47.7 14,8 62.3 27,9

PU

1 8.32 94 6.58 36 51.2 128 87.9 354 43.7 39 77.5 117 194 408 178 399

10 8.32 136 4.77 55 52.0 221 108 494 33.0 53 86.4 186 179 477 244 598

MPI

KV R

1 10 8.18 7.18 7 6 8.97 8.72 10 10 46.4 45.3 10 10 40.6 39.4 8 8 49.8 48.6 11 11 75.3 74.2+ 10 10 94.6 93.0 9 9 72.2 72.4 11 11

100 2.87+ 9 1.88+ 9 27.7+ 9 32.2+ 8 14.4+ 10 76.7 11 47.8+ 8 61.3+ 13

Table 7: Comparison of VI, VIPE, and MPI for Examples 1{8 under the average cost criterion.  The best CPU time performance of VIPE. + The best CPU time performance of MPI.

VI CPU N I 0.9 0.15 105 0.95 0.35 228 0.99 2.01 1320

GrS CPU N I 0.15 64(4) 0.35 178(5) 1.32 668(13)

RPDSA CPU N I 0.20 48(8) 0.42 79(13) 1.22 192(32)

PI VIPE CPU N I CPU N I 0.18 2 0.07 6(4) 0.42 2 0.13 10(8) 2.37 2 0.97 97(95)

Table 8: Comparison of VI, GrS, RPDSA, PI, and VIPE for Toy problem.

24

200 2.92 12 2.00 12 28.9 11 38.1 10 15.2 12 74.6 12 48.0 10 62.4 15

REFERENCES [1] D.P. Bertsekas, Dynamic Programming: Deterministic and Stochastic Models, Prentice-Hall, Englewood Cli s (1987). [2] D.P. Bertsekas and D. A. Casta~non, \Adaptive aggregation methods for in nite horizon dynamic programming," IEEE Transactions on Automatic Control, 34 (1989) 589{598 [3] I. Duenyas and M. P. Van Oyen, \Heuristic scheduling of parallel heterogeneous queues with set-ups," Management Science, 42:6, (1996) 814{829. [4] A. Federgruen and Z. Katalan, \The stochastic economic lot scheduling problem: Cyclical base-stock policies with idle times," Management Science 42:6, (1996) 783{796. [5] M. Herzberg and U. Yechiali, \Accelerating procedures of the value iteration algorithm for discounted Markov decision processes, based on a one-step lookahead analysis," Opns. Res. 42 (1994) 940{946. [6] E. Kim and M.P. Van Oyen, \Dynamic scheduling to minimize lost sales subject to set-up costs," Forthcoming in Queueing Systems (QUESTA), (1998). [7] E. Kim and M.P. Van Oyen, \Beyond the c rule: Dynamic scheduling of a two-class loss queue," Forthcoming in Mathematical Methods of Operations Research, 48:1 (1998). [8] E. Kim and M.P. Van Oyen, \Finite-capacity multi-class production scheduling with setup times," Working paper. [9] P.R. Kumar and Varaiya, P., Stochastic Systems: Estimation, Identi cation, and Adaptive Control, Prentice-Hall, Englewood Cli s (1986). [10] Levy, H. and Sidi, M. \Polling systems: applications, modelling, and optimization," IEEE Trans. Commun. 38, (1990) 1750{1760. [11] Z. Liu, P. Nain, and D. Towsley, \On optimal polling policies," Queueing Systems, 11 (QUESTA) (1992) 59{83. [12] D.M. Markowitz and L. Wein \Heavy trac analysis of dynamic cyclic policies: A uni ed treatment of the single machine scheduling problem," Working Paper 3925-96-MSA MIT Sloan School of Management, Cambridge (1996). [13] K. Ohno and K. Ichiki, \Computing Optimal Policies for Controlled Tandem Queueing Systems" Operations Research, 35:1 (1987) 121{126. [14] M.L. Puterman, Markov Decision Processes, J. Wiley, New York (1994). [15] M.L. Puterman and S.L. Brumelle, \On the convergence of policy iteration in stationary dynamic programming," Math. Opns. Res. 4 (1979) 60{69.

25

[16] M.L. Puterman and M. Shin, \Modi ed policy iteration algorithms for discounted Markov decision problems," Man. Sci. 24 (1978) 1127{1137. [17] M.L. Puterman and M. Shin, \Action elimination procedures for modi ed policy iteration algorithms," Operations Research 30 (1982) 301{318. [18] M. Rieders and K.K. Hoyer, \Dynamic state space decomposition in aggregation algorithms for Markov chains," Technical report No. 93-07, Dept. of Industrial Engineering and Management Sciences, Northwestern University, Evanston, IL 60208 (1994). [19] S.M. Ross, Introduction to Stochastic Dynamic Programming, Academic Press, New York (1983). [20] P.J. Schweitzer, U. Sumita, and K. Ohno, \Replacement process decomposition for discounted Markov renewal programming," Annals of Opns. Res. 29 (1991) 631{646. [21] P.J. Schweitzer, \Block-scaling of value iteration for discounted Markov renewal programming," Annals of Opns. Res. 29 (1991) 603{630. [22] P.J. Schweitzer and K.W. Kindle, \Iterative aggregation for solving undiscounted semi-Markovian reward processes," Commun. Statist.-Stochastic Models 2 (1986) 1{41. [23] H. Takagi, \Queueing analysis of polling models: Progress in 1990{1993," in J.H. Dshalalow, Ed. Frontiers in Queueing: Models, Methods and Problems. CRC Press (1994). [24] L.C. Thomas, R. Hartley, and A.C. Lavercombe, \Computational comparison of value iteration algorithms for discounted Markov decision processes," Opns. Res. Letters 2 (1983) 72{76.

26