estimating equilibrium probabilities for band diagonal ... - Science Direct

6 downloads 25 Views 2MB Size Report
TECHNIQUES. DAVID F. RooERs't and ROBERT D. PLANTE'$ ..... queuing (Bhat [ 901, Ross [Sl]), inventory (Bemelmans [ 921, Ross [ 911) and manpower.
Compufers Ops Res. Vol. 20, No. 8, pp. 857-877, 1993

Printed in Great

0305-OS4Sj93 56.00+ 0.00 Copyright 0

Britain. All rights reserved

1993 Pergamon Press Ltd

ESTIMATING EQUILIBRIUM PROBABILITIES FOR BAND DIAGONAL MARKOV CHAINS USING AGGREGATION AND DISAGGREGATION TECHNIQUES DAVID F.

RooERs’t and ROBERT D. PLANTE’$

‘Department of Quantitative Analysis and Information Systems, College of Business Administration, University of Cincinnati, Carl H. Lindner Hall, Cincinnati, OH 45221-0130 and 2Krannert Graduate School of Management, Purdue University, Lafayette, IN 47907, U.S.A.

(Received

April 1991; in revised form November 1992)

Scope and Purpose-The effective control of large-scale programming problems often involves the use. of Aggregation/Disaggregation (A/D) methodologies. A/D consists of the development of aggregate (reduced) models that are analyzed and the results extrapolated to estimate solutions for the original model, usually resulting in suboptimalsolutions.A/D techniques have gained wide acceptance as valuable methodology for solving large operations research problems and there is voluminous literature on A/D techniques for Markov chains in applied probability and operations research. In this paper an abbreviated survey of A/D for Markov chains is presented. Experimental test results for several different methods for A/D performed upon general band diagonal Markov chains of the type that may arise in certain inventory, queueing, and manpower planning problems are then presented. Abstract-The purpose of this paper is twofold. First, Aggregation/Disaggregation (A/D) techniques for Markov chains, including the notions of (1) lumpability, (2) near decomposability, and (3) iterative A/D, are surveyed. Next, several design issues for A/D were tested upon several variants of a general class of band diagonal Markov chains. Different (1) cluster entities, (2) methods of combination, (3) methods of dissection, and (4) clustering algorithms were applied to ten levels of seven different types of simulated band diagonal Markov chains of the type that may arise in inventory, queuing, and manpower planning and the results are presented. 1. INTRODUCTION

1.1. Discussion

Effective solution techniques for discrete time Markov chains are crucially important for large-scale problems that are commonly encountered in, for example, queuing applications, inventory problems, manpower planning models, maintenance models, and many other practical settings. Typical Markov chains that may result for these applications may be unwieldy because of the state space dimensionality. This is especially true for Markovian models that have states represented as vectors. Mendelssohn [l] noted an example of a Markov chain embedded in a Markov decision process for fisheries management that required a seven dimensional state space with five grid points per dimension. The number of states for this problem is 57 = 78,125 which yields a system larger than many present-day computers can handle efficiently. Given that a large-scale Markov process may be stored, computer processing time may become extremely burdensome, especially for sequential decision making in Markov decision processes.

tDavid F. Rogers is Associate Professor of Quantitative Analysis in the Department of Quantitative Analysis and Information Systems, College of Business Administration, University of Cincinnati, Cincinnati, Ohio. He received a B.S. in Mathematics and an MBA (Quantitative Methods) from Murray State University and his Ph.D. from the Krannert Graduate School of Management at Purdue University. His primary research interests include aggregation/disaggregation techniques applied to large-scale programming problems and cellular manufacturing methodology. He has publications in Operations Research, Journal ofthe Operational Research Society, Decision Sciences, Cost Analysis Applications of Economics and Operations Research, and Medical Decision Making. SRobert D. Plante is Professor of Management at the Krannert Graduate School of Management at Purdue University, West Lafayette, Indiana. He received a B.S. degree in Physics from Worcester Polytechnic Institute and his Ph.D. in Business Administration (Management Science) from the University of Georgia. His research interests are the design and application of operations research models in quality assurance and production/inventory problems. He has consulted for over a dozen firms and his publications may be found in Operations Research, European Journal of Operational Research, Naval Research Logistics, Management Science, Decision Sciences, IIE Transactions, and Journal of the American Statistical Association. 857

DAVIDF. ROGERSand ROBERTD. PLANTE A Priori

Error

Posferiori

analysis

Clllstcr entity

Similarity criteria

Cluster procedure -i

A

Aggregation

analysis

I

Methods of combination Disaggregation

Levels

analysis

Methods of dissection

Fig. 1. Elements of aggregation/disaggregation.

Schweitzer [2] noted that Markov chain models are continually getting larger and more complex. Aggregation and Disaggregation (A/D) techniques have proven to be very effective tools for manipulating systems to overcome hardware and software constraints. They may be used to obtain acceptable approximations for an original model or embedded in an algorithm to assist in completely solving a problem. A/D methodologies for estimating equilibrium (or steady state or limiting) probabilities for large-scale Markov chains will be the focus of this article. 1.2. The aggregationldisaggregation

process

There are many issues that must be considered for an A/D process for large-scale programming and they are summarized in Fig. 1. General issues for most aggregation analyses include selecting the cluster entity, the data from the original model for which a lower dimensionality is desired. Similarity (or distance) criteria, such as the family of Minkowski metrics, may then be calculated for each pair of elements of the cluster entity and stored in a lower triangular matrix. Cluster procedures are then selected and used to determine the groups of elements of the cluster entity that are closest according to the similarity criteria. There are a wide variety of different cluster procedures available which are classified as either hierarchical or nonhierarchical. The level to cluster elements, i.e. the number of clusters, is a major design issue of A/D and involves the tradeoff of model tractability versus information loss. After cluster membership has been determined a method of combination is used to derive data for the reduced (aggregate) model from the original data. The error introduced by employing A/D may be bounded before solving the aggregate model (a priori error bounds) or after solving the aggregate model (a posteriori error bounds). Disaggregation analysis issues include the method of dissection for deriving solutions for the original model from the aggregate solutions. Sometimes the interest may be in disaggregate solutions at levels other than the level of the original model. Anderberg [3], Hartigan [4], Spath [5,6], Gaul and Schader [ 73, Bock [ 81, and Milligan and Cooper [ 91 provide excellent reading for researchers interested in cluster analysis. Ijiri [lo] and Rogers et al. [ 111 contain frameworks and surveys that encompass general A/D methodology. Rogers et al. [ 1l] additionally contains a discussion of A/D techniques specifically designed for optimization problems. The remainder of this article is organized as follows. Sections 2 and 3 are tutorial in nature and are intended to be a thorough survey of A/D techniques in Markov chains. Section 2 is a review of A/D techniques for Markov chains and includes discussions of (1) total consistency between

859

Band diagonal Markov chains

aggregate Markov chain and an original Markov chain, (2) control of Markov chains that are nearly completely decomposable, and (3) iterative A/D approaches for general Markov chains. In Section 3 the A/D elements for approximating equilibrium probabilities of an original Markov chain with equilibrium results from an aggregate Markov chain are further scrutinized. Several portions of the A/D process for Markov chains were then tested and in Section 4 is a description of the simulation study, the experimental design, and results for several of the elements of A/D applied to several variants of band diagonal Markov chains. In Section 5 is a summary of A/D in Markov programming and directions for future research. an

2. FACETS

OF AGGREGATION

AND DISAGGREGATION

2.1. Totally consistent aggregationldisaggregation

IN MARKOV

CHAINS

for Markov chains

Most of the early work for A/D in Markov chains involved finding results for total consistency between an aggregate (or reduced) Markov chain and an original Markov chain. This property is known as lumpability (Kemeny and Snell [ 121) or mergeability (Howard [13]). A lumpable or mergeable Markov chain is one for which a clustering process has been applied and, for any initial probability vector, the resulting transition probability matrix for the aggregate problem retains the Markovian property. In addition, the steady state vector for the aggregate problem of a lumpable Markov chain will have elements equal to the sum of the steady state probabilities for the states in each cluster. The concept of lumpability is due to Burke and Rosenblatt [ 143 and Rosenblatt [ 151. As will now be shown, the conditions necessary to assure lumpability are quite stringent. Let P = (Pij> be the one-step transition probability matrix for an i,j = 1,. .., n state ergodic Markov chain. States that are in some respect considered to be very similar are typically candidates to be clustered. A clustered transition probability matrix for a partition of the set of states A = {A1,Az,...,At} is defined as B = (P,J

= UPV,

(I)

where ii is a t x t matrix of one-step transition probabilities for the partitioned states, t < n, k,l = l,..., t. Matrix U is a t x n matrix of weights for the method of combination such that the kth row contains a probability vector having positive components for states in A, and 0 for the remaining states. Matrix Y is an n x t matrix that is used to identify the members of each A, to be combined for macroscopic analysis. Each column of i/ is a vector consisting of l’s in the components corresponding to states in A, and O’s otherwise. A necessary and sufficient condition for P to be lumpable with respect to P is that

C P,j= &At

C P,,

tfi,m~A,,

(2)

PAi

holds for every pair of clusters A, and A,. The probability of a transition from each state in a cluster to every other cluster must therefore be identical. It may be shown that P is lumpable if and only if the columns of PF’ are fixed vectors of VU, i.e. VUPV = PK

(3)

Much of the prior literature in this area has been for the determination of a U and V such that total consistency holds. Burke and Rosenblatt [ 141 and Rosenblatt [ 151 also considered lumpability conditions for reversible Markov chains as a special case. Hachigian and Rosenblatt [ 161 extended these results for reversible Markov chains by showing that if the reduced Markov chain satisfies the Chapman-Kolmogorov equations then the reduced Markov chain is indeed Markovian. Hachigian [17] discussed particular Markov chains that have clustered chains that satisfy the ChapmanKolmogorov equations yet are not Markovian. Rosenblatt [ 181 provided a somewhat simpler summary and discussion of these topics solely with respect to a discrete time and space Markov chain. Gilbert [ 191, Dharmadhikari [20,21,22], Heller [23], Carlyle [24], Leysieffer [25], and Guardabassi and Rinaldi [26] all studied related issues, Serfozo [27] considered the analogous problem for semi-Markov processes. Abel-Moneim and Leysieffer [28] examined lumpability

860

DAVIDF. ROGERSand ROBERTD. PLANTE

conditions for non-irreducable Markov chains. Sumita and Rieders [ 29 J reinterpreted lumpability in terms of first passage times. Barr and Thomas [30] showed that the eigenvalues of a clustered Markov chain (fi) that is lumpable with respect to some partition are also eigenvalues of the original Markov chain. It is also shown that a left eigenvector Y of ia corresponding to eigenvalue 6 for a lumpable Markov chain satisfies YVP = YV6.

(4)

This relation is exploited to assist in constructing candidate clusters of states. Courtois [31] noted that the conditions for perfect aggregation are rarely met. An approximate test of Markov chain lumpability is provided by Thomas and Barr [32]. They implemented a chi-squared statistic to test whether a given set of clusters implied “near” lumpability with some level of significance. Abdel-Moneim and Leysieffer [ 33 3 and Rubino and Sericola [ 341 provided and analyzed conditions for dete~ining weakly lumpable Markov chains, i.e. Markov chains that are lumpable for only a certain set of possible initial probability vectors. Takahashi [ 351 employed weak D-Markov chain theory for aggregated Markov chains that do not satisfy the Markov property. A weak D-Markov chain is one for which each of the row vectors of an original Markov chain belong to a finite set of stochastic vectors. Weak D-Markov chain theory can be applied to an aggregated Markov chain if appropriate bounds for transition probabilities can be formed. Upper and lower bounds for steady state probabilities, absorbing probabilities, and several other characteristic quantities for queuing models were presented. Finally, there also exists literature for describing situations in which only aggregate states are observed. Research in this area include the works of Taylor [36), and Shaikh [37,38], Fredkin et al. [39], and Fredkin and Rice [40]. In this work, properties of the underlying Markov process with respect to the observed aggregate process are often the matter of interest. Our emphasis in this paper is with processes that are completely observable at the unaggregated level. 2.2. Nearly completely decomposable Markov chains Approaches to solving large complex systems often are to consider the study of subsystems and the relationships between subsystems. These systems usually exhibit strong and/or more frequent interactions between subsystems allowing for a natural decomposition. Courtois [ 3 1) presented a good qualitative elucidation of these general concepts. The notion of near decomposability may be applied to Markov processes as proposed by Simon and Ando [41]. Consider the following completely decomposable ergodic Markov chain of order n and consisting of N subsystems: p:

Pf p* =

(5)

Pb_ The principal stochastic matrices PF, I = 1,. . . , N, are of order n(i), respectively, with

(6) The steady state probabilities for P* may be found by solving for the steady state probabilities for the individual PT. A Nearly Completely Decomposable (NCD) Markov chain P is defined as P = P* + EC

(7)

where E is a small positive real number defined as the degree of coupling and C is an arbitrary matrix defined such that P remains stochastic. Let ir be the ith row of the Ith principal stochastic matrix and j, be the jth column of the Jth principal stochastic matrix. With P and P* stochastic,

Band diagonal Markov chains

861

a necessary and sufficient condition for C is

for all i,. The matrix C thus consists of elements that are at most equal to unity in absolute value and has row summations equal to zero. The value for E is chosen such that (9) I

1

W=’

1

where Pij, is the probability of a transition from state i of submatrix I to state j of submatrix J. Simon and Ando [41] showed that the eigensystems of NCD matrices are well suited for approximation by functions of the eigensystems of the individual PF. Meyer [42] applied the concept of stochastic complementation to generalize theory for an NCD system in a unified, simple manner. Courtois [43,44] and Courtois and Louchard [45] extended the above results by showing that estimates of the left eigenvalues of an original NCD stochastic matrix may be approximated to within s2 by using left eigenvalues from an aggregate matrix. The procedures implemented are a form of optimal combination and optimal dissection. For each stochastic subsystem PT, the left eigenvector associated with the eigenvalue of unity (the steady state probabilities) is used as weights for defining the elements of the aggregate matrix. The steady state probabilities for the N x N aggregate matrix are then dissected to the dimensionality of the original stochastic matrix. The weighting scheme for dissection is identical to that used for combination. The resolution of the linear system of order n is thus reduced to the resolution of a set of N small n(I) x n(Z) independent subsystems and an N x N macro-variable system. The difference between each steady state probability of the original stochastic matrix and the corresponding aggregate estimate is shown to be within .s2 using these procedures. Courtois and Semal [46,47] extended these results to include any nonnegative matrices and Courtois [48] presented a discussion of several issues in error minimization for NCD stochastic models. Haviv and Ritov [49] and Haviv [ 501 suggested approximating the stationary distribution of an NCD Markov chain via a probabilistic approach, rather than solving subsystems, by employing the probability of leaving a subset of states. Haviv and Korach [ 511 are currently performing research on the problem of determining if a given Markov chain is nearly decomposable with respect to a given E and finding the corresponding partition. Phillips and Kokotovic [52] and Coderch et al. [ 53,543 studied methods for constructing increasingly aggregated models for Markov processes corresponding to different time scales and combining these models to produce an asymptotic approximation of the original process. Delebecque and Quadrat [55], Delebecque [56], and Rohlicek and Willsky [57,58] addressed similar topics including the ramifications of transience in the Markov process. Stewart [59&O] proposed a different method for computing error bounds for NCD Markov chains. His method involved several similarity transformations of the matrix P. The determination of the matrix C, which may have an infinite number of possible forms, is not necessary for this method. Also included was a discussion of implementation issues for matrices of various size and form as well as a discussion of numerical problems such as poor convergence and matrices with very sensitive eigenvectors. Comparison of the results and complexity of the error bounds proposed by Courtois [ 23,241 with those proposed by Stewart [ 59,601 would be an interesting study. Markovian queuing models that exhibit near decomposability have received much attention. Balsam0 [61] discussed conditions under which queuing networks with general service time possess the property of NCD. These conditions are based upon the maximum value of the second largest eigenvalue of the P:. Vantilborgh [62] investigated lumpability conditions with respect to large exponential queuing networks with nearly completely decomposable matrices. Improved error bounds for NCD Markov chains were obtained by Vantilborgh [63] with respect to a queuing model of a computer system. Rather than considering the P: in isolation, interactions between the states of each P: and the remainder of the Markov chain are represented by a single fictitious state. Brandwajn [64] presented a unified framework for aggregation and decomposition in queuing

862

DAVID F. ROGERSand ROBERTD. PLANTE

networks. Conway and Georganas [ 65 J developed decomposition and aggregation procedures for multiple class closed queuing networks by solving a series of single class networks. Green 1661 studied Markovian queuing systems for auxiliary servers and Green [67] examined Markbvian queuing systems for general-use and limited-use servers. Both of these applications exhibit near complete decomposability, but this property was not exploited in quite the same manner. In both cases it was shown that the original five-dimensional state space may be reduced to a state space of two dimensions and retain the Markov property. Approximations of the steady state vector, the mean delay, and the probability of an idle system for the original queuing system were derived from the two-dimensional queuing system. 2.3. Iterative aggregationldisaggregation

for Markov

chains

Iterative A/D (IAD) schemes have been developed and successfully implemented to solve for the equilibrium probabilities for Markov chains. These methods involve alternating between the calculation of a macroscopic eigensystem and several microscopic eigensystems and exchanging information to and from both the macroscopic and microscopic levels. Successive improvements of the estimates for eigensystems at the microscopic level is sought until some prespecified convergence criteria is satisfied. For more general Markov chains for which the probabilities from one PT to another are not small, the procedures suggested for NCD systems might not provide sutKciently accurate approximations and IAD routines are necessary. IAD schemes may also be employed for NCD Markov chains if the approximations in Section 2.2 are unsatisfactory. Dudkin, Rabinovich, and Vakhutinsky [68] reviewed general IAD methodology and presented IAD algorithms for solving systems of equations and extremum problems. Several IAD procedures have been developed to solve large-scale problems associated with Markov processes. Schweitzer [69 J and Chatelin [ 703 surveyed the properties, behavior, and applications of an IAD algorithm for finding stationary dist~butions of Markov chains proposed by Takahashi [71]. Koury et al. 1721 combined A/D with point and block iterative techniques for determining equilibrium probabilities. Cao and Stewart [73] compared these methods and an iterative version of Vantilborgh [63] and noted that they have the same asymptotic rate of convergence. Sumita and Rieders [74] noted that the IAD algorithm of Takahashi [71] may be modified for lumpable Markov chains by eliminating the aggregation step. Schweitzer et al. [75] and Schweitzer and Kindle [76] developed and tested IAD algorithms for, respectively, discounted and undiscounted semi-Markov processes. Feinberg and Chiu [773 presented promising results for an IAD algorithm for Markov chains that have been decomposed such that for any cluster of states, all transitions into the cluster enter through one particular state. Miranker and Pan [ 781 and Chatelin and Miranker [ 791 considered the acceleration of successive approximation methods by interjecting an occasional A/D step. Haviv [SO] analyzed six variations of this approach and identified cases where they appeared promising. Haviv [ 811 analyzed a Rayleigh-Ritz refinement technique to accelerate the convergence of iterative procedures for computing equilibrium probabilities of an NCD stochastic matrix. Mandel and Sekerka [82] proved local convergence for a class of general IAD procedures. Bertsekas [ 831 and Bertsekas and Castanon [84] accelerated convergence in difficult problems involving multiple ergodic classes by adaptively changing the aggregate groups of states from iteration to iteration. 2.4. Other aggregation jdisaggregation

methodology

for Markov chains

Schweitzer [85] considered A/D in finite Markov chains with macroscopic relationships being of interest. The steady state vector for the microscopic Markov chains were used for the method of combination to form macroscopic matrices that were not necessarily Markovian. Equilibrium probabi~ties for recurrent macrostates and the mean times until absorption for transient macrostates were considered. It was suggested that microstates within a macrostate exhibit little variability for the most accurate macroscopic results. Several researchers in this area were influenced by this approach of deemphasizing the role of exact aggregation by employing heuristic procedures to define aggregate transition probabilities and holding times. Kim and Smith [86] developed a single pass A/D algorithm for mandatory set decomposable Markov chains. These Markov chains were such that the states can be partitioned and all transitions into a set of states must pass through some state contained in a fixed subset before exiting the set.

863

Band diagonal Markov chains

Whitt [87,88] presented a general procedure for constructing and analyzing approximations of dynamic programming models including Markov decision processes. Tight error bounds were obtained for the optimal return function by replacing the original state and action spaces by subsets. Gross et al. [89] derived steady-state probabilities for multi-echelon repairable item inventory systems modeled as a non-Jacksonian Markovian network. This method differs from the methodology presented for NCD systems in that the network is decomposed into several overlapping subnetworks and calculations iterate between subsystems until convergence is obtained. 3. DESIGN

ISSUES

FOR A/D

TECHNIQUES

FOR GENERAL

MARKOV

CHAINS

Most of the previous results for A/D techniques in Markov chains have centered around the NCD concept. For articles in the literature that do not assume any degree of decomposability, analogous techniques have been employed. Often overlooked, however, is the fact that additional issues deserve attention for A/D techniques in Markov chains that are not NCD. These issues include the determination of clustering algorithms, weighting procedures for the method of combination and the method of dissection, the entity to consider for clustering (e.g. rows or columns), and the level at which clustering should be terminated (i.e. the number of clusters). Weighting procedures for the method of combination and the method of dissection have been considered for NCD Markov chains. Steady state solutions for the subsystems defined by the clusters are typically suggested for use as the weighting vectors. There is no guarantee that these weights are optimal and we shall consider this in this paper. In this section we address these design issues for A/D in general Markov chains that do not possess the NCD property. The objective throughout shall be to approximate the equilibrium (or steady state) probabilities of the original Markov chain that is ergodic and thus the steady state probabilities exist. Simon and Ando [41] in their seminal paper on decomposability of stochastic systems mentioned that the NCD property may not exist for certain Markov chains. Problems such as generalized random walks for which transitions are only possible to a finite number of states adjacent to the current state typically may not possess any detectable degree of decomposability. General Markov chains with bands of elements along the main diagonal (band diagonal Markov chains) as well as the most general Markov chains with virtually all elements pij > 0 likewise may not possess a significant degree of near complete decomposability. Problems such as these may often arise in queuing (Bhat [ 901, Ross [Sl]), inventory (Bemelmans [ 921, Ross [ 911) and manpower management (Bartholomew and Forbes [ 933). The methods to be presented in this paper may also be applicable to problems that do possess the NCD property. For example, consider a NCD Markov chain P (with n = 1000 states) that contains four principal submatrices, pr, I = 1,2,3,4, and n(Z) = 250, which is the number of states in submatrix I. If a software or hardware constraint requires that n < 200, and assuming that the individual Pf do not possess a significant degree of decomposability, then the procedures presented in this research may be useful for forming aggregate principal submatrices. Computational costs of clustering are highly variable. In general, the computational costs of clustering are minimal in comparison to typical analyses performed in operations research such as solving systems of simultaneous equations. The clustering methodologies employed in this paper are attractive because they are relatively efficient. The complexity of the clustering problems is less than the complexity of many algorithms because there is no need to invert matrices. Only simple similarity or distance measures must be calculated and stored for typical cluster procedures. It is the in-core storage for the similarity matrix that is the limiting feature of hierarchical cluster analyses. For whatever entity is being clustered, e.g. rows in a Markov chain, the number of columns will not affect the dimensionality of the similarity matrix. The reader is referred to Anderberg [S] if more information about the computational aspects of clustering is desired. Several design issues for forming aggregate submatrices of general Markov chains shall now be presented and discussed. These design issues shall be considered with respect to best approximating the steady state vector of the original Markov chain. The broadest categorization of these issues is aggregation analysis and its antithesis disaggregation analysis. Aggregation analysis design issues considered below include (1) similarity measures for determining the proximity of the entities to be clustered, (2) the entities to cluster, e.g. rows or columns, (3) the clustering procedure to

DAVIDF. ROGERSand ROBERTD. PLANE

864

incorporate, i.e. the criterion to use for defining clusters, (4) the level of clustering which defines the dimensionality of the clustered matrix, and (5) the method of combination, i.e. the weighting vectors to incorporate for defining clustered data. Because we are only interested in approximating the steady state vector of the original Markov chain, the only design issue to be considered for disaggregation analysis is the method of dissection, i.e. the weighting vectors to incorporate for defining an approximate solution to a disaggregated matrix from the solution of the aggregated matrix. The “reversal” of the other aggregation analysis issues need not be considered unless there is interest in approximating the left eigenvector of a matrix with dimensions smaller than those of the original Markov chain yet larger than the dimensions of the clustered problem. 3.1. Aggregation

analysis design issues

3.1.1. Similarity measures.

For clustering data units a measure of the proximity of the individual data units must first be determined. This measure is usually expressed as a distance given by the family of Minkowski metrics. For cIuste~ng rows of a Markov chain, the family of Minkowski metrics between row i and row j shall be employed and is given by

&j(p) =

(5 lxit j=l

l/P

Xjrl’

> ,

(10)

where I represents the column indices, p the particular metric used, Xit and Xjl the Ith data element of rows i and j respectively, and dij(p) the distance between row i and row j. A similar form may be described for the columns of a Markov chain. A lower triangular matrix may be formed for storing the similarity measures between all possible pairwise combinations of states. The most commonly used metrics are the Euclidean metric obtained with p = 2 and the squared Euclidean metric. The “city block” metric with p = 1 is occasionally encountered. Other metrics are seldom of more than theoretical interest. Klastorin [94] has shown that the squared Euclidean metric performs best for classifying randomly generated multivariate data to an appropriate distribution for several multivariate populations. In this paper, squared Euclidean distance shall be used throughout. 3.1.2. Cluster entities. For any given set of data the issue of what to cluster and the data to involve to make this decision must be resolved. Sometimes the entity to cluster is inherent to the nature of the desired results. For example, cluster analysis would be performed on columns of a linear programming problem if it is only desirable to reduce the number of decision variables. The number of rows in the clustered problem would remain intact. Likewise, clustering rows would be appropriate if it is only desirable to reduce the number of constraints in a linear programming problem. The number of columns in the clustered problem would remain intact. If a reduction in the number of both decision variables and constraints is desirable, cluster analysis would be performed upon both columns and rows respectively. Whether cluster analysis should first be performed on rows or columns is another issue that must be resolved since different clusters of constraints and decision variables may be formed by changing the order of clustering. The Markov chain provides a somewhat unique setting for applying aggregation techniques. Aggregating states implies that both rows and columns of the clustered states be combined in order to retain a square stochastic matrix. An important issue to consider is whether clustering procedures for determining that two states are in some respect similar should be implemented with respect to data from the rows, columns, both rows and columns, or some variation of both rows and columns of the Markov chain. The general nature of Markov chains may help in determining the cluster entity. Because the rows of a Markov chain are required to sum to unity and the columns do not necessariiy possess this property, the power of a clustering procedure might possibly be greater for the columns than it would for the rows because the column information may exhibit more of the variation and/or difference in the distance measure for two states than the row information. Thus clustering procedures that utilize column information initially appear to be more attractive because a greater expected distance measure may exist between the columns and this increases the power of the clustering procedures to detect differences between the communication of any two states. An alternative to row or column clustering is to take the transpose of a column vector and

Band diagonal Markov chains

865

augment it to the respective row vector for each state; similarity measures may be computed for these 2n-vectors. The advantage of this method is that the entire communication between two states is considered. The slight disadvantage is that calculating the similarity measures requires twice as much computation. Another possible scheme is to consider the row vector augmented by the transpose of the column vector less the intersectional elements of the states being considered. This would allow for similarity to be measured only with respect to communication with states other than the two states under consideration. The final approach to be considered here is identical to the one just described except that the similarity between the diagonal elements of the vectors under consideration, xii and Xjj as well as Xii and Xii, are also considered for the similarity measure between states i and j. This allows for an alternative method of considering the behavior of the subsystem defined by the intersection of states i and j. Both of these latter two approaches would be less useful for matrices for which the intersectional elements are relatively small. 3.1.3. Clustering procedures. The procedure to use for defining clusters must also be determined. These procedures include linkage methods, centroid methods, and error sum of squares or variance methods and with input data from the similarity matrix are used to find relationships among the entities. The distinguishing features of these different procedures are ( 1) the criterion for dete~ining the most similar pair of states and (2) the method used for determining the updated similarity matrix after merging two states. The most commonly used methods are hierarchical clustering methods which begin with all states in a separate cluster and proceed until all states are in one cluster. Nonhierarchical methods begin with all states in one cluster and proceed until all states are in a separate cluster. Because they are more frequently used and readily available, hierarchical clustering methods shall be the only methods further considered. An updated similarity matrix is needed at the beginning of each iteration of a hierarchical clustering procedure. Another procedure may be employed that forgoes the calculation of a new similarity matrix at each iteration. The next most similar pair of states from the current similarity matrix may then be clustered. This procedure may be repeated until all pairs of states have been clustered at which point it becomes necessary to update the similarity matrix. The computational savings could be considerable. For example, consider a Markov chain with n = 100 states and define the level as the number of states in the clustered problem. The initial similarity matrix may be employed for levels 50-99. The next similarity matrix may be used for levels 25-49. A clustered matrix with 25 states may be obtained with only two similarity matrices. The usual hierarchical clustering scheme would have required the formation of 75 different similarity matrices. The clusters formed may not yield disaggregated solutions that approximate the actual steady states better than clustering procedures that utilize updated similarity matrices at each iteration, but the computational reduction could be substantial. 3.1.4. Levels ofclustering. Deciding upon the number of clusters to employ for forming a clustered matrix is a substantial practical problem. The number of clusters employed in the aggregate matrix shall be referred to as the level of clustering. The basic tradeoff in determining the level of clustering usually involves comparing the burden of computing the left eigenvector of an aggregate matrix versus the quality of the solution obtained, i.e. how well the disaggregated solution approximates the steady states for the original Markov chain. The best level at which to terminate clustering is dependent upon the problem situation. For example, consider a Markov chain with n = 1000 states. Suppose for the moment that we are not concerned with computation time. If a software and/or hardware constraint only allows for the solution of Markov chains with n’ or fewer states, n’ < n, then we may desire to cluster until we have formed a reduced Markov chain with exactly n’ states. Sometimes computation times are deemed unacceptable even though the particular problem is computationally tractable. The best methods to employ would be those that are for considering the tradeoff between reduced computation time and the closeness of aggregate steady state probability estimates to those of the original Markov chain. An acceptable tradeoff is dependent upon a decision maker’s qualitative judgment for a given problem situation,

866

DAVID F. ROGERS and ROBERT D. PLANTE

3.1.5. Methods of combination. After a clustering procedure has been implemented, a method of deriving data for the clustered problem must be implemented. This method is called the method of combination and may entail combination by dominance or fixed-weight combination. Only fixed-weight combination shall be considered here because it has been shown to be non-dominated by combination by dominance methods for linear systems. Fixed-weight combination involves deriving a vector for each cluster that weights the relative importance of the original data for clustered data. Determination of the method of combination is another important design issue for aggregation in Markov chains. Courtois [43,44] proposed weighting clustered states by their respective equilibrium probabilities for the standardized subsystem defined by the intersection of the clustered elements in the original Markov chain. This method of combination is used to derive an error bound on the optimal equilibrium probabilities from the solution of a clustered problem. However, there is no guarantee that this method will produce the best approximation of the equilibrium probabilities. The nature of clustering procedures is to find and group data that is most similar. Since similar states would be expected to have approximately equal equilibrium probabilities, using equal weighting for all members of a cluster as the fixed-weight method of combination is a logical alternative to using subsystem solutions. In Section 4 we show that using equal weighting as the fixed-weight method of combination may yield as good an approximation for the equilibrium probabilities of band diagonal Markov chains as that of using weights derived from solving the subsystems.

3.2. Disaggregation analysis design issues We shall now consider disaggregation analysis design issues in Markov chains. Disaggregation analysis is the “reversal” process of aggregation analysis and involves deriving approximate solutions to the original disaggregate problem from solutions of the corresponding aggregate problem. Because we are only exploring A/D with respect to recovering approximate steady state solutions for the original Markov chain, the number of design issues for disaggregation are naturally fewer than the design issues for clustering. In particular, procedures for “reversing” the cluster analysis facets of (1) similarity measures, (2) cluster entities, (3) clustering procedures, and (4) levels of clustering are not needed unless interest also lies in approximation of the left eigenvector of a matrix that has a dimensionality greater than the clustered matrix yet smaller than the dimensionality of the original Markov chain. Since this is typically not the case, the only disaggregation analysis design issue that needs consideration is the method of dissection. The weights of relative importance of the states in a cluster are reflected by the method of dissection. The choice of the dissection method involves determining the weighting vectors to use for approximating the steady state vector of the original Markov chain from the left eigenvector derived from a clustered matrix. The only dissection method considered will be fixed-weight dissection. Equal weightings and weightings derived from subsystems defined by the states in a cluster may likewise be employed for the dissection method. However, it is not necessarily true that the weighting vectors used for the dissection method should be the same as the weighting vectors used for the method of combination in order to best approximate the steady state vector for the original Markov chain. 4. SIMULATION

TESTS

AND

RESULTS

FOR

BAND

DIAGONAL

MARKOV

CHAINS

4.1. Test problem generation The nature of the randomly generated test problems and the general linear model used for testing the aggregation techniques shall now be described (Rogers [95]). Several variations of Markov chains, all with bands of elements about the main diagonal and all having 100 states, were considered. The beta distribution, B( a, j?), defined over the open interval (0,l) was used for the random number generator because the density function may assume widely different shapes for various values of the two parameters of the distribution, a and /I. This approach allowed for the formation of several differently-shaped densities for the state transition probabilities of the states of the Markov chains and thus different types of general band diagonal Markov chains could be examined.

Band diagonal Markov chains

867

Three basic densities were considered for generating state probabilities in the test problems. The B( 1, 1) density is simply the uniform density over (0, 1) with its mean, median, and mode at 0.5. The B(2,2) density is symmetric and mesokurtic (i.e. has a moderate peak) with its mean, median, and mode also at 0.5. The B( 1.5851, 1) defines a left skewed density with the mean at approximately 0.61, the median at approximately 0.65, the mode at approximately 0.71, and 2/3 of the area under the density graph to the right of 0.5. Generation of test problems with symmetric densities B( 1, 1) and B(2,2) were treated in the following manner. Twenty-one random numbers were generated from the selected density and placed in the matrix shown in Fig. 2 after row standardization to unity for defining a singly-stochastic matrix. Three different schemes were employed for placing the randomly generated elements in the matrices. All three schemes begin by placing the largest randomly generated number on the main diagonal, i.e. cell (i, i). The next largest element is either placed in cell (i, i + 1) or (i, i - 1). If the next cell selected is (i, i + 1) then continue distributing the randomly generated numbers to cells (i,i-l),(i,i+2),(i,i-2) ,..., (i,i+lO),(i,i10). If the next cell selected is (i, i - 1) then continue distributing the randomly generated numbers to cells (i, i + i), (i, i - 2), (i, i + 2), . . . , (i, i - lo), (i, i + 10). When nonexistent cells are encountered the random number is discarded. This procedure was used to create matrices that have the largest elements down the main diagonal (diagonal dominance) and is representative of many real-world applications. The three schemes differ as follows: the next largest randomly generated element in cell (1) B(cl, /I) PAIRS-Place (i, i i- 1) for odd numbered states and in cell (i, i - 1) for even numbered states. This causes a natural “pairing” of states, i.e. sets of states (1,2), (3,4), (5,6), . . . should tend to be closer in distance than sets of states (2,3), (4,5), (6,7), . . . the next largest randomly generated element (2) B(cr, j?) SLIGHT SKEW-Place in cell (i, i + 1) for all states. This creates a very slight skew to the right because the 2nd, 4th, 6th,.. . , 20th largest numbers are to the right of the main diagonal and the 3rd, 5th, 7th,. . . , 21st largest numbers are to the left of the main diagonal. an additional random number from a uniform (3) B(a, /l,j RANDOM-Select density on (0, 1) and place the next largest randomly generated number in cell (i, i + 1) if this additional random number is in (0.5, 1) and in cell (i, i - 1) otherwise. This is the most general matrix formed because there is no natural pairing or expected skew.

12345678910111213141516171819202122232425..~ T+++++++++ + 0 0 0 0 0 0 0 0 0 0 0 0 0 0 .+ : to + 3 +o 4 5 6 7 + 8 + +o 9 + +o 10 + 11 + +o +o 12 o+ 13 oo+ +o 14 0 o+ +o 15 8 o+ 16 o+ 17 0 o+ o+ 18 0 o+ 19 0 o+ x 0 o+ 21 0 o+ 21 0 o+ 23

-I-

ii

0 0

0

o+

o+

Fig. 2. Distribution of positive elements for randomly generated B(1, 1) and 8(2, 2) Markov chains.

DAVIDF. ROGERSand ROBERTD. PLANTE

868

l_2345678910111213141S16171819202122232425.. + I + + + 0 0 0 0 0 0 0 0 0 0 .: 1 r ++++++++++ 2 + +o 3 + +o 4 + +o 5 + +o 6 + +o I 8 9 10 o+ 11 0 o+ 12 0 o+ 13 0 o+ 14 0 0+ 15 16 : o+ o+ 17 0 o+ 18 0 o+ 19 0 oc 20 0 o+ 21 0 o+ 22 0 0 o+ 23 0 o+ 24 0 04 25 0

+ o+ 0o+ 0 o+

---I-

0 +o +o +o +o

Fig. 3. Distribution of positive elements for randomly generated B(1.5851, 1) Markov chains.

Test problems generated by using the skewed density B( 1.5851,l) were utilized to fill the 100 state matrices of the form shown in Fig. 3. Twenty-one randomly generated numbers were placed in descending order and the largest numbers were placed in cells (i, i), (i, i + l), (i, i + 2), (i, i - l), (i,i+3),(i,i+4),(i,i-2) ,..., (i, i - 6), (i, i + 13), and (i, i + 14), respectively. When nonexistent cells were encountered, the data was discarded. Note that because of the selected band width and method of distributing the random numbers that a good estimate of the percentage of communication to other states from state i is 70% and 30% to the right and left, respectively, for states 7-86. This is because 70% of the numbers are placed to the right of the main diagonal. 4.2. The experimental

design

The four basic cluster procedures considered were Ward’s method, the Mean Within Group Squared Deviation (MWGSD) method, the Within Group Sum of Squares (WGSS) method, and the Centroid method. Ward’s method is for clustering at each stage such that the minimum increase in the total within group error sum of squares is achieved. The MWGSD method is for clustering at each stage to minimize the average contribution to the error sum of squares for each member of the new cluster. This method typically produces clusters of approximately equal variance. The WGSS method is for merging clusters at each stage such that the error sum of squares in the new cluster is minimal. The growth of relatively large clusters is typically less frequent with this procedure than for other methods. Finally, the Centroid method is used for merging clusters at each stage that have the most similar mean vectors (centroids). Many other basic clustering procedures, both hierarchical and nonhierarchical, could have been considered, for example, variations of linkage methods. These four basic clustering procedures were chosen for the following reasons. Ward’s method was chosen because it has often been shown to yield the minimum possible error sum of squares over all possible sets of clusters even though it is hierarchical in nature and finding the true minimum error sum of squares is not guaranteed. Furthermore, it was selected because of the results of Klastorin [91] where it was shown that Ward’s method best detected membership to a known multivariate population among three hierarchical clustering procedures, one nonhierarchical clustering procedure, and the P-median model-based procedure. The MWGSD method and the WGSS method were chosen because of their similarity to Ward’s method in order to compare different variations of the error sum of squares criterion. The Centroid method was selected because it is often appealing to use the mean as the basic summary statistic. The scheme described in the previous section for using the first similarity matrix for clustering

BanddiagonalMarkovchains

869

all elements into clusters containing two members, the next similarity matrix for clustering all elements into clusters of four members, and so forth, is also considered for each of the four basic clustering procedures described above. These procedures shall be referred to as Ward’s Pairs, MWGSD Pairs, WGSS Pairs, and Centroid Pairs, respectively. The final method considered shall be called Dummy Pairs. This method does not require a similarity matrix and merely consists of clustering pairs of elements in ascending order until all elements are in a cluster. Then all clusters consisting of two elements are clustered into groups of four elements in ascending order. This method is continued until all states are in one cluster. The Dummy Pairs method was included to provide a “benchmark” with which to compare the best of the eight clustering procedures. The aggregation/disaggregation techniques summarized in Table 1 were all applied to five replications each of the randomly generated problems. The following Performance Measure (PM) was calculated for all analyses and used as the dependent variable. Define Y*(i) as the equilibrium probability for state i and Y(i) as an approximation for Y*(i) derived from A/D. PM is defined as PM =

f

[Y*(i) - f(i)]’

}/{

.$

[

y*(ir

(11)

}

i=l

PM is a standardized variance measure and comparison of PM when different procedures are applied will provide an indication of how much the sum of estimated steady state probabilities differ from the sum of actual steady state probabilities. An unbalanced factorial experimental design was performed because the Ward’s, MWGSD, and WGSS methods are not applicable to the fourth cluster entity because the sums of squares are not appropriately defined during intermediate stages of these hierarchical clustering procedures. Thus a general linear model to test for differences among means of PM for the different factors was employed. The basic general linear model for the analyses is: PM = M + CO + PR + CL + DC + LE + CO*PR + CO*CL + CO*DC + CO*LE + PR*CL + PR*DC + PR*LE + CL*DC + CL*DC + CL*LE + DC*LE + E for M-the CO-the

population grand mean; method of combination;

Wethod

of

(0) (I) Clwter (1)

izj (3) (4j (5)

combination

Bntitv ROWS Columns Rows and Rows and Row and

Clwtor

Law16

SO,

70,

60,

G.&umtor (1) (2) (3) (4) (5) (6) (7) (6) (9) wad (0) (1)

T&la

fcoL

Equal Weighting Subsystem Solution (CL)

Weighting (5 Options)

Columns Columns Columns

less with

m 60,

Proo*dur*

(2 options)

Intersection Subsystem Cross

(9 Options) 50,

40,

LpBL

30,

20,

and

10

(9 Options)

Ward's Method Mean Within Group Squared Deviation Within Group Sum of Squared (llQ66) Cantroid Method Ward's Pairs NUQBD Pairs 1086 Pairs Centroid Pairs Dummy Pairs of-

(MIQBD)

(2 options)

Equal Weighting Subsystem Solution

1.

Interaction

The

Weighting

Deign Imu~m Conaidarod for thir Abbreviations and codas

tha

Binulation

Btudy

with

(12)

DAVID F. ROGERSand ROBERTD. PLANTE

870

PR-the CL-the DC-the LE-the E-the the

cluster procedure; cluster entity; method of dissection; cluster level; error term, the random score component that cannot be accounted for by condition to which PM is exposed.

The terms such as PR*CL are all two-way interaction terms. The above model excluding all terms containing LE was also employed to examine differences among means of PM at each level. Five different repetitions of each test were performed for the repeated measures design to strengthen the results. For the discussion to follow of Tables 2-9 of results for the general linear model in (12), the RZ value is the proportion of the variation in PM explained by the independent terms. The fractions provided in the columns for the independent variables and the interaction terms are their significance levels. An independent variable is referred to as being statistically significant at the classical levels of 0.05 or 0.01 if the relationship between the dependent variable and the independent variable only has a 0.05 or 0.01 probability of being a random occurrence. The numbers below these fractions are the codes for the statistically equivalent best levels of the independent variables according to Duncan’s multiple range test. Duncan’s multiple range test is for comparing three or more treatments. The null hypothesis is that the means are all equal and the alternative hypothesis is that the means are not all equal. 4.3. Results 4.3.1. Discussion. A summary of the results of all tests is provided in Table 2. Probably the easiest result to predict was that LE was consistently the most significant variable and for each case the value for PM was much greater for more-clustered problems. In Tables 3-9 are results for each matrix type at each of the nine cluster levels and we will begin by examining these outcomes. The results for the matrices were very similar regardless of the different parameter values and thus all of the Pairs, Slight Skew, and Random matrices will be presented, respectively. Quite often, significant interaction effects were seemingly encountered. Each time this occurred, the interactions were examined to see whether their significance was truly due to an interaction of the two factors rather than the strength of the individual main effects. This was done by graphing the significantly interacting factors for each general linear model. Only very infrequently and

Matrix TYIX

R*

B(l,l) Pairs

.84

6(2,2) Pairs

.87

B(l,l)

.83

Slight Skew

PR

CL

OOOl# '2.3,1, 4.7

.0578 1,2,4, 5.3

5237 '0,l

.0345* l-8

.0840 3,4,1, 5.2

.0003#

.OOOl#

1

1,2, 3.4

.OOllW 0

CO .1147 0

DC

.1154

.OOOl#

.1350

.2869

.0524

.OOOl#

.3325

.8912

.0063#

.0835

.OOOlU

.7245

2,3,5, 1.4

0,l

.0001X

.9658

.1562

.0002X

.0660

.OOOl#

.5214

.0001X

.8021

.5800

.0003x

.0136*

.a0031

.2116

.OOOl#

.9920

.5031

.0022#

.0236*

.OOOl#

.0803

.OOOl#

.5166

.0268

.0042#

.86

.0002# 1

.0025# 2.1.

0561 i.5.1.

. OOOld 0

B(2,2) Random

.89

.0013# 1

.0002# 1.2,

0033# i,3,2,

.OOOl# 0

B(1.5851.1) Skew

.63

.OOOl# 1

.0137* 2,4, 3.1

.OOOl# 2,3, 5

.OOOl# 1

Levels of the Factors and Interactions with the Statistically Equivalent Best Group for

Types

# -

CL+DC .OOOl#

.0009#

B(l.1) Random

significance significance

PR+DC .0001#

.4417

.0139+ 1,3, 2.5

05 .Ol

PR*CL .OOOl#

.4705

.OOOl# 1,3, 2.4

*-.

co*DC .0009#

.OOOl#

.OOOl# 1

Significance Belonging to

co*cL .2937

. OOOl# 0

.87

2.

CO*PR .0644

0

B(2,2) Slight Skew

Table

LE

. OOOl# .0001X

level level

Treatments All Distribution

.OOOl#

.OOOl#

.0017#

871

Band diagonal Markov chains RZ

co

PR

CL

Dt!

CO*PR

CO*CL

CO*DC

PR+CL

PR*DC

CL*DC

90

.99

.I747

-9672

.6204

.013a*

.9993

.0316*

.2764

.8614

.9840

.I284

80

.99

.0902

.8564

.5330

.0192*

.2608

.1024

.5772

.0203*

.9833

.3051

70

.99

.2275 0

.6020 1.3.2

.7447 2

.0250* 0

.OOlZ#

.0014#

.1429

.9440

.8174

.4376

60

.98

-5481

.0066X

.4184

.0426*

. OOOl#

.0017#

.1037

.01s4*

.8178

.R346

50

.97

7182 '0,l

.OOOl# a,7,

.1483 3,s

.0190* 0

.0034x

.0060#

.0423*

.ooort

.9987

.6395

40

.98

.2408 0

.QO26# 3*7,

.0766 4

.0088#

.4812

‘2910

.0778

.0001x

.8035

.b563

Zgvel

0

30

.98

.0972

.OOQ3#

.QOZO#

.0040#

.2272

.7431

.2107

.OQOl#

.2942

.OOOl%

20

.97

.4267 Q

.1994 2

.OOOl# 1

.OQOl# 0

.0513

.8447

.0307*

.OOOl#

.2463

.0015#

10

.98

.1269 0

.OOOl# 2‘4, 3.1

.3766 2

.OOOl# 0

.OllO*

.1728

.QOlZ#

.0213*

.OOOl#

.0001#

Table 3.

Significance Levels of the Factors and Interactions with Treatments Belonging to the Statistically EquiValent Best Group for B(l,l) Pairs

*- .05 significance level # - .01 significance level

RZ

co

PR

CL

90

.99

.0535 0

.3031 1.3.2

.8801 1

80

.99

.0463*

.OOOl#

70

.99

.0240* 0

60

.9a

50

Level

CO*PR

co*cL

CO+DC

PR*CL

PR*DC

cL*Dc

.0274* 0

.ooeolf

.OOSS#

.4724

.2061

.9975

.H314

.2074

.0053#

.0355*

.2437

.5863

.OOOl#

.8949

.2835

.2143 1,3,2,

.0957 1

.OOZO# 0

.Ql46*

.I780

.4192

.188?

.1130

.I.881

.0860 0

.5141 8,7,6,

.0069# 1

.OOOB# 0

.0491*

.0775

.2581

.8873

.2993

.