Design of Optimal Array Processors for Two-Step

IEICE TRANS. INF. & SYST., VOL.E82–D, NO.12 DECEMBER 1999

1503

PAPER

Design of Optimal Array Processors for Two-Step Division-Free Gaussian Elimination Shietung PENG† and Stanislav G. SEDUKHIN† , Nonmembers

SUMMARY The design of array processors for solving linear systems using two-step division-free Gaussian elimination method is considered. The two-step method can be used to improve the systems based on the one-step method in terms of numerical stability as well as the requirements for high-precision. In spite of the rather complicated computations needed at each iteration of the two-step method, we develop an innovative parallel algorithm whose data dependency graph meets the requirements for regularity and locality. Then we derive two-dimensional array processors by adopting a systematic approach to investigate the set of all admissible solutions and obtain the optimal array processors under linear time-space scheduling. The array processors is optimal in terms of the number of processing elements used. key words: linear system, parallel algorithm, parallel architecture, systolic array processors

1.

Introduction

In the area of high-performance computing, the design of division-free array processors has been attracting research attention recently [5], [6], [8], [10]. The main reason is that the division unit in high-performance special-purpose processors are both time and space consuming. Besides, as far as numerical stability is concerned, the cumulative effect of roundoff error for division operations also makes division-free algorithms much more appealing. The one-step division-free Gaussian elimination method was used in [8], [10] to develop optimal array processors. This method may drive the system to highprecision requirements and instability as described in [1], [3]. The problem arises from the fact that the absolute values of elements of the updated matrix increase rapidly, and eventually reduce the numerical stability of the algorithm for large input matrices. To increase numerical stability, Bareiss proposed a multistep division-free Gaussian elimination method, and showed that the multi-step method gives better numerical stability than the one-step method [1]. Because the formulation of the multi-step method is irregular and complicated, it is a nontrivial task to design an efficient parallel algorithm and a corresponding array processor based on this method. In this paper, we design a parallel algorithm of the two-step method through proper partitioning and Manuscript received July 15, 1998. Manuscript revised June 18, 1999. † The authors are with University of Aizu, AizuWakamatsu-shi, 965–8580 Japan.

re-indexing. In the design process, we show how to circumvent the irregularity of the original algorithm stepby-step. Then highly-parallel array processors based on this parallel algorithm are systematically designed and analyzed. Two optimal 2-D array processors in terms of the number of PEs are shown in this paper. The key features of these array processors are (1) division-free, and (2) the numerical stability enhanced by the twostep method compared with its counterpart based on the one-step method. The rest of the paper is organized into five sections. Section 2 gives some background for the twostep division-free Gaussian elimination method. A new parallel algorithm based on the two-step method is presented in Sect. 3. In Sect. 4, the 3-D data dependency graph of the new algorithm and its analysis are given. In Sect. 5, two optimal 2-D array processors based on the new algorithm and a systematic approach [9] are described, and their performance is investigated. Finally, in the last section, concluding remarks and further research directions are discussed. 2.

Two-Step Division-Free Gaussian Elimination

Let a generalized linear system of equations be given by AX = B, where A = [aij ], 1 ≤ i, j ≤ n, B = [bij ], 1 ≤ i ≤ n, n + 1 ≤ j ≤ m, and X = [xij ], 1 ≤ i ≤ n, 1 ≤ j ≤ m − n. To solve AX = B, the matrix A should be reduced to diagonal form, or triangular form with subsequent back substitution. In general, systolization of the algorithm which reduces A to diagonal form is more difficult than to a triangular one. In this paper, we consider the algorithm to reduce the matrix A to diagonal form using two-step Gaussian elimination method. The simplest (one-step) division-free algorithm for diagonalization is given by the following recurrence formula: (0)

aij = aij ,

1 ≤ i, j ≤ n;

(0) aij

= bij , 1 ≤ i ≤ n, n + 1 ≤ j ≤ m;  (k−1)  if i = k;   akj , (k) (k−1) (k−1) aij = akj akk  , otherwise;   (k−1) (k−1) aik aij 1 ≤ k ≤ n, 1 ≤ i ≤ n, k ≤ j ≤ m;


1504 (k−2)

|A| is a determinant of matrix A; (k)

(k−1)

Disregarding the factor ak−1,k−1 in Eq. (3) yields (k)

(2). Therefore, the coefficients aij of (2) are smaller

(k−1)

aii = akk · aii , 1 ≤ k ≤ n, 1 ≤ i ≤ k − 1.

(1)

(k)

Notice that aij = 0 for i, j < k and i = j. The advantage of this formula is the absence of division operations. Hence, the division-free algorithm avoids division round-off error and thus is more numerically stable than the classical Gaussian elimination algorithm. However, the one-step equation (1) suffers from rapid (k) increase of absolute values aij of the updated matrix, eventually requires high-precision computation for large input matrices. To circumvent this problem, a multistep approach has been proposed by Bareiss [1]. The equations for the two-step division-free Gaussian elimination is given as follows: (0)

aij = aij ,                          (k) aij =                        

1 ≤ i ≤ n, 1 ≤ j ≤ m; (k−2)

ak−1,k−1

(k−2)

ak−1,k

ak,k−1

(k−2)

akk

(k−2)

aik

ai,k−1

(k−2) (k−2)

(k−2) ak−1,j (k−2) akj , (k−2) a ij

if i = k or k − 1; (k−2)

ak−1,k−1 (k−2)

ak,k−1

(k−2) ak−1,j (k−1) = akj , (k−2) a

(k−1)

akj

(k−2)

ak−1,j

akk

ak−1,k

(k−1) (k−2)

,

if i = k − 1; n k = 2, 4, . . . , 2 , 1 ≤ i ≤ n, k − 1 ≤ j ≤ m; 2 (k−2) (k−2) akk (k) (k−2) ak−1,k−1 aii = aii · (k−2) a(k−2) ak−1,k k,k−1 n k = 2, 4, . . . , 2 , 1 ≤ i ≤ k − 2; 2 (n−1) (n−1) ann anj (n) if n is odd aij = (n−1) , (n−1) a aij in 1 ≤ i ≤ n − 1, n + 1 ≤ j ≤ m. (2) It is instructive to obtain Eq. (2) directly from (1) by applying the one-step equation twice and simplifying the result. The simplified result can be expressed as follows: (k−2) (k−2) (k−2) a k−1,k−1 ak−1,k ak−1,j (k) (k−2) (k−2) (k−2) aij = ak−1,k−1 · a(k−2) akk akj . k,k−1 (k−2) (k−2) (k−2) a a a i,k−1

ik

(k−2)

more efficiently than those of (1) because from aij some terms cancel and need not be calculated. The effect of the above transformation on the ma(k−2) ] can be described as follows. The first equatrix [aij (k)

tion of (2) for aij , when formally extended to j = k − 1 (k−2)

and j = k, reduces the elements aij , i > k and i < k − 1 to zero for two columns, and leaves the el(k−2) (k−2) ements ak−1,j and akj unchanged. It remains to (k−2)

ij

(3)

(k−2)

transform ak,k−1 and ak−1,k to zero. However, once (k−2)

the elements aij

, i > k, have been determined, the

(k−2) element ak,k−1 can be transformed to zero by (1) for (k−1) updating akj as shown in (2). Similarly, the element (k−2) ak−1,k can be transformed to zero by (1) for updating (k) ak−1,j . Notice that if n is odd then additional compu(n−2) tations to transform the element ai,n , i = n, to zero (n−1) , i = n, n ≤ j ≤ m, are needed. by (1) for aij

3.

kj

if i = k;

(k−2)

by a factor ak−1,k−1 and, in addition, can be obtained

A Parallel Algorithm

Regularity is one of the most important factors for designing parallel algorithms and architectures. In [3], it was shown that the irregularity makes the parallelism hard to expose. In [7] Kung stated the importance of the regularity and modularity for designing systolic arrays. In this paper, we define explicitly the regularity of a parallel algorithm. Intuitively, a regular parallel algorithm is the one that can be used to construct array processors systematically. The definition of the regularity of an indexed algorithm is based on the data dependency graph (DDG) of the algorithm defined below (also see [8]). Index space of an indexed algorithm is the set of all index points, p = (i, j, k)T in 3-D case, where each index point is associated with a single computation. Data dependency vector (DDV) is the difference between the index point where a variable is used as input variable and the index point where that variable is generated as output variable. DDG of an indexed algorithm is an directed graph with index points as vertices and DDVs as edges. A parallel algorithm is regular if its DDG satisfies the following two conditions: 1) there do not exist any opposite edges in any dimension; 2) there does not exist any cycle. The main concerns in the design of a regular parallel algorithm using the two-step algorithm are the computations for i = k or k − 1 in Eq. (2) since it involves quite complicated computation patterns. We will show how to reformulate this part of computations step-bystep by partitioning and re-indexing techniques.

PENG and SEDUKHIN: DESIGN OF OPTIMAL ARRAY PROCESSORS

1505

First, to simplify the computation and to reduce the number of multiplications, three indexed variables (k−2) (k−2) (k−2) ckk , ci,k−1 , cik are introduced to hold the intermediate results of the computations in the two-step k is held method. We assume that an indexed variable vij T at index point p = (i, j, k) . (k−2) (k−2) a k−1,k−1 ak−1,k (k−2) , = ckk (k−2) a(k−2) akk k,k−1 (k−2) (k−2) a k−1,k−1 ak−1,k (k−2) ci,k−1 = , (k−2) a(k−2) aik i,k−1 (k−2) (k−2) a k,k−1 akk (k−2) cik = (k−2) , (k−2) a aik i,k−1 (k)

(k−2)

aij = aij

(k−2)

· ckk

(k−2)

(k−2)

− akj

(k−2)

· ci,k−1

(k−2)

+ ak−1,j · cik , 1 ≤ i ≤ n, i = k − 1, i = k, k + 1 ≤ j ≤ m. (4) (k−2)

(k−2)

Consider the computations of ci,k−1 and cik

in

(0)

(4) for k = 2. Since the data item ai1 is needed for (0) (0) computing ci2 and the data item ai2 is needed for com(0) puting ci1 , there are bidirectional edges between index points pi = (i, 1, 0)T and qi = (i, 2, 0)T , 1 ≤ i ≤ n, i = 1, i = 2. That is, there are cycles in the corresponding DDG. To solve this problem, we divide the computa(k−2) tions into two layers, and put the variables ci,k−1 and (k−2)

cik in points (i, k, k − 1)T and (i, k, k)T , respectively, to eliminate the cycles. Therefore, the first step of our (k−2) (k−2) algorithm is to re-index the variables ckk , ci,k−1 , and (k−2)

(k−1)

(k)

(k−1)

cik into ckk , cik , and cik , respectively, and partition the computations at each iteration into two layers as following to guarantee that the corresponding DDG is acyclic. Notice that, after this partitioning, the computation at each index point p = (i, j, k)T of the algorithm involves up to two multiplications and one addition/subtraction. (k−2) (k−2) a k−1,k−1 ak−1,k (k−1) = ckk , (k−2) a(k−2) akk k,k−1 (k−2) (k−2) a k,k−1 akk (k−1) cik = (k−2) , (k−2) a aik i,k−1 (k−1)

aij

(k)

cik

(k)

(k−2)

(k−1)

(k−2)

(k−1)

= aij · ckk + ak−1,j · cik (k−2) (k−2) a k−1,k−1 ak−1,k = , (k−2) a(k−2) aik i,k−1 (k−1)

aij = aij

(k−1)

− akj

(k−1)

,

(k)

· cik ,

(k−2)

= akj ; where akj 1 ≤ i ≤ n, i = k − 1, i = k, k + 1 ≤ j ≤ m. (5)

Next, to eliminate the opposite edges in i and j directions in the DDG of Eq. (5), we adopt the similar technique used in [8]. We shift the (k − 1)th and kth rows to the (n + k − 1)th and (n + k)th rows in the (k − 1)th and kth layers, respectively. The resulting algorithm is shown in Eq. (6). (k−2) (k−2) a k−1,k−1 ak−1,k (k−1) = ckk , (k−2) a(k−2) akk k,k−1 (k−2) (k−2) a k,k−1 akk (k−1) cik = (k−2) , (k−2) a aik i,k−1 k + 1 ≤ i ≤ n + k − 2; (k−1) aij

(k−2)

(k−1)

(k−2)

(k−1)

= aij · ckk + ak−1,j · cik k + 1 ≤ i ≤ n + k − 2, j > k; (k−2) (k−2) a k−1,k−1 ak−1,k (k) cik = , (k−2) a(k−2) aik i,k−1

,

k + 1 ≤ i ≤ n + k − 2; (k) aij

(k−1)

= aij

(k−1)

− akj

(k−1)

where akj

(k−2)

= akj

(k)

· cik , .

(6)

Then we properly distribute the other one-step computations in Eq. (2) into the (k − 1)th and/or kth layers. If n is odd then 2 n2 = 2n − 1. In this case, the nth layer is constructed using the one-step algorithm. (n) Finally, to make the diagonal elements aii next to the (n) elements ai,n+1 for effective data transmission and computations at the last step of the algorithm, we rearrange the initial data to reserve the (n + 1)th column as free space, and shift the (k − 1)th and kth diagonal elements into this column in the (k − 1)th and kth layers, respectively. After diagonalization of matrix A, one more step (a division step) is used to get the solution of the generalized linear systems. A complete description of the regular, two-step division-free (2SDF) algorithm for solving generalized linear systems AX = B is shown below. Algorithm 2SDF begin /*Initialization*/ forall 1 ≤ i ≤ n, 1 ≤ j ≤ n do (0) aij ← aij ; /*Reserve the (n + 1)th column as free space*/ forall 1 ≤ i ≤ n, n + 2 ≤ j ≤ m + 1 do (0) aij ← ai,j−1 ; /*Internal computations*/ for k = 2, 4, . . . , 2 n2 do begin k /*Computation at the (k − 1)th layer*/ (k−1) (k−2) (k−2) (k−2) (k−2) ← ak−1,k−1 · akk − ak,k−1 · ak−1,k ; ckk (a) type


1506

forall k + 1 ≤ i ≤ n + k − 2 do (k−1) (k−2) (k−2) (k−2) (k−2) cik ← ak,k−1 · aik − ai,k−1 · ak,k ; (b) type (k−1) (k−2) an+k−1,n+1 ← ak−1,k−1 ; forall k − 1 ≤ i ≤ n + k − 2, j = k − 1 or k do (k−1) (k−2) aij ← aij ; forall k + 1 ≤ i ≤ n + k − 2, k + 1 ≤ j ≤ m + 1, j = n + 1 do (k−1) (k−2) (k−1) (k−2) (k−1) aij ← aij · ckk + ak−1,j · cik ; (c) type forall k + 1 ≤ j ≤ m + 1, j = n + 1 do begin (k−1) (k−1) (k−1) (k−1) (k−1) ← ak−1,k−1 · akj − ak,k−1 · ak−1,j ; akj (d) type (k−1) (k−2) (k−2) (e) type an+k−1,j ← ak,k · ak−1,j ; end /*Computation at the kth layer*/ forall k + 1 ≤ i ≤ n + k − 2 do (k) (k−1) (k−1) (k−1) (k−1) − ak−1,k · ai,k−1 ; cik ← ak−1,k−1 · aik

4.

Data Dependency Graph of Algorithm 2SDF

In this section, we derive a localized DDG of Algorithm 2SDF, and then analyze it. First, we need a strategy of data pipeline for shared variables so that computations in Algorithm 2SDF involve only local data transfer. The strategy is straightforward, and the resulting localized DDG are shown in Figs. 1 and 2 for the (k − 1)th and kth layers, respectively. If n is an odd integer then one extra layer (the nth layer) is needed. This extra layer of the localized DDG for odd n is shown in Fig. 3. Finally, the (n + 1)th layer of the DDG for the output computations is shown in Fig. 4. From Algorithm 2SDF, it is easy to see that nine different types of nodes are needed for computations in the (k − 1)th and kth layers. The function and the

(f) type (k) an+k,n+1

(k−1) ckk ;

← forall k + 1 ≤ j ≤ m + 1, j = n + 1 do (k) (k) an+k,j ← akj ; forall k + 1 ≤ i ≤ n + k − 2, k + 1 ≤ j ≤ m + 1, j = n + 1 do (k) (k−1) (k) (k−1) (g) type aij ← aij − cik · akj ; forall n + 1 ≤ i ≤ n + k − 2 do (k) (k−1) (k−1) /*Computation for aii = ckk · aii */ (k) (k−1) (k−1) (h) type ai,n+1 ← ckk · ai,n+1 ; forall k ≤ j ≤ m + 1, j = n + 1 do (k) (k−1) (k−2) (k−2) an+k−1,j ← aij − ak,j · ak−1,k ; (i) type end k

Fig. 1

The (k − 1)th layer of DDG.

/*Computation at the nth layer when n is an odd integer*/ if n is odd then forall n < i ≤ 2n − 1, n + 1 < j ≤ m + 1 do begin (n) (n−1) (n−1) (n−1) (n−1) − anj · ain ; aij ← ann · aij (n) (n−1) (n−1) /*Computation for aii = ann · aii */ (n) (n−1) (n−1) ai,n+1 ← ann · ai,n+1 ; end /*Output computations*/ forall n + 1 ≤ i ≤ 2n, n + 2 ≤ j ≤ m + 1 do (n+1) (n) (n) xi−n,j−n−1 ← ai,j /ai,n+1 ; end

Fig. 2

The kth layer of DDG.


1507

location of each type of nodes in the DDG are listed as follows (see Algorithm 2SDF and Fig. 5). (k−1) ckk

at index position • Type (a) node: compute (k, k, k − 1)T ; (k−1) • Type (b) nodes: compute cik at index position (i, k, k − 1)T , k + 1 ≤ i ≤ n + k − 2; (k−1) • Type (c) nodes: compute aij at index position (i, j, k − 1)T , k + 1 ≤ i ≤ n + k − 2, k + 1 ≤ j ≤ m + 1, j = n + 1; (k−1) • Type (d) nodes: compute akj at index position (k, j, k − 1)T , k + 1 ≤ j ≤ m + 1, j = n + 1; (k−1) • Type (e) nodes: compute an+k−1,j at index position (n + k − 1, j, k − 1)T , k + 1 ≤ j ≤ m + 1, j =

Fig. 3

The nth layer of DDG if n is odd.

Fig. 5

• •

• •

n + 1; (k) Type (f) nodes: compute cik at index position (i, k, k)T , k + 1 ≤ i ≤ n + k − 2; (k) Type (g) nodes: compute aij at index position (i, j, k)T , k + 1 ≤ i ≤ n + k − 2, k + 1 ≤ j ≤ m + 1, j = n + 1; (k) Type (h) nodes: compute ai,n+1 at index position (i, n + 1, k)T , k + 1 ≤ i ≤ n + k − 2; (k) Type (i) nodes: compute an+k−1,j at index position (n+ k − 1, j, k)T , k + 1 ≤ j ≤ m+ 1, j = n+ 1.

Notice that “empty” nodes are included in the DDG for proper local transmission of shared data. Let the index space of the algorithm be P. Then

Fig. 4

The (n+1)th layer of the DDG for output computation.

The nine types of the nodes at the (k − 1)th and kth layers of the DDG.


1508

P can be expressed as Pin ∪ Pint ∪ Pout , where Pin , Pint , Pout are the index sets of input, internal, and output computations, respectively. From Algorithm 2SDF and the localized DDG, we have Pin = {(i, j, 0)T | 1 ≤ i ≤ n, 1 ≤ j ≤ m + 1, j = n + 1} ⊆ Z2 × {0};  n  ∪k (Pk−1 ∪ Pk ), 1 ≤ k ≤ ,    2   k = 2k , if n is even; n Pint =  (Pk−1 ∪ Pk ) ∪ Podd , 1 ≤ k ≤ , ∪  k   2   k = 2k , otherwise,

The step function can also be specified in the linear form as step(p) = λ · p + γ, where λ = (1, 1, 1), and γ = −3. This function defines a set of hyperplanes orthogonal to the schedule vector λ on the index space of the algorithm. Equal values of timing function are shown by dashed lines in Figs. 1–4. The minimal computation time of the algorithm is T (m, n) = Tmin (DDG) = step(pmax ) = 3n + m − 1, assuming step(pmin ) = 0. From the discussion above, we have T (m, n) = |ρ|. The allocation function place(p): Z3 → Z2 is defined in the linear form:

where Pk−1 = {(i, j, k − 1)T | k − 1 ≤ i ≤ n + k − 1, k − 1 ≤ j ≤ m + 1}, Pk = {(i, j, k)T | k ≤ i ≤ n + k, k − 1 ≤ j ≤ m + 1}, Podd = {(i, j, n)T | n ≤ i ≤ 2n, n ≤ j ≤ m + 1}; Pout = {(i, j, n + 1)T | n + 1 ≤ i ≤ 2n, n + 1 ≤ j ≤ m + 1} ⊆ Z2 × {n + 1}. Next, we show that the number of multiplications in Algorithm 2SDF is N× = 3n2 (2m − n)/4 + O(n2 + mn). It is easy to see that the number of multiplications in Algorithm 2SDF is dominated by the computation of (c) and (g) types, That is, the number of multiplications in the computation of all other types is O(n2 + mn). The computation of (c) and (g) types have the same number of iterations for each fixed k, which is (n − 2)(m − k), with 3 multiplications together each iteration. Therefore, the total number of multiplications for the computation of (c) and (g) types is n/2

3(n − 2)(m − 2l)

l=1

= 3(n − 2)nm/2 − 6(n − 2)

n/2

l

l=1

= 3n2 m/2 − 3n3 /4 + O(mn + n2 ). This completes the proof. The longest path in the DDG is ρ = pmin → pmax , where pmin = (1, 1, 1)T and pmax = (2n, m + 1, n + 1)T . Therefore, we have |ρ| = 3n + m − 1, where |ρ|, the length of the path ρ, is the “Manhattan” distance between pmin and pmax . A timing (step) function step(p): Pint → Z+ which assigns a computational time step to each index point p ∈ Pint is defined as follows step(p) = i + j + k − 3.

place(p) = Λη · p, where Λη is a (2×3) matrix of the linear transformation corresponding to a projection vector η ∈ ker Λη . Notice that for obtaining the correct input/output data flows in 2-D array processors, we have to find new positions of input/output matrix elements in the 3-D index space, i.e., redefine Pin and Pout domains [9]. 5.

Design of Optimal Array Processors

Many 2-D array processors can be derived by projecting the localized 3-D DDG along different admissible projection vectors. A projection vector η is admissible if and only if λ · η = 0. This condition guarantees that each PE executes at most one computation (one index point in the DDG) at any given time step. In order to find an optimal design of 2-D array processor from the given localized DDG, we need to select from the space of the all admissible solutions based on some integrated criteria such as number of PEs, data pipelining period, computation time, array topology, number of I/O ports, etc. In this paper, for the given algorithm (and its DDG), we say that an array processor is optimal if (1) it uses minimum number of PEs among all array processors obtained by admissible projections, and (2) its total computation time equals to the length of the longest path in the DDG (3n + m − 1 in this case). It can be shown that, for the localized DDG we adopted, there are 13 admissible projections from 17 possible ones [9]. After having obtained and tested all admissible array processors it can be shown that two solutions generated by projecting the localized DDG along i axis and j axis have the minimal number of PEs. Each of the two array processors holds minimal number of PEs in certain range of m and n. We will discuss this in the rest of the section. The 2-D array processor S(0,1,0) that is generated by mapping the localized DDG along η1 = (0, 1, 0)T direction is shown in Fig. 6 for the case of n = 5 and m = 7. This array processor is an array of PEs and programmable register-latches (PRLs). The number of


1509

Fig. 6

The optimal array processor S(0,1,0) and its PE types.

PEs in this array processor is  n  if n is even;  n2 + , 2 (0,1,0) NP = n

  (2n − 1), otherwise, 2 which is independent of m. It is easy to verify from the DDG and the selected projection that there are eight types of PEs and four types of PRLs. The eight types of PEs are depicted in Figs. 6 (a)–(h). Functional description of each type of PE can be obtained easily from Figs. 5 (a)–(i) as well as projection of Figs. 1–4 along direction η1 . The rhombic array processor S(0,1,0) simulates the 3-D DDG without time extension, i.e. it solves a single task in time T(0,1,0) (1, n, m) = Tmin (DDG) = 3n + m − 1. The data pipelining period, defined as the time interval separating the neighboring items of input or output data, is α = |λ · η1 | = 1 and the block pipelining period, defined as the time interval between the initiations of two successive task instances, is β = m + 1, i.e. the next task can be pushed into this array processor after m + 1 time steps. The number of I/O ports is 2n.

Also it is evident that, for l tasks, T(0,1,0) (l, n, m) = T(0,1,0) (1, m, n) + (l − 1)β = 3n + l(m + 1) − 2. Another optimal design can be obtained by mapping the DDG along direction η2 = (1, 0, 0)T . The corresponding array processor, PE types, and input/output data flows are shown in Fig. 7 for the case of n = 5 and m = 7. The number of PEs of the array is  n2 n   mn − + m − , if n is even (1,0,0) 2 2 = NP  n

 (2m − n), otherwise. 2 There are eight types of PEs (see Figs. 7 (a)–(h)), and only one type of register-latches. It is not difficult to show that (1,0,0)

NP

(0,1,0)

≤ NP

if m ≤ min

3n − 1 3n2 + 2n , 2 2(n + 1)

=

3n − 1 . 2

It means that array processor S(1,0,0) is optimal in term of number of PEs if m ≤ 3n−1 2 . The functions carried in each of the eight types of


1510

Fig. 7

The optimal array processor S(1,0,0) and its PE types.

PEs are relatively simple considering the complexity of Algorithm 2SDF. The computations carried on each time step are no more than two multiplications and one addition/subtraction. The total computation time of this array is T(1,0,0) (1, m, n) = Tmin (DDG) = 3n + m − 1. The data pipelining period α equals to |λ · η2 | = 1. The block pipelining period β equals to m, i.e. for l tasks, T(1,0,0) (l, n, m) = 3n + lm − 1. 6.

Concluding Remarks

The design of new array processors in this paper shows that a numerical algorithm which involves rather high computational irregularity can be arranged to perform pipelined computations in an elegant way. The array processors designed in this paper is a step towards designing efficient special-purpose array processors based on an irregular algorithm, namely, two-step divisionfree Gaussian elimination method. The techniques used

in our design should be applicable to other numerical algorithms with rather complicated structure. For example, with minor modification, the techniques used here can also be used to develop array processors for two-step fraction-free (integer-preserving) solutions for linear systems with integer coefficients. Another direction for further research is to consider asynchronous array models. The asynchronous arrays, e.g., wavefront array processors [7], are good alternatives for efficiency and flexibility of massively parallel processing. Finally, to map the proposed parallel algorithm effectively onto existing massively parallel computers using proper partitioning techniques is also worth further research [2]. Acknowledgment We would like to thank the reviewers for their constructive comments to improve the quality of this paper.


1511

References [1] E.H. Bareiss, “Sylvester’s identity and multistep integerpreserving Gaussian elimination,” Mathematics of Computation, vol.22, pp.565–578, 1968. [2] X. Chen and G.M. Megson, “A general methodology of partitioning and mapping for given regular arrays,” IEEE Trans. Parallel & Distributed Systems, vol.6, no.10, pp.1100–1107, 1995. [3] G.C. Fox, R.D. Williams, and P.C. Messina, Parallel Computing Works, Morgan Kaufmann Publishers, San Francisco, 1994. [4] L. Fox, An Introduction to Numerical Linear Algebra, Clarendon Press, Oxford, 1964. [5] E.N. Frantzeskakis and K.J.R. Liu, “A class of square root and division-free algorithms and architectures for QRDbased adaptive signal processing,” IEEE Trans. Signal Processing, vol.42, no.9, pp.2455–2469, 1994. [6] J. Gotze and J. Schwegelshohn, “A square root and division-free Givens rotation for solving least squares problems on systolic arrays,” SIAM J. Sci. Statist. Comput., vol.12, pp.800–807, July 1991. [7] S.Y. Kung, VLSI Array Processors, Prentice Hall, 1988. [8] S. Peng and S.G. Sedukhin, “Array processors design for division-free linear system solving,” The Computer Journal, vol.39, no.8, pp.713–722, 1996. [9] S.G. Sedukhin and I.S. Sedukhin, “Systematic approach and software tool for systolic design,” Lecture Notes in Computer Science, vol.854, pp.172–183, 1994. [10] S.G. Sedukhin, “An algorithm and array processors for solving the systems of linear equations,” Proc. Intern. Conf. Parallel and Distributed Processing Techniques and Applications (PDPTA ’95), pp.307–316, Athens, Georgia, Nov. 1995.

Shietung Peng received his M.S. and Ph.D. degrees in Computer Science from the University of Texas in 1984 and 1986, respectively. He was with the University of Maryland from 1986 to 1993. He is currently with University of Aizu. His research interests include parallel and distributed processing, parallel algorithms, parallel architectures, and highperformance computing. He is a member of ACM and IEEE Computer Society.

Stanislav G. Sedukhin is a professor of computer science at University of Aizu. His research interests are in distributed and parallel high-performance computing and communications, parallel algorithms, architectural synthesis of the applicationspecific VLSI-processors. He received his Candidate of Sciences (Ph.D.) and Doctor of Physical & Mathematical Sciences (Dr.Sci.) in Computer Science from the Russian (former USSR) Academy of Sciences in 1982 and 1993, respectively. Prof. Sedukhin is a member of the IEEE Computer Society, ACM, SIAM, and AMS.