An Efficient Parallel Algorithm MICROFICHED

37 downloads 0 Views 1MB Size Report
UC–405. Unlimited Release. Printed March 1993. RECORD. An Efficient Parallel Algorithm. MICROFICHED. Bruce Hendrickson, Robert Leland, Steve Plimpton.
RECO

SANDIA REPORT SAND92–2765 UC–405 Unlimited Release Printed March 1993

An Efficient

Bruce Hendrickson,

Parallel Algorithm

Robert Leland, Steve Plimpton

Prepared by Sandia National Laboratories Albuquerque, New Mexico 87185 and Livermore, for the United States Department of Energy under Contract DE-AC04-76DP00789

SF2900Q(8-81

)

California

84560

MICROFICHED

Iessued by Sandia National Laboratories, operated for the Unitid States Department of Energy by Sandia Corporation. NOTICE This report wee prepared aa an account of work eponeored by an agency of the Unitad States Government. Neither the United Statm Government nor any agency thereof, nor any of their employees, nor any of their contrac@e, subcontractor, or thei~ ernoployees,make? ~y warranty, express or un had, or aasumes any legai hablhty or responslbdity for the accuracy, comp etaneaa, or uaefulneea of any information, apparatus, product, or proceea diecloeed, or re resents that ita use would not infringe privately ownad rights, Reference f erein tQ any specific commercial product, proceae,or aeMce b trade name, trademark, manufacturer, or otherwise, does not neceaeanj conatituta or imply ite endorsement, recommendation, or favoring by the nited Statae Government, any ency thereof or any of their contractor or mhcontract.ora. The viewa an3 opinions expressed herein do not neceaearily data or reflect those of the United States Government, any agency thereof or any of their contractme.

r

Printad in the United States of America, This report haa been reproduced directly from the beet available copy. Available to DOE and DOE contractors from Office of Scientific and Technical Information PO Box 62 Oak Ridge, TN 97831 Pricee available from (615) 576-8401, FTS 626-8401 Available t.a the public from Nationai Technicai Information Service US Department of Commerce

5285Port Ro al Rd Springfield, J A 22161 NTIS

rice codes

Printi a copy A09 Microfiche COPFAO1

SAND92-2765 Unlimited Release Printed March 1993

Distribution Category UC-405

An Efficient Parallel Algorithm for Matrix–Vector Multiplication Bruce Hendrickson, Robert Leland and Steve Plimpton Sandia National Laboratories Albuquerque, NM 87185

Abstract Abstract. The multiplication of a vector by a matrix is the kernel computation of many algorithms in scientific computation. A fast parallel algorithm for this calculation is therefore necessary if we are to make full use of the new generation of parallel supercomputers. This paper presents a high performance, parallel matrix–vector multiplication algorithm that is particularly well suited to hypercube multiprocessors. For an n x n matrix on p processors, the communication cost of this algorithm is O(n/~ + log(p)), independent of the matrix sparsity pattern. The performance of the algorithm is demonstrated by employing it as the kernel in the well–known NAS conjugate gradient benchmark, where a run time of 6.09 seconds was observed. This is the best published performance on this benchmark achieved to date using a massively parallel supercomputer. Key method

words.

matrix-vector

AMS(MOS)

subject

Abbreviated

title.

multiplication,

classification.

hypercube,

conjugate gradient

65Y05, 65F1O

Parallel Matrix–Vector

This work was supported

parallel computing,

Multiplication

by the Applied Mathematical

Sciences program, U.S. Department

Energy, Office of Energy Research, and was performed at Sandia National Laboratories, the U.S. Department

of Energy under contract No. DE-AC04-76DP00789. 1

of

operated for

1. Introduction.

The multiplication

linear algebra algorithms,

including,

of a vector by a matrix is the kernel computation

for example, the popular Krylov methods for solving linear and

eigen systems. Recent improvements

in such methods, coupled with the increasing use of massively

parallel computers, require the development of efficient parallel algorithms for matrix–vector tion. This paper describes such an algorithm. it is particularly

in many

Although

multiplica-

the method works on all parallel architectures,

well suited to machines with hypercube interconnection

topology, for example the Intel

iPSC/860 and the nCUBE 2. The algorithm described here was developed independently methods of organizing algorithm

parallel many-body

is very similar in structure

calculations

in connection with research on efficient

(see [5]). We subsequently

to a parallel matrix-vector

multiplication

learned that our

algorithm

described

in [4]. We have, nevertheless, chosen to present our algorithm because it improves upon that in [4] in several ways: First, we specify how to overlap communication

and computation

and thereby reduce

the overall run time. Second, we show how to map, the blocks of the matrix to processors in a novel way which improves the performance of a critical architectures.

communication

operation

And third, we consider the actual use of the algorithm

within

on current hypercube the iterative

gradient solution method and show how in this context a small amount of redundant be used to further

reduce the communication

have been able to achieve significantly previously

requirements.

By integrating

conjugate

computation

these improvements

can we

better performance on a well known benchmark than has been

possible with a massively parallel machine.

A very attractive

property of the new algorithm is that its communication

operations are indepen-

dent of the sparsity pattern of the matrix, making it applicable to all matrices. For an n x n matrix on p processors, the cost of the communication structure

is O(n/@+

which allows for other algorithms

this structure

log(p)). However, many sparse matrices exhibit

with even lower communication

requirements.

Typically

arises from the physical problem being modeled by the matrix equation and manifests

itself as the ability to reorder the rows and columns to obtain a nearly block–diagonal

matrix, where

the p diagonal blocks are about equally sized, and the number of matrix elements not in the blocks is small. This structure

can also be expressed in terms of the size of the separator of the graph describing

the nonzero structure

of the matrix.

Our algorithm

is clearly not optimal for such matrices, but there

are many contexts where the matrix structure

is not helpful (e.g. dense matrices, random matrices),

or the effort required to identify

is too large to justify.

algorithm

is most appropriate

This paper is structured munication

primitives.

the structure

It is these settings in which our

and provides high performance. as follows. In the next section we describe the algorithm

In §3 we present refinements and improvements

and its com-

to the basic algorithm,

and

develop a performance model. In §4 we apply the algorithm to the NAS conjugate gradient benchmark problem to demonstrate its utility. 2. A parallel

matrix–vector

Conclusions are drawn in §5. multiplication

algorithm.

Iterative solution methods for linear

and eigen systems are one of the mainstays of scientific computation. matrix–vector

These methods involve repeated

products or matvecs of the form yi = Axi where the the new iterate, xi+1, is generally 2

some simple function necessary that xi+l

of the product vector yi. To sustain the iteration be distributed

on a parallel computer, it is

among processors in the same fashion as the previous iterate xi.

Hence, a good matvec routine will return a yi with the same distribution

as xi so that xi+1 can be

constructed with a minimum of data movement. Our algorithm respects this distribution We will simplify notation and consider the parallel matrix-vector

requirement.

product y = Ax where A is an

n x n matrix and x and y are n-vectors. The number of processors in the parallel machine is denoted by p, and we assume for ease of exposition of 2. It is fairly straightforward

that n is evenly divisible by p and that p is an even power

to relax these restrictions.

Let A be decomposed into square blocks of size (n/@) one of the p processors, as illustrated

by Fig. 1. We introduce

x (n/@,

each of which is assigned to

the Greek subscripts a and /3 running

from O to ~ – 1 to index the row and column ordering of the blocks. The (a, ~) block of A is denoted by Aap and owned by processor Pap. The input vector z and product vector y are also conceptually divided into @ pieces indexed by /3 and a respectively. Pap must know Z@in order to compute its contribution n/fi

which we denote by zap. Thus Z.B = A.~zp,

Given this block decomposition, to ya. This contribution

processor

is a vector of length

and y. = 2P .z.p where the sum is over all the

processors sharing row block a of the matrix.

Ya

Fig. 1. Structure of matrix product y = Ax. 2.1. Communication

primitives.

Our algorithm requires three distinct patterns of communi-

cation. The first of these is an efficient method for summing elements of vectors owned by different processors, and is called a fold operation in [4]. We will use this operation to combine contributions to y owned by the processors that hold a block row of A. The fold operation is sketched in Fig. 2 for communication

among processors with the same block row index a. Each processor begins the fold

operation with a vector zap of length n/fi. of the vectors involved at each stage. Within

The operation requires Iogz (W) stages, halving the length each stage, a processor first divides its vector z into two

equal sized subvectors, Z1 and 22, aa denoted by (zl 122). One of these subvectors is sent to another 3

processor,

while the other processor

received

subvector

the conclusion We denote requires

is summed

sends back itscontribution

element–by-element

of the fold, each processor

this subvector

no redundant

each processor

with Greek

floating

is n/fi

with the retained

has a unique,

superscripts,

point operations,

tothesubvector

length

which remained.

subvector

The

to finish the stage.

At

portion of the fully summed vector.

n/p

hence Pap owns portion and the total number

ya~.

The fold operation

of values sent and received

by

– n/p.

?rocessor

Pap knows zap E IRnl @

: := zap ?ori=O,

. . ..log2(l-l

(Z, IZ2) = z PaBI := PQO with i:h bit of D flipped If bit i of/3 is 1 Then Send

Z1 to processor

Receive

Pap,

W2 from processor

POPI

Zp := 22 i- W2 Else Send

Z2 to processor

Receive

PaO,

w 1 from processor

P.OI

Z:=zl+wl ,Ufl := z ‘rocessor

Fig.

2. The fo

In the second shared

among

essentially outlined

all the processors

begins

in the column

know all n/@

processor

for processor

operation

communication

of length

logarithmic

is n/fi

The optimal and various

of stages

are required,

implementation

The expand

and when the operation

processor

finishes

be

[4], that

operation

is

and the total

all processors

At each step in the operation

and receives that processor’s

by the “1” notation.

of the fold and expand

considerations,

however, efficient implementations

of the matrix

as indicated

ezpand

must

values.

These

As with the fold operation,

number

of values sent and received

a

only a by each

– n/p.

hardware

can be implemented

n/p,

called

that

with the same column index /3. Each processor

values in the union of their subvectors.

are concatenated,

number

algorithm

of the fold operation.

processors

sends all the values it knows to another

two subvectors

knows some information

We use a simple

pattern

between

with a subvector

PQO as part of block row a.

each processor

in a column.

in Fig. 3 for communication

in the column

processor

operation

communication

uses the inverse

Pap now owns yafl E lRnJp

e.g.

the availability

on most architectures.

using only nearest

operations

neighbor

of multiport

On hypercubes,

communication

are owned by a sub cube with W processors. 4

depends

on the machine

communication. for example,

topology

There

are,

these operations

if the blocks in each row and column

On meshes,

if the blocks of the matrix

are

Processor z:=

P08 knowsfl”

EIR”IP

@

Fori=log2(@)

-1,...,0

Pot,@ := Papwithiih Send

bitof

z to processor

Receive

a flipped

P0,,8

w from processor

P=t,D

If bit i of CYis 1 Then z := W[z Else z := Zlw Yo := z Processor

3. The expand

Fig.

mapped

in the natural

implemented

message

operation

way to a square

efficiently

The third

to the processor

operation

can be difficult

an optimal,

transpose

in the transpose

to implement distant

However,

communication

O(@

changing

The

computing

each processor it owns.

Fig. 2, after processor

matrix-vector

operations performs

These

values

VP are d~tributed procemors

of the matrix,

can be

each processor

steps.

operations,

algorithm

Consequently,

is about

the transpose

of the message

less than

messages

We

from a similarly

the length @

is great.

which is discussed

may benefit

is unavoidable,

of messages

congestion

on hypercubes

architectures

this transpose

a large number

for message

for this operation on other

to send a

i.e. PQB sends to PB=. Since we

This is because

even if congestion

details

algorithm.

in the algorithm

are summed

a subvector

within

are performed

are presented

the volume

of

can be delayed by

in steps

multiplication

processor

(1) and (2).

involving

rows in step

of y., whereas

to perform

the procemors

the portion

5

in step

(1),

of the matrix from by

Pep must know all of

exch~ges

its subsegment

After the transposition,

as required

for

the values owned

the next matvec

in COIUmII block 0 of A. The expand

gives each of them all of yp, so the result is distributed

First,

(2) using the fold operation

in steps (3) and (4). In step (3), each processor block of the matrix.

our algorithm

in the following section.

owns n/p of the values of y. Unfortunately,

owning the transpose

among

We can now present

and enhancements

the local matrix-vector

up. This is accomplished ~ with the processor

algorithm

multiplication

which each processor

Pap are just

operations

the overall scaling of the algorithm.

y = Az in Fig. 4. Further

All the numerical

requires

so the potential

step of our matvec

in the fold and expand

2.2.

algorithm

efficiently.

algorithm

of our matvec

algorithm.

portion

processors,

data exchanged without

then the fold and expand

to be efficient for the fold and expand

congestion-free

in $3.1. Implementations tailored

in our matvec

owning the transpase

to architecturally

have devised

P&Oas part of block column /?,

grid of processors,

want row and column communication

must travel

for processor

[9].

communication

communication

Pop now knows ya E lRnl@

operation

of

the values of among these

for a subsequent

matvec.

We note that matrices, result

at this level of detail,

but m we discuss

the algorithm

in the next section,

is identical

the detaila

of steps

in [4] for dense

(1), (2) and (3) are different

and

in a more efficient overall algorithm.

Processor

(2) Fold

zap = Aaezp

zap within

(3) Transpose

rows to form v“$

the y“~, i.e.

a) Send

I@ to Ppe

b) Receive (4) Expand

Fig.

matvec

algorithm

mapped

to subsets

is a subcube,

cessors

and

details

3.1. Transposition

mapping

#o

4. Parallel matrix-vector

3. Algorithmic

on

are most

#o

efficient

within

multiplication

are architecturally

single message

on a parallel

were sent between

adjacent

expand

computer

distant.

inefficient

Modern

between

processors.

On a hypercube sending

and receiving

channel)

the scheme processors

a message

parallel

Nevertheless,

usual scheme of assigning and the high order mapping

induces

diagonal

processor.

computers

is to move within

transposed So

in our algorithm

even if congestion

the algorithm

Unfortunately,

communication

at nearly

subset such a

between routing

pro-

so that

the same speed

a

as if it

messages

are simultaneously

trying

Hence machines

with cut through

routing

is usually

agree. On the nCUBE

to compare

the bit addresses

0100 will route

since messages

a row before are shorter

moving than

delays the transpose

as dimension

order

Unfortunately,

with within

thcee

mesh

dimension

processors

muting.

architectures

a column.

by @

the column

order

routing

in a row route where

Fortunately,

in the fold and expand

messages

the order Thus

from 1001 to 1000 to 1100 to 0100. The

from all the @ occurs

hypercubes

uses low order bits to encode

the row number.

bottleneck

known

of the

along the corresponding

2 and Intel iPSC/860

a procedure

blocks to processors

bits to encode

a natural

use cut-through

processors

in the

columns of the matrix are

and flip the bits in a fixed order (and transmit

matrix

A similar

used

are possible.

if multiple

a message

1001 to processor

congestion

fold primitives

congestion.

is from lowest bit to higheat,

from processor

Pep.

On a hypercube

since it requires

non-adjacent

for routing

until the two addresses

of comparisons

scheme

message

for processor

if rows and

or submeahes

to use the same wire, all but one of them must be delayed. can still suffer from serious

and

allow for fast communication.

operation

can be transmitted

to form VP

algorithm

The

while on a 2-D mesh rows, columns

that

columns

computers.

that

can make the transpose

from Pp.

refinements.

parallel

of processors

A.P and ZP

Pap owns

(1) Compute

@

to the one described

the

on this

through usual

the messages

operations

number

the

routing being

by a factor

the overall communication

scaling

of of

will not be affected.

On a hypercube,

a different

mapping

of matrix

blocks to processors 6

can avoid transpose

congestion

altogether.

With this mapping

operations,

but now the transpose

length

Consider

n/p.

we still have nearest operation

a &dimensional

we assume

that

block number

~. For fast fold and expand

column form a subcube. block row number

communication

is as fast as sending

hypercube

For simplicity

neighbor

This is assured

and the other

and receiving

where the address

d is even. The row block number operations,

a is a d/2-bit

we require

a single message

of each processor

that

the block column

string,

processor

number.

address.

addresses

the matrix) encode

this means

hypercube

the 6-bit

(with

processor

address

the block row index and C2C1co encodes

Note that a subcube

in this mapping

of the hypercube,

the transpose

operation

the proof assumes contention

THEOREM elements

scheme

and fold, operations

3.1.

Consider

a hypercube

in the processor’s

to the processor

Proof.

Consider

using

bit-address

in the transpose

a processor

location

will have with bit-address

transmitted

in as many stages

a sequence

of intermediate processor

as there

patterns.

denoted

of intermediate

by the following theorem.

Although

Now consider

connected

row number

to

and column

are disjoint.

rbQr& lcb- 1... roco, where

are bits, flipping After

and map processors

each stage,

in the transpose

order routing,

is

array

a message

is

bits in order

from right to left to generate

the message

will have been

intermediate

bit pattern.

two processors

After 2k stages,

dimension

PT

the row number

to the

The wires used in routing

whose patterns

the intermediate processor

routed

occur consecutively

processor

the

in the

will have the pattern

are a simple permutation

of the

Also, after 2k – 1 stages,

the

2k and 2k – 1 are equal.

another

usea the same

processor

wire employed

P’ # P, and assume

that

the message

i of the transmission

in step

being

from P to PT.

by this wire by P1 and Pz. Since they differ in bit position

consecutively

in the transition

i – 1 or i is even, so a simple permutation

P.. Similarly,

bits are

id. Then ihe wires used when each processor sends

bits of P in which the lowest k pairs of bits have been swapped.

be encountered

a similar

as long as row and column

with cb . . . co. The processor

by the current

patterns.

values in the bit positions

Either

However,

of a processor’s

. . . coro. The bits of this intermediate

rbcb . e .rkckck_~rk_~

processors

on

optimally.

order routing,

cbrbcb- 1r& 1 . . coro. Under

message from P to P= are those that connect

to Pm

scheme

dimension

P with bit-address

with rb . . . r., and the column number

sequence

still resides

in order from loweat to highest,

location in the array

encoded

intermediate

index.

as demonstrated

for any tlxed routing

bits ~rl r.

alternately.

are interleaved

a message

for the 8x8 blocks of

can be performed

of an array in such a way thai the bit-repwsentations

number

a mapping

would be ~c2r1c1 roco where the three

where bits are flipped

is possible

encode the

in the processor

each row of blocks and column of blocks of the matrix

so the expand

a routing

row and column

the block column

is now contention-free

free mapping

forced to change

original

3-bit

in each row and

Now consider

are interleaved

For a 64–proceaaor

w is the column

address

where the bits of the block row and block column indices of the matrix

of

is a d-bit string.

the processors

if any set of d/2 bits in the d-bit

d/2 bits encode

in the fold and expand

the same permutation

then P = P’ which is a contradiction.

applied

between

stages

Denote

the two

i, PI and Pz can only

of pairs of bits of P must generate

either

algorithm.

P1 or P2; say

PI or Pz; say Pi. If P. = P!

both P1 and P2 must appear 7

from P’

i - 1 and i of the routing

to P’ must also yield either

Otherwise,

routed

after an odd number

of

stages

in one of the routing

even then bits i and ithat

If i is odd then bits i and i + 1 of P must be equal, and if i is

sequences.

1 of P are equal.

case, P1 = P2 which again implies the contradiction

In either

P = P’. Cl 3.2.

Overlapping

and communicate

computation

simultaneously,

sor has sent a message arrives.

in the fold or expand

from step

the fold operation,

we should

in the current

pass.

than

computing

just those that

are about

the fold loop get computed

between

message

time and the time to compute” the next set of elements

Balancing

munication

the

requirements

putational

computational of our algorithm,

load is well balanced

computations

load.

across

the processors.

within each local matvec.

ros, the number will be balanced the matrix.

of floating

if m’ x m/p

for each processor,

For dense matrices

or random

However for matrices

with some structure

shown that

permuting

random

randomly

permutation

Most matrices

used in real applications

map the remaining

permutation diagonal

to the matrix.

can now be mapped

processor

owns.

3.4. minimal

to compute

to match

works best on the particular

the

has m’ nonze-

is 2m’ – n/@.

These

elements

ogielski

in

and Aiello have

with high probability

elements.

[8]. A

when summing

vectors

We have found that when

of these among processors

by first applying

while moving

the distribution

elements

a random

the off-diagonal

can then be computed

elements.

The

that each

in between

the transpose

and to

symmetric

of the y“~ subsegment

saving either

described

a matrix-vector

hardware.

balancing

the send

transmission

time

is smaller.

The algorithm

product.

the com-

the processors.

diagonal

the diagonal

above

multiplication,

We make no assumptions

its local matrix-vector

among

have nonzero

Some of these flops will occur during

the fold summations.

that

of nonzero

that zero values encountered

communication,

time, whichever

2m – n flops to perform

number

gives good balance

randomly

of the diagonal

on the com-

in which m >> n, the load is likely to be balanced.

This can be accomplished

to processors

model.

for the local matvec

to force an even distribution

This preserves

in the transpose

computation

Complexity

in the matrix. during

elements.

The contribution

and receive operations or the diagonal

advantage

of the

of za6.

this requires

it may not be. For these problems,

are likely to be distributed

this is the case, it may be advantageous randomly

matrices

will keep are computed.

owned by a processor

where m is the total

the rows and columns

has the additional

in the fold operation

(flops) required

the send and receive

must also ensure

For our algorithm,

If the region of the matrix

point operations

Then whichever

above has concentrated

algorithm

com-

of Z=e before

the fold loop by the minimum

The discussion

but an efficient

by interleaving

to be sent.

In the final pass, the values that the processor on each pass through

from its neighbor

all the elements

run time is reduced

3.3.

that once a proces-

it is idle until the message

In this way, the total transmission

is able to both coinpute

in step (2) of the algorithm

Rather

compute

values will be sent in the next pass through operations

operations,

(1).

If a processor

in Fig. 4 has the shortcoming

in the fold operation

with computation

beginning

communication.

then the algorithm

This can be alleviated

munication

and

the calculation about

can be implemented

where m is the number of the local matvecs,

the data structure

This allows for the implementation

If we assume 8

to require

the computational

the

of nonzeros and the rest

used on each processor of whatever

load is balanced

algorithm by using the

techniques

described

in $3.3, the time to execute

(2m - n)THoP/p, where T~~P is the The cation

algorithm

volume

requires

of n(2~

communication

Iogz (p) + 1 read/write point

the effective

very sparse,

the computational

transmission

time in the fold operation,

Furthermore, involving reduces

the matrix to n(@

that

diagonal,

– I)/p.

pairs

numbers.

should

be very nearly

– 1)/p.

to form the local matvec

the transpose

and a total

for the natural

volume is n(2~

as discussed

as described

for each processor,

Accounting

communication

time required

we will assume

point operations

time required for a single floating point operation.

– 1) floating

operations,

these floating

transmission

in $3.3.

that

is

to hide the

this is the case.

with computations

communication

The total run time, TtOtal can now be expressed

in the

Unless the matrix

time can be hidden

The effective

parallelism

will be sufficient

in ~3,2. We will assume

communi-

volume

therefore

as

2m–n

(1)

Ttot~

=

—Gop

+

P

where TfiOPis the time to execute send and receive operation

described

4. Application parallel

respectively,

solver.

Conjugate

multiplication

of variants

of the algorithm multiplication,

among

(forming

This

because

[1, 3] discussed requires

of the CG algorithm

are distributed,

small,

and then calculating

if the algorithm

the first two operations

However

these

algorithm rTr,

is algebraically

exploits

essentially

calculation

vector

of our

gradient

(CG)

in Fig. 5. There modified

version

to the matrix–vector

updates

(of z, r and p), as

should divide the workload Unfortunately,

evenly among

these goals are in conflict

calculations

require communication

in Fig. 5 is implemented

in parallel,

each

p’ and hence ~. The calculation

of p’ = rTr can actually

be condensed

simultaneously

are still very costly.

equivalent

with a single global communication,

product

for

into two

with a binary

One way to reduce

the

is to modify it as shown in Fig. 6.

the new algorithm

In exchange

three

is a slightly

In addition

r to compute

global operations

by Van Rosendale

routine.

the efficiency

Ax = b is depicted

can be accomplished

r~ri – ~yTri + cr2yTy, as suggested

matvec

a

point value.

as it is with the mapping

To examine

later.

the inner product

of -y, and the calculation

load of the algorithm

modified

time per floating

to initiate

~ and p’).

In addition,

of 7 = p~y, the distribution

communication

,

are the times

&ceive

the one we present

must know the value of a before it can update

algorithm.

and

for solving the linear system

the cost of communication

all the processors.

exchange

1) Tt~~n*~i~

we used it as the kernel of a conjugate

of the basic CG method;

when the vector updates

global operations



P

algorithm.

the inner loop of the CG algorithm

while keeping

processor

Tsend

‘(w

+

is insignificant,

Gradient

algorithm,

An efficient parallel implementation

because

contention

given in the NAS benchmark

well as two inner products

processors

+ ‘Geceive)

and TtrmBmit is the transmission

if message

A version of the CG algorithm

are a number

l)(T.end

in $3.1. to the

matrix-vector

+

a floating point opdration,

This model will be most accurate hypercubes

(1%2(P)

to the original,

the identity

the communication reduction,

since 4 = yTr and $ = yTy must 9

r~+lri+l

instead

of updating

r

= (r~ – @Y)T(ri – @Y) =

[10]. The values of 7, ~ and @ can be summed

halving

for this communication

but

there

time required

is a net increase

now be computed,

outside

the

of one inner

but @ = rTr

need not

X:=cl r:=b p:=b

rTr

p :=

Fori=l,.

..

y := Ap ‘y :=pTy ff :=

p/7

x :=z+ap r:=r

— ay

p’ := rTr P:=

P’IP

P:=

P’

p:=r+pp Fig.

be calculated additional

explicitly.

2n/p floating

Whether

Since the vectors point operations

this is a net gain depends

communication per unit

on a particular

than

the machine n
:=b ? := r=r Sum ~ over all processors

to form p

Expand

to form p“

Tori

p within

=l,.

columns

..

Compute Fold

ZP =

ZP within

Transpose

y~” to P.P

Receive pTy

(j :=

y=,

J) :=

y=y

Sum

rows to form y~v

y~”, i.e.

Send

~ :=

AP.P.



y := y“f’ from P.P

~, ~ and ~ over all processors

Q :=

p/y

p’:=

p–afj+cd$

/3:=

p’/p

p :=

p’

to form -y, ~ and ~

Z:=z+ap r:=r+(ry p:=r+flp Expand

Fig. C code achieves assembly 5.

about

language

aa n/@.

either

difficult

250 Mflops, which is about

We have presented

cost of this algorithm Consequently, or impossible

serve aa an efficient black-box in a sparse

without

is independent

to exploit.

matrix

achievable

by running

algorithm within

for matrix–vector

matrices

the conjugate

of the zero/nonzero

is most appropriate

gradient

structure

for matrices

in many contexts.

for prototyping

library

sparse

matrix

where few assumptions 12

pure

multiplication,

The

of the matrix

and

in which structure

For example, linear algebra

about

matrix

and

algorithm.

This is clearly the case for dense and random

for sparse routine

F’PV.

any communication.

a parallel

the algorithm

for processor

12% of the raw speed

can be used very effectively

it is also true more generally

be embedded

to form p.

7. A parallel CG algorithm

shown how this algorithm

scales

columns

BLAS on each processor

Conclusions.

communication

p within

matrices,

our algorithm algorithms

structure

is and

could

or could

can be made.

On the NAS conjugate more than 40’%0faster The particular mapping

ensures

cut–through mapping

gradient

than any other mapping

that

routing

we employ

the transpose

haa already

reported

an nCUBE algorithm

of the matrix

operation

2 implementation

running

for hypercubes

rows and columns

of this algorithm

on any massively

parallel

is likely to be of independent are owned entirely

can be performed

proved useful for parallel

to other linear algebra

many–body

by subcubes,

without

calculations

message

runs

machine.

interest.

This

and that

with

contention.

[5], and is probably

This

applicable

algorithms.

Acknowledgements. percube

benchmark,

transposition

We are indebted

algorithm

to David Greenberg

for assistance

in developing

the hy-

in $3.1.

REFERENCES [1] D. H. BAILEY, E. BARSZCZ, J. T. BARTON; D. S. BROWNING, R. L. CARTER, L. DAGUM, R. A. FATOOHI, P. O. FREDERICKSON, T. A. LASINSKI, R. S. SCHREIBER, , H. D. V.

VENKATAKRISHNAN,

Supercomputing

AND

S.

Applications,

K.

The NAS

WEERATUNGA,

parallel

benchmarks,

Supercomputing

’92, IEEE

Computer

Society

[3] D. H. BAILEY, J. T. BARTON, T. A. LASINSKI, benchmarks,

Intl. J.

5 (1991), pp. 63-73.

[2] D. H. BAILEY, E. BARSZCZ, L. DAGUM, AND H. D. SIMON, NAS parallel benchmark Proc.

SIMON,

Tech. Rep. RNR-91-02,

Press, H. D.

AND

results, in

1992, pp. 386-393. SIMON,

NASA Ames Research

The NAS parallel

EDITORS,

Center,

Moffett Field, CA, January

1991. [4] G. C. Fox,

M. A. JOHNSON, G. A. LYZENGA, S. W. Solving problems

WALKER,

on concurrent

processors:

OTTO,

J. K. SALMON,

Volume

1, Prentice

D. W.

AND

Hall, Englewood

Cliffs, NJ, 1988. [5] B. HENDRICKSON munication, December [6] R. W.

S.

AND

Tech.

Rep.

SAND

92-2766,

[7] R. W.

Systems,

LELAND

PhD thesis, J. S.

AND

Numerical

methods

University

Press,

[8] A. T. OGIELSKI J. Sci. Stat. [9] R. VAN

DE

puting

many-body

Sandia

calculations

National

without

Laboratories,

all-to-all

Albuquerque,

comNM,

1992.

LELAND, The Effectiveness

Linear

Parallel

PLIMPTON,

AND

University

of Oxford,

Evaluation

ROLLETT,

in fluid dynamics

Algorithms Oxford,

for Solution

England,

of a parallel

conjugate

III, K. W. Morton

of Large Sparse

October

1989.

gradient

and M. J. Baines,

algorithm, eds.,

in

Oxford

1988, pp. 478-483. W.

Comput.,

GEIJN,

of Parallel Iterative

14 (1993).

Eficient

Conf., IEEE

Computer

computations

on parallel processor

Society

operations,

Press,

in Proc.

on parallel

pp. 44-46. 13

6th Distributed

Memory

Com-

1991, pp. 291-294.

inner product data dependencies

conference

arrays, SIAM

To appear.

global combine

[10] J. VAN ROSENDALE, Minimizing in 1983 International

Sparse matriz

AIELLO,

processing,

in conjugate

gradient

iteration,

H. J. Siegel et al., eds., IEEE,

1983,

EXTERNAL DISTRIBUTION: R W. Alewine

DARPA/RMO 1400 Wilmn Blvd. Arlington, VA 222o9 F~ Altabdi US Air Force Weapono Lab Nuclear Technology Branch Kirtfand+

AFB,

Falcon

AFB, CO

80912-51MfJ

Raymond A. Bair Molecuhu Science Reach. Cntr. Pacific NW hborato~ Richland, WA 99362

Bud Brewster 410South Pierce Wheaton, IL 80187

R. E. Bak Dept. of Mathematical Univ. of CA at San Diego La Jolla, CA 92093

Carl N. Brooks Brc&a Associates 414 Falls Road Chagrin Fafls, OH 44022

Ken Bannister

John Brunet 245 Fmt St. Cambridge, MA 02142

NM 87117-6008

Mad w. Allen The Waif Street Joumaf 1233 Regal hW Daflas, TX 75247 M. Alme Alme and Associates 6219 Bright Plume C&unbii, MD 21044 ChaAa E. Anderson SOuthweat ~ Institute PO Drawer 28610 San Antonio, TX 78284 Dan Anderson Ford Motor Co., Suite 1100 Vie Pke 22400 Michigan Ave. Dearbcrn, MI 48124 Andy Arenth

National Security Agency SlwageRoad Ft. Meade, MD 207ss

Attn: C6 Greg Astfdk Convex Computer (hp. 3(N)0 Waterview Parkwqy PO Box 833851 Richrmdao% TX 750s3-3851 Susan R. Atlas TMC, MS B258

US Army Bdfistic Res. Lab Attrx SLCBIVIB-M Aberdeen Prov. Gmd., MD 21005-5086 Edward Barragy Dept. ASE/EM University of Texas Austin, TX 78712

John Bnmo Cntr for Comp. Sci and Engr College of Engineering Univemity of California Santa BarbM~ CA 931-5110

E. Bamm NAS Applied Rewarch Branch NASA Ames Ikearch cent. Moffett Field, CA 94035

H. L. Buchanan DARPA/DSO 1400 Wifson Blvd. Arlington, VA 22209

W. Beck AT&T Pixel Machinea, 4J-214

D. A. Bud Supercornputing Reach. Cntr,

Crawfords Ccuner Rd. Homdel, NJ 07733-1%8

17100 Saence Dr. Bowie, MD 20715

David J. Dept. of Univ. of La Jolla,

B. L. Buzbee Scientific Computing NCAR PO Box 3000 Boulder, CO 80307

fkuuq AMES FLO1l California at San Diego CA 92093

Mylea R. Berg Lockheed, 0/62-30, B/150 1111 Lockheed Way %nqyvale, CA 9408%3.504

Stephan Bilyk US Army Baflist. Reach. Lab SLCBKTB-AMB Aberdeen Proving Ground, MD 21OOWIO66

Dept.

G. F. Carey Dept. of Engineering Mechanics TICOM ASEEM WRW 305 University of Texan at Austin Austin, TX 78712 Art Carfmn NOSC Code 41 New London, CT 06320

Center for Nonlinear Studies Los Alarnos Nationaf Labs Los Ahunos, NM 87545

Rob Bisseling Shelf Research B.V. Postbus 3003 1003 AA Amsterdam

L. Audander DARPA/DSO 1400 Wifson Blvd.

The Netherlands

CENDI, Information Iut’1 PO Box 4141 Oak Ridge, TN 37831

Matt Blumrich Dept. of Comp. Saence Princeton University Princeto~ NJ 0s644

Charfm T. Casale Aberdeen Group, Inc. 92 State Street Boston, MA 02109

B. W. Boehm DARPA/ISTO 1400 Wilson Blvd. Arlington, VA 222o9

J. M. Cavdlini US Department of Energy OSC, ER-30, GTN Wasbingto~ DC 20585

R. R. Borchcr L-889 Lawrence Lhennore Nat’1 Labs PO Box 808 Livermore, CA 94550

John Champine Clay Rea. Inc., Software Div. 655F Lone Oak Dr. Eagan, MN 55121

Arfington, VA 222o9 D. M. Austin Army High Per Comp. Rea. Cntr. University of Minnesota 1100 S. Second St. Mirmeapolie, MN 5s41s

Bonnie C. Carrolf Sec. Dir.

Scott Badm Univemity of Cdiforni% Stm Diego Dept. of Computer Saence 95OOGifman Drive Engirmxing0114 La Jofla, CA 92091-CX)14 F. R. Bailey MS2Cxl-4 NASA Ames Research Center Moffett Field, CA 94o35 D. H. Bailey NAS Applied Research Branch NASA bResearch Center Moffett F]eld, CA 94o35

Dan BOWhIS Mail Code 8123 Attm D. Bowlus & G. l..diecq Naval Underwater Sys. Cntr. Newport, RI 02841-5047

Tony Chan Department of Computer Science The Chinese Univemity of Hong Kong Shatin, NT Hong Kong J. Chandra Army Research Office PO Box 12211 Resch Triangle Park, NC 27709

Dr. Donald Brand MS - N8930 Geodynamica 14

Siddmtba Chatterjee RIAcs NASA Ames Reaeara Mail Stop T04~l MofTett Field, CA 94~l(W

IBM Corporation 472 Wheelers FMilford, CT 06460

Ibad

C.ter

Wamm Chemock Scientific Advisor DP.1 US Department of Energy Forentd Bldg. 4A.045 Washington, DC 20585 R.C.Y. Chin L-321 La wre.nce Liverrmwe Nat’1 Lab PO Box 808 Livermore, CA 9455o Mark Christen L122 Lawreme Llvennore Nat’1 Lab PO %X 808 Livemnme, CA 9455o M. Cirnent Adv. Sci. Comp/ Div. RM 417 National Science Foundation Washington, DC 20ss0 Richard Cfaytor US Department of Energy Def. Prog. DE1 Foreatd Bldg. 4A-014 Washington, DC 20585 Andy Cleary Centm for Information Research Austrafia National University GPA Box 4 CanberrA ACT 26o1 Australia T. Cole MS180-500 Jet Prop. Lab 4800 Oak Grove Dr., P.amdena. CA 911OP Monte ColeUS Army Bd. Reach Lab SLCBR-SEA (Bldg. 394/216) Aberdeen Prov. Gmd., MD 21005-5006 Tom Coleman Dept. of Computer Science Upeon Hdl Comelf University Ithaca, NY 14853

J. K. Crdlum Thorruu J. Watmn Resch. Center Po Box 218 Yorktown Heights, NY 10598

H. Elman Computer Science Dept. University of Marylaud College Park, MD 20842

Leo Dagmn Computer Sciences Corp. NASA Ames Research Center Moffett Field, CA 94CL3S

M. A. Efmer DARPA/RMO 1400 Wilson Blvd. Arfington, VA 222o9

Kenneth I. Daugherty Chief Scientist HQ DMA (5A), MS -A-16

J. N. Entzminger DARPA/TTO 1400 When Blvd.

8613 Lee Highway Fairfax, VA 22031-2138

Arlington, VA 222o9

L. Davis Cray Research Inc. 1168 Industnd Blvd. Chippawa Fafls, WI s472s Mr. Frank R. Deis Martin Marietta Falcon AFB, CO 80912-50t13 R. A. DeWlllo Comp. & Comput. Reach., Rm. 304 , National Science Foundation Washington DC 20550 L. Deng Applied Mathematics Dept. SUNY at Stony Brook

A. M. Erisrnan MS 7L21 Boeing Computer Services PO Box 24346 Seattle, WA 98124-0346 R. E. Ewing Mathernatica Dept. University of W yoming PO Box 3036 Univemity Station Lammie, WY 82071 El Dabaghi Fadi Charge of Research Institute Nat’1 De Recherche en Informatique et en Automatique Dornaine de Voluceau Romuencourt BP 105 78153 Le Chemay Cedex (France)

Stony Brook, NY 11794-3&lI A. Trent DePersia Prog. Mgr. DARPA/ASTO 1400 Wifscm Blvd. Arlington, VA 22209-2308 Sean Dolan nCUBE 919 E. Hilbdafe Blvd. Foster City, CA 944o4

H. D. F& Institute for Adv. Tech. 4032-2 W. Braker Lane Austin. TX 78759 Kurt D. Fickie US Army Ballistic Resch. Lab ATTN: SLCBR-SE Aberdeen Proving Ground, MD 21OWA4M6 Tom Finnegan

Jack Dongama Department of Computer Science Univemity of Termeasee Knoxville, TN 37996 L. Dowdy Computer Science Department Vanderbilt University Nashville, TN 37235

S. Coney NCUBE 19A Davis Drive Belmont, CA 94oO2

Joanne Downey-Burke 8030 Sangor Dr. Colorado Springs, CO 80920

J. Corones Ames Laboratory 236 Wilhelm Hdl Iowa State University Ames, IA 50011-3020

L S. DutT CSS Division Harwelf Laboratory Oxfordshire,OX11 ORA United Kingdom

Steve COogrOve E6220 KliOb Atomic power Lab PO Box 1072 Schenectady, NY 12301-1072

Alan Edelman University of California, Berkeley Dept. of Mathematics Berkefey, CA 94720

C. L. Crothera

Yale University PO Box 2158 New Haven, CT 06520

S. C. Euenstat Computer Sdence Dept. 15

NORAD/USSPACECOM J2XS STOP 35 PETERSON AFB, CO 80914 J. E. Flaherty Computer Science Dept. Remselaer Polytedr Inst. Troy, NY 12181 L. D. Foedick University of Colorado Computer Science Department Campus Box 43o Boulder, CO 8030P G. C. Fox Northeast PamUef Archit. Cntr. 111 College Place Syracuse, NY 13244 R. F. Freund NRaD- Code 423 San D,ego, CA 991525000 Sveme Fmyen Solar Energy Research Inst. 1617 Cole Blvd.

Golden, CO 80401 David Gale Intel Corp 600 S. Cherry St. Denver, CO 802221~1 D. B. Grmnon Computer %ience Dept. Indhma University Bloomington, IN 47401 C. W. Gear NEC Rcseadr Institute 4 Independence Way Princeton, NJ 08548 J. A. George Needles Half University of Water400 Waterloo, Ont., Can. N2L 3G1 Shomit Ghose nCUBE 919 E. HilhuLale Blvd. Foster City, CA 94404 Clarence Giese 8135 Table Mesa Way Colorado Springs, CO 80919 Dr. Horst Gietl nCUBE GmbH Hammer Strame 85 8000 Mrmich 50 Germany John Glbert Xerox PARC 3333 Coyote Hill Road palo Alto, CA 94304 Micbnel E. Giltrud DNA HQ DNA/SPSD 6801 Telegraph Rd. Alexan&la, VA 2231CL3398

AktairM. Glass AT&T Bell Labe Rm IA-164 6(KI Mountain Avenue Murray Hill, NJ 07974 J. G. Glirnm Dept. of App Math. & Stat. State U. of NY at Stony Brook Stony Brook, NY 11794 Dr. Raphael Gluck TRW-DSSG, R4/1408 One Space Par-k Redondo Bea+ CA 90278 G. H. Golub Computer =Ience Dept. Stanford University Stanford, CA 943o5 Marcia Grabow AT&T Bell Labe ID-153 6CCIMountain Ave. PO Box 636 Murray Hill, NJ 07974-0636 Nancy Grady MS 6240 Oak Ridge Nat’] Lab

Boa 2008 Oak Ridge, TN 37831 Anne Grcenbaum New York University Courant Institute 251 Mercer Street New York, NY 10012-1185 Satya Gupta Intel SSD Bldg. C)&O9, Zone 8 14924 NW Greenbriar,Pwky Beaverton, OR 97cE36 J. Guetafson Computer Sdence Department 236 Wilhelm Hall Iowa State University Ames, IA 50011 R. Graysnrr Hall USDOE/HQ 11XH3 Independence Ave, SW Washington, DC 20585 Cuah Handen Minnesota Supercomputer Cntr. 1200 Washington Ave. So. Minneapolis, MN 55415 Steve Hammond , NCAR PO Box 3000 Borddm, CO 80307

United Kingdom W. D. Hillis Thinking Machinea, Inc. 245 First St. Cambridge, MA 02139 Dan Hitchmck US Department of Energy SCS, ER-30 GTN Washington, DC 20585 LTC Richard Hochbewrg SDIO/SDA The Pentagon Waahingto~ DC 20301-7100 C. J. Holland Director Math and Information Sciences AFOSR/NM, Boiling AFB Washingto~ DC 20332-6448 Dr. Albert C. Holt Oft. of Munitions Ofc of Sec. of St.-ODDRE/TWP Pentagon, Room 3B106o Washington, DC 203301-311Xl Mr. Daniel Holtzman vanguard R?SearCh Inc. 10306 Eaton P]., Suite 450 Fairfax, VA 22030-2201

CDR. D. R. Hamon Chief, Space Integration Div. ussPAcEcoM/J5sI Peterson AFB, CO 809145003

David A. Hopkins US Army Ballistic Resch. Lab. Attention: SLCBfVIB-M Aber-decn Prov. Gmd., MD 21W5-5066

Dr. James P. Hardy NTBIC/GEODYNAMICS MS N 893o Falcon AFB$ CO 60912-LYYJo

Graimm Horton Univemitat Erlangen-Nurnberg IMMD III Martenmrtrase 3 8520 Erlangen Germany

Doug Harless NCUBE 2221 E~t Lamar Blvd., Suite 36o Arlington, TX 76oo6

Fred Howes US Department of Energy OSC, ER-30, GTN Washington, DC 20585

Mike Heath University of Illinois 4157 Beckman Institute 405 N. Mathews Ave. Urbana, IL 61801

Chua-Huang Hu~ Assist. Prof. Dept. Comp. & Info Sci Ohio State Uuiv. 228 Bolz Hall-2036 Neil Ave. Columbus, OH 43210-1277

Greg Heileman EECE Department Univemity of New Mexico Albuquerque, NM 87131

R. E. Huddleston L61 Lawrence Llvermore Nat’1 Lab PO Box 808 Liverrnore. CA 9455o

Brent Henrich Mobile R &D Laboratory 13777 Midway Rd. PO Box 819047 DaU=, TX 75244-4312

Zdenek Johan Thinking Machines Corp. 245 First Street Cambridge, MA 02142-1264

Michael Heroux Cray Research Park 655F Lone Oak Drive Eagrm, MN 55121

Gary Johnson US Department of Energy SCS, ER30 GTN Washington, DC 20585

A.J. Hey University of Southampton Dept. of Electronic and Computer Science Mountbatten Bldg., Highfield Southampton, S095NH

S. Lenmwt Johnsson Thinking Machines Corp. 245 First Street Cambridge, MA 02142-1264

16

G. S. Jones Teeh Program Support Cntr. DARPA/AVSTO 1515 Wifson BIvd. Arlington, VA 222o9 T. H. Jordan Dept of Earth, Atmoa & Pla. Sci. MIT Cambridge, MA 02139 M. H. Kdoa Cornell Theory Center 514A Eng. and Theory Center Hoy Road, Cornell Univemity Ithaca, NY 14853 H. G. Kaper Math. and Comp.

%]. Division

Argonne National Laboratory Argonne, IL 60439 S. Karin SuperComputing Department 9.5MIGilman Drive University of CA at San Diego La Jofla, CA 92093 Herb Keller Applied Math 217-50 Cd Tech Paaadena, CA 91125 M. J. Kelley DARPA/DMO 1400 Witson Blvd. Arlington, VA 222C9 K. W. Kennedy Computer Science Department Rice University PO Box 1892 Houston. TX 77251 Aram K. Kevorkian Codje 73L14 Naval Ocean Systems Center 271 Catalina Blvd. San Diego, CA 92152-5000 John E. Killougb University of Houston Dept. of Chem. Engineering Houston, TX 77204-4792 D. R. Kincaid Cntr. for Num. Andy., RLM l&150 Univemity of Texaa at Austin Austin, TX 78712 T. A. Kitchens US Department of Energy OSC, ER-30, GTN Washington, DC 20565 Thomas Klemas 394 Briar Lane Newark, DE 19711 Dr. Peter L. Knepell NTBIC/GEODYNAMICS MS N 8930 Falcon AFB, CO 80912-50W Max Koontz DOE/OAC/DP Forestal Bldg

5.1

lIXIO Independence Ave. Washington, DC 20585

Kelly AFB San Antonio, TX 782435~

Ann Krause HQ AFOTEC/OAN Kirtl.and AFB, NM 87117-7(XI1

H. Mair Naval Surface Warfare Center 10901 New Hampshire Ave. Silver Springs, MD 2090&5000

V. Kumar Computer Science D~ment Univemity of Minnesota Minneapolis, MN 55455 J. Larmutti MS B-166 Director, SC. Reach. Institute Florida State Univemity Tallahassee, FL 32306

Henry Makowitz MS - 2211-INEL EG&E Idaho Incorporated Idaho F.dlE, ID 83415 David ManddI MS F663 Hydrodynamic App. Grp. X-3 IAM Alamos Nat’] Labs Los Alanms, NM 87545

P. D. Lax New York University-Courant 251 Mercer St.

T. A. ManteufTel Department of Mathematics

New York, NY 10012

University of Co. at Denver Denver, CO 80202

Lawrence A. Lee NC Supercomputing Center PO Box 12689 3021 Comwdlis Rd. Research Triangle Park, NC 27709 Dr. H.R. Leland Calspan Corporation PO Box 400, Buff.do, NY 14225

William S. Mark Lockheed - Org, 96-01 Bldg. 254E 3251 Hanover Street Palo Alto, CA 943031191 Kapit Mathur Thinking Machines Corporation 245 Fimt Street Cambridge, MA 0214>1214

David Levine Math & Comp. S&me Argonne National Laboratay 9700 Cam Avenue South Argonne, IL 60439

John May Kanutn Sciences Corporation 1500 Garden of the Gods Road Colorado Springn, CO 60933

Peter Llttlewood Theoret. Phy. Dept. AT&T Belt Labe Ran lD-?35 Murray Hill, NJ 07974

William McColt Oxford Univ. Computing Lab 6-11 Keble Road Oxford, OX1 3QD United Kingdom

Peter Lomdafd T-II, MS B262 Los Alamoa Nat’] Lab Los AhllllOS, NM 87545

S. F. McCormick

L-As S. Leme SDIO/TNI The Pentagon Washington, DC 20301.7tO0 Col. Gordon A. Long Deptuty Director for Adv. Comp. HQ USSPACECOM/JOSDEPS Peterson AFB, CO 6(X)145(X)3 John LOll 3258 Caminito Ameca La Jolla, CA 92037 Daniel Loyem Koninklijke/ShelL Laboratorium Postbus 3003 1C02 AA Amsterdam The Nethertamb Robert E. Lynch Dept. of CS Purdue University West Lafayete, IN 479o7

Computer Mathematical Group Univemity of CO at Denver 1200 Larimer St. Denver, CO 80204 J. R. McGraw L-316 Lawrence Livermore Nat’1 Lab PO Box 808 Livermore, CA 9455o Jill Mesirov Thinking Machines Corporation 245 F]mt Street Cambridge, MA 0214Z1214 P. C. Messina 158-79 Mathematics & Comp Sci. Dept. Caltech Pasadena, CA 91125 Prof. Ralph Metmlfe Dept. of Mech. Engr. University of Houston 4600 Calhoun Road Houston, TX 772044792 G. A. Michael L306 Lawrence Livermore Nat Lab

Kathy MacLeod AFEWC/SAT 17

PO Box 808 Liverrnore, CA 94550 Lam K. Miller Goodyear Tire & Rubber PO Box 3531 Akron, OH 4430P-3531 Robert E. Milletein TMC 245 Fint Street Cambridge, MA 02142 G. Mohnkem NOSC - Code 73 San Diego, CA 92152-50W C. Moler Tbe Mathworke 24 prime Pti Way Nati~, MA 01760

3000 Waternew

Parkw~

PO Box 733851 Ricbrdaon, TX 75083-3851

Anthony C. Parmee Co-nor and Attde

(Def.)

British Embsssy 31W MaLw. Ave, NW WashingtorL DC 20008

s. v. Parter Department of Mathemati= Van Vleck Hall University of Wisconsin Madison, WI 537o6 Dr. Nieheeth Patel US Army Ballistic Ftesch. Lab. AMXBR-LFD Aberdeen Prov. Gmd., MD 21005-5066

R. J. Plemmone Dept. of Math. & Comp Sci. Wake Forest University PO Box 7311 Wimtort Salem, NC 27109 Alex Pothen Computer Science Department Univemity of Waterloo Waterloo, Ontario N2L 301

Canada John K. Prentice Amparo Corporation 37OORio Grande NW, Suite 5 Albuq., NM 87107-3042 Peter P. F. Radkoweki PO Box 1121 LOU Alamoe, NM 87644

Gery Montry Southwcat software

A. T. Patera Mechanical Engineering Dept. 77 Maseachueetto Ave.

J. Rattncr Intel Scientific Computere

11812 Pemimrrton, NE Albuquerque, NM 87111

MIT Cambridge, MA 02139

15201 NW Greenbriar Pkwy, Beaverton, OR 97oo6

N. R. Morse C-DO, MS B260 Comp. & Comm .Division Loe Alamoe National Lab Loe Alamoe, NM 87546

A. Patrinoe Atmoe. and Cfirn. Resch. Div OSice of Engy ResclL ER-74 US Deprutment of Energy WaAingtoaL D~ 20545

J. P. Retelle Org. 94-90 Lakheed - Bldg. 254G 32S1 Hanover Street Palo Alto, CA 94304

J. R. Medic IBM Thomas J. Wateon Raech Cntr. PO Box 704 Yorktown Heighte, NY 10698

R. F. Peierfs Math. Saenca Group, Bldg. S1S Brookhawrt National Lab upton, NY 11973

C. E. Rhoadea L298 Computational

D. B. Neleon US Department of Energy OSC, E&30, GTN Weehington, DC 20s65 Jeff Newmeyer Org. 81-04, Bldg. 157 1111 Lockheed Way Sunnyvale, CA 9406$3504 D. M. Noeenchuck Mech. and Aero. Engr. Dept. D302 E Quad Princeton University Princeton, NJ 08544 C. E. Oliver Offc of Lab Comp Bldg. 4500N, Oak Ridge Nat’1 Laboratory PO Box 2006 Oak Ridge, TN 37631-6259 Dennis L. Orphal Cdif Reach& Technology Inc. 5117 Jolmeon Dr. Pleasanton, CA 94586 J. M. Ortega Applied Math Department Univemity of Virginia Charlottesville, VA 22903 John Palmer TMC 245 Fimt St. Cambridge, MA 02142 Robert J . Paluck Convex Computer Corp.

Donna Perez NOSC - MCAC

Phyeia Div. PO Box 808 Lawrence Livermore Nat’1Lab Livennore, CA 84550

Reeource Cntr.

Code 912 San Diego, CA 92152-5LXKI K. Perko Supercomputing Solutions, Inc. 6175 Mancy Ridge Dr. San Diego, CA 92121 John Petresky Ballistic Research Lab SLCBRLF-C Aberdeen Prov. Gmd., MD 21MM-5006 Linda Petzold L-316 Lawrence Livennore Natl . Lab. Livermore, CA 9455o Wayne Pfeiffer San Diego SC Center PO Box 856o8 San Diego, CA 92136 Frank X. Pfenneberger Martin Marietta MS-N33104 National Test Bed Falcon AFB, CO 80912-~ Dr. Leslie Pierre SDIO/ENA The Pentagon Washington, DC 20301-7100 Paul Plaesman Math and Computer Science Division Argonne National Lab Argonne, IL 6043S 18

J. R. Rice Computer Science Dept. Purdue University West Lafayette, IN 479o7 John Richardson Thinking Machines Corporation 245 Fhat Street Cambridge, MA 02142-1214 Lt. Col. George E. Richie Chief, ADv. Tech Plans JOSDEPS Peterson AFB, CO 80914-5fK13 John Rollett Oxford University Computing Laboratory %11 Keble Road Oxford, OX1 3QD United Kingdom R. Z. h!?kk Physics and Astronomy Dept. 100 Allen Hall University of Pittsburg Pittsburg, PA 15206 Diane Rnver Michigan State Univemity Dept. of Electrical Engineering 260 Engineering Bldg. East Lansing, MI 48824 Y. sad University of Minnesota 4-192 EE/CSci Bldg. 2LMUnion St. Minneapolis, MN 5545&O159

P. wappm Dept. of Conlp. & Info SCiena Ohio State Univ.-228 Bolz HaU 2036 Neil Ave. Colurnbu, OH 43210-1277 Jod Sdtz Computer Science Department A.V. Williams Building university of Maryland College Park, MD 20742

L. SDirector, Supercbmputer Apps. 152 sup —puter Applications Bldg. 605 E. Sphgfield Chrunpaign, IL 618431 Wk R. Somaky Ballistic Remueb Laboratory SLCBR-SEA, Bldg. 394 Aberdeen Proving Ground, MD 21005

A. H. Sarneh

CSRD 305 Tdbot Laboratory University of Illinois 104 s. wright St. Urbana, IL 61801 P. E. Saylor Dept. of Comp. Saence 222 Digital Computation Univunity of Illinois Urbana, IL 61801

Anthony Skjellum Lawrence Lkmrnore National Laboratol “Y 7000 E-ad Ave., Mail Code L316 Llvermore, CA 94550

D. C. Sorenson Department of Math

LCDR Robert J. sCbOppG Chief, Operatiom Rqmts USSPACECOM/JOSDEPS Peterson AFB, CO 80914

s.

(Stop 35)

M. H. Sclodtz Department of Computer Science Y.de Univen3ity PO Box 2158 New Haven, CT 06520 Da= Schwartz NOSC, Code 733 San Diego, CA 92152A5C4Kl

Waahingto~

Allan Ton-es

Loe Alamoa National Lab PO Box 1666, MS F663 Los Ahrnos, NM 87545

Squires

N. Srinivasan AMOCO Corp PO 87703 Chicago, IL 60680-0703 Thomas Stegmann Digital Equipment Corporation 8085 S. Cbeskr Street Engiewood, CO 80112 D. E. Stein AT&T 100 South Jefferson Rd. Whippally, NJ 07981 M. Steuerwdt Division of Math Sciences National Science Foundation Washington, DC 20550 Mike Stevens nCUBE 919 E. Hillsdale Blvd. Foster City, CA 94404

A. H. Sh— Sa. Computing Amoc. Inc. Suite 307, 246 Church Street New Haven, CT 06510

G. W. Stewart Computer Stience Department University of Maryland Colfege Park, MD 20742

Horst Simon

O. StOrasdi MS-244 NASA Langley Research Cntr. Hampton, VA 23665

NAS Systems Divti]on NASA Amen Research Center Mail Stop T045-1 MofTett Field, CA 94035

c. Richard Sincovec Oak Ridge National Laboratory

DC 20550

Harold ‘lYeMe

Mark Seager LLNL, L80 PO box 803 Livermore, CA 94550

Aehok Singlud CFD Reach. Center 3325 ‘IMana Blvd. Huntsville, AL 35805

Math Sciences National ScienceFoun&Ion

Randy ‘human Mechanical Engineering Dept. University of NM Albuquerque,

Mail Stop T04$1 Moffett Field, CA 9403$1OW

Vineet Singb HP Labs, Bldg. lU, MS 14 1501 Page Mill RcA P.do Alto, CA 94304

A. Tludcr Division of

Scien-

DARPA/ISTO 1400 Wilson Blvd. Arlington, VA 11109

Rob SCbreibcr RIAcs NASA Am- ReeearcbCemter

P.O. Box 2008, Bldg 6012 Oak Ridge, TN 37831-6367

H. Teuteberg Cray Research, Suite 830 6565 American Pkway, NE Albuquerque, NM 87110

125 Lincoln Ave., Suite 400 Santa Fe, NM 87501

Rice University PO BOX1892 Houston, TX 77261 Lab

Gligor A. Taddcovicb PO Box 2% Pound Ridge, NY 1057& cY296

Stwt

NM 87131

Ray Tuminaro CERFACS 42 Ave Gustaw Coriolis 31057 Toulouse Cedex l%nce Mark Urmtia Intel Corp., CO1-03 5200 NE Elam Youns f%vy. Hilleboro, ORE 971246497 Mike Uttormark UW-Madiaon lsw Johnson Dr. Madison, WI 537o6 R. VanDeGeijn Computer Science Department University of Texas Austin, TX 78712 George Vandergrift Dist. Mgr. Convex Computer Corp. 3916 Juan Tabo, NE, Suite 38 Albuquerque, NM 87111 H. VanDerVomt Delft University of Technology Facufty of Mathematics POB 356 26oO AJ Ddft The Netherlands

c. vanLaan

DARPA/TTO 1400 Wilson Blvd. Arlington, VA 22209

Department of Computer Science Cornell University, Rm. 5146 Ithaca, NY 14853

LTC Jarnea Sweeder SDIO/SDA The Pentagon Washington, DC 20301-7100

John VanRosendde ICASE, NASA Langley Researrh Center MS 132C Hampton, VA 23665

R. A. Tapia Mathematical =1. Dept Rice University PO Box 1892 Houston, TX 77251

Steve Vavasis Dept. of Computer Science Upson Hall COmeU Univemity Ithaca, NY 14853

19

R. G. Voigt MS 132-C NASA Langley Ikcb Hampton, VA 36665

Cntr, ICASE

Phuong Vu Cray ReaeadL Inc. 19607 pram Road Houston, TX 77084 David Walker Bldg 6012 Oak Ridge National Lab PO Box 2008 Oak Ridge, TN 37B31 Stevm J. Wallach Convex Computer Corp. 3000 WaterView ParkwW PO Box 833851 Ricbardno~ TX 75083-3651 R. C. Ward Bid. 9207-A Mathernatimf %iencen Oak Ridge National Lab PO Box 4141 Oak Ridge, TN 37831-8083 Thomas A. Weber National Science Found 18Lx3G. Street, NW Washington, DC 20550 G. W. Weigand DARPA/CSTO 37o1 N. Fairfax Ave. Arfington, VA 22203-1714 M. F. Wheeler Math Sciences Dept Rice university PO Box 1892 Houston. TX 77251 A. B. White MS-265 k Ahmms National Lab PO Box 16433 Los Alamm, NM 87544 B. Wilcox DARPA/DSO 1400 Wilson Blvd. Arlington, VA 222o9 Roy Wifliams California Institute of Technology 20&49 Pasadena, CA 91104 C. W. Wif.90n MS MI02L3/Bll Digitaf Equipment Corp. 146 Main Street Maynar~ MA IM175

K. G. Wilson Physics Dept. Ohio State University Columbus, OH 43210 Leonad T. Wifncm NSWC Code G22 Dahlgmn, VA 22448 Peter WOlochOw

fntef Corp., CO1-03 52OONE Elam Young Pkwy. Hiflsboro, OR 97124-6497 P. R. Woodward University of Minnesota Department of Astronomy 116 Chumh Street, SE Minneapolis, MN 55455 M. Wunderli& Math. Sciencm Program National Security Agency Ft. George G. Mead, MD 20755 Hishashi Yasumori KTEC-Kawasaki Steel Techrwrenearch Corporation Hiblya Kokusai Bldg. 2-3 Uchisaiwaicho ‘khrome Chiyoddm, Tokyo lCII David Young Center for Numerical Analysis RLM 13.150 The University of Texax Austin, TX 78713-8510 Robert Young Alcoa Laboratories Alcoa Centw, PA 15069 Attm R. Youn8 & J. McMichaef Wilfiam Zierke Applied Research LabPenn PO Box 30 State CoUege, PA 168C14 INTERNAL DISTRIBUTION: Pauf Fleury Ed Bamis Sudip Dosanjh Bilf Camp Doug Cline David Gardner Grant Heffelfinger Scott Hutchinson Martin Lewitt Steve Plimpton Mark Sears John Sbadid Julie Swisshelrn Dick Allen Bruce Hendrickson (25) David Womble Ernie Brickell Kevin McCurley Robert Benner Carl Diegert Art Hale Rob Leland (25) Courtenay Vaughan Steve Attaway Johnny Bif3e Mark Blanford Jim Schutt Michael McGlaun Allen Robinson Pauf Barrington David Martinez Dons Crawford William Mason Technical Library (5) Technical Publication Document Processing for DOE/OSTI (10) Centraf Technical Fite Charles Tong 20

State.

1400 1402 1421 1421 1421 1421 1421 1421 1421 1421 1421 1421 1422 1422 1422 1423 1423 1424 1424 1424 1424 1424 1425 1425 1425 1425 1431 1431 1432 1434 lWO 1952 7141 7151 7613-2 8523-2 8117