III, K. W. Morton and M. J. Baines .... Joanne Downey-Burke. 8030 Sangor Dr. ..... Livennore, CA 84550. J. R. Rice. Computer Science Dept. Purdue University.
RECO
SANDIA REPORT SAND92–2765 UC–405 Unlimited Release Printed March 1993
An Efficient
Bruce Hendrickson,
Parallel Algorithm
Robert Leland, Steve Plimpton
Prepared by Sandia National Laboratories Albuquerque, New Mexico 87185 and Livermore, for the United States Department of Energy under Contract DE-AC04-76DP00789
SF2900Q(8-81
)
California
84560
MICROFICHED
Iessued by Sandia National Laboratories, operated for the Unitid States Department of Energy by Sandia Corporation. NOTICE This report wee prepared aa an account of work eponeored by an agency of the Unitad States Government. Neither the United Statm Government nor any agency thereof, nor any of their employees, nor any of their contrac@e, subcontractor, or thei~ ernoployees,make? ~y warranty, express or un had, or aasumes any legai hablhty or responslbdity for the accuracy, comp etaneaa, or uaefulneea of any information, apparatus, product, or proceea diecloeed, or re resents that ita use would not infringe privately ownad rights, Reference f erein tQ any specific commercial product, proceae,or aeMce b trade name, trademark, manufacturer, or otherwise, does not neceaeanj conatituta or imply ite endorsement, recommendation, or favoring by the nited Statae Government, any ency thereof or any of their contractor or mhcontract.ora. The viewa an3 opinions expressed herein do not neceaearily data or reflect those of the United States Government, any agency thereof or any of their contractme.
r
Printad in the United States of America, This report haa been reproduced directly from the beet available copy. Available to DOE and DOE contractors from Office of Scientific and Technical Information PO Box 62 Oak Ridge, TN 97831 Pricee available from (615) 576-8401, FTS 626-8401 Available t.a the public from Nationai Technicai Information Service US Department of Commerce
5285Port Ro al Rd Springfield, J A 22161 NTIS
rice codes
Printi a copy A09 Microfiche COPFAO1
SAND92-2765 Unlimited Release Printed March 1993
Distribution Category UC-405
An Efficient Parallel Algorithm for Matrix–Vector Multiplication Bruce Hendrickson, Robert Leland and Steve Plimpton Sandia National Laboratories Albuquerque, NM 87185
Abstract Abstract. The multiplication of a vector by a matrix is the kernel computation of many algorithms in scientific computation. A fast parallel algorithm for this calculation is therefore necessary if we are to make full use of the new generation of parallel supercomputers. This paper presents a high performance, parallel matrix–vector multiplication algorithm that is particularly well suited to hypercube multiprocessors. For an n x n matrix on p processors, the communication cost of this algorithm is O(n/~ + log(p)), independent of the matrix sparsity pattern. The performance of the algorithm is demonstrated by employing it as the kernel in the well–known NAS conjugate gradient benchmark, where a run time of 6.09 seconds was observed. This is the best published performance on this benchmark achieved to date using a massively parallel supercomputer. Key method
words.
matrix-vector
AMS(MOS)
subject
Abbreviated
title.
multiplication,
classification.
hypercube,
conjugate gradient
65Y05, 65F1O
Parallel Matrix–Vector
This work was supported
parallel computing,
Multiplication
by the Applied Mathematical
Sciences program, U.S. Department
Energy, Office of Energy Research, and was performed at Sandia National Laboratories, the U.S. Department
of Energy under contract No. DE-AC04-76DP00789. 1
of
operated for
1. Introduction.
The multiplication
linear algebra algorithms,
including,
of a vector by a matrix is the kernel computation
for example, the popular Krylov methods for solving linear and
eigen systems. Recent improvements
in such methods, coupled with the increasing use of massively
parallel computers, require the development of efficient parallel algorithms for matrix–vector tion. This paper describes such an algorithm. it is particularly
in many
Although
multiplica-
the method works on all parallel architectures,
well suited to machines with hypercube interconnection
topology, for example the Intel
iPSC/860 and the nCUBE 2. The algorithm described here was developed independently methods of organizing algorithm
parallel many-body
is very similar in structure
calculations
in connection with research on efficient
(see [5]). We subsequently
to a parallel matrix-vector
multiplication
learned that our
algorithm
described
in [4]. We have, nevertheless, chosen to present our algorithm because it improves upon that in [4] in several ways: First, we specify how to overlap communication
and computation
and thereby reduce
the overall run time. Second, we show how to map, the blocks of the matrix to processors in a novel way which improves the performance of a critical architectures.
communication
operation
And third, we consider the actual use of the algorithm
within
on current hypercube the iterative
gradient solution method and show how in this context a small amount of redundant be used to further
reduce the communication
have been able to achieve significantly previously
requirements.
By integrating
conjugate
computation
these improvements
can we
better performance on a well known benchmark than has been
possible with a massively parallel machine.
A very attractive
property of the new algorithm is that its communication
operations are indepen-
dent of the sparsity pattern of the matrix, making it applicable to all matrices. For an n x n matrix on p processors, the cost of the communication structure
is O(n/@+
which allows for other algorithms
this structure
log(p)). However, many sparse matrices exhibit
with even lower communication
requirements.
Typically
arises from the physical problem being modeled by the matrix equation and manifests
itself as the ability to reorder the rows and columns to obtain a nearly block–diagonal
matrix, where
the p diagonal blocks are about equally sized, and the number of matrix elements not in the blocks is small. This structure
can also be expressed in terms of the size of the separator of the graph describing
the nonzero structure
of the matrix.
Our algorithm
is clearly not optimal for such matrices, but there
are many contexts where the matrix structure
is not helpful (e.g. dense matrices, random matrices),
or the effort required to identify
is too large to justify.
algorithm
is most appropriate
This paper is structured munication
primitives.
the structure
It is these settings in which our
and provides high performance. as follows. In the next section we describe the algorithm
In §3 we present refinements and improvements
and its com-
to the basic algorithm,
and
develop a performance model. In §4 we apply the algorithm to the NAS conjugate gradient benchmark problem to demonstrate its utility. 2. A parallel
matrix–vector
Conclusions are drawn in §5. multiplication
algorithm.
Iterative solution methods for linear
and eigen systems are one of the mainstays of scientific computation. matrix–vector
These methods involve repeated
products or matvecs of the form yi = Axi where the the new iterate, xi+1, is generally 2
some simple function necessary that xi+l
of the product vector yi. To sustain the iteration be distributed
on a parallel computer, it is
among processors in the same fashion as the previous iterate xi.
Hence, a good matvec routine will return a yi with the same distribution
as xi so that xi+1 can be
constructed with a minimum of data movement. Our algorithm respects this distribution We will simplify notation and consider the parallel matrix-vector
requirement.
product y = Ax where A is an
n x n matrix and x and y are n-vectors. The number of processors in the parallel machine is denoted by p, and we assume for ease of exposition of 2. It is fairly straightforward
that n is evenly divisible by p and that p is an even power
to relax these restrictions.
Let A be decomposed into square blocks of size (n/@) one of the p processors, as illustrated
by Fig. 1. We introduce
x (n/@,
each of which is assigned to
the Greek subscripts a and /3 running
from O to ~ – 1 to index the row and column ordering of the blocks. The (a, ~) block of A is denoted by Aap and owned by processor Pap. The input vector z and product vector y are also conceptually divided into @ pieces indexed by /3 and a respectively. Pap must know Z@in order to compute its contribution n/fi
which we denote by zap. Thus Z.B = A.~zp,
Given this block decomposition, to ya. This contribution
processor
is a vector of length
and y. = 2P .z.p where the sum is over all the
processors sharing row block a of the matrix.
Ya
Fig. 1. Structure of matrix product y = Ax. 2.1. Communication
primitives.
Our algorithm requires three distinct patterns of communi-
cation. The first of these is an efficient method for summing elements of vectors owned by different processors, and is called a fold operation in [4]. We will use this operation to combine contributions to y owned by the processors that hold a block row of A. The fold operation is sketched in Fig. 2 for communication
among processors with the same block row index a. Each processor begins the fold
operation with a vector zap of length n/fi. of the vectors involved at each stage. Within
The operation requires Iogz (W) stages, halving the length each stage, a processor first divides its vector z into two
equal sized subvectors, Z1 and 22, aa denoted by (zl 122). One of these subvectors is sent to another 3
processor,
while the other processor
received
subvector
the conclusion We denote requires
is summed
sends back itscontribution
element–by-element
of the fold, each processor
this subvector
no redundant
each processor
with Greek
floating
is n/fi
with the retained
has a unique,
superscripts,
point operations,
tothesubvector
length
which remained.
subvector
The
to finish the stage.
At
portion of the fully summed vector.
n/p
hence Pap owns portion and the total number
ya~.
The fold operation
of values sent and received
by
– n/p.
?rocessor
Pap knows zap E IRnl @
: := zap ?ori=O,
. . ..log2(l-l
(Z, IZ2) = z PaBI := PQO with i:h bit of D flipped If bit i of/3 is 1 Then Send
Z1 to processor
Receive
Pap,
W2 from processor
POPI
Zp := 22 i- W2 Else Send
Z2 to processor
Receive
PaO,
w 1 from processor
P.OI
Z:=zl+wl ,Ufl := z ‘rocessor
Fig.
2. The fo
In the second shared
among
essentially outlined
all the processors
begins
in the column
know all n/@
processor
for processor
operation
communication
of length
logarithmic
is n/fi
The optimal and various
of stages
are required,
implementation
The expand
and when the operation
processor
finishes
be
[4], that
operation
is
and the total
all processors
At each step in the operation
and receives that processor’s
by the “1” notation.
of the fold and expand
considerations,
however, efficient implementations
of the matrix
as indicated
ezpand
must
values.
These
As with the fold operation,
number
of values sent and received
a
only a by each
– n/p.
hardware
can be implemented
n/p,
called
that
with the same column index /3. Each processor
values in the union of their subvectors.
are concatenated,
number
algorithm
of the fold operation.
processors
sends all the values it knows to another
two subvectors
knows some information
We use a simple
pattern
between
with a subvector
PQO as part of block row a.
each processor
in a column.
in Fig. 3 for communication
in the column
processor
operation
communication
uses the inverse
Pap now owns yafl E lRnJp
e.g.
the availability
on most architectures.
using only nearest
operations
neighbor
of multiport
On hypercubes,
communication
are owned by a sub cube with W processors. 4
depends
on the machine
communication. for example,
topology
There
are,
these operations
if the blocks in each row and column
On meshes,
if the blocks of the matrix
are
Processor z:=
P08 knowsfl”
EIR”IP
@
Fori=log2(@)
-1,...,0
Pot,@ := Papwithiih Send
bitof
z to processor
Receive
a flipped
P0,,8
w from processor
P=t,D
If bit i of CYis 1 Then z := W[z Else z := Zlw Yo := z Processor
3. The expand
Fig.
mapped
in the natural
implemented
message
operation
way to a square
efficiently
The third
to the processor
operation
can be difficult
an optimal,
transpose
in the transpose
to implement distant
However,
communication
O(@
changing
The
computing
each processor it owns.
Fig. 2, after processor
matrix-vector
operations performs
These
values
VP are d~tributed procemors
of the matrix,
can be
each processor
steps.
operations,
algorithm
Consequently,
is about
the transpose
of the message
less than
messages
We
from a similarly
the length @
is great.
which is discussed
may benefit
is unavoidable,
of messages
congestion
on hypercubes
architectures
this transpose
a large number
for message
for this operation on other
to send a
i.e. PQB sends to PB=. Since we
This is because
even if congestion
details
algorithm.
in the algorithm
are summed
a subvector
within
are performed
are presented
the volume
of
can be delayed by
in steps
multiplication
processor
(1) and (2).
involving
rows in step
of y., whereas
to perform
the procemors
the portion
5
in step
(1),
of the matrix from by
Pep must know all of
exch~ges
its subsegment
After the transposition,
as required
for
the values owned
the next matvec
in COIUmII block 0 of A. The expand
gives each of them all of yp, so the result is distributed
First,
(2) using the fold operation
in steps (3) and (4). In step (3), each processor block of the matrix.
our algorithm
in the following section.
owns n/p of the values of y. Unfortunately,
owning the transpose
among
We can now present
and enhancements
the local matrix-vector
up. This is accomplished ~ with the processor
algorithm
multiplication
which each processor
Pap are just
operations
the overall scaling of the algorithm.
y = Az in Fig. 4. Further
All the numerical
requires
so the potential
step of our matvec
in the fold and expand
2.2.
algorithm
efficiently.
algorithm
of our matvec
algorithm.
portion
processors,
data exchanged without
then the fold and expand
to be efficient for the fold and expand
congestion-free
in $3.1. Implementations tailored
in our matvec
owning the transpase
to architecturally
have devised
P&Oas part of block column /?,
grid of processors,
want row and column communication
must travel
for processor
[9].
communication
communication
Pop now knows ya E lRnl@
operation
of
the values of among these
for a subsequent
matvec.
We note that matrices, result
at this level of detail,
but m we discuss
the algorithm
in the next section,
is identical
the detaila
of steps
in [4] for dense
(1), (2) and (3) are different
and
in a more efficient overall algorithm.
Processor
(2) Fold
zap = Aaezp
zap within
(3) Transpose
rows to form v“$
the y“~, i.e.
a) Send
I@ to Ppe
b) Receive (4) Expand
Fig.
matvec
algorithm
mapped
to subsets
is a subcube,
cessors
and
details
3.1. Transposition
mapping
#o
4. Parallel matrix-vector
3. Algorithmic
on
are most
#o
efficient
within
multiplication
are architecturally
single message
on a parallel
were sent between
adjacent
expand
computer
distant.
inefficient
Modern
between
processors.
On a hypercube sending
and receiving
channel)
the scheme processors
a message
parallel
Nevertheless,
usual scheme of assigning and the high order mapping
induces
diagonal
processor.
computers
is to move within
transposed So
in our algorithm
even if congestion
the algorithm
Unfortunately,
communication
at nearly
subset such a
between routing
pro-
so that
the same speed
a
as if it
messages
are simultaneously
trying
Hence machines
with cut through
routing
is usually
agree. On the nCUBE
to compare
the bit addresses
0100 will route
since messages
a row before are shorter
moving than
delays the transpose
as dimension
order
Unfortunately,
with within
thcee
mesh
dimension
processors
muting.
architectures
a column.
by @
the column
order
routing
in a row route where
Fortunately,
in the fold and expand
messages
the order Thus
from 1001 to 1000 to 1100 to 0100. The
from all the @ occurs
hypercubes
uses low order bits to encode
the row number.
bottleneck
known
of the
along the corresponding
2 and Intel iPSC/860
a procedure
blocks to processors
bits to encode
a natural
use cut-through
processors
in the
columns of the matrix are
and flip the bits in a fixed order (and transmit
matrix
A similar
used
are possible.
if multiple
a message
1001 to processor
congestion
fold primitives
congestion.
is from lowest bit to higheat,
from processor
Pep.
On a hypercube
since it requires
non-adjacent
for routing
until the two addresses
of comparisons
scheme
message
for processor
if rows and
or submeahes
to use the same wire, all but one of them must be delayed. can still suffer from serious
and
allow for fast communication.
operation
can be transmitted
to form VP
algorithm
The
while on a 2-D mesh rows, columns
that
columns
computers.
that
can make the transpose
from Pp.
refinements.
parallel
of processors
A.P and ZP
Pap owns
(1) Compute
@
to the one described
the
on this
through usual
the messages
operations
number
the
routing being
by a factor
the overall communication
scaling
of of
will not be affected.
On a hypercube,
a different
mapping
of matrix
blocks to processors 6
can avoid transpose
congestion
altogether.
With this mapping
operations,
but now the transpose
length
Consider
n/p.
we still have nearest operation
a &dimensional
we assume
that
block number
~. For fast fold and expand
column form a subcube. block row number
communication
is as fast as sending
hypercube
For simplicity
neighbor
This is assured
and the other
and receiving
where the address
d is even. The row block number operations,
a is a d/2-bit
we require
a single message
of each processor
that
the block column
string,
processor
number.
address.
addresses
the matrix) encode
this means
hypercube
the 6-bit
(with
processor
address
the block row index and C2C1co encodes
Note that a subcube
in this mapping
of the hypercube,
the transpose
operation
the proof assumes contention
THEOREM elements
scheme
and fold, operations
3.1.
Consider
a hypercube
in the processor’s
to the processor
Proof.
Consider
using
bit-address
in the transpose
a processor
location
will have with bit-address
transmitted
in as many stages
a sequence
of intermediate processor
as there
patterns.
denoted
of intermediate
by the following theorem.
Although
Now consider
connected
row number
to
and column
are disjoint.
rbQr& lcb- 1... roco, where
are bits, flipping After
and map processors
each stage,
in the transpose
order routing,
is
array
a message
is
bits in order
from right to left to generate
the message
will have been
intermediate
bit pattern.
two processors
After 2k stages,
dimension
PT
the row number
to the
The wires used in routing
whose patterns
the intermediate processor
routed
occur consecutively
processor
the
in the
will have the pattern
are a simple permutation
of the
Also, after 2k – 1 stages,
the
2k and 2k – 1 are equal.
another
usea the same
processor
wire employed
P’ # P, and assume
that
the message
i of the transmission
in step
being
from P to PT.
by this wire by P1 and Pz. Since they differ in bit position
consecutively
in the transition
i – 1 or i is even, so a simple permutation
P.. Similarly,
bits are
id. Then ihe wires used when each processor sends
bits of P in which the lowest k pairs of bits have been swapped.
be encountered
a similar
as long as row and column
with cb . . . co. The processor
by the current
patterns.
values in the bit positions
Either
However,
of a processor’s
. . . coro. The bits of this intermediate
rbcb . e .rkckck_~rk_~
processors
on
optimally.
order routing,
cbrbcb- 1r& 1 . . coro. Under
message from P to P= are those that connect
to Pm
scheme
dimension
P with bit-address
with rb . . . r., and the column number
sequence
still resides
in order from loweat to highest,
location in the array
encoded
intermediate
index.
as demonstrated
for any tlxed routing
bits ~rl r.
alternately.
are interleaved
a message
for the 8x8 blocks of
can be performed
of an array in such a way thai the bit-repwsentations
number
a mapping
would be ~c2r1c1 roco where the three
where bits are flipped
is possible
encode the
in the processor
each row of blocks and column of blocks of the matrix
so the expand
a routing
row and column
the block column
is now contention-free
free mapping
forced to change
original
3-bit
in each row and
Now consider
are interleaved
For a 64–proceaaor
w is the column
address
where the bits of the block row and block column indices of the matrix
of
is a d-bit string.
the processors
if any set of d/2 bits in the d-bit
d/2 bits encode
in the fold and expand
the same permutation
then P = P’ which is a contradiction.
applied
between
stages
Denote
the two
i, PI and Pz can only
of pairs of bits of P must generate
either
algorithm.
P1 or P2; say
PI or Pz; say Pi. If P. = P!
both P1 and P2 must appear 7
from P’
i - 1 and i of the routing
to P’ must also yield either
Otherwise,
routed
after an odd number
of
stages
in one of the routing
even then bits i and ithat
If i is odd then bits i and i + 1 of P must be equal, and if i is
sequences.
1 of P are equal.
case, P1 = P2 which again implies the contradiction
In either
P = P’. Cl 3.2.
Overlapping
and communicate
computation
simultaneously,
sor has sent a message arrives.
in the fold or expand
from step
the fold operation,
we should
in the current
pass.
than
computing
just those that
are about
the fold loop get computed
between
message
time and the time to compute” the next set of elements
Balancing
munication
the
requirements
putational
computational of our algorithm,
load is well balanced
computations
load.
across
the processors.
within each local matvec.
ros, the number will be balanced the matrix.
of floating
if m’ x m/p
for each processor,
For dense matrices
or random
However for matrices
with some structure
shown that
permuting
random
randomly
permutation
Most matrices
used in real applications
map the remaining
permutation diagonal
to the matrix.
can now be mapped
processor
owns.
3.4. minimal
to compute
to match
works best on the particular
the
has m’ nonze-
is 2m’ – n/@.
These
elements
ogielski
in
and Aiello have
with high probability
elements.
[8]. A
when summing
vectors
We have found that when
of these among processors
by first applying
while moving
the distribution
elements
a random
the off-diagonal
can then be computed
elements.
The
that each
in between
the transpose
and to
symmetric
of the y“~ subsegment
saving either
described
a matrix-vector
hardware.
balancing
the send
transmission
time
is smaller.
The algorithm
product.
the com-
the processors.
diagonal
the diagonal
above
multiplication,
We make no assumptions
its local matrix-vector
among
have nonzero
Some of these flops will occur during
the fold summations.
that
of nonzero
that zero values encountered
communication,
time, whichever
2m – n flops to perform
number
gives good balance
randomly
of the diagonal
on the com-
in which m >> n, the load is likely to be balanced.
This can be accomplished
to processors
model.
for the local matvec
to force an even distribution
This preserves
in the transpose
computation
Complexity
in the matrix. during
elements.
The contribution
and receive operations or the diagonal
advantage
of the
of za6.
this requires
it may not be. For these problems,
are likely to be distributed
this is the case, it may be advantageous randomly
matrices
will keep are computed.
owned by a processor
where m is the total
the rows and columns
has the additional
in the fold operation
(flops) required
the send and receive
must also ensure
For our algorithm,
If the region of the matrix
point operations
Then whichever
above has concentrated
algorithm
com-
of Z=e before
the fold loop by the minimum
The discussion
but an efficient
by interleaving
to be sent.
In the final pass, the values that the processor on each pass through
from its neighbor
all the elements
run time is reduced
3.3.
that once a proces-
it is idle until the message
In this way, the total transmission
is able to both coinpute
in step (2) of the algorithm
Rather
compute
values will be sent in the next pass through operations
operations,
(1).
If a processor
in Fig. 4 has the shortcoming
in the fold operation
with computation
beginning
communication.
then the algorithm
This can be alleviated
munication
and
the calculation about
can be implemented
where m is the number of the local matvecs,
the data structure
This allows for the implementation
If we assume 8
to require
the computational
the
of nonzeros and the rest
used on each processor of whatever
load is balanced
algorithm by using the
techniques
described
in $3.3, the time to execute
(2m - n)THoP/p, where T~~P is the The cation
algorithm
volume
requires
of n(2~
communication
Iogz (p) + 1 read/write point
the effective
very sparse,
the computational
transmission
time in the fold operation,
Furthermore, involving reduces
the matrix to n(@
that
diagonal,
– I)/p.
pairs
numbers.
should
be very nearly
– 1)/p.
to form the local matvec
the transpose
and a total
for the natural
volume is n(2~
as discussed
as described
for each processor,
Accounting
communication
time required
we will assume
point operations
time required for a single floating point operation.
– 1) floating
operations,
these floating
transmission
in $3.3.
that
is
to hide the
this is the case.
with computations
communication
The total run time, TtOtal can now be expressed
in the
Unless the matrix
time can be hidden
The effective
parallelism
will be sufficient
in ~3,2. We will assume
communi-
volume
therefore
as
2m–n
(1)
Ttot~
=
—Gop
+
P
where TfiOPis the time to execute send and receive operation
described
4. Application parallel
respectively,
solver.
Conjugate
multiplication
of variants
of the algorithm multiplication,
among
(forming
This
because
[1, 3] discussed requires
of the CG algorithm
are distributed,
small,
and then calculating
if the algorithm
the first two operations
However
these
algorithm rTr,
is algebraically
exploits
essentially
calculation
vector
of our
gradient
(CG)
in Fig. 5. There modified
version
to the matrix–vector
updates
(of z, r and p), as
should divide the workload Unfortunately,
evenly among
these goals are in conflict
calculations
require communication
in Fig. 5 is implemented
in parallel,
each
p’ and hence ~. The calculation
of p’ = rTr can actually
be condensed
simultaneously
are still very costly.
equivalent
with a single global communication,
product
for
into two
with a binary
One way to reduce
the
is to modify it as shown in Fig. 6.
the new algorithm
In exchange
three
is a slightly
In addition
r to compute
global operations
by Van Rosendale
routine.
the efficiency
Ax = b is depicted
can be accomplished
r~ri – ~yTri + cr2yTy, as suggested
matvec
a
point value.
as it is with the mapping
To examine
later.
the inner product
of -y, and the calculation
load of the algorithm
modified
time per floating
to initiate
~ and p’).
In addition,
of 7 = p~y, the distribution
communication
,
are the times
&ceive
the one we present
must know the value of a before it can update
algorithm.
and
for solving the linear system
the cost of communication
all the processors.
exchange
1) Tt~~n*~i~
we used it as the kernel of a conjugate
of the basic CG method;
when the vector updates
global operations
–
P
algorithm.
the inner loop of the CG algorithm
while keeping
processor
Tsend
‘(w
+
is insignificant,
Gradient
algorithm,
An efficient parallel implementation
because
contention
given in the NAS benchmark
well as two inner products
processors
+ ‘Geceive)
and TtrmBmit is the transmission
if message
A version of the CG algorithm
are a number
l)(T.end
in $3.1. to the
matrix-vector
+
a floating point opdration,
This model will be most accurate hypercubes
(1%2(P)
to the original,
the identity
the communication reduction,
since 4 = yTr and $ = yTy must 9
r~+lri+l
instead
of updating
r
= (r~ – @Y)T(ri – @Y) =
[10]. The values of 7, ~ and @ can be summed
halving
for this communication
but
there
time required
is a net increase
now be computed,
outside
the
of one inner
but @ = rTr
need not
X:=cl r:=b p:=b
rTr
p :=
Fori=l,.
..
y := Ap ‘y :=pTy ff :=
p/7
x :=z+ap r:=r
— ay
p’ := rTr P:=
P’IP
P:=
P’
p:=r+pp Fig.
be calculated additional
explicitly.
2n/p floating
Whether
Since the vectors point operations
this is a net gain depends
communication per unit
on a particular
than
the machine n
:=b ? := r=r Sum ~ over all processors
to form p
Expand
to form p“
Tori
p within
=l,.
columns
..
Compute Fold
ZP =
ZP within
Transpose
y~” to P.P
Receive pTy
(j :=
y=,
J) :=
y=y
Sum
rows to form y~v
y~”, i.e.
Send
~ :=
AP.P.
‘
y := y“f’ from P.P
~, ~ and ~ over all processors
Q :=
p/y
p’:=
p–afj+cd$
/3:=
p’/p
p :=
p’
to form -y, ~ and ~
Z:=z+ap r:=r+(ry p:=r+flp Expand
Fig. C code achieves assembly 5.
about
language
aa n/@.
either
difficult
250 Mflops, which is about
We have presented
cost of this algorithm Consequently, or impossible
serve aa an efficient black-box in a sparse
without
is independent
to exploit.
matrix
achievable
by running
algorithm within
for matrix–vector
matrices
the conjugate
of the zero/nonzero
is most appropriate
gradient
structure
for matrices
in many contexts.
for prototyping
library
sparse
matrix
where few assumptions 12
pure
multiplication,
The
of the matrix
and
in which structure
For example, linear algebra
about
matrix
and
algorithm.
This is clearly the case for dense and random
for sparse routine
F’PV.
any communication.
a parallel
the algorithm
for processor
12% of the raw speed
can be used very effectively
it is also true more generally
be embedded
to form p.
7. A parallel CG algorithm
shown how this algorithm
scales
columns
BLAS on each processor
Conclusions.
communication
p within
matrices,
our algorithm algorithms
structure
is and
could
or could
can be made.
On the NAS conjugate more than 40’%0faster The particular mapping
ensures
cut–through mapping
gradient
than any other mapping
that
routing
we employ
the transpose
haa already
reported
an nCUBE algorithm
of the matrix
operation
2 implementation
running
for hypercubes
rows and columns
of this algorithm
on any massively
parallel
is likely to be of independent are owned entirely
can be performed
proved useful for parallel
to other linear algebra
many–body
by subcubes,
without
calculations
message
runs
machine.
interest.
This
and that
with
contention.
[5], and is probably
This
applicable
algorithms.
Acknowledgements. percube
benchmark,
transposition
We are indebted
algorithm
to David Greenberg
for assistance
in developing
the hy-
in $3.1.
REFERENCES [1] D. H. BAILEY, E. BARSZCZ, J. T. BARTON; D. S. BROWNING, R. L. CARTER, L. DAGUM, R. A. FATOOHI, P. O. FREDERICKSON, T. A. LASINSKI, R. S. SCHREIBER, , H. D. V.
VENKATAKRISHNAN,
Supercomputing
AND
S.
Applications,
K.
The NAS
WEERATUNGA,
parallel
benchmarks,
Supercomputing
’92, IEEE
Computer
Society
[3] D. H. BAILEY, J. T. BARTON, T. A. LASINSKI, benchmarks,
Intl. J.
5 (1991), pp. 63-73.
[2] D. H. BAILEY, E. BARSZCZ, L. DAGUM, AND H. D. SIMON, NAS parallel benchmark Proc.
SIMON,
Tech. Rep. RNR-91-02,
Press, H. D.
AND
results, in
1992, pp. 386-393. SIMON,
NASA Ames Research
The NAS parallel
EDITORS,
Center,
Moffett Field, CA, January
1991. [4] G. C. Fox,
M. A. JOHNSON, G. A. LYZENGA, S. W. Solving problems
WALKER,
on concurrent
processors:
OTTO,
J. K. SALMON,
Volume
1, Prentice
D. W.
AND
Hall, Englewood
Cliffs, NJ, 1988. [5] B. HENDRICKSON munication, December [6] R. W.
S.
AND
Tech.
Rep.
SAND
92-2766,
[7] R. W.
Systems,
LELAND
PhD thesis, J. S.
AND
Numerical
methods
University
Press,
[8] A. T. OGIELSKI J. Sci. Stat. [9] R. VAN
DE
puting
many-body
Sandia
calculations
National
without
Laboratories,
all-to-all
Albuquerque,
comNM,
1992.
LELAND, The Effectiveness
Linear
Parallel
PLIMPTON,
AND
University
of Oxford,
Evaluation
ROLLETT,
in fluid dynamics
Algorithms Oxford,
for Solution
England,
of a parallel
conjugate
III, K. W. Morton
of Large Sparse
October
1989.
gradient
and M. J. Baines,
algorithm, eds.,
in
Oxford
1988, pp. 478-483. W.
Comput.,
GEIJN,
of Parallel Iterative
14 (1993).
Eficient
Conf., IEEE
Computer
computations
on parallel processor
Society
operations,
Press,
in Proc.
on parallel
pp. 44-46. 13
6th Distributed
Memory
Com-
1991, pp. 291-294.
inner product data dependencies
conference
arrays, SIAM
To appear.
global combine
[10] J. VAN ROSENDALE, Minimizing in 1983 International
Sparse matriz
AIELLO,
processing,
in conjugate
gradient
iteration,
H. J. Siegel et al., eds., IEEE,
1983,
EXTERNAL DISTRIBUTION: R W. Alewine
DARPA/RMO 1400 Wilmn Blvd. Arlington, VA 222o9 F~ Altabdi US Air Force Weapono Lab Nuclear Technology Branch Kirtfand+
AFB,
Falcon
AFB, CO
80912-51MfJ
Raymond A. Bair Molecuhu Science Reach. Cntr. Pacific NW hborato~ Richland, WA 99362
Bud Brewster 410South Pierce Wheaton, IL 80187
R. E. Bak Dept. of Mathematical Univ. of CA at San Diego La Jolla, CA 92093
Carl N. Brooks Brc&a Associates 414 Falls Road Chagrin Fafls, OH 44022
Ken Bannister
John Brunet 245 Fmt St. Cambridge, MA 02142
NM 87117-6008
Mad w. Allen The Waif Street Joumaf 1233 Regal hW Daflas, TX 75247 M. Alme Alme and Associates 6219 Bright Plume C&unbii, MD 21044 ChaAa E. Anderson SOuthweat ~ Institute PO Drawer 28610 San Antonio, TX 78284 Dan Anderson Ford Motor Co., Suite 1100 Vie Pke 22400 Michigan Ave. Dearbcrn, MI 48124 Andy Arenth
National Security Agency SlwageRoad Ft. Meade, MD 207ss
Attn: C6 Greg Astfdk Convex Computer (hp. 3(N)0 Waterview Parkwqy PO Box 833851 Richrmdao% TX 750s3-3851 Susan R. Atlas TMC, MS B258
US Army Bdfistic Res. Lab Attrx SLCBIVIB-M Aberdeen Prov. Gmd., MD 21005-5086 Edward Barragy Dept. ASE/EM University of Texas Austin, TX 78712
John Bnmo Cntr for Comp. Sci and Engr College of Engineering Univemity of California Santa BarbM~ CA 931-5110
E. Bamm NAS Applied Rewarch Branch NASA Ames Ikearch cent. Moffett Field, CA 94035
H. L. Buchanan DARPA/DSO 1400 Wifson Blvd. Arlington, VA 22209
W. Beck AT&T Pixel Machinea, 4J-214
D. A. Bud Supercornputing Reach. Cntr,
Crawfords Ccuner Rd. Homdel, NJ 07733-1%8
17100 Saence Dr. Bowie, MD 20715
David J. Dept. of Univ. of La Jolla,
B. L. Buzbee Scientific Computing NCAR PO Box 3000 Boulder, CO 80307
fkuuq AMES FLO1l California at San Diego CA 92093
Mylea R. Berg Lockheed, 0/62-30, B/150 1111 Lockheed Way %nqyvale, CA 9408%3.504
Stephan Bilyk US Army Baflist. Reach. Lab SLCBKTB-AMB Aberdeen Proving Ground, MD 21OOWIO66
Dept.
G. F. Carey Dept. of Engineering Mechanics TICOM ASEEM WRW 305 University of Texan at Austin Austin, TX 78712 Art Carfmn NOSC Code 41 New London, CT 06320
Center for Nonlinear Studies Los Alarnos Nationaf Labs Los Ahunos, NM 87545
Rob Bisseling Shelf Research B.V. Postbus 3003 1003 AA Amsterdam
L. Audander DARPA/DSO 1400 Wifson Blvd.
The Netherlands
CENDI, Information Iut’1 PO Box 4141 Oak Ridge, TN 37831
Matt Blumrich Dept. of Comp. Saence Princeton University Princeto~ NJ 0s644
Charfm T. Casale Aberdeen Group, Inc. 92 State Street Boston, MA 02109
B. W. Boehm DARPA/ISTO 1400 Wilson Blvd. Arlington, VA 222o9
J. M. Cavdlini US Department of Energy OSC, ER-30, GTN Wasbingto~ DC 20585
R. R. Borchcr L-889 Lawrence Lhennore Nat’1 Labs PO Box 808 Livermore, CA 94550
John Champine Clay Rea. Inc., Software Div. 655F Lone Oak Dr. Eagan, MN 55121
Arfington, VA 222o9 D. M. Austin Army High Per Comp. Rea. Cntr. University of Minnesota 1100 S. Second St. Mirmeapolie, MN 5s41s
Bonnie C. Carrolf Sec. Dir.
Scott Badm Univemity of Cdiforni% Stm Diego Dept. of Computer Saence 95OOGifman Drive Engirmxing0114 La Jofla, CA 92091-CX)14 F. R. Bailey MS2Cxl-4 NASA Ames Research Center Moffett Field, CA 94o35 D. H. Bailey NAS Applied Research Branch NASA bResearch Center Moffett F]eld, CA 94o35
Dan BOWhIS Mail Code 8123 Attm D. Bowlus & G. l..diecq Naval Underwater Sys. Cntr. Newport, RI 02841-5047
Tony Chan Department of Computer Science The Chinese Univemity of Hong Kong Shatin, NT Hong Kong J. Chandra Army Research Office PO Box 12211 Resch Triangle Park, NC 27709
Dr. Donald Brand MS - N8930 Geodynamica 14
Siddmtba Chatterjee RIAcs NASA Ames Reaeara Mail Stop T04~l MofTett Field, CA 94~l(W
IBM Corporation 472 Wheelers FMilford, CT 06460
Ibad
C.ter
Wamm Chemock Scientific Advisor DP.1 US Department of Energy Forentd Bldg. 4A.045 Washington, DC 20585 R.C.Y. Chin L-321 La wre.nce Liverrmwe Nat’1 Lab PO Box 808 Livermore, CA 9455o Mark Christen L122 Lawreme Llvennore Nat’1 Lab PO %X 808 Livemnme, CA 9455o M. Cirnent Adv. Sci. Comp/ Div. RM 417 National Science Foundation Washington, DC 20ss0 Richard Cfaytor US Department of Energy Def. Prog. DE1 Foreatd Bldg. 4A-014 Washington, DC 20585 Andy Cleary Centm for Information Research Austrafia National University GPA Box 4 CanberrA ACT 26o1 Australia T. Cole MS180-500 Jet Prop. Lab 4800 Oak Grove Dr., P.amdena. CA 911OP Monte ColeUS Army Bd. Reach Lab SLCBR-SEA (Bldg. 394/216) Aberdeen Prov. Gmd., MD 21005-5006 Tom Coleman Dept. of Computer Science Upeon Hdl Comelf University Ithaca, NY 14853
J. K. Crdlum Thorruu J. Watmn Resch. Center Po Box 218 Yorktown Heights, NY 10598
H. Elman Computer Science Dept. University of Marylaud College Park, MD 20842
Leo Dagmn Computer Sciences Corp. NASA Ames Research Center Moffett Field, CA 94CL3S
M. A. Efmer DARPA/RMO 1400 Wilson Blvd. Arfington, VA 222o9
Kenneth I. Daugherty Chief Scientist HQ DMA (5A), MS -A-16
J. N. Entzminger DARPA/TTO 1400 When Blvd.
8613 Lee Highway Fairfax, VA 22031-2138
Arlington, VA 222o9
L. Davis Cray Research Inc. 1168 Industnd Blvd. Chippawa Fafls, WI s472s Mr. Frank R. Deis Martin Marietta Falcon AFB, CO 80912-50t13 R. A. DeWlllo Comp. & Comput. Reach., Rm. 304 , National Science Foundation Washington DC 20550 L. Deng Applied Mathematics Dept. SUNY at Stony Brook
A. M. Erisrnan MS 7L21 Boeing Computer Services PO Box 24346 Seattle, WA 98124-0346 R. E. Ewing Mathernatica Dept. University of W yoming PO Box 3036 Univemity Station Lammie, WY 82071 El Dabaghi Fadi Charge of Research Institute Nat’1 De Recherche en Informatique et en Automatique Dornaine de Voluceau Romuencourt BP 105 78153 Le Chemay Cedex (France)
Stony Brook, NY 11794-3&lI A. Trent DePersia Prog. Mgr. DARPA/ASTO 1400 Wifscm Blvd. Arlington, VA 22209-2308 Sean Dolan nCUBE 919 E. Hilbdafe Blvd. Foster City, CA 944o4
H. D. F& Institute for Adv. Tech. 4032-2 W. Braker Lane Austin. TX 78759 Kurt D. Fickie US Army Ballistic Resch. Lab ATTN: SLCBR-SE Aberdeen Proving Ground, MD 21OWA4M6 Tom Finnegan
Jack Dongama Department of Computer Science Univemity of Termeasee Knoxville, TN 37996 L. Dowdy Computer Science Department Vanderbilt University Nashville, TN 37235
S. Coney NCUBE 19A Davis Drive Belmont, CA 94oO2
Joanne Downey-Burke 8030 Sangor Dr. Colorado Springs, CO 80920
J. Corones Ames Laboratory 236 Wilhelm Hdl Iowa State University Ames, IA 50011-3020
L S. DutT CSS Division Harwelf Laboratory Oxfordshire,OX11 ORA United Kingdom
Steve COogrOve E6220 KliOb Atomic power Lab PO Box 1072 Schenectady, NY 12301-1072
Alan Edelman University of California, Berkeley Dept. of Mathematics Berkefey, CA 94720
C. L. Crothera
Yale University PO Box 2158 New Haven, CT 06520
S. C. Euenstat Computer Sdence Dept. 15
NORAD/USSPACECOM J2XS STOP 35 PETERSON AFB, CO 80914 J. E. Flaherty Computer Science Dept. Remselaer Polytedr Inst. Troy, NY 12181 L. D. Foedick University of Colorado Computer Science Department Campus Box 43o Boulder, CO 8030P G. C. Fox Northeast PamUef Archit. Cntr. 111 College Place Syracuse, NY 13244 R. F. Freund NRaD- Code 423 San D,ego, CA 991525000 Sveme Fmyen Solar Energy Research Inst. 1617 Cole Blvd.
Golden, CO 80401 David Gale Intel Corp 600 S. Cherry St. Denver, CO 802221~1 D. B. Grmnon Computer %ience Dept. Indhma University Bloomington, IN 47401 C. W. Gear NEC Rcseadr Institute 4 Independence Way Princeton, NJ 08548 J. A. George Needles Half University of Water400 Waterloo, Ont., Can. N2L 3G1 Shomit Ghose nCUBE 919 E. HilhuLale Blvd. Foster City, CA 94404 Clarence Giese 8135 Table Mesa Way Colorado Springs, CO 80919 Dr. Horst Gietl nCUBE GmbH Hammer Strame 85 8000 Mrmich 50 Germany John Glbert Xerox PARC 3333 Coyote Hill Road palo Alto, CA 94304 Micbnel E. Giltrud DNA HQ DNA/SPSD 6801 Telegraph Rd. Alexan&la, VA 2231CL3398
AktairM. Glass AT&T Bell Labe Rm IA-164 6(KI Mountain Avenue Murray Hill, NJ 07974 J. G. Glirnm Dept. of App Math. & Stat. State U. of NY at Stony Brook Stony Brook, NY 11794 Dr. Raphael Gluck TRW-DSSG, R4/1408 One Space Par-k Redondo Bea+ CA 90278 G. H. Golub Computer =Ience Dept. Stanford University Stanford, CA 943o5 Marcia Grabow AT&T Bell Labe ID-153 6CCIMountain Ave. PO Box 636 Murray Hill, NJ 07974-0636 Nancy Grady MS 6240 Oak Ridge Nat’] Lab
Boa 2008 Oak Ridge, TN 37831 Anne Grcenbaum New York University Courant Institute 251 Mercer Street New York, NY 10012-1185 Satya Gupta Intel SSD Bldg. C)&O9, Zone 8 14924 NW Greenbriar,Pwky Beaverton, OR 97cE36 J. Guetafson Computer Sdence Department 236 Wilhelm Hall Iowa State University Ames, IA 50011 R. Graysnrr Hall USDOE/HQ 11XH3 Independence Ave, SW Washington, DC 20585 Cuah Handen Minnesota Supercomputer Cntr. 1200 Washington Ave. So. Minneapolis, MN 55415 Steve Hammond , NCAR PO Box 3000 Borddm, CO 80307
United Kingdom W. D. Hillis Thinking Machinea, Inc. 245 First St. Cambridge, MA 02139 Dan Hitchmck US Department of Energy SCS, ER-30 GTN Washington, DC 20585 LTC Richard Hochbewrg SDIO/SDA The Pentagon Waahingto~ DC 20301-7100 C. J. Holland Director Math and Information Sciences AFOSR/NM, Boiling AFB Washingto~ DC 20332-6448 Dr. Albert C. Holt Oft. of Munitions Ofc of Sec. of St.-ODDRE/TWP Pentagon, Room 3B106o Washington, DC 203301-311Xl Mr. Daniel Holtzman vanguard R?SearCh Inc. 10306 Eaton P]., Suite 450 Fairfax, VA 22030-2201
CDR. D. R. Hamon Chief, Space Integration Div. ussPAcEcoM/J5sI Peterson AFB, CO 809145003
David A. Hopkins US Army Ballistic Resch. Lab. Attention: SLCBfVIB-M Aber-decn Prov. Gmd., MD 21W5-5066
Dr. James P. Hardy NTBIC/GEODYNAMICS MS N 893o Falcon AFB$ CO 60912-LYYJo
Graimm Horton Univemitat Erlangen-Nurnberg IMMD III Martenmrtrase 3 8520 Erlangen Germany
Doug Harless NCUBE 2221 E~t Lamar Blvd., Suite 36o Arlington, TX 76oo6
Fred Howes US Department of Energy OSC, ER-30, GTN Washington, DC 20585
Mike Heath University of Illinois 4157 Beckman Institute 405 N. Mathews Ave. Urbana, IL 61801
Chua-Huang Hu~ Assist. Prof. Dept. Comp. & Info Sci Ohio State Uuiv. 228 Bolz Hall-2036 Neil Ave. Columbus, OH 43210-1277
Greg Heileman EECE Department Univemity of New Mexico Albuquerque, NM 87131
R. E. Huddleston L61 Lawrence Llvermore Nat’1 Lab PO Box 808 Liverrnore. CA 9455o
Brent Henrich Mobile R &D Laboratory 13777 Midway Rd. PO Box 819047 DaU=, TX 75244-4312
Zdenek Johan Thinking Machines Corp. 245 First Street Cambridge, MA 02142-1264
Michael Heroux Cray Research Park 655F Lone Oak Drive Eagrm, MN 55121
Gary Johnson US Department of Energy SCS, ER30 GTN Washington, DC 20585
A.J. Hey University of Southampton Dept. of Electronic and Computer Science Mountbatten Bldg., Highfield Southampton, S095NH
S. Lenmwt Johnsson Thinking Machines Corp. 245 First Street Cambridge, MA 02142-1264
16
G. S. Jones Teeh Program Support Cntr. DARPA/AVSTO 1515 Wifson BIvd. Arlington, VA 222o9 T. H. Jordan Dept of Earth, Atmoa & Pla. Sci. MIT Cambridge, MA 02139 M. H. Kdoa Cornell Theory Center 514A Eng. and Theory Center Hoy Road, Cornell Univemity Ithaca, NY 14853 H. G. Kaper Math. and Comp.
%]. Division
Argonne National Laboratory Argonne, IL 60439 S. Karin SuperComputing Department 9.5MIGilman Drive University of CA at San Diego La Jofla, CA 92093 Herb Keller Applied Math 217-50 Cd Tech Paaadena, CA 91125 M. J. Kelley DARPA/DMO 1400 Witson Blvd. Arlington, VA 222C9 K. W. Kennedy Computer Science Department Rice University PO Box 1892 Houston. TX 77251 Aram K. Kevorkian Codje 73L14 Naval Ocean Systems Center 271 Catalina Blvd. San Diego, CA 92152-5000 John E. Killougb University of Houston Dept. of Chem. Engineering Houston, TX 77204-4792 D. R. Kincaid Cntr. for Num. Andy., RLM l&150 Univemity of Texaa at Austin Austin, TX 78712 T. A. Kitchens US Department of Energy OSC, ER-30, GTN Washington, DC 20565 Thomas Klemas 394 Briar Lane Newark, DE 19711 Dr. Peter L. Knepell NTBIC/GEODYNAMICS MS N 8930 Falcon AFB, CO 80912-50W Max Koontz DOE/OAC/DP Forestal Bldg
5.1
lIXIO Independence Ave. Washington, DC 20585
Kelly AFB San Antonio, TX 782435~
Ann Krause HQ AFOTEC/OAN Kirtl.and AFB, NM 87117-7(XI1
H. Mair Naval Surface Warfare Center 10901 New Hampshire Ave. Silver Springs, MD 2090&5000
V. Kumar Computer Science D~ment Univemity of Minnesota Minneapolis, MN 55455 J. Larmutti MS B-166 Director, SC. Reach. Institute Florida State Univemity Tallahassee, FL 32306
Henry Makowitz MS - 2211-INEL EG&E Idaho Incorporated Idaho F.dlE, ID 83415 David ManddI MS F663 Hydrodynamic App. Grp. X-3 IAM Alamos Nat’] Labs Los Alanms, NM 87545
P. D. Lax New York University-Courant 251 Mercer St.
T. A. ManteufTel Department of Mathematics
New York, NY 10012
University of Co. at Denver Denver, CO 80202
Lawrence A. Lee NC Supercomputing Center PO Box 12689 3021 Comwdlis Rd. Research Triangle Park, NC 27709 Dr. H.R. Leland Calspan Corporation PO Box 400, Buff.do, NY 14225
William S. Mark Lockheed - Org, 96-01 Bldg. 254E 3251 Hanover Street Palo Alto, CA 943031191 Kapit Mathur Thinking Machines Corporation 245 Fimt Street Cambridge, MA 0214>1214
David Levine Math & Comp. S&me Argonne National Laboratay 9700 Cam Avenue South Argonne, IL 60439
John May Kanutn Sciences Corporation 1500 Garden of the Gods Road Colorado Springn, CO 60933
Peter Llttlewood Theoret. Phy. Dept. AT&T Belt Labe Ran lD-?35 Murray Hill, NJ 07974
William McColt Oxford Univ. Computing Lab 6-11 Keble Road Oxford, OX1 3QD United Kingdom
Peter Lomdafd T-II, MS B262 Los Alamoa Nat’] Lab Los AhllllOS, NM 87545
S. F. McCormick
L-As S. Leme SDIO/TNI The Pentagon Washington, DC 20301.7tO0 Col. Gordon A. Long Deptuty Director for Adv. Comp. HQ USSPACECOM/JOSDEPS Peterson AFB, CO 6(X)145(X)3 John LOll 3258 Caminito Ameca La Jolla, CA 92037 Daniel Loyem Koninklijke/ShelL Laboratorium Postbus 3003 1C02 AA Amsterdam The Nethertamb Robert E. Lynch Dept. of CS Purdue University West Lafayete, IN 479o7
Computer Mathematical Group Univemity of CO at Denver 1200 Larimer St. Denver, CO 80204 J. R. McGraw L-316 Lawrence Livermore Nat’1 Lab PO Box 808 Livermore, CA 9455o Jill Mesirov Thinking Machines Corporation 245 F]mt Street Cambridge, MA 0214Z1214 P. C. Messina 158-79 Mathematics & Comp Sci. Dept. Caltech Pasadena, CA 91125 Prof. Ralph Metmlfe Dept. of Mech. Engr. University of Houston 4600 Calhoun Road Houston, TX 772044792 G. A. Michael L306 Lawrence Livermore Nat Lab
Kathy MacLeod AFEWC/SAT 17
PO Box 808 Liverrnore, CA 94550 Lam K. Miller Goodyear Tire & Rubber PO Box 3531 Akron, OH 4430P-3531 Robert E. Milletein TMC 245 Fint Street Cambridge, MA 02142 G. Mohnkem NOSC - Code 73 San Diego, CA 92152-50W C. Moler Tbe Mathworke 24 prime Pti Way Nati~, MA 01760
3000 Waternew
Parkw~
PO Box 733851 Ricbrdaon, TX 75083-3851
Anthony C. Parmee Co-nor and Attde
(Def.)
British Embsssy 31W MaLw. Ave, NW WashingtorL DC 20008
s. v. Parter Department of Mathemati= Van Vleck Hall University of Wisconsin Madison, WI 537o6 Dr. Nieheeth Patel US Army Ballistic Ftesch. Lab. AMXBR-LFD Aberdeen Prov. Gmd., MD 21005-5066
R. J. Plemmone Dept. of Math. & Comp Sci. Wake Forest University PO Box 7311 Wimtort Salem, NC 27109 Alex Pothen Computer Science Department Univemity of Waterloo Waterloo, Ontario N2L 301
Canada John K. Prentice Amparo Corporation 37OORio Grande NW, Suite 5 Albuq., NM 87107-3042 Peter P. F. Radkoweki PO Box 1121 LOU Alamoe, NM 87644
Gery Montry Southwcat software
A. T. Patera Mechanical Engineering Dept. 77 Maseachueetto Ave.
J. Rattncr Intel Scientific Computere
11812 Pemimrrton, NE Albuquerque, NM 87111
MIT Cambridge, MA 02139
15201 NW Greenbriar Pkwy, Beaverton, OR 97oo6
N. R. Morse C-DO, MS B260 Comp. & Comm .Division Loe Alamoe National Lab Loe Alamoe, NM 87546
A. Patrinoe Atmoe. and Cfirn. Resch. Div OSice of Engy ResclL ER-74 US Deprutment of Energy WaAingtoaL D~ 20545
J. P. Retelle Org. 94-90 Lakheed - Bldg. 254G 32S1 Hanover Street Palo Alto, CA 94304
J. R. Medic IBM Thomas J. Wateon Raech Cntr. PO Box 704 Yorktown Heighte, NY 10698
R. F. Peierfs Math. Saenca Group, Bldg. S1S Brookhawrt National Lab upton, NY 11973
C. E. Rhoadea L298 Computational
D. B. Neleon US Department of Energy OSC, E&30, GTN Weehington, DC 20s65 Jeff Newmeyer Org. 81-04, Bldg. 157 1111 Lockheed Way Sunnyvale, CA 9406$3504 D. M. Noeenchuck Mech. and Aero. Engr. Dept. D302 E Quad Princeton University Princeton, NJ 08544 C. E. Oliver Offc of Lab Comp Bldg. 4500N, Oak Ridge Nat’1 Laboratory PO Box 2006 Oak Ridge, TN 37631-6259 Dennis L. Orphal Cdif Reach& Technology Inc. 5117 Jolmeon Dr. Pleasanton, CA 94586 J. M. Ortega Applied Math Department Univemity of Virginia Charlottesville, VA 22903 John Palmer TMC 245 Fimt St. Cambridge, MA 02142 Robert J . Paluck Convex Computer Corp.
Donna Perez NOSC - MCAC
Phyeia Div. PO Box 808 Lawrence Livermore Nat’1Lab Livennore, CA 84550
Reeource Cntr.
Code 912 San Diego, CA 92152-5LXKI K. Perko Supercomputing Solutions, Inc. 6175 Mancy Ridge Dr. San Diego, CA 92121 John Petresky Ballistic Research Lab SLCBRLF-C Aberdeen Prov. Gmd., MD 21MM-5006 Linda Petzold L-316 Lawrence Livennore Natl . Lab. Livermore, CA 9455o Wayne Pfeiffer San Diego SC Center PO Box 856o8 San Diego, CA 92136 Frank X. Pfenneberger Martin Marietta MS-N33104 National Test Bed Falcon AFB, CO 80912-~ Dr. Leslie Pierre SDIO/ENA The Pentagon Washington, DC 20301-7100 Paul Plaesman Math and Computer Science Division Argonne National Lab Argonne, IL 6043S 18
J. R. Rice Computer Science Dept. Purdue University West Lafayette, IN 479o7 John Richardson Thinking Machines Corporation 245 Fhat Street Cambridge, MA 02142-1214 Lt. Col. George E. Richie Chief, ADv. Tech Plans JOSDEPS Peterson AFB, CO 80914-5fK13 John Rollett Oxford University Computing Laboratory %11 Keble Road Oxford, OX1 3QD United Kingdom R. Z. h!?kk Physics and Astronomy Dept. 100 Allen Hall University of Pittsburg Pittsburg, PA 15206 Diane Rnver Michigan State Univemity Dept. of Electrical Engineering 260 Engineering Bldg. East Lansing, MI 48824 Y. sad University of Minnesota 4-192 EE/CSci Bldg. 2LMUnion St. Minneapolis, MN 5545&O159
P. wappm Dept. of Conlp. & Info SCiena Ohio State Univ.-228 Bolz HaU 2036 Neil Ave. Colurnbu, OH 43210-1277 Jod Sdtz Computer Science Department A.V. Williams Building university of Maryland College Park, MD 20742
L. SDirector, Supercbmputer Apps. 152 sup —puter Applications Bldg. 605 E. Sphgfield Chrunpaign, IL 618431 Wk R. Somaky Ballistic Remueb Laboratory SLCBR-SEA, Bldg. 394 Aberdeen Proving Ground, MD 21005
A. H. Sarneh
CSRD 305 Tdbot Laboratory University of Illinois 104 s. wright St. Urbana, IL 61801 P. E. Saylor Dept. of Comp. Saence 222 Digital Computation Univunity of Illinois Urbana, IL 61801
Anthony Skjellum Lawrence Lkmrnore National Laboratol “Y 7000 E-ad Ave., Mail Code L316 Llvermore, CA 94550
D. C. Sorenson Department of Math
LCDR Robert J. sCbOppG Chief, Operatiom Rqmts USSPACECOM/JOSDEPS Peterson AFB, CO 80914
s.
(Stop 35)
M. H. Sclodtz Department of Computer Science Y.de Univen3ity PO Box 2158 New Haven, CT 06520 Da= Schwartz NOSC, Code 733 San Diego, CA 92152A5C4Kl
Waahingto~
Allan Ton-es
Loe Alamoa National Lab PO Box 1666, MS F663 Los Ahrnos, NM 87545
Squires
N. Srinivasan AMOCO Corp PO 87703 Chicago, IL 60680-0703 Thomas Stegmann Digital Equipment Corporation 8085 S. Cbeskr Street Engiewood, CO 80112 D. E. Stein AT&T 100 South Jefferson Rd. Whippally, NJ 07981 M. Steuerwdt Division of Math Sciences National Science Foundation Washington, DC 20550 Mike Stevens nCUBE 919 E. Hillsdale Blvd. Foster City, CA 94404
A. H. Sh— Sa. Computing Amoc. Inc. Suite 307, 246 Church Street New Haven, CT 06510
G. W. Stewart Computer Stience Department University of Maryland Colfege Park, MD 20742
Horst Simon
O. StOrasdi MS-244 NASA Langley Research Cntr. Hampton, VA 23665
NAS Systems Divti]on NASA Amen Research Center Mail Stop T045-1 MofTett Field, CA 94035
c. Richard Sincovec Oak Ridge National Laboratory
DC 20550
Harold ‘lYeMe
Mark Seager LLNL, L80 PO box 803 Livermore, CA 94550
Aehok Singlud CFD Reach. Center 3325 ‘IMana Blvd. Huntsville, AL 35805
Math Sciences National ScienceFoun&Ion
Randy ‘human Mechanical Engineering Dept. University of NM Albuquerque,
Mail Stop T04$1 Moffett Field, CA 9403$1OW
Vineet Singb HP Labs, Bldg. lU, MS 14 1501 Page Mill RcA P.do Alto, CA 94304
A. Tludcr Division of
Scien-
DARPA/ISTO 1400 Wilson Blvd. Arlington, VA 11109
Rob SCbreibcr RIAcs NASA Am- ReeearcbCemter
P.O. Box 2008, Bldg 6012 Oak Ridge, TN 37831-6367
H. Teuteberg Cray Research, Suite 830 6565 American Pkway, NE Albuquerque, NM 87110
125 Lincoln Ave., Suite 400 Santa Fe, NM 87501
Rice University PO BOX1892 Houston, TX 77261 Lab
Gligor A. Taddcovicb PO Box 2% Pound Ridge, NY 1057& cY296
Stwt
NM 87131
Ray Tuminaro CERFACS 42 Ave Gustaw Coriolis 31057 Toulouse Cedex l%nce Mark Urmtia Intel Corp., CO1-03 5200 NE Elam Youns f%vy. Hilleboro, ORE 971246497 Mike Uttormark UW-Madiaon lsw Johnson Dr. Madison, WI 537o6 R. VanDeGeijn Computer Science Department University of Texas Austin, TX 78712 George Vandergrift Dist. Mgr. Convex Computer Corp. 3916 Juan Tabo, NE, Suite 38 Albuquerque, NM 87111 H. VanDerVomt Delft University of Technology Facufty of Mathematics POB 356 26oO AJ Ddft The Netherlands
c. vanLaan
DARPA/TTO 1400 Wilson Blvd. Arlington, VA 22209
Department of Computer Science Cornell University, Rm. 5146 Ithaca, NY 14853
LTC Jarnea Sweeder SDIO/SDA The Pentagon Washington, DC 20301-7100
John VanRosendde ICASE, NASA Langley Researrh Center MS 132C Hampton, VA 23665
R. A. Tapia Mathematical =1. Dept Rice University PO Box 1892 Houston, TX 77251
Steve Vavasis Dept. of Computer Science Upson Hall COmeU Univemity Ithaca, NY 14853
19
R. G. Voigt MS 132-C NASA Langley Ikcb Hampton, VA 36665
Cntr, ICASE
Phuong Vu Cray ReaeadL Inc. 19607 pram Road Houston, TX 77084 David Walker Bldg 6012 Oak Ridge National Lab PO Box 2008 Oak Ridge, TN 37B31 Stevm J. Wallach Convex Computer Corp. 3000 WaterView ParkwW PO Box 833851 Ricbardno~ TX 75083-3651 R. C. Ward Bid. 9207-A Mathernatimf %iencen Oak Ridge National Lab PO Box 4141 Oak Ridge, TN 37831-8083 Thomas A. Weber National Science Found 18Lx3G. Street, NW Washington, DC 20550 G. W. Weigand DARPA/CSTO 37o1 N. Fairfax Ave. Arfington, VA 22203-1714 M. F. Wheeler Math Sciences Dept Rice university PO Box 1892 Houston. TX 77251 A. B. White MS-265 k Ahmms National Lab PO Box 16433 Los Alamm, NM 87544 B. Wilcox DARPA/DSO 1400 Wilson Blvd. Arlington, VA 222o9 Roy Wifliams California Institute of Technology 20&49 Pasadena, CA 91104 C. W. Wif.90n MS MI02L3/Bll Digitaf Equipment Corp. 146 Main Street Maynar~ MA IM175
K. G. Wilson Physics Dept. Ohio State University Columbus, OH 43210 Leonad T. Wifncm NSWC Code G22 Dahlgmn, VA 22448 Peter WOlochOw
fntef Corp., CO1-03 52OONE Elam Young Pkwy. Hiflsboro, OR 97124-6497 P. R. Woodward University of Minnesota Department of Astronomy 116 Chumh Street, SE Minneapolis, MN 55455 M. Wunderli& Math. Sciencm Program National Security Agency Ft. George G. Mead, MD 20755 Hishashi Yasumori KTEC-Kawasaki Steel Techrwrenearch Corporation Hiblya Kokusai Bldg. 2-3 Uchisaiwaicho ‘khrome Chiyoddm, Tokyo lCII David Young Center for Numerical Analysis RLM 13.150 The University of Texax Austin, TX 78713-8510 Robert Young Alcoa Laboratories Alcoa Centw, PA 15069 Attm R. Youn8 & J. McMichaef Wilfiam Zierke Applied Research LabPenn PO Box 30 State CoUege, PA 168C14 INTERNAL DISTRIBUTION: Pauf Fleury Ed Bamis Sudip Dosanjh Bilf Camp Doug Cline David Gardner Grant Heffelfinger Scott Hutchinson Martin Lewitt Steve Plimpton Mark Sears John Sbadid Julie Swisshelrn Dick Allen Bruce Hendrickson (25) David Womble Ernie Brickell Kevin McCurley Robert Benner Carl Diegert Art Hale Rob Leland (25) Courtenay Vaughan Steve Attaway Johnny Bif3e Mark Blanford Jim Schutt Michael McGlaun Allen Robinson Pauf Barrington David Martinez Dons Crawford William Mason Technical Library (5) Technical Publication Document Processing for DOE/OSTI (10) Centraf Technical Fite Charles Tong 20
State.
1400 1402 1421 1421 1421 1421 1421 1421 1421 1421 1421 1421 1422 1422 1422 1423 1423 1424 1424 1424 1424 1424 1425 1425 1425 1425 1431 1431 1432 1434 lWO 1952 7141 7151 7613-2 8523-2 8117