duces a graph that is used in a compiler embedded loop iteration partitioner. Programmers use Fortran extensions to specify which loops and which distributed.
NASA
Contractor
ICASE
Report
Report No.
187635
91-73
ICASE o,,
DISTRIBUTED MEMORY COMPILER METHODS FOR IRREGULAR PROBLEMS -- DATA COPY REUSE AND RUNTIME PART_IONING
,-4 I N
co t,'3 ..-:0 U-_"
Z
_0
r-4
,O
Q
17, O
Raja Das Ravi Ponnusamy Joel Saltz Dimitri Mavriplis
_0
=E LU _- ,.J a_r_ QO LU ta E
L'_ ..J ZU b,_ V! ZU O I.I-a_ ck
LU ,.J
Contract
No.
September
u.J ::3.. Ir
NAS1-18605 {I:LU
1991
Z I..-H
Institute
for Computer
NASA
Langley
Hampton, Operated
Applications
Research
Virginia
in Science
and Engineering
Center
23665-5225
by the Universities
Space
Research
Association
r¢ ,,-_ b..t ¢_" IdJ QO tL Z'_ ,_tJ tr_ UJ,._ t_Q V1 ,O3: I'-- I.--- LU .1._ L 00ud O I >" Ca. C3a_: tjuJ I -J U P
i...- E ZO National Space
Aeronautic_
and
(::{
.,_
C:) u.,
Administralion
I.angley
Resesrch
Center
Hamplon,
Virginia
23665-5225
f
7
DISTRIBUTED FOR
MEMORY
IRREGULAR
REUSE
Raja _ICASE,
MS 132C,
bDepartment
PROBLEMS
AND
Das _, Ravi
-
RUNTIME
Ponnusamy NASA
of Computer
COMPILER
b, Joel
Langley
Science,
DATA
COPY
PARTITIONING
1
Saltz _ and
Research
Syracuse
METHODS
Dimitri
Center,
University,
Mavriplis
Hampton,
_
VA 23665
Syracuse,
NY
13244-4100
ABSTRACT
This
paper
distributed how
outlines
memory
to link
grammers processors.
We data.
that also
carry
out
In many
programs,
long as it can
be verified
experimental the
we show data
usefulness
of our
from and
sparse
and
loops
the values
we can
and
loop
and Engineering Saltz was provided
an
unstructured
important
are
explicitly
role
problems.
compilers.
iterations
to deal
in any
We describe
In our
scheme,
pro-
to be distributed
between
with
complex
potentially
partitioning. for access
tracking
and
the
off-processor
same
assigned
effectively
a 3-D unstructured
reuse Euler
reusing
to off-processor stored
solver
run
copies
of off-processor
memory memory
locations.
locations
off-processor
data.
on an iPSC/860
As
remain
We
present
to demonstrate
methods.
IThis work was supported by the National Aeronautics No. NAS1-18605
will play
memory
having
data
mechanism
several
that
from
how data
work
that
we believe
to handle
users
a viable
which
to distributed
specify
insulates
describe
unmodified,
able
partitioners
implicitly
This
methods
compiler
runtime
can
algorithms
two
while the authors were in residence (ICASE),
NASA Langley Research
by NSF from NSF Grant
and Space Administration
at the Institute Center, Hampton,
ASC-8819374.
for Computer
under NASA Contract Applications
VA 23665. In addition,
in Science support
for
1
Introduction
Over
the
memory tured
past
few years,
code
for a large
problems,
runtime.
In these
by a runtime and
needed
compiler
cases,
to carry
the
out
to reduce
off-processor A compiler-linked compose
data
at runtime one
structures
or more
embedded duces use
loop data
Fortran
data
and
iterations
widely
in
In this
that
In the
are
interest
extensions context
out data
structure
preprocessing
between
number
allowing
of a pre-existing
further
typically
paper
of casting
data
processors.
accessed
uses
to partition
loop
using
of communications
known
only
is made work,
memories
at
possible
to map
data
of processors.
by a distributed
methods. memory
In this
paper,
compilers,
and
requirements
dependency
which
The
memory
we describe:
by eliminating
scheme
compiler
redundant
vote
and
between
of the software
and
language,
Fortran
D
and
iteration
to generate In sparse indirection.
out
code
that
arises
from
at runtime
Programmers should specify
be used how data
of developing years
and
was
outlined
language
some
development
a set of
(see for instance
the
to realize
pro-
exten-
of these
ideas.
of languages
we present
our
work
and in the
[17].
unstructured
to carry
to de-
to a compiler
to compilers
machines,
partitioning
Runtime
idea
support
in the
SIMD
communication and
The
required
that
implicitly
partitioners
runtime
passed
arrays
by G. Fox for many such
that
partitioner.
distributed
processors.
pursued
graph
code
programmers
for standardization
loop
generates
which
MIMD
calls
also
loops
memory
produces
is then
loop iteration
the
information
compiler
embedded
some
us to develop
The
representation
for linking
we describe
dependency
of the dependency
Consequently,
has been
data
iterations.
graph
The
partitions.
our
dynamic
representation
in a compiler
a general
for distributed
Once
partitioner
partitioner.
partitioners
[16]), and
sions
compilation
are to be distributed
[15] and
unstruc-
[38].
communication
to specify
structure
applicable
[32].
This
is used
extensions
and
values
to partition
be generated
to distributed
a standardized nests.
that
to derive loop
and
structure
a graph
runtime
In sparse
architectures
the
distributed
accesses.
runtime
generates
between can
efficient
by variable
is used
compilation
partitioners
problems.
memory
of data
preprocessing
interprocessor data
of distributed
runtime
to generate
unstructured is determined
movement
two new
needed
This preprocessing
runtime
we call
presents
use
methods
and
structure
phase.
how to link runtime how
of sparse
effective
to schedule
paper
developed
dependency
in a process
This
class
preprocessing
structures code
the
we have
have calls
needed
determined, to efficiently
computations,
preprocessing the
been
required
distributed
is used data
we carry transport arrays
to generate
transport.
In many
are
a small cases,
severalloops accessthe sameoff-processormemorylocations. As long as it is known that the valuesassignedto off-processormemory locations remain unmodified, it is possibleto reuse stored off-processordata. A mixture of compile-time and run-time analysiscan be used to recognizethesesituations. Compiler analysisdetermineswhen it is safeto assumethat the off-processordata copy remains valid. Softwareprimitives generatecommunications calls which selectively fetch only those off-processordata, which are not available locally. We will call a communications pattern that eliminates redundant off-processor data accesses an incremental
schedule.
[22] and
that
data
an overview mental
the
and
of our
the
2
Overview
2.1
Overview
We will present is a version
builds
on the
executors,
work
described
in
partitioning. of our
[6],
specified
requires
that
have
explored
extensively
on this work
will be presented
could
D can
be used
array
In Fortran
elements.
D, one
attributes the
in the
extensions
Fortran tributed
the
array
two declarations.
problem (e.g. context
first
which
extensions
in
generate
we use
incre-
to control
performance
of Fortran
data
decomposition
found
how
data
in this
data
FortranD
has
same
optimizations
of languages
and
of such
a distribution
distribution processors.
is decomposition.
used
fixes the
and
our
analogous
compilers.
an example
2
Many
D, the
1, we present
declaration
D as
[33], [7, 27, 26, 28]) While
range
between
Fortran
[11],
inter-processor
The
of
and
an irregular
array.
description
[35] and
of Fortran
called
a
book.
decompositions,
D
specifications,
is to be distributed.
specify
a template
D. Fortran
[17], a less detailed
al.,
define
[25],
context
of data
et.
of specifying
for a wide
is to be partitioned The
set
be found
explicitly
[45],
to explicitly
of a distributed
used
4 we will present
5 we will present
in the
a rich
may
users
In Figure
declares
methods
by Hiranandani
be used
primitives
methods.
with
extensions article
primi-
D
77 enhanced
language
language
in Section
the
In Section
transformations
the
Finally,
our runtime-compilation
of Fortran
the
describe
3.1, we will describe
3.2 we will describe
to compilers.
We describe and
performance
2. In Section
In Section
partitioners
effort.
in the
language
which
compiler
D is given
researchers
cant
here
in Section
schedules.
of Fortran
of the
currently
work
described
work
iteration
runtime
to characterize
definition
loop
and
compiler-linked
of the
incremental
inspectors
Fortran
context
produce
to couple
drawn
preprocessing
[45].
We will set tives
The
partition
a Fortran
D declaration.
to characterize size,
A distribution Decomposition
of dis-
dimension
the
signifi-
and
way
is produced fixes the
name,
in
using di-
.,..
$1
REAL*8
x(N),y(N)
$2
INTEGER
map(N)
S3 DECOMPOSITION $4
DISTRIBUTE
$5
ALIGN
$6
... set values
$7
DISTRIBUTE
$8
ALIGN
reg(N),irreg(i) reg(block)
map
with
reg
of map
array
Distribute
addition,
size of the
is an executable
processors.
Fortran
can explicitly
A specific
array
is associated
statement
$3, Figure
$4, decomposition
used
cessors.
The
current
decompositions. and loop
reg
compiler iterations
array
statement
and
the
user
specify with
..
distribution
i of the Fortran
equal
are to be distributed
regular
Fortran
processors.
are
onto
defined.
a block reg.
align.
In
to each
map will be between
map(i)
pro-
is set equal
p.
to explicitly sections,
processors.
assigned Array
when
in
In statement
is to be partitioned array;
onto
distributions,
D statement
with
to processor
for programmers
is distribute.
is to be mapped
distribution
an integer
programmers
between
blocks
irreg
in the following
it possible
the
with
is assigned
requires
a template
distributions
sized
declaration
is to be mapped
using
using
second
of several
a distribution
distribution
£rreg
illustrate
how
a choice
map is aligned how
The
dimensional
into
D syntax
make
with
is specified
distribution
As we shall techniques
$8)
template.
a distribution
$5, array
Distribution
specifies
how
is partitioned
(in statement
An irregular
to p, element
method
D Irregular
distributed
1, two size N, one
In statement
to specify
h Fortran
D provides
a user
processor.
mapping
irreg
Figure
and
some
irreg(map)
x,y with
mensionality
using
our
define
irregular
new language
to implicitly
specify
data
extensions
how data
and
2.2
Overview
In this section, in previous
we will give an overview
publications
a program's
In the
been
the
rest
of the
latency
collected
messages.
will be
must
such
and
inspector
be
calculates once
These
the
directly
primitives
[6]; they
distributed
carry
data-sets
a set of schedules, a obtain
copies
deal
namely,
inspector
loop
examines
data
needs
off-processor The
actual
executor
computation.
are
named
out
the
distribution
over
the
numerous
specify
the
of data
PARTI
stored
and local
elements
the
and
data
references
uses
the
communication
needed
off-processor
executor. made
where
a suite
by. a
that
data
from
the
of primitives
pairs.
Runtime
Toolkit
indexed
memories.
calls
the
original
information
developed
due
at compile the
inspector
of globally
processor
in specified
to predict
and
matrix
arises
of information
the
Automated
sparse
typically
inspector/executor
retrieval
be
can be reduced
and
This
possible
lack
We have
(Parallel
by
should
data.
then
to generate
used
to be transmitted
to be fetched
loop
strategy
meshes
input
this
a computation
receive.
it is not
with
define
subsequent
communica-
array and
two constructs
by programmers
which
case,
of the
during
a non-trivial
on unstructured
In this
or input
of the
is typically
on the
described
phase.
of fetching
PDEs
To
it is received.
to implement
can be used
code.
the
what
elements
to send
depends
prefetched.
execution,
cost
nature that
information
needs
as solving
into
there
reasons,
processor
in the
the
Vital
primitives
produced
structures
follows.
The
pattern
is transformed
program
stored
[12],
each
communication
loop
processor,
that
problems,
data
architectures,
large data
data
by this preprocessing
relatively what
algorithms,
the
phase
MIMD
of the PARTI
in determining
when
For efficiency
data
During
role
are determined
level of indirection
sequential
a large
cost.
the
what
In many
or startup
In irregular
time
[38]).
approach,
memory
by precomputing
to some
of the functionality
a preprocessing
algorithm
into
algorithms,
[6],
plays
PARTI
initialized,
In distributed tions
( [45],
initialization
computation. have
of PARTI
Each
at
ICASE)
but
irregularly
inspector
produces
to either:
memory
locations
(i.e.
gather)
or,
b modify
the
contents
c accumulate (i.e. Schedulers only
(e.g.
of specified add
off-processor
or multiply)
values
memory
locations
to specified
(i.e.
off-processor
scatter), memory
or locations,
accumulate). use hash
a single
copy
tables
to generate
of each
executor
by PARTI
primitives
memory
locations.
In this
communication
off-processor to gather, paper,
the
calls that,
datum
[22],
scatter
and
idea
[45].
The
accumulate
of eliminating
for each loop nest, schedules data
duplicates
are
to/from has
been
transmit
used
in the
off-processor taken
a step
further. If severalloops require different but overlappingdata referenceswe can now avoid communicatingredundant data (SeeSection 3,1 and Section4.1.3). In distributed memory machines,large data arrays needto be partitioned betweenlocal memoriesof processors.These partitioned data arrays are called distributed arrays. Long term
storage
tributed
of distributed
machine.
manner. numbered
frequently
the mesh.
When
to each and
the
thus
build
way
does
not
Each
for another
processor
therefore
must
is accomplished on
the
required
to access
of the
primitives
This
used
The
scheduling
primitives
but
putations. the
new
the
couple
incorporate
the
its local
a given
address
array
element,
elements
table
the
first
of the
array,
we look
up
in the
(m/N)
minimizes
array
of the
elements processor,
array
processor's host
we must
memory.
processor
also contains of the
N elements,
processors.
This
the
N/P
second
in the
If we are distributed
• P + 1th processor. translation
We
address.
of processors.
its address
of distributed translation
of
processor,
number
distributed
pattern
memories
P is the
initialization
connectivity
the
where
be found
are
to a particular
lists
local on
mesh
element
the translation the
computational
arbitrary
in this
dis-
in an irregular
in a way that
is assigned
...,
can
primitives with
partitioners are
related
a number
These
to assign
in the
tables,
and
One other
tables.
Primitives
along
that
etc
handles
describes
primitives
to be able
over
N/P
element
to access
processors,
first
we know
PARTI
section
between
which
primitives
are
the
mth
a problem
of N elements,
arrays
to the
in such
array
locations
distributed
of an irregular
to access
for each
processor,
the
table,
PARTI
3
second
and
be distributed
by putting
elements
translation
be itself
need
to be able
table which,
memory
correspondence
structures
it resides,
array
nodes
in a distributed
processor
a translation
the
we may
in which
to specific
to partition
a useful
the data
element
For a one-dimensional and
in which have
we partition
processor.
know
the
is assigned
advantageous
communication,
in order
data
It is frequently
For instance,
interprocessor
array
of new
primitives
differ
which
the
schedule
primitives
that
to compilers to the
then
couple
primitives
we have
in a number
had
of ways
carry
out
partitioners
are entirely
PARTI
insights
and
new. described
about from
sparse
movement to compilers.
The
of data The
data
movement
and
earlier
( [6] and
[45])
and
unstructured
those
described
are virtually
identical
earlier
comin that
primitives:
eliminate make original
redundant
it simple
off-processor
to produce
sequential
references
parallelized
loops
loops.
5
and that
in form
to the
real*8 x(N),y(N) C
Loop
L1
do i=l,n_edge nl
over
edges
involving
x, y
= edge_list(i)
n2 = edge_list(n_edge+i) $1
y(nl)=
y(nl)
+ ...x(nl)...
x(n2)
$2
y(n2)=
y(n2)
+ ...x(nl)...
x(n2)
end
do
C
Loop
L2
do i= 1,n_face ml
over
Boundary
faces
involving
x, y
= faceAist(i)
m2 = faceJist(n_face+i) m3 = faceAist(2*n_face $3
y(ml)
$4
y(m2)= end
= y(ml) y(m2)
+ i )
+ ...x(ml) + ...x(ml)...
... x(m2) x(m2)..,
... x(m3) x(m3)
do
Figure
2: Sequential
Code
To explain how the primitives work, we will use an examplewhich is similar to loops found in unstructured computational fluid dynamics (CFD) codes. In most unstructured CFD codes,a meshis constructedwhich describesan object andthe physicalregion in which a fluid interacts with the object. Loops in fluid flow solverssweepover this meshstructure. The two loops shownin Figure 2 representa sweepoverthe edgesof an unstructured mesh followed by a sweepover faces that define the boundary of the object. Since the mesh is unstructured, an indirection array has to be usedto accessthe vertices during a loop over the edgesor the boundary faces. In loop L1, a sweepis carried out over the edgesof the mesh and the referencepattern is specifiedby integer array edgeAist. Loop L2 represents a sweepover boundary faces,and the referencepattern is specifiedby faceAist. The array x only appearsin the right hand side of expressionsin Figure 2, (statements $1 through $4), so the valuesof x are not modified by these loops. In Figure 2, array y is both read and written to. Thesereferencesall involve accumulationsin which computed quantities are addedto specifiedelementsof y (statements$1, $2, $3 and $4). 3.1
Primitives
In this section time
support
our earlier compiler new
we need suite
of primitives
is that
distributed single
are virtually
memory
derived
redundant
described
in
preprocessing identical
machine
2 in order
to present As was
[6], this runtime
support
can be used
memory make
codes
the
same
be produced
manually
the
to produce
sequential
quality
object
by the
sequential
the
either
by a Our
parallelized
loops.
The
importance
code
on the
nodes
program
run-
case with
by programmers.
it straightforward
to the original
to generate as could
Figure
references.
that
in form
from
off-processor
into distributed
it will be possible
primitives
of situations
times.
out
Scheduling
running
of of the on a
node.
Our ber
carry
example
to eliminate
or can be embedded
that
this
Communications
we use a running
primitives
loops
for
In such
distributed
make
use
in which situations,
array
a single
tables
[22] to allow
off-processor
the primitives
us to recognize
distributed
only fetch
a single
embedded
fortran
array copy
and
reference
of each
unique
exploit
a num-
is used
several
off-processor
reference.
3.1.1
PARTI
Figure
3 depicts
gather,
dfscatter_add
Executor the
ezecutor and
preprocessing
phase,
icantly
non-incremental
when
of hash
code
with
dfscatter_addnc.
to be described schedules
Before
in Section are
this 3.1.2.
employed.
callable
code This An
PARTI
is run, executor example
we have code of the
procedures to carry changes executor
dfmout signifcode
a
when the preprocessingis done without using incremental schedulesis given in [40]. The arrays x and y are partitioned betweenprocessors,eachprocessoris responsiblefor the long term storageof specifiedelementsof eachof thesearrays. The way in which x and y are to be partitioned betweenprocessorsis determinedby the inspector. In this example,elements of x and y are partitioned betweenprocessorsin exactly the sameway. Each processoris responsiblefor n_on_proc elements of x and y. It
should
Figure
be noted
that
3 is identical
named
x and
to that
y; in Figure
of a distributed
memory than
responsible.
We will store
The
PARTI
precomputed single
locations
data
ure
that
ter_addnc begins
data
obtain
with
indices
data
be described
in Section
off-processor
location
PARTI
In this
section,
arguments
a useful mesh
in Figure
by the
in which
correspondence
mesh
declared
for which
beginning
with
P is
local
array
data
between
pattern
uses
which
x defined
data
on each
with and
processors
is specified
communication
by either
schedules The
specify
In Figure
Copies
a
to fetch
schedules
is to be obtained. processor.
using" a
3, off-
of the off-processor
x(n_on_proc+l). dfscatter_addnc,
memory
in statement
locations.
to off processor data
Both
dfscatter_add
locations
is accumulated
distinctions
between
In Figure
3, several
from
and
may
be
$3 Fig-
and
a buffer
to locations
dfscatter_add data
82 and
dfscat-
area
that
of y between
dfscatter_addnc
accumulated
will
to a given
L1 or in loop L2.
we will outline
in a way that
arbitrary
elements
arrays
processor
y are
elements
in
Inspector
needed way
in loop
of array
loops
use
a single
x and
by loop L1 or by loop L2.
Off-processor
3.1.3.
on
P, arrays
3 move
dfmgather
from
to be accumulated
The
number
of the
3, we again
defined
communication
either
to off-processor
n_on_proc.
3.1.2
The
beginning
arrays
array
structure
In Figure
processor
in Figure
dfscatter_add
y(n_on_proc+l).
1 and
The
area
procedures
3, accumulate
depicted
array
the
c+ 1).
of schedules,
from
the
of off-processor
memory
in a buffer
PARTI
to store
will be needed
2.
represent
On each
pattern.
calls,
in Figure
y (n_on_pro
calls
is obtained
placed
The
c+ 1) and
in distributed
data
are
copies
procedure
y now
be needed
or by an array data
the
loops
3, x and
subroutine
schedule
processor
of the
communication
off-processor the
would
x (n_on_pro
for
multiprocessor.
to be larger
elements
except
4) allows
code
the
we carry
in Figure
nodes to the
minimizes
points
how
connectivity
us to map
processor. a globally
the
3. This
of an irregular
interprocessor
to each
out
preprocessing
preprocessing mesh
pattern
are of the
is depicted
numbered mesh.
communication, The
PARTI
indexed
needed
distributed
in Figure
frequently When
we may need
procedure
to generate 4.
do not
have
we partition
such
to be able
to assign
ifbuild_translation_table array
the
onto
processors
a
(S1 in an
real*8 x(n_on_proc+n_off_proc) real*8 y(n_on_proc+n_off_proc) S1 dfmgather(sched_array,2,x (n_on_proc+1),x) C Loop over edgesinvolving x, y L1 do i=l,local_n_edge nl
-- local_edgeAist(i)
n2 = local_edgeAist(local._n_edge+i) $1
y(nl)
= y(nl)
+ ...x(nl)...
x(n2)
$2
y(n2)
= y(n2)
+ ...x(nl)...
x(n2)
end $2
do
dfscatter_add(edge_sched,y(n_on-proc+ C Loop L2
over
Boundary
faces
1),y) involving
x, y
do i=l,localm_face ml
= local_faceAist(i)
m2 = local_faceAist(localm_face+i) m3 = local_faceAist(2*local_nA'ace $3
y(ml)=
y(ml)
+ ...x(ml)...
x(m2)..,
x(m3)
$4
y(m2)=
y(m2)
+ ...x(ml)...
x(m2)..,
x(m3)
end $3
+ i )
do
dfscatter_ddnc(face_sched,y(n_on_proc+
1),
buffer_mapping,y)
Figure
3: Parallelized
Code
9
for Each
Processor
S1 translation_table= ifbuild_translation_table(1,myvals,n_on_proc) S2 call flocalize(translation_table,edge_sched,part_edgeJist, local_edgeJist,2*n_edge,n_off_proc) S3 sched_array(1)= $4
edge_sched
call fmlocalize(translation_table,face_sched, increment
al_face_sched,
4*n__face,
n_off_procface,
n_new_off_proc_face, $5
sched_array(2)
par t_face__list,local_facedist,
buffer_mapping,
= incremental_face_sched
Figure arbitrary
fashion.
Each
the
elements
for which
array
processor specific the
needs
The needed
PARTI to produce
($2 in Figure (i) (ii)
4).
a pointer
number
Flocalize (i)
flocalize
executor
On each
a list of globally
that
processor
code
(edgedist
of globally
fmlocalize
depicted
to a particular the
distributed
carry
in Figure
Figure
4).
global
If a given
index
translation
i for
table
out 3.
the
bulk
of the
We will first
preprocessing
describe
flocalize,
(translation_table
array
references
in $2),
for which
processor
(2*n_edge
in $2).
P will be
and
distributed
array
references
returns:
a schedule
an integer code
(iii)
a
to find
that
can
be used
array
that
in PARTI
gather
and
scatter
procedures
(edge_sched
S2), (ii)
of
is passed:
table
distributed
indexed
in S1,
a list
memory.
P, flocalize
in $2),
ifbuild_translation_table
(myvals
corresponds
translation
indexed
Processor
procedure
can consult
and
processor
to a distributed
responsible, (iii)
the
the
in distributed
procedures
for Each
it will be responsible
the
datum
Code
passes
a datum
array,
of that
4: Inspector
processor
to obtain
distributed
location
1,sched_array)
number
(local_edge_.list of distinct
can be used in $2),
off-processor
to specify
the
pattern
of indirection
in the
executor
and references 10
found
in edgeAist
(n_off_proc
in $2).
in
part_edge_list
Iocal_edgelist
Flocalize off
buffer
processor
references
references
gather
into bottom of data
array
i i
local
data
buffer_____
off processor
Figure
5: Flocalize
11
Mechanism
data
A sketch of shown
how the procedure
in Figure
on each
2 is partitioned
processor
in Figure
part_edge_list elements are
to index
of arrays
generated
array
following
x starts
loops
simple
In Figure
by loops
gather
executor
Figure
to bring
achieved is first
changes
execution
the
to point
using
of the edge
the
references
array
the
is placed
buffer
for data
part_edge__list
to the
schedule
buffer
to
addresses.
returned
loop using
use
indexed
valid
data
For example,
changed
into the buffer
($2
PARTI
4) and
data
to x are copy
need
by flocal-
the loeal_edgelist
to be accessed
carried
of every
out.
distinct
fmlocalize
fmlocalize
makes
set of pre-existing
3 obtains
Figure
the same
procedure
by a given
in Figure
representation
off-processor
data
by multiple
In the
beginning
off-processor ($4
in Figure
schedules.
The
only
it
those
procedure
two schedules;
produced
of x
4) makes
to obtain
using
of
value
it possible
incremental_face_sched
of the incremental
in the off-processor
data
the formation the
duplicates
by using hashed
face_loop
a hash
using
is hashed.
remove
all the
results
showing
To review
duplicates the the
(i) a pointer
by the
table.
The hash
point
and
usefulness work
of globally
to bring shaded
the
out
indexed
Next
dfm-
edge_sched
by fmlocalize
that
($4
distributed
data
exists
edge schedule data
and
edge
to be accessed table
In Section
is formed
of duplicates
by the
hash
schedule
for the face_loop
6. Removal
in the
schedule.
6. The
is
schedule
during allows
the us to
5 we will present
schedule. we will summarize
procedure.
distributed
in Figure
to be accessed
all the
incremental
translation
by the
in Figure
data
by fmlocalize,
of this PARTI
indexed
region
function.
is given
in the off-processor
off-processor
of incremental
carried
is given
the information
form
to a distributed
a list of globally number
shown
At this
schedule
for the edge_loop
of the schedule
a simple
one of the arguments
(iii)
are
references,
not requested
pictorial
we remove
(ii)
flocalize
a single
The
duplicate
by flocalize
During
but
when
to globally
so that each
to flocalize
2. We cannot
refers
for
edge_list
4).
The
first.
Hence,
array
passed
in Figure
buffer array.
assignments
can gather
these
data
produced
The
for that
in which
2, no
L1 or L2.
off-processor in the
depicted
data
references
of situations
processor
to remove
part_edge_list
this part_edge_list
is executed.
is collected
5. The
data.
2).
3, each
referenced
loop
in Figure
as part_edge_list
changes
in a way such that
are a variety
(Figure
Figure
data
is stored
the correct
There
edge
off-processor
the off processor
accesses
Flocalize
The
of edge_list
on a processor
the on-processor
the
ize, the data
y.
is shown
processors.
4 is a subset
at x(n_on_proeh-1).
loeal_edgellst, When
the
works
between
an array
x and
when
immediately
flocalize
On each
table array
array 12
processor,
(translation_table references. references
the
significance
fmlocalize
in $4),
(faceJist (4*n_face
in $4), in $4),
of all
is passed:
INCREMENTAL
OFF PROCESSOR IN SWEEP
SCHEDULE
FETCfIES
OVER
EDGES
o_._o_o_f/_s_S
INCREMENTAL °o
SCHEDULE
DUPLICATES
/ EDGE SCHEDULE
Figure
6: Incremental
13
Schedule
(iv)
number
of pre-existing
duplicates (v)
(1 in $4),
an array
Fmlocalize (i)
that
to pre-existing
take
can any
an incremental the
pre-existing
code
be used
number
(v)
number
that
3.1.3
A Return already the
incremental element.
loops
L1 and
procedures
to
the
discussed
edge_sched
dfmgather between
accesses
in loop
Consequently
procedure
+
manner.
buffer_mapping
pattern
3.1.3.
in Section
This
schedule
in $4),
accesses
not
included
of indirection
in the
(n_off_proc_face
in $4).
encountered
1).
Figure
in
any
separate
When
in
executor
other
3, our
can
schedule
to dfscatter_addnc
use
value
be employed;
locations in S3, Figure
non-contiguous) 14
after
accumulation
in the this
accessed
is to be uses
locations
access
with
is specified
3. (dfscatter_addnc
buffer
buffer
accu-
schedule
locations locations.
locations by
to
beginning
of y to buffer
associated must
buffer
buffer
to handle
(edge_sched)
procedure
elements be
locations,
a schedule
consecutive
(face_sched)
use of
distributed
our off-processor
then
already
we make
so
of a buffer.
off-processor
may
anything
accumulations
of y to buffer
can
locations
schedule of buffer
situation,
said
off-processor
off-processor
consecutive
we assign copies
We
not When
to each
elements
3.
we have
dfscatter_addnc.
elements
each
but
location
in this
dfscatter_add
pattern
and
out
below,
memory
3.1.1
buffer
to off-processor
Figure
:The passed
the
in Section
off-processor
off-processor
in S3,
procedures.
in S4),
in face_list
consecutive
L1,
in distributed
of the
data
not
we carry
reference
to accumulate
some
off-processor
references
a single
example,
of distinct
y(n_on_proc
(face_sched
dfscatter_add
we assign
no longer
PARTI
account
only
scatter
Executor
copies
where
mulated.
ter_add
- to be discussed
In our
may
into
references
L2. As we will describe
off-processor
irregular
removing
in $4).
and
to specify
off-processor
distinction
We assign
in L2,
when
in $4).
schedules,
array
with
account
gather
(incremental.t'ace..sched
off-processor
of distinct
buffer__mapping
specify
into
in $4),
of distinct
(vi)
about
(sched_array
includes
can be used
(n_new_off_proc_face
far
schedules
schedules
that
schedules
(local._face..list
(iv)
have
to be taken
in PARTI
pre-existing
schedule
a list of integers
We
need
returns:
does not
(iii)
that
and
of pointers
a schedule
(ii)
schedules
integer
stands
in an array
for dfscat-
3.2
Mapper
In irregular
Coupler
problems,
by assigning
all computations
We consequently first partition loop
distributed
arrays
that
processor
tioners
array
are
partitioning
must
a new
structure
at
developed
notion
run
time,
primitives
approach
specified
the
by users
We now outline a specific
loop
first
consider
and
are
which
what
has been loops
carried
in which
dependencies
the
of a loop statement
S and the
of S. As the
implies,
name
statement
BRDG
we merge
l links
make
of undirected
use
statement runtime loop
conform, dependence
RDG A loop
is the RDG
using
the
we associate
graph sum
is constructed
of
statement
graphs.
loop
weights
the
loop
by adding
When
all
BRDG
RDG. of the
merged
The
two collapsed edge
15
this
< i,j
a generic purpose,
for the loops
partitioners.
in the
later
4.1.
the
left hand
directed
graph.
> between
side
We merge
the
data
arrays
When
partitioners
appearing graph,
with
side
hand
a loop BRDG.
a undirected
BRDG
without
to represent
on the right
associates
in size
We
on the
Most
We
section).
elements
distributed
4.1.
to loops
BRDG)
to form
in which
conform
in this
defined
problem
in Section loop
data
we have
a program
ourselves
vertex.
into
weight
we
in a given
S in a loop
parti-
Here,
described
element
the
currently
partitioners.
with
(statement
in the
statement.
in Section
slightly
out
is
the
partitioner
is a bipartite
an implicit
coupling
format from
array
to pro-
manual
For
We also restrict
l with
[9] but This
input
graph
of each
by producing
appearing
array
BRDG
each
collapse
of the
set
to minimize
makes
side [5]
problems.
will be relaxed
a weight
or the
same
will be carried
programs
of all distributed
connectivity we can
the
iterations
partitioning
extensions
arrays
dependency
loop
to be discussed
language
of a distributed
with
is to
partition
so as to attempt
is generated
manner.
statement
assigned
fashion.
to link a data
restriction
indices
associated
[6].
approach
cases,
of parallelized
of the
extensions
to be done
runtime
use
structure
language
index
partitionings,
[41], [15],
with
all distributed
bipartite
between
data
(this
array
left hand
a standardized
in an identical
dependencies
a statement
partitioners
generate
specified
Our
computation
on the
to make
is independent
needs
to be distributed
all)
in a manual
we wish
certain
yet
to data
available,
programs
can
processor
iterations.
iterations
approach
methods
standardized
using
Our
appearing
which
to processors
to a single
as in many
not
loop
variable
of linking
loop
on distributed
we have
necessarily
when
and
approach
not
to user
troublesome
introduce
define
the
be coupled
particularly
arrays,
(although
iteration
work
loops.
references.
with
many
based
computational
loop
arrays
we will partition
associated
There
a given
then,
by many
that
most
to allocate
to be a practical
distributed
distributed
assumption
loop
used
We do assume
non-local
and
appears
we partition
cessors.
In our
arrays
are
involve
both distributed
distributed This
desirable
that
partition
iterations.
When
it is frequently
each
the
in a loop
edge
of the
j either
when
edges. nodes
i and
a referenceto array index i j appears
on the
a reference
to array
to i appears Each
time
< i,j type
generation
process
munication.
The
data
side,
index
on the
edges
of such
side of an expression
a reference
and
related
a counter
associated
is encountered,
output
structure
on the
j appears
on the right
edge
Accumulation
this
right
appears
i, i >
do not
represented
ignored
induce
Compressed
as follows.
We assume
Sparse
< i,j
in the
data Row
com-
structure
(CSR)
>.
graph
inter-processor
by a distributed
to Saad's
with
[30],
format
(see
[37]). Data (i)
partitioning
At
compile
RDG (ii)
time
The
loop into
The
want
we have
edge
vertices
is associated
each
array
with
is carried
out
This
procedure
assigned
code
that
to each
is a distributed
of the
(If the
produces
a loop
partitions
subgraph
translation
identically
our
decision
distributed
array
To partition
left
hand
distributed
the
loop
correspond
to
table. arrays
and
between
array
motivation
associated
we to assign
the
This
trans-
referenced
element,
in
of assigning
work
the loop
to partition
RDG
loop
correspond
to either to the
distributed as an input
iterations
variable
side then
we would
to reducing Each
the RDG
a unidirectional
processor arrays
hand
manner,
communication.
correspond
convention
S's left
in this
would
would
we partition
One
a replicated
work
cost of interprocessor
work. with
of S references
S in a way that
Instead
for using
to attempt
side
partitions
communication.
computational
processor
Were
for statement
a boundary
partition
S in the
in all processors).
imbalance
distributed
we must
statement
the RDG
cost of load
Our
data,
element.
S's left hand
RIG
RDG
procedure
data
The
partitioning
of the partitioning
bidirectional
from
code is generated.
distribution.
a program
to cross
separately.
to a data The
partitioned
to partition
combined
coupling
P processors.
loop.
distributed work
array
table
is to compute
the
is passed
P subgraphs.
output
lation
Once
a dependency
RDG
a distributed
the
out
at runtime,
RDG
(iii)
is carried
associated
and
to a data
loop
partitioner
so as to minimize
or with
iterations comes
off-processor
references. loop iterations,
associates
with
each
we use
a graph
loop iteration
called
the
i, all indices 16
runtime of each
iteration
graph
or RIG.
distributed
array
accessed
during iteration distributed loop
i. The
array.
iteration,
The
the
We partition
The
RIG
(ii)
The
processor
using
If the
distributed
distributed
are
many
section
of integer of this
for for each distributed
array
distributed,
table
an
are referenced. reference
this
(Section
iteration
strategies
be used
2.2).
appearing
information
The
in
is obtained
processor
partitioning
the
can be used
to partition
to the
in the
that
loop
processor
assignments
procedure
which
makes
to partition
iterations.
associated
data,
there
We currently
with
the
largest
are
employ number
of
RIG.
Mapping:
we outline
each
primitives
compiler-linked
indirection
initial
initial
situations
arrays
Runtime employed
distribution
been
Support
to carry
out.
when
we begin
will handle
either
regular
is defined
RDG
includes
Procedure along
as the only
codes), Thus
out compiler-linked
data
and loop
eIiminate_dup_edges with
a count
in a loop have all local
indirection
our derivation
distributed
been
loop RDG
of the recorded, graphs
needed
to support
of the array uses
loop iteration loop
RDG
elements
number
table
of times
edges_to_RDG to form
the
with
edge
loop RDG.
17
The
The
object
In many
cases, In some
mappings
may have
been
irregularly
Our runtime Iinit.
processor.
The The
support local loop local
loop
Ii,_it.
has
generates
and
distribution.
already
distributions
to store
each
array
mapping.
to a single
associated
a hash
default
may have
of a compiler-linked initial
references.
for mapping.
irregular
arrays
of loop iterations
array
Ii,_it, will be a simple
integer
or irregular
distribution
distributed
information
preprocessing
restriction
an initial
to determine
of loop iterations
distributed
RDG
with
is to extract
adaptive carried
mapping
needed
preprocessing
(e.g.
already
merges
processor.
partitioning.
We begin
edges
could
references
In this
edges,
for each
graph.
loop iterations
Compiler-linked
the
arrays
using
possible
that
3.3
iteration
each
lists,
graph.
assign
array
RIPA
with
distributed
translation
one irregularly
manner:
is irregularly
are partitioned
strategies that
the
at least
graph or RIPA
associated
loop in which
array
references
assignment
references
following
distributed
RIPA
as there
strategies
in the
is found
to generate
use of the
data
for each
loop that
processor
assignment
iterations
also many
of distinct
iterations
the array's
Loop
Just
number
for every
iteration
is generated
are used (iii)
is generated
runtime
loop
(i)
a RIG.
RIG
unique
directed
been
encountered.
the local data
dependency
loop RDG
structures
that
Once
all
and then describe
partition loop iterations betweenprocessorsin blocks partition integer indirection array edge_list sothat if iteration i P, edge_list(i)
and
preprocessing
edge_list(n_edgeq-i)
are described
in
are on P (methods
is assigned
needed
to processor
to carry
out this
[45]).
do i= 1,n_edge pass
dependency
end
loop
loop
RDG
RDG
Figure
returns that
RDG
graph
partitioners
can
are
consider
using
the
The
the
RIG.
RIPA
4
eliminate_dup_edges
is partitioned
how
outline
compiler
arrays
that
Data
Distributions
describes
RDG_partitioner.
the
the data,
in Figure
new
mapping.
Note
the only constraint
time,
y are
is that
how
that
the
to be partitioned of calls
the
user
has
based
to a set
primitives specified
the
loop
of mapper
L1
coupler
7. by two
transformed
accesses
distributed
iteration
assume
a sequence
is supported
distributed
2 to illustrate
We
x and
in Figure
the
to a distributed
primitives,
by a compiler. translation
array
The
tables
reference,
partitioning
deref_rig primitive
returns
patti:
deter_rig
to find the
deter_rig
procedure,
and
processor
the
RIP.&.
iter_partition.
Compiler we first
specify
using
is returned
RDG_partitioner.
to partition
by code
each
edges_to_RDG
mapping.
Irregular
programs.
iterations
with
new
procedure
sequence.
with
primitive
using
A pointer
table
depicted
compile
is generated
associated
In this section
calling
as shown
of loop
table
partitioner
translation
that
At
hash
for Deriving
code
extensions
This
PARTI
the
to a data
correct
manner.
RIG
assignments
passed
partitioners
partitioning The
describes
Support
sequential
are embedded
tion_rig. inputs
the
to link
in a conforming
from
can use any heuristic
the
language
primitives
to procedure
to RDG_partitioner.
to a distributed
have
be used
(n2,nl)
structure
which
7: Runtime
a pointer
We
The
table
RDG_partitioner.
the
data
is passed
translation
loop
(nl,n2),
do
obtain
the
edges
data
and
describe loop
language
iterations
transformations
used
extensions are
which
to be partitioned
to carry
18
out
this
allow
a programmer
between implicitly
to implicitly
processors. defined
work
We
then
and
data
mapping. The compiler transformations generatecode which embedsthe mapper coupler primitives describedin Section3.2. In addition weoutline compiler transformations needed to take advantageof the incrementalschedulingprimitives describedin Section3.1. 4.1
Compiler-Linked
4.1.1
Problem
Mapping
Overview
The
current
define
Fortran
irregular
data
In Figure The
array
D syntax
map is used
(statement
real
arrays
to specify
reg (statement $6).
irreg is determined
by values
assigned
indicates
x(100)
both
difficulty
would pattern
to the
are
program
available
problems.
not
rich
enough
Our
coupler Figure
explicitly
coupler,
on the
dependency
all arrays
listed
loop in question
such
to the
linear
a nest
systems
of loops
From
the
loop
gives
is 10, this
the
how one
distribution The
Fortran-
of the
map array
of partitioning from
scratch
the partitioners
typically
heuristics can represent
and
operate
(e.g.
solvers,
L that
map(100)
generation
a wealth
programmer
blocks
of decomposition
a partitioner.
partitioners
literature
processors
it is not obvious
which
the
between
the
with
the
on data
meshes
different structures
in finite
difference
etc).
involves
each
L, we produce
at
irregularly compile
distributed
time
a mapper
from the
the
sequential
code
code
in Figure
2.
in Figure
2. The
To simplify
code
presentation,
in Figure only
8 contains
L1 is depicted
8.
statement
mapper
in sparse
are
$5).
10.
array
to couple
(statement
3.2)
L2 from
in Figure
We use
is known
to partition.
8 is derived
L1 and
explicitly
map is aligned
value
1 is that
map
there
in the
if the
by running
interface
described
is to identify
(see Section
The
user
While
is no standard
matrices
we will need
the
process.
interpretation
approach
array.
array
distribution
to processor
separately
for
the
in Figure
[41], [15], [5]), coding
There
sparse
to
irreg
by among
For example,
are assigned
distribute
Integer
$7 is that
depicted
to be generated
partitioners
physical
programmers
decomposition
reg is distributed
to map.
y(100)
the
of irreg.
statement
has
effort.
then
declarations
compilation
The
equations,
loops
the
(see for instance
a significant
whose
and
the irregularly
of irreg
D constructs
array
with
partition
2.1 requires
y with
distribution
$4) and of the
The
The
x and
the
meaning
that
in Section
decompositions.
1, we align
decomposition
outlined
$4 to designate
implicitmap(x,y) relations
L1 as the
indicates
between
in an implicitmap parallelizes,
loop
except
that
distributed statement for possible
19
that
an RDG arrays
are
loop
x and
will be used
graph
is to be generated
type
distributed output
and
a
based
y in loop L1. We assume
to be identically accumulation
to generate
that
dependencies
that the (If
,°,°
real*8
x(N),y(N)
decomposition
coupling(N)
S1 if(remap.eq.yes) $2
then
distribute
coupling(implielt
using
edges)
endif $3
align
x,y with
coupling
,°°,
$4
implicitmap(x,y)
C
Loop
L1
do i=l,n_edge
over
edges
edges involving
x, y
nl = edge_list(i) n2 = edge_list(n_edge+i) S1 y(nl) $2
= y(nl)
y(n2)-end
y(n2)
+ ...x(nl)...
x(n2)
+ ...x(nX)...
x(n2)
do
,,.,
L2
Loop
over
faces
involving
Figure
x, y
8: Example
of Implicit
2O
Mapping
the compiler
cannot
In many the
determine
codes
RDG
used
over
the
original
mesh
topology.
It is easy implicit based
code. the
using
will be used implicit
the
all relevant
in the
In our same
V that
procedure
there
using
any
implicit
the
information calls
pointer
to this RDG
Loop
hash
RIPA.
array.
RIG. The
table
are
distributed the
Section
members
RIPA
in turn
an
is generated
case,
how
to any
the
arrays
of the
flow
variables
in the same
between
a transformed
loop
L' is generated.
partitioner, at runtime
whenever
3.2,
is partitioned
iter_partition.
21
cases,
the
and
case,
indexing
L' contains loop
a hash
procedure
RDG
(see
table.
A
produces
RDG_partitioner.
to each
in Section
local
produces This
determine
as L. In this
L is identified
edges_to_RDG.
distribute
In many
When
eliminate_dup_edges
determine
can
analysis.
the
using
all variables
must
analysis
that
is located
identify
procedure
to generate
arrays
implicit
statement
V.
...
to determine
in L and
data
loop
distribute
...
be killed
code,
in this
distributed
to be able
must
in the
in the
statement
compiler
standard
embedded
pattern
distribution
of V could
to procedure
Corresponding
As described
The
are
distribute
implicit
will be needed
to a data
partitioned
when
loop.
3.2 that
is passed
executable
of distributed
to L is obtained,
from
the
we specify
loop RDG
indirection
We need
not be placed
that
so that
is encountered
to anticipate
8, the
made
might
The
for the
of interprocedural
is passed
iterations
any
been
L1 represents
$4 recaptures
primitives
using
be predicted
L. In this
have
results
Recall
which
generates the
loop
so that
loops.
user.
encountered.
specified
that
here
a multiple
compiler
be able
functions
to elirninate_dup_edges 3.2).
ularly
chance
pertaining
Section
loop
user
statement
the
implicit
(Figure
subscript
assignments
we will require
...
we must
example
and
distribution
the
by the
in L can
as the
is any
implicit
whether
the
simple
the
how
In order
L is next
patterns
determine
whether ...
when
reference
executes.
sense,
In this case,
of loops
8, loop
statement
described
from
a nest
in Figure from
is reported).
Primitives
L specified RDG.
instance,
obtained
arising
distribute
to make
in L will be indexed,
one loop.
an error
we can specify
extensions
8 to show
the
For
RDG
Coupler
loop
are valid,
problems,
mesh.
patterns
statement
to generate
using
based
The
than
in Figure
locates
assumptions
language
Mapper
the
compiler
the
dependency
example
When
mesh
original
more
Embedding
these
of a mesh.
to generalize
on merged
We use the
the
edges
mapping
4.1.2
to solve
will represent
a sweep
that
such
a loop loop
the
RIG
using
the
accesses
L is generated
at least a loop
one
irreg-
L" which
is passed
to deter_rig
to produce
iteration
partitioning
procedure,
a
4.1.3
Inspector/Executor
Generation for Incremental
Scheduling
Inspectors and executorsmust be generatedfor loops in which distributed arrays are accessedvia indirection. Inspectors and executorsare also neededin most loops that access irregularly distributed arrays. In this section we outline what must be done to generate distributed memory programs which makeeffectiveuseof incrementaland non-incremental schedules.Most of what wedescribeis asyet unimplemented,although we haveconstructed and benchmarkeda simple compiler capableof carrying out local transformations to embed non-incrementalschedules.This work is describedin [45]. We first outline what must be done to generate an inspector and an executor for a program loop L. We assumethat dependencyanalysishas determined that L either has no loop carried dependencies,or has only the simple accumulation type output dependencies.of the sort exemplifiedin Figure 2. It shouldbe notedthat the calling sequences of the compilerembeddablePARTI primitives differ somewhat from the primitives describedin Section3. The functionality describedin primitives flocalize and fmlocalize are each implemented as a larger
set of simpler
We scan
primitives.
through
the
distributed
or are
for a given
distributed
reference the
the
along
carried
out
functions
within
loop.
carried
can be hoisted
pattern.
in which
a loop,
For instance,
indexing
code consider
y(n2)
z(n2) do
22
are
array's
a compiler
n2 = nde(2*i-1)
end
the
subscript
We must
patterns
can
check
produces the
as the
loop:
irregularly a schedule
function
methods
pattern
of the sure
that
described
by computations is not modified
preprocessing
a representation
following
are
to make
modified
indexing generate
,At that
to generate
of ,4 are loop invariant
nl = nde(2*i)
......
needed from
distribution.
As long as a distributed
out within
arrays
Information
do i=l,n
.. = x(nl)..,
of distributed
can be produced
of an array's
cases
set
indirection.
out of L. This preprocessing
indexing
the
of all members
address
the
find
reference,
knowledge
do not
and
using
array
with
by computations
array's
indexed
subscript
in this paper
loop
code
that
of the distributed
The subscript function of y and z (using notation from the Fortran 90 array extensions) is nde(2:2*n:2), and the subscript function of x is nde(l:2*n-l:2). Recall from Section2.2, that schedulesspecify communication patterns and are not bound to a specific distributed array.
We can
avoid
communication and
pattern
z in the
a single
having
above
schedule
will reoccur
loop
are
to bring
Optimizations
to compute
that
reduce
impact
on storage
requirements.
should
be reasonably
The
out
distributed
schedules
have
can
of previously
processor dependencies. that
are Thus
off-processor
L. The first some
full
the
mation
possible have
computed that
been
modified
Analysis duplicate about
the
must data
the
program
subscript
statement
data
the
be
last
call
carried
behavior. S for which
we must
that
loop
iteration
ensure
L had
to a single
entering
of incremental
in two
In order
off-
carried ensures
to be valid
passes.
schedules.
schedules
the
processor
incremental
which
that
no loop
L will continue out
to make
the
in
A compiler second
pass,
to replace will have
a full already
L.
executors
for loop
When
L is called
time
L requires
us to obtain
multiple
times
we attempt
L is called,
we need
to determine
distributions
in the
set of distributed
inforto reuse
whether
it is
arrays
.A
to L.
out
if we are
to use
loops.
We need
between Consider
a
retransmission
In order
During
or loop
a favorable elimination
use
full schedules.
and
has
to avoid
5, proper
elements,
to know
time required
[45] we describe
for L with
within
Each
In
it possible
be carried
we need
functions
since
can
if y
to compute
subexpression
on communication.
loop
same
manner.
in Section
before
only
also
schedules.
we assumed
the
instance,
the preprocessing schedules
make
that
For
we need
of common
spent
each
with
as a whole.
communications
program
executor
schedule,
schedules.
also
that
schedules
inspectors
reduce
array
to assign
of off-processor
a program
time
immediately
and
of efficient
about
on the
will be replaced
storage
previously
obtained
an incremental
Generation
show
Recall
decision
an inspector
with
caused
our
As we wiIl
we know
z.
in a rudimentary
of distributed
valid.
manner,
redundant
3.1),
of incremental
schedules
schedule
still
data
generation
generates
copies
place
of y and
(Section
effect
in a loop.
modifications
optimization
a marked
one
of redundant
in identifying
schedules,
stored
copies
elimination
array.
when
of schedules
Minor
this
use of incremental
of unchanged
use
the
effective
carries
elements
the number
Obviously,
than
schedules
in a conforming
in off-processor
inspector.
that
in more
partitioned
by the
compiler
redundant
a right
we would
hand
incremental rather
side reference
schedules
comprehensive to distributed
like to use an incremental
schedule.
to eliminate information array
x in
We will need
to know when
off-processor
data
copies
of values
of x become
and
23
invalidated
by new
assignments,
which communicationsscheduleswill havealready been invokedby the time we reach X_
Methods
exist which
both
of these
(e.g.
[13],
and
objectives [10])
control
program Each
have
is a directed
predicates
time
is used,
the
off-processor
data
type
in a program
of edge
reuse
edge.
program
Using
that
In ongoing of slicing results
joint
5
work
methods
with
which
work
the
PARTI
Euler
of dependence
and
graph.
represent
We will call
[43], [23] we can find
values
of the
Kennedy's
distributed
group
will allow
elements
as part
kind
use
of the
schedules this
as a specific
of dependence and
reference
edge
predicates
a
of a
to x in statement developing
of incremental
Fortran
available.
We can view
dependence
we are currently the
dependence.
which
data.
all statements
array
at Rice,
us to automate
this
among
become
to know
this
statements
or a data
off-processor
graph
dependences
dependence array
reusable
assignment
represent
for x at S, we need
of potentially
will be implemented
S.
a variant
schedules.
D compiler
being
The
developed
a given
Results
[31].
mesh.
Two
the primitive
flocalize
of varying
mesh sizes.
and
unstructured Table
run
is relatively
The
meshes
2 were
until
of the
Euler
solver
incremental
solver
smallest
timings
obtained
because
mesh Figure
had
9 depicts
using
incremental
access
state
one version
(Section
a sequence
used
3.1),
the
view of the
described
in
the
(Section
vertex
that
in unstructured
3-
meshes
mesh.
210K The
[41].
single the
only
The
had
communication The
used
3.1.2).
mesh
on
primitive
similar
largest 210K
mesh
solution
the other
of structurally
vertices,
We conjecture
unstructured
a steady
non-incremental
patterns
a 3-D
schedules
schedules.
24
to port
tested,
3.6K
method
Solver
computed
a surface
by the
obtained
data
were
using
used
4 Megaflops. the
it has
schedules
was tested
using
used
only non-incremental
partitioned
at approximately poor
iterates
Euler
Equation
3 were
code
generated
edges.
Euler
in Section
and
were the
the
to generate
1.2 million
1 shows
Table
Euler
versions
fmIocalize
D unstructured
from
described
The
flocalizeand
vertices
Results
procedures
solver
meshes
edges
a control
schedule
the
dependence
[21].
Timing
The
The
job of achieving
A program
represent
Copies of off-processor
methods,
affect
to do a reasonable
codes.
vertices
either
dependence
Experimental
5.1
and
new
as a type
slicing
might
of this
at Rice
reuse
whose
represents
storage
scientific
in a program.
an incremental
caused
us to be able
irregular
occur
An edge
to generate
to allow
graph
that
a schedule
already
likely
for many
components.
In order
appear
node
single mesh
scheduling code
node
for these
performance
computations
are
Size Mesh
Number of Processors 1 2 8 16
64
4.1 6.0 12.0 14.4
-
4.6
Mflops 3.6K
Wime/iter(s) comm/iter(s)
26K
210K
3.1
1.5
1.3
-
0.5
0.9
0.9
-
Mflops
-
19.2
29.9
Time/iter(s)
-
7.1
4.5
comm/iter(s)
-
2.3
2.0
Mflops
-
-
-
118.6
Time/iter(s)
-
-
-
8.4
-
-
3.7
comm/iter(s)
Table
1:
Timings
Incremental
highly
the
Both
and
when
employed
when
the not
code
less than
In this section,
per
floating
for the
Intel
iteration
impact
cost
incremental
cost
per
a significant
communications
incremental
the
Mesh
point
80860
Using
Non-
operation
is very
architecture
on communications
per iteration
schedules. schedules.
dropped
the
sequential
code
parallelization
The On
from
on
and
a single In the
to keep
costs.
on 16 processors
communications
the
210K
3.7 seconds
requires
at least
calls the
Euler
execution
For
was
cost
mesh
2.0
dropped
on 64 processors
to 2.3 seconds
Results
using
we present
data
that
the
code any
to the
is virtually
new
primitives.
sequential
code
and
codes,
total
cost
times
100 iterations execution
parallelized
to introduce
with
parallelized
to the
the
by the
node
compared
parallel
and
process
preprocessing
3 % of the
Timing
the
employ
degradation.
typically
references
it difficult
had
we used
running
insignificant
program were
by
make
scheduling
of the
expect
performance was
Irregular
when
we
schedules.
form
exacted
parallel
Unstructured,
data.
mesh,
incremental
Since
5.2
26K
communications
those
with
we did not
to 1.1 seconds
we did
:
of memory
characteristics
supplied
on the
seconds
iPSC/860
number
use of incremental
instance,
the
the
of these
processor The
Intel
Schedule
irregular
high.
from
the
required
to converge,
to solve and
the
identical,
inefficiencies
beyond
We
compared
found
only
the a 2 %
of all preprocessing the
problems.
preprocessing
The times
times.
Mapper
indicates
25
that
Coupler the costs
incurred
by the
mapper
cou-
Size
Number of
Mesh
1
3.6K
16
4.1
7.1
16.9
17.4
Time/iter(s)
4.6
2.6
1.1
1.1
0.3
0.5
0.7
-
-
23.8
38.8
Time/iter(s)
-
-
5.6
3.4
comm/iter(s)
-
-
1.1
1.1
Mflops 210K
....
Time/iter(s)
Intel
iPSC/860
64
144.3 -
comm/iter(s)
from
8
Mflops 26K
2: Timings
2
Mflops
comm/iter(s)
Table
Processors
-
-
....
7.1 2.3
: Unstructured,
Irregular
Mesh
Using
Incremental
Schedule
Table
3: Mapper
Coupler
Timings Number
Number
2
of Vertices graph 3.6K
generation
mapper iter
(secs)
partitioner
comp/iter graph 9.4K
iter
(secs)
(secs)
generation
mapper
(secs.)
O.34
54K
generation
mapper iter
of Processors 4
8
16
32
0.20
15.92
11.50
12.11
14.92
0.94
0.57
0.42
0.34
2.4
1.31
0.6
0.34
0.86
0.69
0.53
0.35
70.96
62.3
65.2
89.7
1.19
0.82
0.60
0.43
4.83
2.35
1.1
0.67
(secs.)
(secs)
1.50
(secs.)
544.81
(secs)
partitioner
iPSC/860
0.21
comp/iter(secs) graph
Intel
0.24
(secs)
partitioner
from
(secs)
comp/iter(secs)
26
64
0.94 673.14
3.30
3.03
6.06
3.81
Figure 9: SurfaceView of Unstructured MeshEmployedfor Computing Flow over ONERA M6 Wing , Number of nodes = 210K pier primitives wereroughly on the order of the costof a singleiteration of our unstructured mesh code. We also show that the mapper coupler costsare quite small comparedto the cost of partitioning the data. In Table 3, graph generation depicts the time required by the mapper interface to generate the
runtime
loop over time
dependence edges
includes
the
edges_to_RDG
number tioner
that
time
shown table
problem
effort
structure
equivalent to call
depicts
of Simon's
is relatively
noted
required
data
(Section
3.2.
These
to loop L1 in Figure
eliminate_dup_edges
and
timings
2. The the
graph
time
involve
a
generation
required
to call
3.3)
of subgraphs
only a modest
The
time
3, mapper
version
(RDG)
is functionally
(Section
In Table allelized
that
graph
to the
both
3 gives
also includes
the
the time
to partition
partitioner
needed
needed
could
sizes.
27
We
high parallel
be used
to partition
for a single
the
RDG
iteration
using
the
The
and
implementation.
of the
a
parti-
because
It should
The
Euler
into
of the
count
iterations
a par-
RDG
cost
operation
as a mapper. loop
using
partitioned
employed.
partitioner's
an efficient
partitioner time
[41].
of processors
of the
to produce
graph
in Table
needed
number
because
was made
any parallelized
time
eigenvalue
equal high
the
be
iter partitioner
among
processors.
code
for different
6
Conclusions
Programs
designed
iterative
methods
of such
programs
Several
are
classes
environment memory
have
[46],
[36],
compilation Kali
which
project
was
[25] and arrays
the
the
first
ARF
projects
[14],
our
in three
was
provides
first
towards
using
that
designed discrete
[25],
dy-
facilities
multiprocessors
[21],
Fortran
[45], and
support
domains.
[7,27,26,28]
[39],
towards
to provide
memory
the
distributed
targeted
at distributed
compiler
a programming
facilities
inspector/executor
the
are targeted
environment
projects;
[32],
and
examples
meshes
environment
of these
project
to implement which
triangular
[24],
direct
Some
[44] describes
of 2-D or 3-D
[20],
that
Williams
targeted
[19],
PARTI
compiler compiler
environments
environment
sparse
[18].
a programming
decompositions
employed
[25], and
in this paper.
[44] and
[9] is an interactive
[42],
are
described
unstructured
developed
of compiler
methods
[4],
programming
or manual
[1],
including
problems.
with
[3] has
DecTool
[34],
computations
programming
or adaptive
This
are a variety
[8],
[2], [29],
developed
Baden
automatic
There
in
for calculations
load balancing.
of irregular
of the optimizations
described
(DIME)
for either
[45]. Runtime
D project
[38]. The
type
Kali
runtime
to support
[21],
compiler
preprocessing
irregularly
distributed
[45].
This
paper
has
required
of carrying
reduce
interprocessor
data
accesses.
form
of PARTI
outlined We bedding
the
presented
runtime
capable
runtime
by the such modest
order
results had
of the
described that
unstructured
and
demonstrated
that
our
a significant
impact
cost
to the
PARTI
primitives and than
costs
of a single cost
of the
to Intel found
data
that
iteration partitioner send
methods
the
design
embed
these
of the
and
overheads
by the of our
fluid our
receive
28
coupler
unstructured We did
incurred
20 %).
costs.
mapper
calls
described
how
to
off-processor
implemented primitives,
dynamics
in the and
then
Our
in this by using
performance
and
the In
results
were
code
compare
PARTI
These
off-processor
primitives
paper.
by em-
results.
redundant
mesh not
code
performance
for eliminating
itself.
compilers
primitives.
presented
method
PARTI
in detail
memory
redundant
has been
computational
have
described
We also
by eliminating
on communications
incurred
and
distributed
partitions.
for this
mesh
by hand
compared
(no more
We first
methods,
to design
requirements required
the
how and
support
support
that
a comparison
workload
transformations a full
demonstrated
small
runtime
compiler
communication
on the
dynamic
primitives.
compilation
We described
communication
implemented our
two new runtime
support. out
This
performance
also
many
computations.
namic
out a range
of irregular
machines.
particle
the
require
researchers
particular
the
to carry
roughly
were
time
quite
required
[6] we presented
appear
to be quite
We havejoined forces with the Fortran D group in compiler development and are implementing the methodsdescribedin this paper in the context of Fortran D in cooperation with Kennedy's group at Rice. The non-incrementalPARTI primitives describedin Section3.1 are availablefor public distribution and can be obtained from netlib or from the anonymousftp cite ra.cs.yale.edu. The incremental PARTI primitives and the Mapper coupler primitives described in Section 3.2 will be releasedsoonand will be availablethrough the samesources..
Acknowledgements The
authors
would
versally
applicable
Seema
Hiranandani
support The and mesh
like to thank partitioners; for many
for irregular authors
his help
munications
we would useful
Fox for many
enlightening
also like to thank
discussions
about
Ken
discussions
Kennedy,
integrating
into
Chuck
about
uni-
Koelbel
and
Fortran-D
runtime
problems.
would
in getting
partitioning
Geoffrey
also
like to thank:
us started
software;
and
with
Dennis Faust,
Gannon
Horst
Venkatakrishnan
Simon
for useful
scheduling.
29
for the for the
use use
suggestions
of his
Faust
system
of his unstructured for low
level
com-
References [1] F.
Andre,
bution.
[2] C.
J.-L.
Ashcraft,
S. C. Eisenstat,
Baden.
Baxter,
for
and
algorithms
[7] M. vlsi.
[8] A.
Cheung
mentation. Group,
[9] N. and
and
Cooper
Proceedings
1990.
for distributed
partitioning
and
sparse
coordinating
To appear,
Pasadena,
SIAM
J. Sci.
CA,
In
pages
strategy
An experimental Proceedings
1698,
Execution
17il,
time
May
support
Concurrency:
study
of the
1988
January
for nonuniform
C-36(5):570-580,
and
The
1988.
problems
on
1987.
for adaptive Practice
scientific
and Experience,
paragon
School
Cornell
A software
CSD-TR-1025,
January
environment:
University
Engineering,
E. N. Houstis,
axchietctures
Languages,
multicomputer
of Electrical
decomposer:
to mulitprocessor
Programming
EE-CEG-89-9,
Report
September
its compilation
on Principles
Report
Domain
of the
K. Crowley.
methods.
architectures.
C. E. Houstis,
K.
krylov
J. Scroggs.
language
architectures.
and
and
A partitioning
memory
University
and Implementation,
version
and
A. P. Reeves.
J. R. Rice.
Department,
dynamically
on Computers,
A CM Symposium
P. Chrisochoides,
[11] Thinking
Trans.
A parallel
Cornell
June
distri-
1991.
Technical
to parallel
[10]K.
June
In 2nd
algorithm
data
1990.
S. Eisentstat,
S. H. Bokhari.
on distributed
C. Chen.
380-388,
on multiprocessors.
Conference,
J. Saltz,
3(3):159-178,
running
preconditioned
IEEE
Berryman,
for
to manage
pages
A fan-in
11(3):593-599,
M. Schultz,
parallel
multiprocessors.
[6] H.
on Supercomputing,
abstractions
Multiprocessor
J. Berger
A system
1991.
J. Saltz,
Hypercube
PANDORE:
J. W. H. Liu.
SISSC,
calculations
Comput.,
of methods
[5] M.
and
Programming scientific
Statist.
H. Thomas.
Conference
factorization.
localized
[4] D.
and
In International
numerical
[3] S.
Pazat,
July
1986.
A first
Computer
or
imple-
Engineering
1989.
P. N. Papachiou, tool
for mapping
Purdue
University,
S. K. pde
Kortesis,
computations
Computer
Science
1990.
Kennedy. ACM ACM
Interprocedural
SIGPLAN
88 Conference
SIGPLAN
Machines
Corporation.
1.0, Thinking
Machines
side-effect
Not. CM
Corporation,
3O
on Programming
23, 7, pages
Fortran Feb
analysis
reference 1991.
57-66,
July
manual.
in linear Language
time.
Design
1988. Technical
In
Report
[12]R.
Das,
J. Saltz,
(document
and
and
H. Berryman.
patti
software
A manual
available
for parti
runtime
primitives
netlib).
Interim
Report
through
- revision 91-17,
1
ICASE,
1991.
[13]
J. Ferrante,
K. Ottenstein,
in optimization.
[14]
I. Foster
and
Englewood
[15]
G. Fox. on the 13:
Cliffs,
network.
[17] G.
University,
Volumes
for
Modern
Load
Parallel
and
Computer
vector
its use
multiplication
its Applications.
Architectures
loosely
synchronous
problems
K. Kennedy,
C. Koelbel,
U. Kremer,
C. Tseng,
Department
December
Volume
Martin
Schultz
with
a neural
Applications,
pages
Report
G. Lyzenga,
P. Hatcher,
ACPC/TR
90-1,
A. Lapadula,
S. Otto,
of Parallel
S. Hiranandani,
Programming,
programming
Multiprocessors, 1991.
in Fortran
J. Saltz
and
Rice
M. Wu.
COMP
TR90-
P.
Englewood
Cliffs,
Center
D. Walker.
memory
for Parallel and
New
73-82,
J. Anderson.
C. Tseng. D.
April
Compiler
In Compilers
Mehrotra
Editors,
Jersey,
Problems
1988.
multiprocessing
Computation,
In 3rd A CM SIGPLAN pages
Solving
sys-
1990.
A production
Symposium
quality
on Principles
1991. support and
for machine-independent
Runtime
Amsterdam,
Software The
for
Scalable
Netherlands,
To
Elsevier.
Hiranandani, migration
and
and
M. Quinn,
machines.
K. Kennedy,
Science
J. Salmon,
for distributed
Austrian
R. Jones,
for hypercube
of Computer
and
1990.
parallelization
puting,
matrix
and
Automatic
data
and sparse
Computers
H. M. Gerndt.
[22] S.
and
Prentice-Hall,
Concurrent
Prentice-Hall,
appear
Programming.
on Hypercube
Computers.
parallel
in Parallel
in Mathematics
balancing
on Concurrent
C* compiler
[21]
Concepts
to load balancing
specificatiofi.
Fox, M. Johnson,
Practice
graph
1988.
Conf.
D language
tems.
dependence
1988.
141, Rice
[20]
New
IMA
Algorithms
S. Hiranandani,
Fortran
[19]
In The
In Third
Fox,
[18] G.
approach
W. Furmanski.
241-27278,
program
1987.
Strand:
Springer-Verlag,
G. Fox and
The
N J, 1990.
A graphical
Numerical
J. Warren.
TOPLAS,
S. Taylor.
hypercube.
Editor.
[16]
ACM
and
J. Saltz, schemes
to appear,
P. Mehrotra,
and
on multicomputers.
12, August
H. Berryman. Journal
1991. 31
Performance
of Parallel
and
of hashed Distributed
cache Com-
[23] S.
Horwitz,
ACM
[24]
TOPLAS,
K. Ikudome, tion
[25]
T. Reps,
for distributed
distributed
memory
architectures.
of Parallel
Programming,
M. Chen.
[27] J.
M. Chen.
distributed
arrays.
Computation, J. Li and
MIT
J. W.
primitives
runtime
J. Mavriplis.
Three
In AIAA
R. Mirchandaney, support
computers.
In Snd pages
and
symbolic
In DMCC5,
Supporting
ACM
paralleliza-
pages
shared
SIGPLAN
177-186.
ACM
data
Symposium
SIGPLAN,
communication
from
'90, November
alignment:
1105-1114,
structures
on
on Principles
March
1990.
shared-memory
program
1990.
Minimizing
cost of cross-reference
of the 3rd Symposium
the
and
coordination
Compiler
Software, Rogers
Conference
models
Computing,
for
and
pages
on the
Frontiers
of interprocessor
Parallel
and
Computing,
for parallel
between of Massively
communication. Cambridge
In
Mass,
1991.
Pingali.
on Programming
parallel
for the
sparse
France,
ICASE, euler
Conference,
D. M. Nicol,
Data-parallel
report,
Dynamics
In Proceedings
, St. Malo
September
Interim
multigrid
Fluid
processors.
for
cholesky
1986.
unstructured
R. M. Smith,
P. J. Hatcher.
scheduling
in progress.
Computational
J. H. Saltz,
K.
task
3:327-342,
dimensional lOth
69-76,
and
for compilers,
on Supercomputing
M. J. Quinn
June
domain
Computational
Parti
[34] A.
An automatic
Rosendale.
explicit
Automating
Parallel
Conference
[33]
graphs.
Press.
Liu.
runtime
dependence
1990.
Languages
91-1549cp. [32]
J. Van
In Proceedings
factorization.
[31] D.
parallel
Supercomputing
Index
October
Programming The
and
Generating
M. Chen.
using
1990.
In Proceedings
Li and
J. Flower.
memory
P. Mehrotra,
slicing
1990.
and
C. Koelbel,
references.
[30]
A. Kolawa,
SC, April
Li and
Interprocedural
January
Charleston,
[26] J.
[29]
D. Binkley.
12(1):26-60,
G. Fox,
system
Practice
[28]
and
and of the
pages
Kay
140-152,
programming
equations, June
Crowley.
1988
1991.
ACM July
paper
1991. Principles
of
International 1988.
on multicomputers.
IEEE
1990.
Process Language
decomposition Design
1989.
32
and
through
locality
Implementation.
of reference. ACM
SIGPLAN,
In
[35] M. Rosing
and
computation
R. Schnabel.
on distributed
University
of Colorado,
[36] M. Rosing, in Dino.
R. W. Schnabel,
Applications,
pages
[37] Y. Saad.
and
553-560,
multiprocessors.
- a new
language
Technical
Report
for
numerical
CU-CS-385-88,
R. P. Weaver. Conference
Expressing
complex
on Hypercubes,
parallel
Conurrent
algorithms
Computers
and
1989.
a basic
tool
kit
for sparse
matrix
computations.
Report
90-20,
1990.
[38]J. Saltz,
H. Berryman,
Concurrency,
[39]J. Saltz,
Practice
K. Crowley,
execution
of loops
Computing,
and
J. Wu.
and
Experience,
on
message
8:303-312,
for realistic Portland,
loops.
[41]H. Simon.
of the
Physics
Journal
ICASE,
to appear:
1990.
Run-time
scheduling
of Parallel
and
and
Distributed
Program
and
J. Wu. Parti
procedures
Memory
Computing
Conference,
1991.
Permagon
Mellon
H. Berryman,
of the 6th Distributed
mesh
on Parallel
A Parallelizing
[43]M. Weiser.
90-59,
Berryman.
machines.
of unstructured
Conference
Carnegie
Harry
D. Mavriplis,
April-May
Applications.
[42]P. S. Tseng.
and
Report
for multiprocessors,
1990.
Partitioning
ceedings
compilation
1991.
passing
In Proceedings
Oregon,
Runtime
R. Mirchandaney,
[4o]J. Saltz, R. Das, R. Ponnusamy,
thesis,
of Dino
1988.
of the 4th
Sparsekit:
overview
memory Boulder,
In Proceedings
RIACS,
An
Methods
Press,
slicing.
IEEE
for parallel
on Large
Scale
processing.
Structural
In Pro-
Analysis
and
1991.
Compiler
University,
problems
for Distributed
Memory
Pittsburgh,
PA,
May
Trans.
on Software
Parallel
Computers.
PhD
1989. Eng.,
SE-10(4):352-357,
July
1984. [44] R. D. Williams C3P
715,
[45]J. Wu,
and
Caltech J.
R. Glowinski. Concurrent
Saltz,
[46]H. Zima,
and
In Proceedings
pages H. Bast,
parallelization.
Computation
S. Hiranandani,
for multicomputers. Processing,
Distributed
II-26,II-30,
irregular Program,
H. Berryman.
of the
1991
finite
elements.
February Runtime
International
Technical
Report
1989. compilation Conference
methods on Parallel
1991.
and
M. Gerndt.
Superb:
Parallel
Computing,
6:1-18,
33
A tool 1988.
for semi-automatic
MIMD/SIMD
NASA
Report
Page
2. Government Accession No.
1, Report No, NASA CR-187635 ICASE
Documentation
Report
No.
3. Recipient's Catalog No.
91-73
4, Title and Subtitle DISTRIBUTED
5. Report Date
MEMORY
PROBLEMS
--
DATA
COMPILER COPY
METHODS
REUSE
AND
FOR
IRREGULAR
RUNTIME
September
PARTITIONING
8. Performing Organization Report No.
7. Author(s} Raja
Das
Ravi Joel
Ponnusamy Saltz
91-73 10. Work Unit No.
Dimitri Mavriplis 9. Pe_orming Organization Name and Address Institute and Mail
1991
6. Performing Organization Code
for
Computer
505-90-.52-01
Applications
11. Contract or Grant No.
inScience
Engineering Stop
NASI-18605
132C,
NASA
Langley
Research
Center 13. Type of Report and Period Covered
Hampton, VA 23665-5225 12. SponsoringAgency Name and Addre_
Contractor National
Aeronautics
Langley
Research
Hampton,
VA
and
Space
Report
Administration 14, Sponsoring,_gency Code
Center
23665-5225
15. Supplementaw Notes Langley Michael
Technical F. Card
To
Monitor:
appear in and Runtime Memory P.
Final Repor 16_'A_tract
any
We
describe
In
our
to
be
paper
outlines
distributed how
scheme,
memory to
link
two
with
methods
compiler runtime
programmers
distributed
plicit!y
Machines,
Mehrotra,
Editors:
Elsevier
Compilers Distributed
J.
Saltz
and
Press.
t
This in
book -- "Languages, Environments for
implicitly
processors.
potentially
to
complex
we
believe
handle
partitioners
can
between
which
able
to
algorithms
how
insulates that
play memory
data
and
users
carry
an
important
andunstructu=ed
distributed
specify This
will
sparse
role
problems.
compilerS.
loop
iterations
from
having
out
work
and
and
reusing
the
same
to
data
are
deal
ex-
partition-
ing. We processor memory
describe
data.
In
locations.
processor stored Euler
also
memory
a viable
many As
long
locations
off-processor solver run on
mechanism
programs, as
it
remain
data. We an iPSC/860
can
be
sparse,
mesh,
compiler,
verified we
NASA FORM 1626 OCT86
tha@
the
show
that
present experimental to demonstrate the
executor,
distributed
59
values we
data from usefulness
can
copies
of
off-
off-processor assigned
to
effectively
offreuse
a 3-D unstructured of our methods.
memory
_.
- Mathematical - Computer
. Up¢lassified Security Cla=ff. (of this pe_} Unclassified
and
Computer
Sciences
(General) 61
Unclassified
access
18. Distribution Statement
inspector,
19. Security Classif. (of this report)
tracking
loops
unmodified,
17. Key Words(SuggestedbyAuthor|s)) unstructured
for
several
Programming - Unlimited 21. No. of _ges 35
and
_.
Software
Price A03
NASA-L_Jey,1991