NASA Contractor Report 187635 ICASE Report No. 91-73 ... - CiteSeerX

7 downloads 4953 Views 2MB Size Report
duces a graph that is used in a compiler embedded loop iteration partitioner. Programmers use Fortran extensions to specify which loops and which distributed.
NASA

Contractor

ICASE

Report

Report No.

187635

91-73

ICASE o,,

DISTRIBUTED MEMORY COMPILER METHODS FOR IRREGULAR PROBLEMS -- DATA COPY REUSE AND RUNTIME PART_IONING

,-4 I N

co t,'3 ..-:0 U-_"

Z

_0

r-4

,O

Q

17, O

Raja Das Ravi Ponnusamy Joel Saltz Dimitri Mavriplis

_0

=E LU _- ,.J a_r_ QO LU ta E

L'_ ..J ZU b,_ V! ZU O I.I-a_ ck

LU ,.J

Contract

No.

September

u.J ::3.. Ir

NAS1-18605 {I:LU

1991

Z I..-H

Institute

for Computer

NASA

Langley

Hampton, Operated

Applications

Research

Virginia

in Science

and Engineering

Center

23665-5225

by the Universities

Space

Research

Association

r¢ ,,-_ b..t ¢_" IdJ QO tL Z'_ ,_tJ tr_ UJ,._ t_Q V1 ,O3: I'-- I.--- LU .1._ L 00ud O I >" Ca. C3a_: tjuJ I -J U P

i...- E ZO National Space

Aeronautic_

and

(::{

.,_

C:) u.,

Administralion

I.angley

Resesrch

Center

Hamplon,

Virginia

23665-5225

f

7

DISTRIBUTED FOR

MEMORY

IRREGULAR

REUSE

Raja _ICASE,

MS 132C,

bDepartment

PROBLEMS

AND

Das _, Ravi

-

RUNTIME

Ponnusamy NASA

of Computer

COMPILER

b, Joel

Langley

Science,

DATA

COPY

PARTITIONING

1

Saltz _ and

Research

Syracuse

METHODS

Dimitri

Center,

University,

Mavriplis

Hampton,

_

VA 23665

Syracuse,

NY

13244-4100

ABSTRACT

This

paper

distributed how

outlines

memory

to link

grammers processors.

We data.

that also

carry

out

In many

programs,

long as it can

be verified

experimental the

we show data

usefulness

of our

from and

sparse

and

loops

the values

we can

and

loop

and Engineering Saltz was provided

an

unstructured

important

are

explicitly

role

problems.

compilers.

iterations

to deal

in any

We describe

In our

scheme,

pro-

to be distributed

between

with

complex

potentially

partitioning. for access

tracking

and

the

off-processor

same

assigned

effectively

a 3-D unstructured

reuse Euler

reusing

to off-processor stored

solver

run

copies

of off-processor

memory memory

locations.

locations

off-processor

data.

on an iPSC/860

As

remain

We

present

to demonstrate

methods.

IThis work was supported by the National Aeronautics No. NAS1-18605

will play

memory

having

data

mechanism

several

that

from

how data

work

that

we believe

to handle

users

a viable

which

to distributed

specify

insulates

describe

unmodified,

able

partitioners

implicitly

This

methods

compiler

runtime

can

algorithms

two

while the authors were in residence (ICASE),

NASA Langley Research

by NSF from NSF Grant

and Space Administration

at the Institute Center, Hampton,

ASC-8819374.

for Computer

under NASA Contract Applications

VA 23665. In addition,

in Science support

for

1

Introduction

Over

the

memory tured

past

few years,

code

for a large

problems,

runtime.

In these

by a runtime and

needed

compiler

cases,

to carry

the

out

to reduce

off-processor A compiler-linked compose

data

at runtime one

structures

or more

embedded duces use

loop data

Fortran

data

and

iterations

widely

in

In this

that

In the

are

interest

extensions context

out data

structure

preprocessing

between

number

allowing

of a pre-existing

further

typically

paper

of casting

data

processors.

accessed

uses

to partition

loop

using

of communications

known

only

is made work,

memories

at

possible

to map

data

of processors.

by a distributed

methods. memory

In this

paper,

compilers,

and

requirements

dependency

which

The

memory

we describe:

by eliminating

scheme

compiler

redundant

vote

and

between

of the software

and

language,

Fortran

D

and

iteration

to generate In sparse indirection.

out

code

that

arises

from

at runtime

Programmers should specify

be used how data

of developing years

and

was

outlined

language

some

development

a set of

(see for instance

the

to realize

pro-

exten-

of these

ideas.

of languages

we present

our

work

and in the

[17].

unstructured

to carry

to de-

to a compiler

to compilers

machines,

partitioning

Runtime

idea

support

in the

SIMD

communication and

The

required

that

implicitly

partitioners

runtime

passed

arrays

by G. Fox for many such

that

partitioner.

distributed

processors.

pursued

graph

code

programmers

for standardization

loop

generates

which

MIMD

calls

also

loops

memory

produces

is then

loop iteration

the

information

compiler

embedded

some

us to develop

The

representation

for linking

we describe

dependency

of the dependency

Consequently,

has been

data

iterations.

graph

The

partitions.

our

dynamic

representation

in a compiler

a general

for distributed

Once

partitioner

partitioner.

partitioners

[16]), and

sions

compilation

are to be distributed

[15] and

unstruc-

[38].

communication

to specify

structure

applicable

[32].

This

is used

extensions

and

values

to partition

be generated

to distributed

a standardized nests.

that

to derive loop

and

structure

a graph

runtime

In sparse

architectures

the

distributed

accesses.

runtime

generates

between can

efficient

by variable

is used

compilation

partitioners

problems.

memory

of data

preprocessing

interprocessor data

of distributed

runtime

to generate

unstructured is determined

movement

two new

needed

This preprocessing

runtime

we call

presents

use

methods

and

structure

phase.

how to link runtime how

of sparse

effective

to schedule

paper

developed

dependency

in a process

This

class

preprocessing

structures code

the

we have

have calls

needed

determined, to efficiently

computations,

preprocessing the

been

required

distributed

is used data

we carry transport arrays

to generate

transport.

In many

are

a small cases,

severalloops accessthe sameoff-processormemorylocations. As long as it is known that the valuesassignedto off-processormemory locations remain unmodified, it is possibleto reuse stored off-processordata. A mixture of compile-time and run-time analysiscan be used to recognizethesesituations. Compiler analysisdetermineswhen it is safeto assumethat the off-processordata copy remains valid. Softwareprimitives generatecommunications calls which selectively fetch only those off-processordata, which are not available locally. We will call a communications pattern that eliminates redundant off-processor data accesses an incremental

schedule.

[22] and

that

data

an overview mental

the

and

of our

the

2

Overview

2.1

Overview

We will present is a version

builds

on the

executors,

work

described

in

partitioning. of our

[6],

specified

requires

that

have

explored

extensively

on this work

will be presented

could

D can

be used

array

In Fortran

elements.

D, one

attributes the

in the

extensions

Fortran tributed

the

array

two declarations.

problem (e.g. context

first

which

extensions

in

generate

we use

incre-

to control

performance

of Fortran

data

decomposition

found

how

data

in this

data

FortranD

has

same

optimizations

of languages

and

of such

a distribution

distribution processors.

is decomposition.

used

fixes the

and

our

analogous

compilers.

an example

2

Many

D, the

1, we present

declaration

D as

[33], [7, 27, 26, 28]) While

range

between

Fortran

[11],

inter-processor

The

of

and

an irregular

array.

description

[35] and

of Fortran

called

a

book.

decompositions,

D

specifications,

is to be distributed.

specify

a template

D. Fortran

[17], a less detailed

al.,

define

[25],

context

of data

et.

of specifying

for a wide

is to be partitioned The

set

be found

explicitly

[45],

to explicitly

of a distributed

used

4 we will present

5 we will present

in the

a rich

may

users

In Figure

declares

methods

by Hiranandani

be used

primitives

methods.

with

extensions article

primi-

D

77 enhanced

language

language

in Section

the

In Section

transformations

the

Finally,

our runtime-compilation

of Fortran

the

describe

3.1, we will describe

3.2 we will describe

to compilers.

We describe and

performance

2. In Section

In Section

partitioners

effort.

in the

language

which

compiler

D is given

researchers

cant

here

in Section

schedules.

of Fortran

of the

currently

work

described

work

iteration

runtime

to characterize

definition

loop

and

compiler-linked

of the

incremental

inspectors

Fortran

context

produce

to couple

drawn

preprocessing

[45].

We will set tives

The

partition

a Fortran

D declaration.

to characterize size,

A distribution Decomposition

of dis-

dimension

the

signifi-

and

way

is produced fixes the

name,

in

using di-

.,..

$1

REAL*8

x(N),y(N)

$2

INTEGER

map(N)

S3 DECOMPOSITION $4

DISTRIBUTE

$5

ALIGN

$6

... set values

$7

DISTRIBUTE

$8

ALIGN

reg(N),irreg(i) reg(block)

map

with

reg

of map

array

Distribute

addition,

size of the

is an executable

processors.

Fortran

can explicitly

A specific

array

is associated

statement

$3, Figure

$4, decomposition

used

cessors.

The

current

decompositions. and loop

reg

compiler iterations

array

statement

and

the

user

specify with

..

distribution

i of the Fortran

equal

are to be distributed

regular

Fortran

processors.

are

onto

defined.

a block reg.

align.

In

to each

map will be between

map(i)

pro-

is set equal

p.

to explicitly sections,

processors.

assigned Array

when

in

In statement

is to be partitioned array;

onto

distributions,

D statement

with

to processor

for programmers

is distribute.

is to be mapped

distribution

an integer

programmers

between

blocks

irreg

in the following

it possible

the

with

is assigned

requires

a template

distributions

sized

declaration

is to be mapped

using

using

second

of several

a distribution

distribution

£rreg

illustrate

how

a choice

map is aligned how

The

dimensional

into

D syntax

make

with

is specified

distribution

As we shall techniques

$8)

template.

a distribution

$5, array

Distribution

specifies

how

is partitioned

(in statement

An irregular

to p, element

method

D Irregular

distributed

1, two size N, one

In statement

to specify

h Fortran

D provides

a user

processor.

mapping

irreg

Figure

and

some

irreg(map)

x,y with

mensionality

using

our

define

irregular

new language

to implicitly

specify

data

extensions

how data

and

2.2

Overview

In this section, in previous

we will give an overview

publications

a program's

In the

been

the

rest

of the

latency

collected

messages.

will be

must

such

and

inspector

be

calculates once

These

the

directly

primitives

[6]; they

distributed

carry

data-sets

a set of schedules, a obtain

copies

deal

namely,

inspector

loop

examines

data

needs

off-processor The

actual

executor

computation.

are

named

out

the

distribution

over

the

numerous

specify

the

of data

PARTI

stored

and local

elements

the

and

data

references

uses

the

communication

needed

off-processor

executor. made

where

a suite

by. a

that

data

from

the

of primitives

pairs.

Runtime

Toolkit

indexed

memories.

calls

the

original

information

developed

due

at compile the

inspector

of globally

processor

in specified

to predict

and

matrix

arises

of information

the

Automated

sparse

typically

inspector/executor

retrieval

be

can be reduced

and

This

possible

lack

We have

(Parallel

by

should

data.

then

to generate

used

to be transmitted

to be fetched

loop

strategy

meshes

input

this

a computation

receive.

it is not

with

define

subsequent

communica-

array and

two constructs

by programmers

which

case,

of the

during

a non-trivial

on unstructured

In this

or input

of the

is typically

on the

described

phase.

of fetching

PDEs

To

it is received.

to implement

can be used

code.

the

what

elements

to send

depends

prefetched.

execution,

cost

nature that

information

needs

as solving

into

there

reasons,

processor

in the

the

Vital

primitives

produced

structures

follows.

The

pattern

is transformed

program

stored

[12],

each

communication

loop

processor,

that

problems,

data

architectures,

large data

data

by this preprocessing

relatively what

algorithms,

the

phase

MIMD

of the PARTI

in determining

when

For efficiency

data

During

role

are determined

level of indirection

sequential

a large

cost.

the

what

In many

or startup

In irregular

time

[38]).

approach,

memory

by precomputing

to some

of the functionality

a preprocessing

algorithm

into

algorithms,

[6],

plays

PARTI

initialized,

In distributed tions

( [45],

initialization

computation. have

of PARTI

Each

at

ICASE)

but

irregularly

inspector

produces

to either:

memory

locations

(i.e.

gather)

or,

b modify

the

contents

c accumulate (i.e. Schedulers only

(e.g.

of specified add

off-processor

or multiply)

values

memory

locations

to specified

(i.e.

off-processor

scatter), memory

or locations,

accumulate). use hash

a single

copy

tables

to generate

of each

executor

by PARTI

primitives

memory

locations.

In this

communication

off-processor to gather, paper,

the

calls that,

datum

[22],

scatter

and

idea

[45].

The

accumulate

of eliminating

for each loop nest, schedules data

duplicates

are

to/from has

been

transmit

used

in the

off-processor taken

a step

further. If severalloops require different but overlappingdata referenceswe can now avoid communicatingredundant data (SeeSection 3,1 and Section4.1.3). In distributed memory machines,large data arrays needto be partitioned betweenlocal memoriesof processors.These partitioned data arrays are called distributed arrays. Long term

storage

tributed

of distributed

machine.

manner. numbered

frequently

the mesh.

When

to each and

the

thus

build

way

does

not

Each

for another

processor

therefore

must

is accomplished on

the

required

to access

of the

primitives

This

used

The

scheduling

primitives

but

putations. the

new

the

couple

incorporate

the

its local

a given

address

array

element,

elements

table

the

first

of the

array,

we look

up

in the

(m/N)

minimizes

array

of the

elements processor,

array

processor's host

we must

memory.

processor

also contains of the

N elements,

processors.

This

the

N/P

second

in the

If we are distributed

• P + 1th processor. translation

We

address.

of processors.

its address

of distributed translation

of

processor,

number

distributed

pattern

memories

P is the

initialization

connectivity

the

where

be found

are

to a particular

lists

local on

mesh

element

the translation the

computational

arbitrary

in this

dis-

in an irregular

in a way that

is assigned

...,

can

primitives with

partitioners are

related

a number

These

to assign

in the

tables,

and

One other

tables.

Primitives

along

that

etc

handles

describes

primitives

to be able

over

N/P

element

to access

processors,

first

we know

PARTI

section

between

which

primitives

are

the

mth

a problem

of N elements,

arrays

to the

in such

array

locations

distributed

of an irregular

to access

for each

processor,

the

table,

PARTI

3

second

and

be distributed

by putting

elements

translation

be itself

need

to be able

table which,

memory

correspondence

structures

it resides,

array

nodes

in a distributed

processor

a translation

the

we may

in which

to specific

to partition

a useful

the data

element

For a one-dimensional and

in which have

we partition

processor.

know

the

is assigned

advantageous

communication,

in order

data

It is frequently

For instance,

interprocessor

array

of new

primitives

differ

which

the

schedule

primitives

that

to compilers to the

then

couple

primitives

we have

in a number

had

of ways

carry

out

partitioners

are entirely

PARTI

insights

and

new. described

about from

sparse

movement to compilers.

The

of data The

data

movement

and

earlier

( [6] and

[45])

and

unstructured

those

described

are virtually

identical

earlier

comin that

primitives:

eliminate make original

redundant

it simple

off-processor

to produce

sequential

references

parallelized

loops

loops.

5

and that

in form

to the

real*8 x(N),y(N) C

Loop

L1

do i=l,n_edge nl

over

edges

involving

x, y

= edge_list(i)

n2 = edge_list(n_edge+i) $1

y(nl)=

y(nl)

+ ...x(nl)...

x(n2)

$2

y(n2)=

y(n2)

+ ...x(nl)...

x(n2)

end

do

C

Loop

L2

do i= 1,n_face ml

over

Boundary

faces

involving

x, y

= faceAist(i)

m2 = faceJist(n_face+i) m3 = faceAist(2*n_face $3

y(ml)

$4

y(m2)= end

= y(ml) y(m2)

+ i )

+ ...x(ml) + ...x(ml)...

... x(m2) x(m2)..,

... x(m3) x(m3)

do

Figure

2: Sequential

Code

To explain how the primitives work, we will use an examplewhich is similar to loops found in unstructured computational fluid dynamics (CFD) codes. In most unstructured CFD codes,a meshis constructedwhich describesan object andthe physicalregion in which a fluid interacts with the object. Loops in fluid flow solverssweepover this meshstructure. The two loops shownin Figure 2 representa sweepoverthe edgesof an unstructured mesh followed by a sweepover faces that define the boundary of the object. Since the mesh is unstructured, an indirection array has to be usedto accessthe vertices during a loop over the edgesor the boundary faces. In loop L1, a sweepis carried out over the edgesof the mesh and the referencepattern is specifiedby integer array edgeAist. Loop L2 represents a sweepover boundary faces,and the referencepattern is specifiedby faceAist. The array x only appearsin the right hand side of expressionsin Figure 2, (statements $1 through $4), so the valuesof x are not modified by these loops. In Figure 2, array y is both read and written to. Thesereferencesall involve accumulationsin which computed quantities are addedto specifiedelementsof y (statements$1, $2, $3 and $4). 3.1

Primitives

In this section time

support

our earlier compiler new

we need suite

of primitives

is that

distributed single

are virtually

memory

derived

redundant

described

in

preprocessing identical

machine

2 in order

to present As was

[6], this runtime

support

can be used

memory make

codes

the

same

be produced

manually

the

to produce

sequential

quality

object

by the

sequential

the

either

by a Our

parallelized

loops.

The

importance

code

on the

nodes

program

run-

case with

by programmers.

it straightforward

to the original

to generate as could

Figure

references.

that

in form

from

off-processor

into distributed

it will be possible

primitives

of situations

times.

out

Scheduling

running

of of the on a

node.

Our ber

carry

example

to eliminate

or can be embedded

that

this

Communications

we use a running

primitives

loops

for

In such

distributed

make

use

in which situations,

array

a single

tables

[22] to allow

off-processor

the primitives

us to recognize

distributed

only fetch

a single

embedded

fortran

array copy

and

reference

of each

unique

exploit

a num-

is used

several

off-processor

reference.

3.1.1

PARTI

Figure

3 depicts

gather,

dfscatter_add

Executor the

ezecutor and

preprocessing

phase,

icantly

non-incremental

when

of hash

code

with

dfscatter_addnc.

to be described schedules

Before

in Section are

this 3.1.2.

employed.

callable

code This An

PARTI

is run, executor example

we have code of the

procedures to carry changes executor

dfmout signifcode

a

when the preprocessingis done without using incremental schedulesis given in [40]. The arrays x and y are partitioned betweenprocessors,eachprocessoris responsiblefor the long term storageof specifiedelementsof eachof thesearrays. The way in which x and y are to be partitioned betweenprocessorsis determinedby the inspector. In this example,elements of x and y are partitioned betweenprocessorsin exactly the sameway. Each processoris responsiblefor n_on_proc elements of x and y. It

should

Figure

be noted

that

3 is identical

named

x and

to that

y; in Figure

of a distributed

memory than

responsible.

We will store

The

PARTI

precomputed single

locations

data

ure

that

ter_addnc begins

data

obtain

with

indices

data

be described

in Section

off-processor

location

PARTI

In this

section,

arguments

a useful mesh

in Figure

by the

in which

correspondence

mesh

declared

for which

beginning

with

P is

local

array

data

between

pattern

uses

which

x defined

data

on each

with and

processors

is specified

communication

by either

schedules The

specify

In Figure

Copies

a

to fetch

schedules

is to be obtained. processor.

using" a

3, off-

of the off-processor

x(n_on_proc+l). dfscatter_addnc,

memory

in statement

locations.

to off processor data

Both

dfscatter_add

locations

is accumulated

distinctions

between

In Figure

3, several

from

and

may

be

$3 Fig-

and

a buffer

to locations

dfscatter_add data

82 and

dfscat-

area

that

of y between

dfscatter_addnc

accumulated

will

to a given

L1 or in loop L2.

we will outline

in a way that

arbitrary

elements

arrays

processor

y are

elements

in

Inspector

needed way

in loop

of array

loops

use

a single

x and

by loop L1 or by loop L2.

Off-processor

3.1.3.

on

P, arrays

3 move

dfmgather

from

to be accumulated

The

number

of the

3, we again

defined

communication

either

to off-processor

n_on_proc.

3.1.2

The

beginning

arrays

array

structure

In Figure

processor

in Figure

dfscatter_add

y(n_on_proc+l).

1 and

The

area

procedures

3, accumulate

depicted

array

the

c+ 1).

of schedules,

from

the

of off-processor

memory

in a buffer

PARTI

to store

will be needed

2.

represent

On each

pattern.

calls,

in Figure

y (n_on_pro

calls

is obtained

placed

The

c+ 1) and

in distributed

data

are

copies

procedure

y now

be needed

or by an array data

the

loops

3, x and

subroutine

schedule

processor

of the

communication

off-processor the

would

x (n_on_pro

for

multiprocessor.

to be larger

elements

except

4) allows

code

the

we carry

in Figure

nodes to the

minimizes

points

how

connectivity

us to map

processor. a globally

the

3. This

of an irregular

interprocessor

to each

out

preprocessing

preprocessing mesh

pattern

are of the

is depicted

numbered mesh.

communication, The

PARTI

indexed

needed

distributed

in Figure

frequently When

we may need

procedure

to generate 4.

do not

have

we partition

such

to be able

to assign

ifbuild_translation_table array

the

onto

processors

a

(S1 in an

real*8 x(n_on_proc+n_off_proc) real*8 y(n_on_proc+n_off_proc) S1 dfmgather(sched_array,2,x (n_on_proc+1),x) C Loop over edgesinvolving x, y L1 do i=l,local_n_edge nl

-- local_edgeAist(i)

n2 = local_edgeAist(local._n_edge+i) $1

y(nl)

= y(nl)

+ ...x(nl)...

x(n2)

$2

y(n2)

= y(n2)

+ ...x(nl)...

x(n2)

end $2

do

dfscatter_add(edge_sched,y(n_on-proc+ C Loop L2

over

Boundary

faces

1),y) involving

x, y

do i=l,localm_face ml

= local_faceAist(i)

m2 = local_faceAist(localm_face+i) m3 = local_faceAist(2*local_nA'ace $3

y(ml)=

y(ml)

+ ...x(ml)...

x(m2)..,

x(m3)

$4

y(m2)=

y(m2)

+ ...x(ml)...

x(m2)..,

x(m3)

end $3

+ i )

do

dfscatter_ddnc(face_sched,y(n_on_proc+

1),

buffer_mapping,y)

Figure

3: Parallelized

Code

9

for Each

Processor

S1 translation_table= ifbuild_translation_table(1,myvals,n_on_proc) S2 call flocalize(translation_table,edge_sched,part_edgeJist, local_edgeJist,2*n_edge,n_off_proc) S3 sched_array(1)= $4

edge_sched

call fmlocalize(translation_table,face_sched, increment

al_face_sched,

4*n__face,

n_off_procface,

n_new_off_proc_face, $5

sched_array(2)

par t_face__list,local_facedist,

buffer_mapping,

= incremental_face_sched

Figure arbitrary

fashion.

Each

the

elements

for which

array

processor specific the

needs

The needed

PARTI to produce

($2 in Figure (i) (ii)

4).

a pointer

number

Flocalize (i)

flocalize

executor

On each

a list of globally

that

processor

code

(edgedist

of globally

fmlocalize

depicted

to a particular the

distributed

carry

in Figure

Figure

4).

global

If a given

index

translation

i for

table

out 3.

the

bulk

of the

We will first

preprocessing

describe

flocalize,

(translation_table

array

references

in $2),

for which

processor

(2*n_edge

in $2).

P will be

and

distributed

array

references

returns:

a schedule

an integer code

(iii)

a

to find

that

can

be used

array

that

in PARTI

gather

and

scatter

procedures

(edge_sched

S2), (ii)

of

is passed:

table

distributed

indexed

in S1,

a list

memory.

P, flocalize

in $2),

ifbuild_translation_table

(myvals

corresponds

translation

indexed

Processor

procedure

can consult

and

processor

to a distributed

responsible, (iii)

the

the

in distributed

procedures

for Each

it will be responsible

the

datum

Code

passes

a datum

array,

of that

4: Inspector

processor

to obtain

distributed

location

1,sched_array)

number

(local_edge_.list of distinct

can be used in $2),

off-processor

to specify

the

pattern

of indirection

in the

executor

and references 10

found

in edgeAist

(n_off_proc

in $2).

in

part_edge_list

Iocal_edgelist

Flocalize off

buffer

processor

references

references

gather

into bottom of data

array

i i

local

data

buffer_____

off processor

Figure

5: Flocalize

11

Mechanism

data

A sketch of shown

how the procedure

in Figure

on each

2 is partitioned

processor

in Figure

part_edge_list elements are

to index

of arrays

generated

array

following

x starts

loops

simple

In Figure

by loops

gather

executor

Figure

to bring

achieved is first

changes

execution

the

to point

using

of the edge

the

references

array

the

is placed

buffer

for data

part_edge__list

to the

schedule

buffer

to

addresses.

returned

loop using

use

indexed

valid

data

For example,

changed

into the buffer

($2

PARTI

4) and

data

to x are copy

need

by flocal-

the loeal_edgelist

to be accessed

carried

of every

out.

distinct

fmlocalize

fmlocalize

makes

set of pre-existing

3 obtains

Figure

the same

procedure

by a given

in Figure

representation

off-processor

data

by multiple

In the

beginning

off-processor ($4

in Figure

schedules.

The

only

it

those

procedure

two schedules;

produced

of x

4) makes

to obtain

using

of

value

it possible

incremental_face_sched

of the incremental

in the off-processor

data

the formation the

duplicates

by using hashed

face_loop

a hash

using

is hashed.

remove

all the

results

showing

To review

duplicates the the

(i) a pointer

by the

table.

The hash

point

and

usefulness work

of globally

to bring shaded

the

out

indexed

Next

dfm-

edge_sched

by fmlocalize

that

($4

distributed

data

exists

edge schedule data

and

edge

to be accessed table

In Section

is formed

of duplicates

by the

hash

schedule

for the face_loop

6. Removal

in the

schedule.

6. The

is

schedule

during allows

the us to

5 we will present

schedule. we will summarize

procedure.

distributed

in Figure

to be accessed

all the

incremental

translation

by the

in Figure

data

by fmlocalize,

of this PARTI

indexed

region

function.

is given

in the off-processor

off-processor

of incremental

carried

is given

the information

form

to a distributed

a list of globally number

shown

At this

schedule

for the edge_loop

of the schedule

a simple

one of the arguments

(iii)

are

references,

not requested

pictorial

we remove

(ii)

flocalize

a single

The

duplicate

by flocalize

During

but

when

to globally

so that each

to flocalize

2. We cannot

refers

for

edge_list

4).

The

first.

Hence,

array

passed

in Figure

buffer array.

assignments

can gather

these

data

produced

The

for that

in which

2, no

L1 or L2.

off-processor in the

depicted

data

references

of situations

processor

to remove

part_edge_list

this part_edge_list

is executed.

is collected

5. The

data.

2).

3, each

referenced

loop

in Figure

as part_edge_list

changes

in a way such that

are a variety

(Figure

Figure

data

is stored

the correct

There

edge

off-processor

the off processor

accesses

Flocalize

The

of edge_list

on a processor

the on-processor

the

ize, the data

y.

is shown

processors.

4 is a subset

at x(n_on_proeh-1).

loeal_edgellst, When

the

works

between

an array

x and

when

immediately

flocalize

On each

table array

array 12

processor,

(translation_table references. references

the

significance

fmlocalize

in $4),

(faceJist (4*n_face

in $4), in $4),

of all

is passed:

INCREMENTAL

OFF PROCESSOR IN SWEEP

SCHEDULE

FETCfIES

OVER

EDGES

o_._o_o_f/_s_S

INCREMENTAL °o

SCHEDULE

DUPLICATES

/ EDGE SCHEDULE

Figure

6: Incremental

13

Schedule

(iv)

number

of pre-existing

duplicates (v)

(1 in $4),

an array

Fmlocalize (i)

that

to pre-existing

take

can any

an incremental the

pre-existing

code

be used

number

(v)

number

that

3.1.3

A Return already the

incremental element.

loops

L1 and

procedures

to

the

discussed

edge_sched

dfmgather between

accesses

in loop

Consequently

procedure

+

manner.

buffer_mapping

pattern

3.1.3.

in Section

This

schedule

in $4),

accesses

not

included

of indirection

in the

(n_off_proc_face

in $4).

encountered

1).

Figure

in

any

separate

When

in

executor

other

3, our

can

schedule

to dfscatter_addnc

use

value

be employed;

locations in S3, Figure

non-contiguous) 14

after

accumulation

in the this

accessed

is to be uses

locations

access

with

is specified

3. (dfscatter_addnc

buffer

buffer

accu-

schedule

locations locations.

locations by

to

beginning

of y to buffer

associated must

buffer

buffer

to handle

(edge_sched)

procedure

elements be

locations,

a schedule

consecutive

(face_sched)

use of

distributed

our off-processor

then

already

we make

so

of a buffer.

off-processor

may

anything

accumulations

of y to buffer

can

locations

schedule of buffer

situation,

said

off-processor

off-processor

consecutive

we assign copies

We

not When

to each

elements

3.

we have

dfscatter_addnc.

elements

each

but

location

in this

dfscatter_add

pattern

and

out

below,

memory

3.1.1

buffer

to off-processor

Figure

:The passed

the

in Section

off-processor

off-processor

in S3,

procedures.

in S4),

in face_list

consecutive

L1,

in distributed

of the

data

not

we carry

reference

to accumulate

some

off-processor

references

a single

example,

of distinct

y(n_on_proc

(face_sched

dfscatter_add

we assign

no longer

PARTI

account

only

scatter

Executor

copies

where

mulated.

ter_add

- to be discussed

In our

may

into

references

L2. As we will describe

off-processor

irregular

removing

in $4).

and

to specify

off-processor

distinction

We assign

in L2,

when

in $4).

schedules,

array

with

account

gather

(incremental.t'ace..sched

off-processor

of distinct

buffer__mapping

specify

into

in $4),

of distinct

(vi)

about

(sched_array

includes

can be used

(n_new_off_proc_face

far

schedules

schedules

that

schedules

(local._face..list

(iv)

have

to be taken

in PARTI

pre-existing

schedule

a list of integers

We

need

returns:

does not

(iii)

that

and

of pointers

a schedule

(ii)

schedules

integer

stands

in an array

for dfscat-

3.2

Mapper

In irregular

Coupler

problems,

by assigning

all computations

We consequently first partition loop

distributed

arrays

that

processor

tioners

array

are

partitioning

must

a new

structure

at

developed

notion

run

time,

primitives

approach

specified

the

by users

We now outline a specific

loop

first

consider

and

are

which

what

has been loops

carried

in which

dependencies

the

of a loop statement

S and the

of S. As the

implies,

name

statement

BRDG

we merge

l links

make

of undirected

use

statement runtime loop

conform, dependence

RDG A loop

is the RDG

using

the

we associate

graph sum

is constructed

of

statement

graphs.

loop

weights

the

loop

by adding

When

all

BRDG

RDG. of the

merged

The

two collapsed edge

15

this

< i,j

a generic purpose,

for the loops

partitioners.

in the

later

4.1.

the

left hand

directed

graph.

> between

side

We merge

the

data

arrays

When

partitioners

appearing graph,

with

side

hand

a loop BRDG.

a undirected

BRDG

without

to represent

on the right

associates

in size

We

on the

Most

We

section).

elements

distributed

4.1.

to loops

BRDG)

to form

in which

conform

in this

defined

problem

in Section loop

data

we have

a program

ourselves

vertex.

into

weight

we

in a given

S in a loop

parti-

Here,

described

element

the

currently

partitioners.

with

(statement

in the

statement.

in Section

slightly

out

is

the

partitioner

is a bipartite

an implicit

coupling

format from

array

to pro-

manual

For

We also restrict

l with

[9] but This

input

graph

of each

by producing

appearing

array

BRDG

each

collapse

of the

set

to minimize

makes

side [5]

problems.

will be relaxed

a weight

or the

same

will be carried

programs

of all distributed

connectivity we can

the

iterations

partitioning

extensions

arrays

dependency

loop

to be discussed

language

of a distributed

with

is to

partition

so as to attempt

is generated

manner.

statement

assigned

fashion.

to link a data

restriction

indices

associated

[6].

approach

cases,

of parallelized

of the

extensions

to be done

runtime

use

structure

language

index

partitionings,

[41], [15],

with

all distributed

bipartite

between

data

(this

array

left hand

a standardized

in an identical

dependencies

a statement

partitioners

generate

specified

Our

computation

on the

to make

is independent

needs

to be distributed

all)

in a manual

we wish

certain

yet

to data

available,

programs

can

processor

iterations.

iterations

approach

methods

standardized

using

Our

appearing

which

to processors

to a single

as in many

not

loop

variable

of linking

loop

on distributed

we have

necessarily

when

and

approach

not

to user

troublesome

introduce

define

the

be coupled

particularly

arrays,

(although

iteration

work

loops.

references.

with

many

based

computational

loop

arrays

we will partition

associated

There

a given

then,

by many

that

most

to allocate

to be a practical

distributed

distributed

assumption

loop

used

We do assume

non-local

and

appears

we partition

cessors.

In our

arrays

are

involve

both distributed

distributed This

desirable

that

partition

iterations.

When

it is frequently

each

the

in a loop

edge

of the

j either

when

edges. nodes

i and

a referenceto array index i j appears

on the

a reference

to array

to i appears Each

time

< i,j type

generation

process

munication.

The

data

side,

index

on the

edges

of such

side of an expression

a reference

and

related

a counter

associated


is encountered,

output

structure

on the

j appears

on the right

edge

Accumulation

this

right

appears

i, i >

do not

represented

ignored

induce

Compressed

as follows.

We assume

Sparse

< i,j

in the

data Row

com-

structure

(CSR)

>.

graph

inter-processor

by a distributed

to Saad's

with

[30],

format

(see

[37]). Data (i)

partitioning

At

compile

RDG (ii)

time

The

loop into

The

want

we have

edge

vertices

is associated

each

array

with

is carried

out

This

procedure

assigned

code

that

to each

is a distributed

of the

(If the

produces

a loop

partitions

subgraph

translation

identically

our

decision

distributed

array

To partition

left

hand

distributed

the

loop

correspond

to

table. arrays

and

between

array

motivation

associated

we to assign

the

This

trans-

referenced

element,

in

of assigning

work

the loop

to partition

RDG

loop

correspond

to either to the

distributed as an input

iterations

variable

side then

we would

to reducing Each

the RDG

a unidirectional

processor arrays

hand

manner,

communication.

correspond

convention

S's left

in this

would

would

we partition

One

a replicated

work

cost of interprocessor

work. with

of S references

S in a way that

Instead

for using

to attempt

side

partitions

communication.

computational

processor

Were

for statement

a boundary

partition

S in the

in all processors).

imbalance

distributed

we must

statement

the RDG

cost of load

Our

data,

element.

S's left hand

RIG

RDG

procedure

data

The

partitioning

of the partitioning

bidirectional

from

code is generated.

distribution.

a program

to cross

separately.

to a data The

partitioned

to partition

combined

coupling

P processors.

loop.

distributed work

array

table

is to compute

the

is passed

P subgraphs.

output

lation

Once

a dependency

RDG

a distributed

the

out

at runtime,

RDG

(iii)

is carried

associated

and

to a data

loop

partitioner

so as to minimize

or with

iterations comes

off-processor

references. loop iterations,

associates

with

each

we use

a graph

loop iteration

called

the

i, all indices 16

runtime of each

iteration

graph

or RIG.

distributed

array

accessed

during iteration distributed loop

i. The

array.

iteration,

The

the

We partition

The

RIG

(ii)

The

processor

using

If the

distributed

distributed

are

many

section

of integer of this

for for each distributed

array

distributed,

table

an

are referenced. reference

this

(Section

iteration

strategies

be used

2.2).

appearing

information

The

in

is obtained

processor

partitioning

the

can be used

to partition

to the

in the

that

loop

processor

assignments

procedure

which

makes

to partition

iterations.

associated

data,

there

We currently

with

the

largest

are

employ number

of

RIG.

Mapping:

we outline

each

primitives

compiler-linked

indirection

initial

initial

situations

arrays

Runtime employed

distribution

been

Support

to carry

out.

when

we begin

will handle

either

regular

is defined

RDG

includes

Procedure along

as the only

codes), Thus

out compiler-linked

data

and loop

eIiminate_dup_edges with

a count

in a loop have all local

indirection

our derivation

distributed

been

loop RDG

of the recorded, graphs

needed

to support

of the array uses

loop iteration loop

RDG

elements

number

table

of times

edges_to_RDG to form

the

with

edge

loop RDG.

17

The

The

object

In many

cases, In some

mappings

may have

been

irregularly

Our runtime Iinit.

processor.

The The

support local loop local

loop

Ii,_it.

has

generates

and

distribution.

already

distributions

to store

each

array

mapping.

to a single

associated

a hash

default

may have

of a compiler-linked initial

references.

for mapping.

irregular

arrays

of loop iterations

array

Ii,_it, will be a simple

integer

or irregular

distribution

distributed

information

preprocessing

restriction

an initial

to determine

of loop iterations

distributed

RDG

with

is to extract

adaptive carried

mapping

needed

preprocessing

(e.g.

already

merges

processor.

partitioning.

We begin

edges

could

references

In this

edges,

for each

graph.

loop iterations

Compiler-linked

the

arrays

using

possible

that

3.3

iteration

each

lists,

graph.

assign

array

RIPA

with

distributed

translation

one irregularly

manner:

is irregularly

are partitioned

strategies that

the

at least

graph or RIPA

associated

loop in which

array

references

assignment

references

following

distributed

RIPA

as there

strategies

in the

is found

to generate

use of the

data

for each

loop that

processor

assignment

iterations

also many

of distinct

iterations

the array's

Loop

Just

number

for every

iteration

is generated

are used (iii)

is generated

runtime

loop

(i)

a RIG.

RIG

unique

directed

been

encountered.

the local data

dependency

loop RDG

structures

that

Once

all

and then describe

partition loop iterations betweenprocessorsin blocks partition integer indirection array edge_list sothat if iteration i P, edge_list(i)

and

preprocessing

edge_list(n_edgeq-i)

are described

in

are on P (methods

is assigned

needed

to processor

to carry

out this

[45]).

do i= 1,n_edge pass

dependency

end

loop

loop

RDG

RDG

Figure

returns that

RDG

graph

partitioners

can

are

consider

using

the

The

the

RIG.

RIPA

4

eliminate_dup_edges

is partitioned

how

outline

compiler

arrays

that

Data

Distributions

describes

RDG_partitioner.

the

the data,

in Figure

new

mapping.

Note

the only constraint

time,

y are

is that

how

that

the

to be partitioned of calls

the

user

has

based

to a set

primitives specified

the

loop

of mapper

L1

coupler

7. by two

transformed

accesses

distributed

iteration

assume

a sequence

is supported

distributed

2 to illustrate

We

x and

in Figure

the

to a distributed

primitives,

by a compiler. translation

array

The

tables

reference,

partitioning

deref_rig primitive

returns

patti:

deter_rig

to find the

deter_rig

procedure,

and

processor

the

RIP.&.

iter_partition.

Compiler we first

specify

using

is returned

RDG_partitioner.

to partition

by code

each

edges_to_RDG

mapping.

Irregular

programs.

iterations

with

new

procedure

sequence.

with

primitive

using

A pointer

table

depicted

compile

is generated

associated

In this section

calling

as shown

of loop

table

partitioner

translation

that

At

hash

for Deriving

code

extensions

This

PARTI

the

to a data

correct

manner.

RIG

assignments

passed

partitioners

partitioning The

describes

Support

sequential

are embedded

tion_rig. inputs

the

to link

in a conforming

from

can use any heuristic

the

language

primitives

to procedure

to RDG_partitioner.

to a distributed

have

be used

(n2,nl)

structure

which

7: Runtime

a pointer

We

The

table

RDG_partitioner.

the

data

is passed

translation

loop

(nl,n2),

do

obtain

the

edges

data

and

describe loop

language

iterations

transformations

used

extensions are

which

to be partitioned

to carry

18

out

this

allow

a programmer

between implicitly

to implicitly

processors. defined

work

We

then

and

data

mapping. The compiler transformations generatecode which embedsthe mapper coupler primitives describedin Section3.2. In addition weoutline compiler transformations needed to take advantageof the incrementalschedulingprimitives describedin Section3.1. 4.1

Compiler-Linked

4.1.1

Problem

Mapping

Overview

The

current

define

Fortran

irregular

data

In Figure The

array

D syntax

map is used

(statement

real

arrays

to specify

reg (statement $6).

irreg is determined

by values

assigned

indicates

x(100)

both

difficulty

would pattern

to the

are

program

available

problems.

not

rich

enough

Our

coupler Figure

explicitly

coupler,

on the

dependency

all arrays

listed

loop in question

such

to the

linear

a nest

systems

of loops

From

the

loop

gives

is 10, this

the

how one

distribution The

Fortran-

of the

map array

of partitioning from

scratch

the partitioners

typically

heuristics can represent

and

operate

(e.g.

solvers,

L that

map(100)

generation

a wealth

programmer

blocks

of decomposition

a partitioner.

partitioners

literature

processors

it is not obvious

which

the

between

the

with

the

on data

meshes

different structures

in finite

difference

etc).

involves

each

L, we produce

at

irregularly compile

distributed

time

a mapper

from the

the

sequential

code

code

in Figure

2.

in Figure

2. The

To simplify

code

presentation,

in Figure only

8 contains

L1 is depicted

8.

statement

mapper

in sparse

are

$5).

10.

array

to couple

(statement

3.2)

L2 from

in Figure

We use

is known

to partition.

8 is derived

L1 and

explicitly

map is aligned

value

1 is that

map

there

in the

if the

by running

interface

described

is to identify

(see Section

The

user

While

is no standard

matrices

we will need

the

process.

interpretation

approach

array.

array

distribution

to processor

separately

for

the

in Figure

[41], [15], [5]), coding

There

sparse

to

irreg

by among

For example,

are assigned

distribute

Integer

$7 is that

depicted

to be generated

partitioners

physical

programmers

decomposition

reg is distributed

to map.

y(100)

the

of irreg.

statement

has

effort.

then

declarations

compilation

The

equations,

loops

the

(see for instance

a significant

whose

and

the irregularly

of irreg

D constructs

array

with

partition

2.1 requires

y with

distribution

$4) and of the

The

The

x and

the

meaning

that

in Section

decompositions.

1, we align

decomposition

outlined

$4 to designate

implicitmap(x,y) relations

L1 as the

indicates

between

in an implicitmap parallelizes,

loop

except

that

distributed statement for possible

19

that

an RDG arrays

are

loop

x and

will be used

graph

is to be generated

type

distributed output

and

a

based

y in loop L1. We assume

to be identically accumulation

to generate

that

dependencies

that the (If

,°,°

real*8

x(N),y(N)

decomposition

coupling(N)

S1 if(remap.eq.yes) $2

then

distribute

coupling(implielt

using

edges)

endif $3

align

x,y with

coupling

,°°,

$4

implicitmap(x,y)

C

Loop

L1

do i=l,n_edge

over

edges

edges involving

x, y

nl = edge_list(i) n2 = edge_list(n_edge+i) S1 y(nl) $2

= y(nl)

y(n2)-end

y(n2)

+ ...x(nl)...

x(n2)

+ ...x(nX)...

x(n2)

do

,,.,

L2

Loop

over

faces

involving

Figure

x, y

8: Example

of Implicit

2O

Mapping

the compiler

cannot

In many the

determine

codes

RDG

used

over

the

original

mesh

topology.

It is easy implicit based

code. the

using

will be used implicit

the

all relevant

in the

In our same

V that

procedure

there

using

any

implicit

the

information calls

pointer

to this RDG

Loop

hash

RIPA.

array.

RIG. The

table

are

distributed the

Section

members

RIPA

in turn

an

is generated

case,

how

to any

the

arrays

of the

flow

variables

in the same

between

a transformed

loop

L' is generated.

partitioner, at runtime

whenever

3.2,

is partitioned

iter_partition.

21

cases,

the

and

case,

indexing

L' contains loop

a hash

procedure

RDG

(see

table.

A

produces

RDG_partitioner.

to each

in Section

local

produces This

determine

as L. In this

L is identified

edges_to_RDG.

distribute

In many

When

eliminate_dup_edges

determine

can

analysis.

the

using

all variables

must

analysis

that

is located

identify

procedure

to generate

arrays

implicit

statement

V.

...

to determine

in L and

data

loop

distribute

...

be killed

code,

in this

distributed

to be able

must

in the

in the

statement

compiler

standard

embedded

pattern

distribution

of V could

to procedure

Corresponding

As described

The

are

distribute

implicit

will be needed

to a data

partitioned

when

loop.

3.2 that

is passed

executable

of distributed

to L is obtained,

from

the

we specify

loop RDG

indirection

We need

not be placed

that

so that

is encountered

to anticipate

8, the

made

might

The

for the

of interprocedural

is passed

iterations

any

been

L1 represents

$4 recaptures

primitives

using

be predicted

L. In this

have

results

Recall

which

generates the

loop

so that

loops.

user.

encountered.

specified

that

here

a multiple

compiler

be able

functions

to elirninate_dup_edges 3.2).

ularly

chance

pertaining

Section

loop

user

statement

the

implicit

(Figure

subscript

assignments

we will require

...

we must

example

and

distribution

the

by the

in L can

as the

is any

implicit

whether

the

simple

the

how

In order

L is next

patterns

determine

whether ...

when

reference

executes.

sense,

In this case,

of loops

8, loop

statement

described

from

a nest

in Figure from

is reported).

Primitives

L specified RDG.

instance,

obtained

arising

distribute

to make

in L will be indexed,

one loop.

an error

we can specify

extensions

8 to show

the

For

RDG

Coupler

loop

are valid,

problems,

mesh.

patterns

statement

to generate

using

based

The

than

in Figure

locates

assumptions

language

Mapper

the

compiler

the

dependency

example

When

mesh

original

more

Embedding

these

of a mesh.

to generalize

on merged

We use the

the

edges

mapping

4.1.2

to solve

will represent

a sweep

that

such

a loop loop

the

RIG

using

the

accesses

L is generated

at least a loop

one

irreg-

L" which

is passed

to deter_rig

to produce

iteration

partitioning

procedure,

a

4.1.3

Inspector/Executor

Generation for Incremental

Scheduling

Inspectors and executorsmust be generatedfor loops in which distributed arrays are accessedvia indirection. Inspectors and executorsare also neededin most loops that access irregularly distributed arrays. In this section we outline what must be done to generate distributed memory programs which makeeffectiveuseof incrementaland non-incremental schedules.Most of what wedescribeis asyet unimplemented,although we haveconstructed and benchmarkeda simple compiler capableof carrying out local transformations to embed non-incrementalschedules.This work is describedin [45]. We first outline what must be done to generate an inspector and an executor for a program loop L. We assumethat dependencyanalysishas determined that L either has no loop carried dependencies,or has only the simple accumulation type output dependencies.of the sort exemplifiedin Figure 2. It shouldbe notedthat the calling sequences of the compilerembeddablePARTI primitives differ somewhat from the primitives describedin Section3. The functionality describedin primitives flocalize and fmlocalize are each implemented as a larger

set of simpler

We scan

primitives.

through

the

distributed

or are

for a given

distributed

reference the

the

along

carried

out

functions

within

loop.

carried

can be hoisted

pattern.

in which

a loop,

For instance,

indexing

code consider

y(n2)

z(n2) do

22

are

array's

a compiler

n2 = nde(2*i-1)

end

the

subscript

We must

patterns

can

check

produces the

as the

loop:

irregularly a schedule

function

methods

pattern

of the sure

that

described

by computations is not modified

preprocessing

a representation

following

are

to make

modified

indexing generate

,At that

to generate

of ,4 are loop invariant

nl = nde(2*i)

......

needed from

distribution.

As long as a distributed

out within

arrays

Information

do i=l,n

.. = x(nl)..,

of distributed

can be produced

of an array's

cases

set

indirection.

out of L. This preprocessing

indexing

the

of all members

address

the

find

reference,

knowledge

do not

and

using

array

with

by computations

array's

indexed

subscript

in this paper

loop

code

that

of the distributed

The subscript function of y and z (using notation from the Fortran 90 array extensions) is nde(2:2*n:2), and the subscript function of x is nde(l:2*n-l:2). Recall from Section2.2, that schedulesspecify communication patterns and are not bound to a specific distributed array.

We can

avoid

communication and

pattern

z in the

a single

having

above

schedule

will reoccur

loop

are

to bring

Optimizations

to compute

that

reduce

impact

on storage

requirements.

should

be reasonably

The

out

distributed

schedules

have

can

of previously

processor dependencies. that

are Thus

off-processor

L. The first some

full

the

mation

possible have

computed that

been

modified

Analysis duplicate about

the

must data

the

program

subscript

statement

data

the

be

last

call

carried

behavior. S for which

we must

that

loop

iteration

ensure

L had

to a single

entering

of incremental

in two

In order

off-

carried ensures

to be valid

passes.

schedules.

schedules

the

processor

incremental

which

that

no loop

L will continue out

to make

the

in

A compiler second

pass,

to replace will have

a full already

L.

executors

for loop

When

L is called

time

L requires

us to obtain

multiple

times

we attempt

L is called,

we need

to determine

distributions

in the

set of distributed

inforto reuse

whether

it is

arrays

.A

to L.

out

if we are

to use

loops.

We need

between Consider

a

retransmission

In order

During

or loop

a favorable elimination

use

full schedules.

and

has

to avoid

5, proper

elements,

to know

time required

[45] we describe

for L with

within

Each

In

it possible

be carried

we need

functions

since

can

if y

to compute

subexpression

on communication.

loop

same

manner.

in Section

before

only

also

schedules.

we assumed

the

instance,

the preprocessing schedules

make

that

For

we need

of common

spent

each

with

as a whole.

communications

program

executor

schedule,

schedules.

also

that

schedules

inspectors

reduce

array

to assign

of off-processor

a program

time

immediately

and

of efficient

about

on the

will be replaced

storage

previously

obtained

an incremental

Generation

show

Recall

decision

an inspector

with

caused

our

As we wiIl

we know

z.

in a rudimentary

of distributed

valid.

manner,

redundant

3.1),

of incremental

schedules

schedule

still

data

generation

generates

copies

place

of y and

(Section

effect

in a loop.

modifications

optimization

a marked

one

of redundant

in identifying

schedules,

stored

copies

elimination

array.

when

of schedules

Minor

this

use of incremental

of unchanged

use

the

effective

carries

elements

the number

Obviously,

than

schedules

in a conforming

in off-processor

inspector.

that

in more

partitioned

by the

compiler

redundant

a right

we would

hand

incremental rather

side reference

schedules

comprehensive to distributed

like to use an incremental

schedule.

to eliminate information array

x in

We will need

to know when

off-processor

data

copies

of values

of x become

and

23

invalidated

by new

assignments,

which communicationsscheduleswill havealready been invokedby the time we reach X_

Methods

exist which

both

of these

(e.g.

[13],

and

objectives [10])

control

program Each

have

is a directed

predicates

time

is used,

the

off-processor

data

type

in a program

of edge

reuse

edge.

program

Using

that

In ongoing of slicing results

joint

5

work

methods

with

which

work

the

PARTI

Euler

of dependence

and

graph.

represent

We will call

[43], [23] we can find

values

of the

Kennedy's

distributed

group

will allow

elements

as part

kind

use

of the

schedules this

as a specific

of dependence and

reference

edge

predicates

a

of a

to x in statement developing

of incremental

Fortran

available.

We can view

dependence

we are currently the

dependence.

which

data.

all statements

array

at Rice,

us to automate

this

among

become

to know

this

statements

or a data

off-processor

graph

dependences

dependence array

reusable

assignment

represent

for x at S, we need

of potentially

will be implemented

S.

a variant

schedules.

D compiler

being

The

developed

a given

Results

[31].

mesh.

Two

the primitive

flocalize

of varying

mesh sizes.

and

unstructured Table

run

is relatively

The

meshes

2 were

until

of the

Euler

solver

incremental

solver

smallest

timings

obtained

because

mesh Figure

had

9 depicts

using

incremental

access

state

one version

(Section

a sequence

used

3.1),

the

view of the

described

in

the

(Section

vertex

that

in unstructured

3-

meshes

mesh.

210K The

[41].

single the

only

The

had

communication The

used

3.1.2).

mesh

on

primitive

similar

largest 210K

mesh

solution

the other

of structurally

vertices,

We conjecture

unstructured

a steady

non-incremental

patterns

a 3-D

schedules

schedules.

24

to port

tested,

3.6K

method

Solver

computed

a surface

by the

obtained

data

were

using

used

4 Megaflops. the

it has

schedules

was tested

using

used

only non-incremental

partitioned

at approximately poor

iterates

Euler

Equation

3 were

code

generated

edges.

Euler

in Section

and

were the

the

to generate

1.2 million

1 shows

Table

Euler

versions

fmIocalize

D unstructured

from

described

The

flocalizeand

vertices

Results

procedures

solver

meshes

edges

a control

schedule

the

dependence

[21].

Timing

The

The

job of achieving

A program

represent

Copies of off-processor

methods,

affect

to do a reasonable

codes.

vertices

either

dependence

Experimental

5.1

and

new

as a type

slicing

might

of this

at Rice

reuse

whose

represents

storage

scientific

in a program.

an incremental

caused

us to be able

irregular

occur

An edge

to generate

to allow

graph

that

a schedule

already

likely

for many

components.

In order

appear

node

single mesh

scheduling code

node

for these

performance

computations

are

Size Mesh

Number of Processors 1 2 8 16

64

4.1 6.0 12.0 14.4

-

4.6

Mflops 3.6K

Wime/iter(s) comm/iter(s)

26K

210K

3.1

1.5

1.3

-

0.5

0.9

0.9

-

Mflops

-

19.2

29.9

Time/iter(s)

-

7.1

4.5

comm/iter(s)

-

2.3

2.0

Mflops

-

-

-

118.6

Time/iter(s)

-

-

-

8.4

-

-

3.7

comm/iter(s)

Table

1:

Timings

Incremental

highly

the

Both

and

when

employed

when

the not

code

less than

In this section,

per

floating

for the

Intel

iteration

impact

cost

incremental

cost

per

a significant

communications

incremental

the

Mesh

point

80860

Using

Non-

operation

is very

architecture

on communications

per iteration

schedules. schedules.

dropped

the

sequential

code

parallelization

The On

from

on

and

a single In the

to keep

costs.

on 16 processors

communications

the

210K

3.7 seconds

requires

at least

calls the

Euler

execution

For

was

cost

mesh

2.0

dropped

on 64 processors

to 2.3 seconds

Results

using

we present

data

that

the

code any

to the

is virtually

new

primitives.

sequential

code

and

codes,

total

cost

times

100 iterations execution

parallelized

to introduce

with

parallelized

to the

the

by the

node

compared

parallel

and

process

preprocessing

3 % of the

Timing

the

employ

degradation.

typically

references

it difficult

had

we used

running

insignificant

program were

by

make

scheduling

of the

expect

performance was

Irregular

when

we

schedules.

form

exacted

parallel

Unstructured,

data.

mesh,

incremental

Since

5.2

26K

communications

those

with

we did not

to 1.1 seconds

we did

:

of memory

characteristics

supplied

on the

seconds

iPSC/860

number

use of incremental

instance,

the

the

of these

processor The

Intel

Schedule

irregular

high.

from

the

required

to converge,

to solve and

the

identical,

inefficiencies

beyond

We

compared

found

only

the a 2 %

of all preprocessing the

problems.

preprocessing

The times

times.

Mapper

indicates

25

that

Coupler the costs

incurred

by the

mapper

cou-

Size

Number of

Mesh

1

3.6K

16

4.1

7.1

16.9

17.4

Time/iter(s)

4.6

2.6

1.1

1.1

0.3

0.5

0.7

-

-

23.8

38.8

Time/iter(s)

-

-

5.6

3.4

comm/iter(s)

-

-

1.1

1.1

Mflops 210K

....

Time/iter(s)

Intel

iPSC/860

64

144.3 -

comm/iter(s)

from

8

Mflops 26K

2: Timings

2

Mflops

comm/iter(s)

Table

Processors

-

-

....

7.1 2.3

: Unstructured,

Irregular

Mesh

Using

Incremental

Schedule

Table

3: Mapper

Coupler

Timings Number

Number

2

of Vertices graph 3.6K

generation

mapper iter

(secs)

partitioner

comp/iter graph 9.4K

iter

(secs)

(secs)

generation

mapper

(secs.)

O.34

54K

generation

mapper iter

of Processors 4

8

16

32

0.20

15.92

11.50

12.11

14.92

0.94

0.57

0.42

0.34

2.4

1.31

0.6

0.34

0.86

0.69

0.53

0.35

70.96

62.3

65.2

89.7

1.19

0.82

0.60

0.43

4.83

2.35

1.1

0.67

(secs.)

(secs)

1.50

(secs.)

544.81

(secs)

partitioner

iPSC/860

0.21

comp/iter(secs) graph

Intel

0.24

(secs)

partitioner

from

(secs)

comp/iter(secs)

26

64

0.94 673.14

3.30

3.03

6.06

3.81

Figure 9: SurfaceView of Unstructured MeshEmployedfor Computing Flow over ONERA M6 Wing , Number of nodes = 210K pier primitives wereroughly on the order of the costof a singleiteration of our unstructured mesh code. We also show that the mapper coupler costsare quite small comparedto the cost of partitioning the data. In Table 3, graph generation depicts the time required by the mapper interface to generate the

runtime

loop over time

dependence edges

includes

the

edges_to_RDG

number tioner

that

time

shown table

problem

effort

structure

equivalent to call

depicts

of Simon's

is relatively

noted

required

data

(Section

3.2.

These

to loop L1 in Figure

eliminate_dup_edges

and

timings

2. The the

graph

time

involve

a

generation

required

to call

3.3)

of subgraphs

only a modest

The

time

3, mapper

version

(RDG)

is functionally

(Section

In Table allelized

that

graph

to the

both

3 gives

also includes

the

the time

to partition

partitioner

needed

needed

could

sizes.

27

We

high parallel

be used

to partition

for a single

the

RDG

iteration

using

the

The

and

implementation.

of the

a

parti-

because

It should

The

Euler

into

of the

count

iterations

a par-

RDG

cost

operation

as a mapper. loop

using

partitioned

employed.

partitioner's

an efficient

partitioner time

[41].

of processors

of the

to produce

graph

in Table

needed

number

because

was made

any parallelized

time

eigenvalue

equal high

the

be

iter partitioner

among

processors.

code

for different

6

Conclusions

Programs

designed

iterative

methods

of such

programs

Several

are

classes

environment memory

have

[46],

[36],

compilation Kali

which

project

was

[25] and arrays

the

the

first

ARF

projects

[14],

our

in three

was

provides

first

towards

using

that

designed discrete

[25],

dy-

facilities

multiprocessors

[21],

Fortran

[45], and

support

domains.

[7,27,26,28]

[39],

towards

to provide

memory

the

distributed

targeted

at distributed

compiler

a programming

facilities

inspector/executor

the

are targeted

environment

projects;

[32],

and

examples

meshes

environment

of these

project

to implement which

triangular

[24],

direct

Some

[44] describes

of 2-D or 3-D

[20],

that

Williams

targeted

[19],

PARTI

compiler compiler

environments

environment

sparse

[18].

a programming

decompositions

employed

[25], and

in this paper.

[44] and

[9] is an interactive

[42],

are

described

unstructured

developed

of compiler

methods

[4],

programming

or manual

[1],

including

problems.

with

[3] has

DecTool

[34],

computations

programming

or adaptive

This

are a variety

[8],

[2], [29],

developed

Baden

automatic

There

in

for calculations

load balancing.

of irregular

of the optimizations

described

(DIME)

for either

[45]. Runtime

D project

[38]. The

type

Kali

runtime

to support

[21],

compiler

preprocessing

irregularly

distributed

[45].

This

paper

has

required

of carrying

reduce

interprocessor

data

accesses.

form

of PARTI

outlined We bedding

the

presented

runtime

capable

runtime

by the such modest

order

results had

of the

described that

unstructured

and

demonstrated

that

our

a significant

impact

cost

to the

PARTI

primitives and than

costs

of a single cost

of the

to Intel found

data

that

iteration partitioner send

methods

the

design

embed

these

of the

and

overheads

by the of our

fluid our

receive

28

coupler

unstructured We did

incurred

20 %).

costs.

mapper

calls

described

how

to

off-processor

implemented primitives,

dynamics

in the and

then

Our

in this by using

performance

and

the In

results

were

code

compare

PARTI

These

off-processor

primitives

paper.

by em-

results.

redundant

mesh not

code

performance

for eliminating

itself.

compilers

primitives.

presented

method

PARTI

in detail

memory

redundant

has been

computational

have

described

We also

by eliminating

on communications

incurred

and

distributed

partitions.

for this

mesh

by hand

compared

(no more

We first

methods,

to design

requirements required

the

how and

support

support

that

a comparison

workload

transformations a full

demonstrated

small

runtime

compiler

communication

on the

dynamic

primitives.

compilation

We described

communication

implemented our

two new runtime

support. out

This

performance

also

many

computations.

namic

out a range

of irregular

machines.

particle

the

require

researchers

particular

the

to carry

roughly

were

time

quite

required

[6] we presented

appear

to be quite

We havejoined forces with the Fortran D group in compiler development and are implementing the methodsdescribedin this paper in the context of Fortran D in cooperation with Kennedy's group at Rice. The non-incrementalPARTI primitives describedin Section3.1 are availablefor public distribution and can be obtained from netlib or from the anonymousftp cite ra.cs.yale.edu. The incremental PARTI primitives and the Mapper coupler primitives described in Section 3.2 will be releasedsoonand will be availablethrough the samesources..

Acknowledgements The

authors

would

versally

applicable

Seema

Hiranandani

support The and mesh

like to thank partitioners; for many

for irregular authors

his help

munications

we would useful

Fox for many

enlightening

also like to thank

discussions

about

Ken

discussions

Kennedy,

integrating

into

Chuck

about

uni-

Koelbel

and

Fortran-D

runtime

problems.

would

in getting

partitioning

Geoffrey

also

like to thank:

us started

software;

and

with

Dennis Faust,

Gannon

Horst

Venkatakrishnan

Simon

for useful

scheduling.

29

for the for the

use use

suggestions

of his

Faust

system

of his unstructured for low

level

com-

References [1] F.

Andre,

bution.

[2] C.

J.-L.

Ashcraft,

S. C. Eisenstat,

Baden.

Baxter,

for

and

algorithms

[7] M. vlsi.

[8] A.

Cheung

mentation. Group,

[9] N. and

and

Cooper

Proceedings

1990.

for distributed

partitioning

and

sparse

coordinating

To appear,

Pasadena,

SIAM

J. Sci.

CA,

In

pages

strategy

An experimental Proceedings

1698,

Execution

17il,

time

May

support

Concurrency:

study

of the

1988

January

for nonuniform

C-36(5):570-580,

and

The

1988.

problems

on

1987.

for adaptive Practice

scientific

and Experience,

paragon

School

Cornell

A software

CSD-TR-1025,

January

environment:

University

Engineering,

E. N. Houstis,

axchietctures

Languages,

multicomputer

of Electrical

decomposer:

to mulitprocessor

Programming

EE-CEG-89-9,

Report

September

its compilation

on Principles

Report

Domain

of the

K. Crowley.

methods.

architectures.

C. E. Houstis,

K.

krylov

J. Scroggs.

language

architectures.

and

and

A partitioning

memory

University

and Implementation,

version

and

A. P. Reeves.

J. R. Rice.

Department,

dynamically

on Computers,

A CM Symposium

P. Chrisochoides,

[11] Thinking

Trans.

A parallel

Cornell

June

distri-

1991.

Technical

to parallel

[10]K.

June

In 2nd

algorithm

data

1990.

S. Eisentstat,

S. H. Bokhari.

on distributed

C. Chen.

380-388,

on multiprocessors.

Conference,

J. Saltz,

3(3):159-178,

running

preconditioned

IEEE

Berryman,

for

to manage

pages

A fan-in

11(3):593-599,

M. Schultz,

parallel

multiprocessors.

[6] H.

on Supercomputing,

abstractions

Multiprocessor

J. Berger

A system

1991.

J. Saltz,

Hypercube

PANDORE:

J. W. H. Liu.

SISSC,

calculations

Comput.,

of methods

[5] M.

and

Programming scientific

Statist.

H. Thomas.

Conference

factorization.

localized

[4] D.

and

In International

numerical

[3] S.

Pazat,

July

1986.

A first

Computer

or

imple-

Engineering

1989.

P. N. Papachiou, tool

for mapping

Purdue

University,

S. K. pde

Kortesis,

computations

Computer

Science

1990.

Kennedy. ACM ACM

Interprocedural

SIGPLAN

88 Conference

SIGPLAN

Machines

Corporation.

1.0, Thinking

Machines

side-effect

Not. CM

Corporation,

3O

on Programming

23, 7, pages

Fortran Feb

analysis

reference 1991.

57-66,

July

manual.

in linear Language

time.

Design

1988. Technical

In

Report

[12]R.

Das,

J. Saltz,

(document

and

and

H. Berryman.

patti

software

A manual

available

for parti

runtime

primitives

netlib).

Interim

Report

through

- revision 91-17,

1

ICASE,

1991.

[13]

J. Ferrante,

K. Ottenstein,

in optimization.

[14]

I. Foster

and

Englewood

[15]

G. Fox. on the 13:

Cliffs,

network.

[17] G.

University,

Volumes

for

Modern

Load

Parallel

and

Computer

vector

its use

multiplication

its Applications.

Architectures

loosely

synchronous

problems

K. Kennedy,

C. Koelbel,

U. Kremer,

C. Tseng,

Department

December

Volume

Martin

Schultz

with

a neural

Applications,

pages

Report

G. Lyzenga,

P. Hatcher,

ACPC/TR

90-1,

A. Lapadula,

S. Otto,

of Parallel

S. Hiranandani,

Programming,

programming

Multiprocessors, 1991.

in Fortran

J. Saltz

and

Rice

M. Wu.

COMP

TR90-

P.

Englewood

Cliffs,

Center

D. Walker.

memory

for Parallel and

New

73-82,

J. Anderson.

C. Tseng. D.

April

Compiler

In Compilers

Mehrotra

Editors,

Jersey,

Problems

1988.

multiprocessing

Computation,

In 3rd A CM SIGPLAN pages

Solving

sys-

1990.

A production

Symposium

quality

on Principles

1991. support and

for machine-independent

Runtime

Amsterdam,

Software The

for

Scalable

Netherlands,

To

Elsevier.

Hiranandani, migration

and

and

M. Quinn,

machines.

K. Kennedy,

Science

J. Salmon,

for distributed

Austrian

R. Jones,

for hypercube

of Computer

and

1990.

parallelization

puting,

matrix

and

Automatic

data

and sparse

Computers

H. M. Gerndt.

[22] S.

and

Prentice-Hall,

Concurrent

Prentice-Hall,

appear

Programming.

on Hypercube

Computers.

parallel

in Parallel

in Mathematics

balancing

on Concurrent

C* compiler

[21]

Concepts

to load balancing

specificatiofi.

Fox, M. Johnson,

Practice

graph

1988.

Conf.

D language

tems.

dependence

1988.

141, Rice

[20]

New

IMA

Algorithms

S. Hiranandani,

Fortran

[19]

In The

In Third

Fox,

[18] G.

approach

W. Furmanski.

241-27278,

program

1987.

Strand:

Springer-Verlag,

G. Fox and

The

N J, 1990.

A graphical

Numerical

J. Warren.

TOPLAS,

S. Taylor.

hypercube.

Editor.

[16]

ACM

and

J. Saltz, schemes

to appear,

P. Mehrotra,

and

on multicomputers.

12, August

H. Berryman. Journal

1991. 31

Performance

of Parallel

and

of hashed Distributed

cache Com-

[23] S.

Horwitz,

ACM

[24]

TOPLAS,

K. Ikudome, tion

[25]

T. Reps,

for distributed

distributed

memory

architectures.

of Parallel

Programming,

M. Chen.

[27] J.

M. Chen.

distributed

arrays.

Computation, J. Li and

MIT

J. W.

primitives

runtime

J. Mavriplis.

Three

In AIAA

R. Mirchandaney, support

computers.

In Snd pages

and

symbolic

In DMCC5,

Supporting

ACM

paralleliza-

pages

shared

SIGPLAN

177-186.

ACM

data

Symposium

SIGPLAN,

communication

from

'90, November

alignment:

1105-1114,

structures

on

on Principles

March

1990.

shared-memory

program

1990.

Minimizing

cost of cross-reference

of the 3rd Symposium

the

and

coordination

Compiler

Software, Rogers

Conference

models

Computing,

for

and

pages

on the

Frontiers

of interprocessor

Parallel

and

Computing,

for parallel

between of Massively

communication. Cambridge

In

Mass,

1991.

Pingali.

on Programming

parallel

for the

sparse

France,

ICASE, euler

Conference,

D. M. Nicol,

Data-parallel

report,

Dynamics

In Proceedings

, St. Malo

September

Interim

multigrid

Fluid

processors.

for

cholesky

1986.

unstructured

R. M. Smith,

P. J. Hatcher.

scheduling

in progress.

Computational

J. H. Saltz,

K.

task

3:327-342,

dimensional lOth

69-76,

and

for compilers,

on Supercomputing

M. J. Quinn

June

domain

Computational

Parti

[34] A.

An automatic

Rosendale.

explicit

Automating

Parallel

Conference

[33]

graphs.

Press.

Liu.

runtime

dependence

1990.

Languages

91-1549cp. [32]

J. Van

In Proceedings

factorization.

[31] D.

parallel

Supercomputing

Index

October

Programming The

and

Generating

M. Chen.

using

1990.

In Proceedings

Li and

J. Flower.

memory

P. Mehrotra,

slicing

1990.

and

C. Koelbel,

references.

[30]

A. Kolawa,

SC, April

Li and

Interprocedural

January

Charleston,

[26] J.

[29]

D. Binkley.

12(1):26-60,

G. Fox,

system

Practice

[28]

and

and of the

pages

Kay

140-152,

programming

equations, June

Crowley.

1988

1991.

ACM July

paper

1991. Principles

of

International 1988.

on multicomputers.

IEEE

1990.

Process Language

decomposition Design

1989.

32

and

through

locality

Implementation.

of reference. ACM

SIGPLAN,

In

[35] M. Rosing

and

computation

R. Schnabel.

on distributed

University

of Colorado,

[36] M. Rosing, in Dino.

R. W. Schnabel,

Applications,

pages

[37] Y. Saad.

and

553-560,

multiprocessors.

- a new

language

Technical

Report

for

numerical

CU-CS-385-88,

R. P. Weaver. Conference

Expressing

complex

on Hypercubes,

parallel

Conurrent

algorithms

Computers

and

1989.

a basic

tool

kit

for sparse

matrix

computations.

Report

90-20,

1990.

[38]J. Saltz,

H. Berryman,

Concurrency,

[39]J. Saltz,

Practice

K. Crowley,

execution

of loops

Computing,

and

J. Wu.

and

Experience,

on

message

8:303-312,

for realistic Portland,

loops.

[41]H. Simon.

of the

Physics

Journal

ICASE,

to appear:

1990.

Run-time

scheduling

of Parallel

and

and

Distributed

Program

and

J. Wu. Parti

procedures

Memory

Computing

Conference,

1991.

Permagon

Mellon

H. Berryman,

of the 6th Distributed

mesh

on Parallel

A Parallelizing

[43]M. Weiser.

90-59,

Berryman.

machines.

of unstructured

Conference

Carnegie

Harry

D. Mavriplis,

April-May

Applications.

[42]P. S. Tseng.

and

Report

for multiprocessors,

1990.

Partitioning

ceedings

compilation

1991.

passing

In Proceedings

Oregon,

Runtime

R. Mirchandaney,

[4o]J. Saltz, R. Das, R. Ponnusamy,

thesis,

of Dino

1988.

of the 4th

Sparsekit:

overview

memory Boulder,

In Proceedings

RIACS,

An

Methods

Press,

slicing.

IEEE

for parallel

on Large

Scale

processing.

Structural

In Pro-

Analysis

and

1991.

Compiler

University,

problems

for Distributed

Memory

Pittsburgh,

PA,

May

Trans.

on Software

Parallel

Computers.

PhD

1989. Eng.,

SE-10(4):352-357,

July

1984. [44] R. D. Williams C3P

715,

[45]J. Wu,

and

Caltech J.

R. Glowinski. Concurrent

Saltz,

[46]H. Zima,

and

In Proceedings

pages H. Bast,

parallelization.

Computation

S. Hiranandani,

for multicomputers. Processing,

Distributed

II-26,II-30,

irregular Program,

H. Berryman.

of the

1991

finite

elements.

February Runtime

International

Technical

Report

1989. compilation Conference

methods on Parallel

1991.

and

M. Gerndt.

Superb:

Parallel

Computing,

6:1-18,

33

A tool 1988.

for semi-automatic

MIMD/SIMD

NASA

Report

Page

2. Government Accession No.

1, Report No, NASA CR-187635 ICASE

Documentation

Report

No.

3. Recipient's Catalog No.

91-73

4, Title and Subtitle DISTRIBUTED

5. Report Date

MEMORY

PROBLEMS

--

DATA

COMPILER COPY

METHODS

REUSE

AND

FOR

IRREGULAR

RUNTIME

September

PARTITIONING

8. Performing Organization Report No.

7. Author(s} Raja

Das

Ravi Joel

Ponnusamy Saltz

91-73 10. Work Unit No.

Dimitri Mavriplis 9. Pe_orming Organization Name and Address Institute and Mail

1991

6. Performing Organization Code

for

Computer

505-90-.52-01

Applications

11. Contract or Grant No.

inScience

Engineering Stop

NASI-18605

132C,

NASA

Langley

Research

Center 13. Type of Report and Period Covered

Hampton, VA 23665-5225 12. SponsoringAgency Name and Addre_

Contractor National

Aeronautics

Langley

Research

Hampton,

VA

and

Space

Report

Administration 14, Sponsoring,_gency Code

Center

23665-5225

15. Supplementaw Notes Langley Michael

Technical F. Card

To

Monitor:

appear in and Runtime Memory P.

Final Repor 16_'A_tract

any

We

describe

In

our

to

be

paper

outlines

distributed how

scheme,

memory to

link

two

with

methods

compiler runtime

programmers

distributed

plicit!y

Machines,

Mehrotra,

Editors:

Elsevier

Compilers Distributed

J.

Saltz

and

Press.

t

This in

book -- "Languages, Environments for

implicitly

processors.

potentially

to

complex

we

believe

handle

partitioners

can

between

which

able

to

algorithms

how

insulates that

play memory

data

and

users

carry

an

important

andunstructu=ed

distributed

specify This

will

sparse

role

problems.

compilerS.

loop

iterations

from

having

out

work

and

and

reusing

the

same

to

data

are

deal

ex-

partition-

ing. We processor memory

describe

data.

In

locations.

processor stored Euler

also

memory

a viable

many As

long

locations

off-processor solver run on

mechanism

programs, as

it

remain

data. We an iPSC/860

can

be

sparse,

mesh,

compiler,

verified we

NASA FORM 1626 OCT86

tha@

the

show

that

present experimental to demonstrate the

executor,

distributed

59

values we

data from usefulness

can

copies

of

off-

off-processor assigned

to

effectively

offreuse

a 3-D unstructured of our methods.

memory

_.

- Mathematical - Computer

. Up¢lassified Security Cla=ff. (of this pe_} Unclassified

and

Computer

Sciences

(General) 61

Unclassified

access

18. Distribution Statement

inspector,

19. Security Classif. (of this report)

tracking

loops

unmodified,

17. Key Words(SuggestedbyAuthor|s)) unstructured

for

several

Programming - Unlimited 21. No. of _ges 35

and

_.

Software

Price A03

NASA-L_Jey,1991