Parallel PAB3D - NASA Technical Reports Server (NTRS)

1 downloads 0 Views 807KB Size Report
related programs and files. 3. Treatement of Viscosity terms: is as usual centrally approximated. ... list of local pieces ..... The first goal was achieved by broadcasting ..... M3: Computation. As we mention the innermost loop in solver4 consists.
NASA/CRICASE

1998-207636 Interim

Parallel in MPI Fabio

PAB3D:

State

Khaled

31

Experiences

with a Prototype

University

S. Abdol-Hamid

Analytical S. Paul

Services

& Materials,

Inc.

Pao Langley

Institute NASA

No.

Guerinoni

Virginia

NASA

Report

for

Research

Computer

Langley

Hampton, Operated

Center

Applications

Research

Center

by Universities

Space

in Science

and Engineering

VA

National Aeronautics

Research

Association

and

Space Administration Langley Research Hampton, Virginia

April

1998

Center 23681-2199

Prepared for Langley Research under Contract NAS 1-19480

Center

Available

from the following:

NASA Center for AeroSpace 7121 Standard Drive Hanover,

MD 21076-1320

(301) 621-0390

Information

(CASI)

National

Technical

Information

5285 Port Royal Road Springfield,

VA 22161-2171

(703) 487-4650

Service

(NTIS)

PARALLEL PAB3D EXPERIENCES

FABIO

Abstract. and

physical

domain.

Interface

(MPI),

and

define

identify

Key

can

readily

this like

as

a number

As

the

limitations

of distributed

task

an

ported

structure

a set

and

gained

t

acceptance

disjoint

blocks

using

briefly

characteristics

used

the

outline

the

for

MPI

problems

future

solver,

in the

covering

Message

the

Passing

of the

for communication

on

Navier-Stokes

PAO

of PAB3D

working

version

has

(COMMSYS)

when

(MPI),

code

is derived

from

communication,

of this

and

nature.

Last,

we

work.

structured

meshes,

broadcasting,

Y-MP

to

the

The

the

other

besides

from

of Mathematics,

Services

Configuration

& Materials,

a sequential In

State

and Space Applications Inc. Hampton,

Branch

free

NASA

was

Significant

NASTAR

a trend

many

[6].

for

of these

up

processing

by

in shared

a are

examples

of systems

shared-memory were

using

memory

systems

for shared The

original

implementing

the

is more

first

of

or like

University,

in Science

memory systems

designed

shared

memory.

systems,

clusters

to

the

use

of workstation

PARMACS,

PVM,

and

P.O.

outcomes

Box 9068,

under

multiprocessors

simplified

the

takes

care

of parallel

Contract

and Engineering

since of

our

one

has

project,

Petersburg,

we

VA 23806.

No. NAS1-19480 (ICASE),

to take

into

show

and

Center,

NASA Langley

MS 499, Hampton,

account that

This research

while the author Research

VA 23666 Research

I/O

communication.

complicated,

the

Administration

Langley

than

examples Other

large codes

communicating

showing

availability

written

code the

for more

process.

system.

becomes

Virginia

Aeronautics

Aerodynamics

were

memory

at the Institute for Computer VA 23681-0001.

tAnalytical

codes

communication.

by the National

and

there

implicitly

to the

of processors.

consequence,

parallel

industry

[5].

when

starts

aerospace

_: Whitncy's

example

In

of massively

task

Pratt

costs

widespread

that

in a number

tasks,

and

in thc

and

80's

SX-4.

contributed

programmer

an application

NEC

form

distributed

Parallelizing

the

late

a trend

run

for

independent

MPI-2

fact

which

the

of scalability

in the

and

of switching

* Department

or less

in

been

Paragon

proceedings,

or

trend.

MPI the

issues

Intcl

designed

in term

standard

computation;

cmerged

were

of more

has

have

codes Cray

Science

to the

in confcrence

transition,

residence Hampton,

Interface

discuss

interface

current

processing

found

memory

recently,

supported

data

that

domain,

We

a simple

Computer

of systems

the

run

In the

Passing

. Parallel

recently be

of

more

principal

the

solver

implementation

be enconuntered

from

Message

A number

ENS3D,

became

The

to

the

MPI

?, AND S. PAUL

Stokes

processing.

describe

likely

classification.

the

systems,

parallel

testing. We

Introduction

decade.

S. ABDOL-HAMID

as computational

on

IN

communication

Subject

Many

for

of improvement

words.

report

for

PROTOTYPE

Navicr

It takes

first

a standard

techniques

point-to-point

1.

is the

"patching".

levels

*, KHALED

communities.

This

a prototype

general

A

is a three-dimensional

industrial

prcprocessing some

GUERINONI

PAB3D

research

WITH

:

VA 23681-0001

the

was was in

Center,

tranformations opinion,

arc



The



Standard



Limited,

This

problem

report

a better

techniques

used

rough-hewn.

correct

The report

of PAB3D

left out, thc

2.

runs

for a single,

implementations

case of PAB3D

which

Following

decision,

and certainly designed

wc call COMMSYS

such

amount large

of work

number

description

to describe

some

is necessarilly

balancing"

or "speed-up"

to be done.

chosen

primarily

of PAB3D, because

The

important

processors,

and

we go on to the describe

the simplicity

of its develop-

follows. some

extends for the

adhoc

to other

techniques

that

applications.

sequential

(for COMMnication

has proven

As it turns

block-cell

connectivity

SYStem).

description

of Code

. The PAB3D

Navier-Stokes

for supersonic

jet

(RANS)

exhaust

code (currently

solver.

very useful

in

out, the key issue and

We conclude

flow analysis.

in its version

It was initially

the

with

use of an

suggestions

After

enhancements

it has become

a general

configurations

[1]. This

aerodynamic including

Treatment

and propulsion

turbulence of

models,

Convection

and terms:

multi-block upwinding

capability. is used,

Here among

13) is a three

developed

integrated

to the purpose

dimen-

in 1986 by Khaled code since

that

Navicr-Stokes

code has several

schemes

S.

time

code for for the

are some of them: which

is possible

to choose

the

variants.



Roe



van Leer

flux splitting



van Leer

implicit

numerical

Limiters,

strategies

will be used

prototype

of heterogeneous

techniques,

2.

in our

Characteristics.

Brief

following

here

ourselves

as "load

by the use of multi-block/multi-zone

1.

learnt

Our

complex RANS

key to success,

problem.

a summary

On it we describe

structure

Reynolds-averaged

Abdol-Hamid

for a relative

but realistic

Lessons

We will restrict

processing

of the MPI implementation

use of a data

PAB3D

2.1. sional

is still significant

parallel

interface to MPI for future work.

The

improvements.

as there

5 is the core of the report.

is the

processing.

in parallel

as follows.

our particular

resources.

for PAB3D.

for continuing

concepts

prototype

residuals

of a protype

for parallel

directions

elementary

A brief description

Section

of the

the key characteristics:

size

parallelization

is organized

principal

amount

with

functionality

suggesting

is that

produces

here

version

purposedly

achievement

be of realistic

on the

and

Some

been

ment.

must

a limited prototype

options

to provide

the

provided

but essential

is the first

have

feasible

was to have a well-defined

flux

required

to to prevent

are incorporated

in the code



Van Albada



Sweby's



S-V (Spekreijse-Venkat)



Modified

min-mod

S-V

oscillations

in high

order

methods

near

shocks.

A number

of

PATCHER CONNECTIVITYfiLE

BOUNDARY AND

I

CONNECT1WrY

J

DATABASE

I INITIAL CONDfl'IONS I

! SOLUTION EXPANSION

RESTART FiLE GRID FILE CONNECTIVITY FILE INITIALCONDITION FILE CONTROL FILE

FIG.

3.

Treatement

2.1.

Use

of Viscosity

of

PAB3D,

terms:

Version

is as usual

13,

and

related

centrally

programs

and

approximated.

files

There

is a wide choice

for the

cross terms

4.



j-thin



jk-uncoupled

layers,



jk-coupled

Turbulence

becn

k-thin

model:

implemented Two-equation



Shih-Zhu-Lumley



Gatski-Speziale



Grimaji important

pressibility from which boundary

structure

of the solver

a number

of algebraic

part.

A wide variety

Reynolds-stress

of models

have

models.

model

include

correction

"effective"

viscosity

the

ability

methods. and other

to deal The

code

with

real

accepts

pararameters

gas

equations

flows involving

are computed.

of state

and

non-reacting

It is also possiblc

several

com-

multi-species to specify

other

conditions. view

description

2.2. when

features

independent

including

k-epsilon

turbulence

A schematic detailed

is completely

into the code,



Other

layers

The

and

its relation

of this code with

Patching.

"conservative of the

of PAB3D

The

patching" code

by the

most

emphasis significant

was introduced creation

with

other

programs

on the turbulence improvement to the

of new cells

and models

ot to the

code [4]. This

at the

interaces.

files is shown can be found PAB3D

allowed A group

code

the

in Figure

2.1.

A

in [3, 2]. occurred

in 1990

multi-block/multi-zone

of such

cells is a patch

,

calleda piece

in this

space-integrated Later,

the

three

fluxes

patches

significant •

report.

The

are computed

databases data

were

are stored

in arrays

so that

expanded

and

improved.

Version

13 of the

code

(current)

contains

information

for the

database, piece

Depending

is provided.

on the value of pieceinfo,

Four

our

purposes

wet

set

of cells in the piece

face/block

adjacent

where

the piece

belongs

face/block

IPTF(block,blockinfo) tured

cells ratios

so as to get:

* the number the

of the

conservatively.

: A global patch

corresponding

pieceinfo

of overlapping

bases:

IPCB(piece,pieceinfo) the

amount

block,

and

: The after

the

block database

patching

which

is done,

the

provides,

pathces

dimensions

involved

for the struc-

in the

exchange

of

information number

of pieces

list of local pieces list of adjacent



IPCBL(block,face,localpiece) may

contain

sponding These

local pieces

several

global

2.3.

identified

number

in volved

by a local number.

to access in the

local-to-global

the database

mapping. This

array

A face in a block provides

the

corre-

if required.

commmunication

part

of the parallel

system,

as explainge

in

5.

The

rdnc blocks 2 nozzle

pieces,

piece

are the only arrays

Section

: The piece

Prototype

and a total

which

characteristics

Problem

was designed are given

high pressure

plenum

grid

for the

grid Jet

in the following

nb

The computational

. The

of 1.29 million

idm

jdm

1

61

33

2

61

3

65

4

computational points.

Noise

The

grid physical

Laboratory

for the model

at NASA

parallel

process

Langley.

The

kdm

nbt

33

61

122793

33

113

242385

97

33

113

361713

downstream

of Block

3

5

97

33

113

361713

downstream

of Block

4

6

61

17

17

17629

cartesian

core

1

7

65

17

17

18785

cartesian

core

2

8

97

17

17

28033

cartesian

core

3

9

97

17

17

28033

cartesian

core

4

one quarter

and nozzle

has

Mach

computational

grid

environment.

The

Description

106689

describes

case

table:

53

chamber

test

is a convergent-divergent

of the

4

of nozzle

exterior

of nozzle

downstream

physical

flow acceleration

interior

path

nozzle

of nozzle

and

its ambient

to Mach 2 at the nozzle

exit is contained

FIC. 2.2. Blocks near nozzle. Block 1 is the interior o/ the nozzle

in block

1.

in blocks the

of block

singularity

block Thc

known

four

The

blocks

different

processes

memory

by the prototype

Standard



Roe's

flux differencing



Third

order



Coupled

the

jet

utility.

grid

grids

These

Some of these

listed

under above,

plume of blocks

with

the MPI

the numerical

to eliminate

flow path,

the

flow

6 through

9.

The

tables

interface

for each pair connectivity

both for the simplicity grid

with general

an average

can be used

blocks

data

by a block

is chosen

is described

In order

interface

of a multiblock

five blocks

points.

environment

cylindrical.

(patching)

This example

complexity

exit

exhaust

cartesian

automatically

capacities.

conditions

and by

arc

by one or more patched

on a single workstation

Ideal

of 250,000

to test

are small

of the

connectivity grid

workstation

enough

such

that

points clusters

multiple

system. techniques

are fairly

standard:

gas simulation

Parallel Model

(spatial)

interior

in two groups:

of 25,000

nozzle

five block

is covered

"Patcher"

sizes come



3.1.

nozzle

are generated

as the



3.

for these

and the moderate

can be initiated

As required

a laboratory-type

faces are described

an average and

topology

of symmetry

tables

block

speed

provides

axis of the

simply

with

grid

axis

flow configuration,

requirements.

block

The

at the

interfaces.

physical

flow which

the

between

preprocessor,

with

5.

surrounding

connectivity

and

ambient

2 through

polar

domain

The

k-c models scheme

interpolation

viscous

terms

Implementation of Computation. level.

upwind

That

in the j-k plane Decisions. From

is, the each process

the

beginning,

would

it was clear

be in charge

5

the

of a block.

parallelism For this type

would

be at the

of computations,

7

l- /--/-/ / i

XXz; -W} ¢

-2 •

_G.?9"._, ,;?':.?

FiG. 2.3. A cross

one In

finds the

in the

literature

master-slave

two

approach

available

process

processor

in a point-to-point

process,

operation

One

of the

inability this

and

which

the

receives,

main

requires

of the

nodes

must

called

a shrunken

size

spawn

the

be

Tasks

Since that

in which

In

distributed

an

On

the

and

resends

are

of the

processor

and

computation.

optimal

other

does

hand

user

one

keeping

the

master

other

the

through

the

of the

project was

the

same

but

In was

not

type

slave

master

synchronizes

predecessor

must process

PVM,

a master-slave to mantain

of task

be

Thus,

as any

not

distributed

a full

adressing

node.

the as

simplicity we incline

other

only

is the

approach the

appropriate.

of a code,

data

in each

of its

interface.

implementation

processors,

the

between

it does

Thus

and

approach

parallel

among

at

goals

master-slave

data,

on

computations.

(MPI-1)

started

of tasks

communication

appropriate.

in the

each

spawning

to send/receive

MPI

some

the

evenly

model.

it

the

asynchronous

needs

participate original

a task.

cxccutables.

all the

a slave

not

of communication

of code

processors

as there

often

for

computa-

well.

space,

All

This

is

is called

the

was

global are

not

of premium

concern.

grid

coordinates.

As

blocks,

preliminary

we opted

versions

for the

of parallel

Most

we were full code,

of the

data

confmdent

block

model.

as is now

is devoted that

Such the

case.

we

to storing

the

get

least

would

approach, Figure

at

easier 3.2

unknown as many

to implement,

is

a schematic

of

shows

alternatives.

4. from

between

to coordinate

model.

and

used

buffers

it does

we decided

block

variables,

the

message but

Distribution.

work

block

When

model,

in charge

Data

However,

two

graph

tional

full

to

code,

of the

3.2.

the

former

having

arc

fashion.

differences

sequential

in favor

the

to handle

problem

approaches.

designated

required

slaves,

of the prototype

of well-establisehd

a processsor

possibly

of the

of the

types

section

The free.

MPI Among

implementation: the

most

widely

LAM. developed

There and

are more

6

several robust

implementations are

the

Argonne

of MPI, MPICH

all and

available the

Ohio

GLOBALADDRESSING SPACE

FULL BLOCK METHOD

I II " ]E"' II "} TM

F_ RC3

,,.C2C2 I ML__J IF--I MII SHRUNKEN

FIG. 3.1.

Data

distribution

among

Supercomputer

Center

Local

used

the

in its

current

The

LAM

former

dynamic the

added

that

allow

process Some

of

MPI

Area

has

functionality

(LAM)

extensions. includes

of MP1-2,

a sequential

[7].

code the full block is easier to implement

For

the

parallelization

of PAB3D

wc have

6.1.

some

LAM

process/processor

from

Multicomputer

version

spawning.

Starting

still

control

For

example,

extensions

in the

through

making.

the

to

do

In

addition,

a file called

original

this.

application

MPI

These

should

LAM

comes

schema

[9] does not with

and

be

not

allow

confused

utility

for with

commands

configuration

control

a

schema. commands

example,

TASK

version process

processors.

BLOCK METHOD

the

allow command

(G/L)

probing

of the

status

mpitask

shows

the

of the following

FUNCTION

PEERIROOT

remote

hosts,

described

by

TAG

COMM

COUNT

0/0

pab3d

Bcast

010

WORLD*

6438865

pab3d

WaitAll

0/0

WORLD*

256



process

REAL

1/I

pab3d

Beast

0/0

WORLD*

6438865

REAL

pab3d

Bcast

0/0

WORLD*

6438865

REAL

The

display

The

status

irunning_

shows, indicates

the

processor

non-MPI

and

activity.

For

REAL

2/2

information

schema.

DATATYPE

8/8

3 pab3d

the

output

name The

of process, PEER

and

its TAG

MPI fields

function involves

at

the

moment.

point-to-point

communication.COMM with

which

MPI

is complex

control,

is the

exchange

enough,

initialization

point

communicator

messages).

communication

involved

COUNT

but

most

indicates

applications

termination

and

and collective

operations,

(an MPI

notion

to delimit

the size of message require

identification,

only

the

a dozen

most

[8]. A short

commands

important

sampler

the

group

and DATATYPE

or so.

commands

of typical

of process

its type. Besides

are

operations

the

for point-to-

in each category

follows.

Point-to-point

The

communication



MPI_RECV



MPI_IRECV



MPI_SEND

MPI_SEND

"event"

in four flavours.They

has happened.

Collective

MPI_RECV

in the sense

is blocking,

that

the call will not return

but MPIARECV

until

some

is not.

communication



MPI_BCAST



MPI_GATHER



MPI_REDUCE

MPI__BCAST collects

Similarly,

are "blocking"

is used

information

operations

to distribute from

to obtain

other

a result

information processes.

in a single

among

all process

MPI_REDUCE process

might

but which

in the communicator.

MPI_GATHER

be used

with

involves

in conjuction

arithmetic

all process,

as when

doing

a scalar

the defining

the data

orgainization,

product. 5. the

An

actual

through

approach implementation

M4.

Source

appropriate

suffix.

independent

testing.

An important

as described

code

remain

Phase

developed in the without cases, The

that

MI:

do most

computation

MPI

calls was carried

calls corresponding to include

very useful,

only

of the main reduced

out in four

to a phase the routines

was to develop

independently

implementation

Start-up. to prevent

part

debugging

code

the

After

to write

was identified necessary

In the part

an interface

phases,

by providing

for their

a communication

code.

distinct

complete

subsystem,

which

M1 an and

we

of the commmunication

to COMMSYS.

The original

unchanged.

any consideration

resulting

MPI

we tried

was tested

over a number

for simple

the

proved

the

computations.

incorporation

code with

below,

virtually

that

passing

In each step

which

step,

5.1.

and

technique

call COMMSYS,

codes

to message

It is widely parallelizaition

of years, of the

it is natural

acknowledged

that

the single characteristic

is the issue of I/O.

In large

that

and improvements

modifications

code.

Developers

tend

to add

for parallel

processing.

This

is either

I/O

statements

done

of the sequential

engineering

software

are done essentially

to the

for genuine

projects,

code

reasons

generously, or, in many

purposes.

is extremely

difficult

or expensive

to parallelize.

8

The reason

being

is that

each

I/O

cannotbeexecuted by morethana singleprocessor. Furthermore, if codeis modifiedsoasto ensure execution by a singleprocessor, inputstaments mustbefollowed by expensive broadcasts. An exception to this is whenthe codewrittenfromthestartwith distributed I/O. This is an active area of research these

days.

Eventually,

The

principal

beginning

goal of phase

of the

solver.

all parallel

main

of these

In short, •

there

are avoidable,

the

goals

At the

Run

starting

enough

were

speed

limited

I/O

out,

delicate

the

question.

At higher

B-segments

PAB3D,

this

that

data

was identified

more

on this on the final

must

have

that

than

complicated

at the

as routine

things

further.

step.

then

about

times would

have

1600 I/O

Thus,

FORTRAN

the

of the

been less and thus would

that

we could

it will not scale this task

technique

was ruled

statements,

constants

size.

The straighforward case,

do.

( whose sizc is determine

for the purpose,

in the input

same

routine.

application-oriented,

to find out later

project.

broacast

size.

actual

utilities

the

process

arrays

the

full-scale,

ad-hoc

single

Dummy

than

exactly

computational

it to actual

rather

only

of the overall

system

operation.

turnaround

as the

each main

limiting

prototype,

and

contains

process

and commons

calls,

we developed

consuming

each

was completely

a reduced-scale

approach

this

have

not afford properly.

was by far the

of restricting

out from

approach

the

the

beginning.

was clearly

out of

was needed.

where

subroutines

was to partition

tend the I/O

to be called intcnsive

within

or I/O

the same

subroutine

context

and functionality,

intensive,

in in two type

of

B segments.

were constructed

1. There

have the proper

We will discuss

after input

rather

to one processor

C and

for efficiency.

fact

so as to accomodate

But as our project

level routines,

segments:

I/O

routine,

variables

in the broadcast

A new technique

a very useful

local

size) broadcasting,

and time

As the PAB3D

this

part,

by broadcasting

and in spite

statement

processors

prototype

within

computational

in full scale

(actual

risk by testing

As it turn

are not.

processors,

were used

development.

to take

most

dealt

dimensions

By doing

some

others

all computing

In the

I/O

of the

The first goal was achieved

arrays

routine.

is still some

in its arguments,

with

at run-time)

that

distributed

of M1 were two:

information •

be able to deal with

M1 was to ensure

computational

Unfortunately,

Some

codes must

according

is no branching

process

(to avoid

I/O

to the

into the conflicts),

following

segment.

characteristics

The

idea is that

and thus code branching

the

segment

will execute

into the segment

risks being

in a single improperly

executed. 2. The segment a minimal 3. The

must

of computation

number

of derived

4. Due to potential The

B-segments

processing. the segment. must

coincide

The

concept

In particular,

be broadcast.

maximize

of input

operations,

while

at the same time must

contain

in between. variables

side-effects, with

the number

the

of derived

must

be kept

whenever notion

any variablc

possible

of "critical

variable

to a minimum avoid subroutines code"

is introduced

involved

established

here

in a READ

within. since

the

to mean

any variable

is a derived

variable.

early that

days

of parallel

is modified

All derived

variables

in

read{77,*) do

40

it,

40

B1

ib),i=l,nblock)

j=l,kix

p(j,ib)

=

a(j,ib)**2.0

e(j,ib)

=

P(j,ib)/

*

rho(j,ib}

/ gammar(nsp)

(gammar(nsp--l.O)

+0.5*rho(j,ib)*u(j,ib)

continue

Derived

read

variables:

(77,*)

call

it,

FIG.

5.1.

In B1

depend._

on

the

Figure

5.1 shows

tricky.

Consider

very difficult

p,

becomes work,

syntax

by the

B-segments

for example

to determine Proper

existence

of dummy

the

might

be adjustable

variable

the

has to be careful Following

with

the

contains

analyzers

outputs

gammar(?)

variables

second

derived

(rarely

variables.

are routines

of a routine.

Some

available

use of good

while

Detecting

within

in the

routines

compilers

that

derived

case,

B_,

variables

the B-segment.

in PAB3D

in practice),

i.e. in the

A possibility other

might

provide

case it modifies

is to make

difficulties

designed

may

in this

the possibility

it all

In some

prototype provide

can very cases,

have useful

cross-references

it is

more

than

information.

which

facilitates

the

routine

non-local in-line,

variables,

the situation

and let the compiler

do the

arise.

respect:

variables

of side-effects.

in commons

However,

are explicitly

this lead to another

passed

problem:

as the

arrays. of an array (passed

is provided

as argument),

as real a(1), at broadcast

broadcast

is show

the

derived

there

side-effects,

it is well

a B-segment,

actual

routine

output

altogether

dimension

is declared

the

the case when

complicated.

PAB3D

Often

to detect

a(?), rho(?),

variables.

within

avoiding

showing

documentation

of derived

extremely

arguments,

analyzer

p (?),

subroutines

but even in this situation

Fortunately,

ga/rm_r)

it is possible

two

If the subroutine

rho,

it, igf, e (?),

can be complemented

the detection

a,

82

variables:

called

200 arguments. This

it, igf, p, e

(igf(iib),i=l,nblock)

energy(e,

Derived

the

(igf(i,

the

might

the

at some

syntax higher

not represent

of code.

The

real

level common.

its actual

dimension,

Thus

dimension,

if a local

and

thus

one

time.

commands.

with a fixed format in Figure

or determined

declaration

as an source MPI

only for checking

5.3.

Message

include

file, the

In many

M1 source

cases,

these

are

provides generated

as shown

in Figure

5.2.

lengths

for arrays

are determined

10

Segment

a '*.bct'

file which

automatically

characteristics from

static

contains

from

syntax

for a higher data

and

thus

lever the

XX =

nblk*nsec*6*20*nprt*nzon

call

MPI_BCAST(ibcf,XX,MPI_INTEGER,

+

MASTER,MPI_COMM_WORLD,

ierr)

XX

= nblk*nzon*ngt

call

MPI_BCAST(ibf,XX,MPI_INTEGER,

+

MASTER,MPI_COMM_WORLD,

ierr)

XX = nblk* call

(21+2*npcmx+l)

MPI_BCAST(iptf,XX,MPI_INTEGER,

+

MASTER,MPI_COMM_WORLD,

ierr)

XX = jkmx*(ncsp)+l call

MPI_BCAST(qOs,XX,MPI_REAL,

+

MASTER,MPI_COMM_WDRLD,

ierr)

FIG. 5.2. Automatically Segmt.

Length

Der.

Var

generated

'*.bet' file

Subrtn.

I/O

units

BCT

file

B1

192

G1

244

B2

32

G2

10

B3

45

36

7

So13-MIB3

B4

24

19

97

Sol3-M1B4

G3

145

49

5

3O

So13-M2B1

98,99

Sol3-M1B2

rinput

zonm inidct init jkbar solver

B5

3

outfl

FIG. 5.3. Segment

ordering The

of the

executable

in all

Figure

Phase

structure M1,

Figure

5.7

code, above

a convenience

in and

5.4

M2:

of the

described

output,

calls

staments

process.

5.2.

are

broadcasts

illustrates

important.

that

form

the

shows

the

actual

in order has

to

carried to

define

residuals the

buffer

Section

of the

is necessry

the

breaking computed.

for

exchanging

6, we are

up

points

to

data,

the

qbuf,

11

does

processors

topic

G-segments.

computation

Solver

this

of the

a little

compututation,

and

the

parts

understand and

to retake

are

significant

to solver.

in

going

B-segments,

most

communication

parallelization

arc

for high level routine

of the

division

It

describe

the

In

complement

Communication.

PAB3D partial

is not

characteristics

code

more

These

run

in phase

M1.

about

subsystems. a few

after

P1

global which

and

P6

again.

the

global

The

phase

iterations. partial

involved

This solutions

in

the

solver

regtbl2

turbke

Solverl3.f

initdct

zonm

]

out

BI

I D3fl3I

FIG. 5.4. Segmentation

activity.

The

(process)

is responsible

boundary. For each routine

buffer

Referring global

consists

solver4

consists

lower

can happen

we must

solver4

level

(block

make

designed

consist

as a system

cases contain

cells of the and

pieces

where

the

they

calls are shown

create

the

pieces

off-line which

10 and 25 while it it here

loops as follows:

loop)

processor

pieces,

receiving

P1 sends

sure tbat the

Single

is called once;

of nested

And this is done through COMMSYS

of tile

to the figure,

It is at the

for M1.

for sending

iteration,

1

where

as dashed

using

the

connects

patcher.

it with

P6 receives

the

lines

core computations

take

loop,

zone loop, and (a sequence

process

must

work independently.

all the pertinent

place.

of) block

However,

information

before

from other

of two parts of include

which

interact

files which

exclusively

contain

only declarations/definition

commsyspar.h ber of blocks, ction



contains maximun

can be obtained

parameters commsys.h

provide

or

The loops. this

process.

COMMSYS.

while

with the arrays

its own databases.

the upper

of the

patching

The include

case ones contain

system.

files named

executable

three

pieces directly

more

declares

parameters, per block,

and maximun

from the PAB3D

flexibility. the

for the commsys.h

arrays.

commons,

The dependencies The

global

numbers:

12

total

arrays

arrays:

statements.

pieces. leaving

might

While these

be stated

correspond

maximum

num-

this informa-

as independent in the makefile.

to the

global

It is

with lower

arc:



block

block

them.

time

have received

Each

another

piece

They

1

2

P1

io

,

25

:

recv send

send

recv

P6

I

FIG. 5.5.

add_piece:

(scalar) sending

recv_piece:

receiving

and

the

local

n_piece:



of the

block

piece

the

block

the

file

the

are

13 patching

the

the the

block

: the

global

issued

here.

commsys.h

and P6

receives.

are

receives

contain

set

and

arrays

according

blocking

declarations

sending

in commsyspar.h), of the

send

arrays

of communication

matching

PI

I

information:

block

Non-blocking

number

(a parameter

5.6 shows

loops.

block

global

Since

5.3.

local

recv_piece:

addition,

Figure

to the

that

COMM_LOCAL.h:

status.

of piece,

piece

calls

in processes

I

in qbuf.

block

global

MPI

o] qbuf

'I I

of piece

of pieces

COMM_GLOBAL.h



per

arrays

number

addres

block

send_piece:

actual

In

piece

from_piece:

Two copies

I

it was executables

to

sends

runtime

are

pertaining

receiving natural

to include

COMM_GLOBAL

issued

to

messages

information.

No

here.

communications

request

depends

on the

number

this

type

of declarations

and

COMMLOCAL

and

of pieces here.

to the

Version

subsystem.

Phase In each

M3: of the

Computation. cases,

the

As

index

we mention

variable

the

of the

loop

innermost

is called

loop 'ib';

in solver4

for

simpicity

consists we will

of block to these

as

ib-loops. Early

in solver4,

instances. tain

the

the

Subsequent proper

ib-loop ib-loop

informacion

that do since

invokes the

the

communnication

computations

it has

been

on

exchanged

13

the

system blocks

through

(which qbuf).

has

been

we can Similarly,

executed assure

as that

we execute

differenc they

con-

separate

............................. iIPCBL(b!ock,

i

face,localpiece_

..............................

a m a ¢'}

,

r ........................ IPCB

r ........................

(piece,

pieceinfo IPTF(block,

blockinfo)i

* number ofcells

* number of _eces

*

* loc__ece

face/block

belonging

* face/blockadjacency

* adjacent loc_ _oces block database

patch database

1

1

add_piece{2,

npie)

ffJ from

o o

piecelnpie)

n_oieclnb2k) send_91ece(nlpi,

nb2k)

l

recv_iece(nlpi,

nblk)

1

to_iece(npie)

COMM_GLOBAL.h

COMM_LOCAL.h

FIG. 5.6. Relation between the databases of PAB3D and those of COMMSYS

instances

of this.

required Using

data the

nication this

newly

has

stage.

phase.

provided

been The

taken

information,

6.

Status,

in 9 different

There

processors

completion

6.1.

of effort.

Timely

as in the case in the ordcr LAM/MPI

work

with

the patching

information,

will find the

Work

the

i of loop

final

phase

is the

in processor

commands but this

updating

of an broadcast

in charge

i-1.

addcd

All commu-

is already

to the done

system

at

in an earlier

of the

operation,

solution, a gather

at

the

end of the

operation,

in which

of the output.

Conclusions.

them

ib, will run

no communication

is for self-identification,

reverse

and

and bring

instance

a parameter.

Thc

This

As wc have been able to compute

to a designated

root

processor,

correct

we are very optimistic

residuals about

the

of the project. a few things

The

Messaging.

in which MPI

communication

that

experiences

of our prototype

and

MPI

to the processor

F_urther

arc, however,

degrees PAB3D.

the

using

Updating.

for output.

is passed

successful

which

M2, so there

with

is simply

M4:

iterations data

routines,

care in step

only intcrphase

Phase

global

to-point

computational

Self-identification

5.4.

block

The

in qbuf.

they

need to be perfected.

learned

hcrc

should

In an environment problem

as discussed

Here

provide

where

is a partial

clues on the design

computational

in Section

list which

loads

from

2, it is fundamental

involve

of future

blocks that

a different versions

of

vary

widely,

messages

arrive

are posted.

provide

a number

commands.

of options,

In order

such as tagging

to use the optimal

14

and specific

method

implementation

for our purposes,

of point-

a test-problem

OPTIMAL ?

I

[

r

]

G

segment

B

segment

bcl

FIG.

must the

be propcrly code,

with

designed. realistic

implementation

6.2. scribed

I/O

Broadcast

trivial

operation

5 may

Certainly,

the above

the

The organization

work,

is necessary: required

or not.

can be a quite

conditions

on, answered

A test-problem

the question

is needed

However,

the whole

code either

G. Clearly,

improvement,

especially

dummy

this

(number

elaborated

robustness,

of on the construction

restrictive.

size of the broadcast

whether

to guide

further

of process).

arrays,

defines

the

hierarchy that

is making

This idea is illustrated

in Figure

the code in its present

the

broadcast

each with

a

6.2.

state

admits,

of most

derived

lengths

can be

files.

5, the message

declared

de-

of partitions:

problem,

in Section

involve

a whole

partition

an optimization

However,

only involve

as we explain

define

G or B or its opposite,

properly

problem. which

they

of a segment

dimension.

lengths These

lengths. of the bct files, Figure

for this provision.

but should

a little

like making

for future

variables, to actual

transition

in parallel

The

complement

a straightforward

Due to provisions

changed

seem

partitions B and the

being

broadcast

segmentation

such as the one wc worked

run

optimization.

parameter

in principle,

Optimal

decisions.

in Section

from the

can

file

6.1.

A prototype,

data,

I

Finding

be relatively

some

to rearrange

of the actual the

5.1 has been purposedly

the correct

easy for someone lenghts

who knows

are broadcast

bet so as to let them

left in a uniform

value for the size of the message thc code well.

as derived

bc broadcast

15

variables

before

they

format

to allow easily

a

involvc

significant

Nonetheless,

a word

of caution

themselves.

Thus,

are used.

might

it might

be

full

6.3.

Memory

block

approach.

For

shrunken

block

using

the

This

is a major

leave

the

7.

way

Distribution.

of the

code,

for distributed

University

we explain

the

reasons

that

total

the

core

data

would

like

to

why

memory

we

decided

requirements

to start must

with

the

bc reduced

by

as it

modifies

structures.

This

improvement

should

I/O.

Acknowledgements.

Dominion

Section

it is imperative

model.

change open

In

efficiency,

The

first

for his support

and

author

for making

the

thank

project

David

Keyes

of

ICASE

and

Old

possible.

REFERENCES

[1]

K.

ABDOL-HAMID, equations:

[2] --,

K.

Preliminary

Implementation Report

[3]

A multiblock/multizone

ABDOL-HAMID,

applications, of algebraic

CR-4702,

NASA,

and

Tech.

stress

(PAB3D-v2)

Report

model

for

CR-182032,

in a general

the

three-dimensional

NASA,

3-d

Navier-Stokes

October

navier-stokes

1990.

method

(pab3d),

Tech.

1995.

J. CARLSON,

to attached

code

separated

AND B.

flows

for

LAKSHMANNAN,

use

with

k-e

Application

turbulence

of Navier-Stokes

model,

Tech.

Report

code

PAB3D

TP-3489,

NASA,

1994. [4] K.

ABDOL-HAMID,

J.

and

patch

[5] A.

ECER, ics,

conservatie J.

PERIAUX,

North

Holland,

Pasadena, [6] C.

C.

of workstations Dynamics, 1996,

pp.

Kinnear [8] W. [9] OAK

GROPP, RIDGE Tenessee,

AND S.

algorithm, N.

Tech.

SATOFUKA,

January

PAO,

1996.

Calculation

Report

of turbulent

95-2336,

AND S. TAYLOR, Proceedings

AIAA, eds.,

of

flows

mesh

sequencing

1995.

Parallel

Parallel

using

Comptuational

CFD

conference,

Fluid June

Dynam-

26-28,

1995,

California.

FISCHBERC,

[7] GDB/RBD,

CARLSON,

RHIE, for

A.

Ecer,

R.

ZACHARIAS,

production J.

P.

running

Periaux,

N.

BRADLEY,

of parallel Satofuka,

AND W. DEsSUREAUAULT,

cfd applications, and

S.

in Parallel

Using

hundreds

Computational

Taylor,

eds.,

Stony

Brook,

report,

Ohio

Supercomputer

North

Fluid Holland,

9 22. MPI Road, E.

primer/

developing

Columbus, LUSK,

NATIONAL

AND LAB,

OH A.

with

43212,

tech.

November

SKJELLUM, MPI:

LAM,

A message

Using

Center,

1224

1996. MPI,

passing

1995.

16

The

MIT

interface

Press, standard,

1994. tech.

report,

University

of

17

i

REPORT

DOCUMENTATION

PAGE

OMB No 0r_0188 Form Approved

I

i Public reporting burden gathering and maintaining collection of information, Davis

Highway.

Suite

1. AGENCY

for this collection of information is estimated to average 1 hour the data needed, and completing and reviewing the collection including suggestions for reducing this burden, to Washington

1204.

Arlington,

VA

USE ONLY(Leave

22202-4302.

I 4. TITLE

AND

Parallel

and

to the

blank) I 2. REPORT April

Office

of

per response, including the time for reviewing instructions, searching existing data sources, of information, Send comments regarding this burden estimate or any other aspect of this Headquarters Services. Directorate for Information Operations and Reports. 1215 Jefferson

Management

and

Budget.

Paperwork

Reduction

3. Contractor REPORT

1998

TYPEReport AND

SUBTITLE

PAB3D:

Project

(0704-0188).

DATES

with

DC

20503

COVERED

5. FUNDING

Experiences

Washington.

DATE

a prototype

NUMBERS

in MPI C NAS1-19480 WU

505-90-52-01

6. AUTHOR(S) Fabio

Guerinoni

Khaled

S. Abdol-Hamid

S.

Pao

Paul

7. PERFORMING Institute Mail

ORGANIZATION

for Stop

Computer

403,

Hampton,

NASA

VA

Langley

Aeronautics

Langley

]1.

VA

Space

ADDRESS(ES)

Report

No.

31

10. SPONSORING/MONITORING AGENCY REPORT NUMBER 1998-207636 Interim

Report

No.

31

Dennis

M.

Bushnell

STATEMENT

12b. DISTRIBUTION

CODE

Unlimited Category

60

Availability:

NASA-CASI

communities.

It

report

on

processing.

takes

the

We

data structure (COMMSYS)

Message

AND

Administration

(301)621-0390

ABSTRACT (Maximum 200 words) PAB3D is a three-dimensional

14. SUBJECT

NAME(S)

Interim

ICASE

Monitor:

Nonstandard

problems

ICASE

NOTES

Distribution:

first

Center

NASA/CR-

DISTRIBUTION/AVAILABILITY Unclassified

13.

8. PERFORMING ORGANIZATION REPORT NUMBER

Engineering

Center

SUPPLEMENTARY

Subject

and

23681-2199

Langley Technical Final Report

12a.

ADDRESS(ES)

Science

Research

AGENCY and

Research

Hampton,

AND in

23681-2199

9. SPONSORING/MONITORING National

NAME(S)

Applications

Navier

discuss

Stokes

computational

implementation

used for for MPI

of this

as

of

briefly

the

PAB3D

Last,

we

levels

set

using

of

has

disjoint

the of

is derived and some identify

that

a

characteristics

communication communication,

nature.

solver

domain,

gained

Message

the

code

acceptance

blocks Passing and

define

from preprocessing general techniques improvement

from

in

covering

the

(MPI),

a prototype be

current

version

point-to-point

and

domain.

for

and

for

testing.

The

(MPI),

Navier-Stokes

solver,

structured

meshes,

broadcasting,

communication

is the parallel

principal

a simple interface when working on outline

future

15. NUMBER Interface

industrial This

a standard

We describe enconuntered

TERMS Passing

research

physical

Interface

"patching". likely to the

the

work.

OF PAGES

21 16. PRICE CODE A03

17. SECURITY CLASSIFICATION OF REPORT Unclassified

OF THIS PAGE | 18. SECURITY CLASSIFICATION I Unclassified /

_/SN 7540-01-280-5500

19. SECURITY CLASSIFICATION OF ABSTRACT

20. LIMITATION OF ABSTRACT

'Standard Form 298(Rev. 2-89) Prescribedby ANSI Std Z39-18 298-102

18