related programs and files. 3. Treatement of Viscosity terms: is as usual centrally approximated. ... list of local pieces ..... The first goal was achieved by broadcasting ..... M3: Computation. As we mention the innermost loop in solver4 consists.
NASA/CRICASE
1998-207636 Interim
Parallel in MPI Fabio
PAB3D:
State
Khaled
31
Experiences
with a Prototype
University
S. Abdol-Hamid
Analytical S. Paul
Services
& Materials,
Inc.
Pao Langley
Institute NASA
No.
Guerinoni
Virginia
NASA
Report
for
Research
Computer
Langley
Hampton, Operated
Center
Applications
Research
Center
by Universities
Space
in Science
and Engineering
VA
National Aeronautics
Research
Association
and
Space Administration Langley Research Hampton, Virginia
April
1998
Center 23681-2199
Prepared for Langley Research under Contract NAS 1-19480
Center
Available
from the following:
NASA Center for AeroSpace 7121 Standard Drive Hanover,
MD 21076-1320
(301) 621-0390
Information
(CASI)
National
Technical
Information
5285 Port Royal Road Springfield,
VA 22161-2171
(703) 487-4650
Service
(NTIS)
PARALLEL PAB3D EXPERIENCES
FABIO
Abstract. and
physical
domain.
Interface
(MPI),
and
define
identify
Key
can
readily
this like
as
a number
As
the
limitations
of distributed
task
an
ported
structure
a set
and
gained
t
acceptance
disjoint
blocks
using
briefly
characteristics
used
the
outline
the
for
MPI
problems
future
solver,
in the
covering
Message
the
Passing
of the
for communication
on
Navier-Stokes
PAO
of PAB3D
working
version
has
(COMMSYS)
when
(MPI),
code
is derived
from
communication,
of this
and
nature.
Last,
we
work.
structured
meshes,
broadcasting,
Y-MP
to
the
The
the
other
besides
from
of Mathematics,
Services
Configuration
& Materials,
a sequential In
State
and Space Applications Inc. Hampton,
Branch
free
NASA
was
Significant
NASTAR
a trend
many
[6].
for
of these
up
processing
by
in shared
a are
examples
of systems
shared-memory were
using
memory
systems
for shared The
original
implementing
the
is more
first
of
or like
University,
in Science
memory systems
designed
shared
memory.
systems,
clusters
to
the
use
of workstation
PARMACS,
PVM,
and
P.O.
outcomes
Box 9068,
under
multiprocessors
simplified
the
takes
care
of parallel
Contract
and Engineering
since of
our
one
has
project,
Petersburg,
we
VA 23806.
No. NAS1-19480 (ICASE),
to take
into
show
and
Center,
NASA Langley
MS 499, Hampton,
account that
This research
while the author Research
VA 23666 Research
I/O
communication.
complicated,
the
Administration
Langley
than
examples Other
large codes
communicating
showing
availability
written
code the
for more
process.
system.
becomes
Virginia
Aeronautics
Aerodynamics
were
memory
at the Institute for Computer VA 23681-0001.
tAnalytical
codes
communication.
by the National
and
there
implicitly
to the
of processors.
consequence,
parallel
industry
[5].
when
starts
aerospace
_: Whitncy's
example
In
of massively
task
Pratt
costs
widespread
that
in a number
tasks,
and
in thc
and
80's
SX-4.
contributed
programmer
an application
NEC
form
distributed
Parallelizing
the
late
a trend
run
for
independent
MPI-2
fact
which
the
of scalability
in the
and
of switching
* Department
or less
in
been
Paragon
proceedings,
or
trend.
MPI the
issues
Intcl
designed
in term
standard
computation;
cmerged
were
of more
has
have
codes Cray
Science
to the
in confcrence
transition,
residence Hampton,
Interface
discuss
interface
current
processing
found
memory
recently,
supported
data
that
domain,
We
a simple
Computer
of systems
the
run
In the
Passing
. Parallel
recently be
of
more
principal
the
solver
implementation
be enconuntered
from
Message
A number
ENS3D,
became
The
to
the
MPI
?, AND S. PAUL
Stokes
processing.
describe
likely
classification.
the
systems,
parallel
testing. We
Introduction
decade.
S. ABDOL-HAMID
as computational
on
IN
communication
Subject
Many
for
of improvement
words.
report
for
PROTOTYPE
Navicr
It takes
first
a standard
techniques
point-to-point
1.
is the
"patching".
levels
*, KHALED
communities.
This
a prototype
general
A
is a three-dimensional
industrial
prcprocessing some
GUERINONI
PAB3D
research
WITH
:
VA 23681-0001
the
was was in
Center,
tranformations opinion,
arc
•
The
•
Standard
•
Limited,
This
problem
report
a better
techniques
used
rough-hewn.
correct
The report
of PAB3D
left out, thc
2.
runs
for a single,
implementations
case of PAB3D
which
Following
decision,
and certainly designed
wc call COMMSYS
such
amount large
of work
number
description
to describe
some
is necessarilly
balancing"
or "speed-up"
to be done.
chosen
primarily
of PAB3D, because
The
important
processors,
and
we go on to the describe
the simplicity
of its develop-
follows. some
extends for the
adhoc
to other
techniques
that
applications.
sequential
(for COMMnication
has proven
As it turns
block-cell
connectivity
SYStem).
description
of Code
. The PAB3D
Navier-Stokes
for supersonic
jet
(RANS)
exhaust
code (currently
solver.
very useful
in
out, the key issue and
We conclude
flow analysis.
in its version
It was initially
the
with
use of an
suggestions
After
enhancements
it has become
a general
configurations
[1]. This
aerodynamic including
Treatment
and propulsion
turbulence of
models,
Convection
and terms:
multi-block upwinding
capability. is used,
Here among
13) is a three
developed
integrated
to the purpose
dimen-
in 1986 by Khaled code since
that
Navicr-Stokes
code has several
schemes
S.
time
code for for the
are some of them: which
is possible
to choose
the
variants.
•
Roe
•
van Leer
flux splitting
•
van Leer
implicit
numerical
Limiters,
strategies
will be used
prototype
of heterogeneous
techniques,
2.
in our
Characteristics.
Brief
following
here
ourselves
as "load
by the use of multi-block/multi-zone
1.
learnt
Our
complex RANS
key to success,
problem.
a summary
On it we describe
structure
Reynolds-averaged
Abdol-Hamid
for a relative
but realistic
Lessons
We will restrict
processing
of the MPI implementation
use of a data
PAB3D
2.1. sional
is still significant
parallel
interface to MPI for future work.
The
improvements.
as there
5 is the core of the report.
is the
processing.
in parallel
as follows.
our particular
resources.
for PAB3D.
for continuing
concepts
prototype
residuals
of a protype
for parallel
directions
elementary
A brief description
Section
of the
the key characteristics:
size
parallelization
is organized
principal
amount
with
functionality
suggesting
is that
produces
here
version
purposedly
achievement
be of realistic
on the
and
Some
been
ment.
must
a limited prototype
options
to provide
the
provided
but essential
is the first
have
feasible
was to have a well-defined
flux
required
to to prevent
are incorporated
in the code
•
Van Albada
•
Sweby's
•
S-V (Spekreijse-Venkat)
•
Modified
min-mod
S-V
oscillations
in high
order
methods
near
shocks.
A number
of
PATCHER CONNECTIVITYfiLE
BOUNDARY AND
I
CONNECT1WrY
J
DATABASE
I INITIAL CONDfl'IONS I
! SOLUTION EXPANSION
RESTART FiLE GRID FILE CONNECTIVITY FILE INITIALCONDITION FILE CONTROL FILE
FIG.
3.
Treatement
2.1.
Use
of Viscosity
of
PAB3D,
terms:
Version
is as usual
13,
and
related
centrally
programs
and
approximated.
files
There
is a wide choice
for the
cross terms
4.
•
j-thin
•
jk-uncoupled
layers,
•
jk-coupled
Turbulence
becn
k-thin
model:
implemented Two-equation
•
Shih-Zhu-Lumley
•
Gatski-Speziale
•
Grimaji important
pressibility from which boundary
structure
of the solver
a number
of algebraic
part.
A wide variety
Reynolds-stress
of models
have
models.
model
include
correction
"effective"
viscosity
the
ability
methods. and other
to deal The
code
with
real
accepts
pararameters
gas
equations
flows involving
are computed.
of state
and
non-reacting
It is also possiblc
several
com-
multi-species to specify
other
conditions. view
description
2.2. when
features
independent
including
k-epsilon
turbulence
A schematic detailed
is completely
into the code,
•
Other
layers
The
and
its relation
of this code with
Patching.
"conservative of the
of PAB3D
The
patching" code
by the
most
emphasis significant
was introduced creation
with
other
programs
on the turbulence improvement to the
of new cells
and models
ot to the
code [4]. This
at the
interaces.
files is shown can be found PAB3D
allowed A group
code
the
in Figure
2.1.
A
in [3, 2]. occurred
in 1990
multi-block/multi-zone
of such
cells is a patch
,
calleda piece
in this
space-integrated Later,
the
three
fluxes
patches
significant •
report.
The
are computed
databases data
were
are stored
in arrays
so that
expanded
and
improved.
Version
13 of the
code
(current)
contains
information
for the
database, piece
Depending
is provided.
on the value of pieceinfo,
Four
our
purposes
wet
set
of cells in the piece
face/block
adjacent
where
the piece
belongs
face/block
IPTF(block,blockinfo) tured
cells ratios
so as to get:
* the number the
of the
conservatively.
: A global patch
corresponding
pieceinfo
of overlapping
bases:
IPCB(piece,pieceinfo) the
amount
block,
and
: The after
the
block database
patching
which
is done,
the
provides,
pathces
dimensions
involved
for the struc-
in the
exchange
of
information number
of pieces
list of local pieces list of adjacent
•
IPCBL(block,face,localpiece) may
contain
sponding These
local pieces
several
global
2.3.
identified
number
in volved
by a local number.
to access in the
local-to-global
the database
mapping. This
array
A face in a block provides
the
corre-
if required.
commmunication
part
of the parallel
system,
as explainge
in
5.
The
rdnc blocks 2 nozzle
pieces,
piece
are the only arrays
Section
: The piece
Prototype
and a total
which
characteristics
Problem
was designed are given
high pressure
plenum
grid
for the
grid Jet
in the following
nb
The computational
. The
of 1.29 million
idm
jdm
1
61
33
2
61
3
65
4
computational points.
Noise
The
grid physical
Laboratory
for the model
at NASA
parallel
process
Langley.
The
kdm
nbt
33
61
122793
33
113
242385
97
33
113
361713
downstream
of Block
3
5
97
33
113
361713
downstream
of Block
4
6
61
17
17
17629
cartesian
core
1
7
65
17
17
18785
cartesian
core
2
8
97
17
17
28033
cartesian
core
3
9
97
17
17
28033
cartesian
core
4
one quarter
and nozzle
has
Mach
computational
grid
environment.
The
Description
106689
describes
case
table:
53
chamber
test
is a convergent-divergent
of the
4
of nozzle
exterior
of nozzle
downstream
physical
flow acceleration
interior
path
nozzle
of nozzle
and
its ambient
to Mach 2 at the nozzle
exit is contained
FIC. 2.2. Blocks near nozzle. Block 1 is the interior o/ the nozzle
in block
1.
in blocks the
of block
singularity
block Thc
known
four
The
blocks
different
processes
memory
by the prototype
Standard
•
Roe's
flux differencing
•
Third
order
•
Coupled
the
jet
utility.
grid
grids
These
Some of these
listed
under above,
plume of blocks
with
the MPI
the numerical
to eliminate
flow path,
the
flow
6 through
9.
The
tables
interface
for each pair connectivity
both for the simplicity grid
with general
an average
can be used
blocks
data
by a block
is chosen
is described
In order
interface
of a multiblock
five blocks
points.
environment
cylindrical.
(patching)
This example
complexity
exit
exhaust
cartesian
automatically
capacities.
conditions
and by
arc
by one or more patched
on a single workstation
Ideal
of 250,000
to test
are small
of the
connectivity grid
workstation
enough
such
that
points clusters
multiple
system. techniques
are fairly
standard:
gas simulation
Parallel Model
(spatial)
interior
in two groups:
of 25,000
nozzle
five block
is covered
"Patcher"
sizes come
•
3.1.
nozzle
are generated
as the
•
3.
for these
and the moderate
can be initiated
As required
a laboratory-type
faces are described
an average and
topology
of symmetry
tables
block
speed
provides
axis of the
simply
with
grid
axis
flow configuration,
requirements.
block
The
at the
interfaces.
physical
flow which
the
between
preprocessor,
with
5.
surrounding
connectivity
and
ambient
2 through
polar
domain
The
k-c models scheme
interpolation
viscous
terms
Implementation of Computation. level.
upwind
That
in the j-k plane Decisions. From
is, the each process
the
beginning,
would
it was clear
be in charge
5
the
of a block.
parallelism For this type
would
be at the
of computations,
7
l- /--/-/ / i
XXz; -W} ¢
-2 •
_G.?9"._, ,;?':.?
FiG. 2.3. A cross
one In
finds the
in the
literature
master-slave
two
approach
available
process
processor
in a point-to-point
process,
operation
One
of the
inability this
and
which
the
receives,
main
requires
of the
nodes
must
called
a shrunken
size
spawn
the
be
Tasks
Since that
in which
In
distributed
an
On
the
and
resends
are
of the
processor
and
computation.
optimal
other
does
hand
user
one
keeping
the
master
other
the
through
the
of the
project was
the
same
but
In was
not
type
slave
master
synchronizes
predecessor
must process
PVM,
a master-slave to mantain
of task
be
Thus,
as any
not
distributed
a full
adressing
node.
the as
simplicity we incline
other
only
is the
approach the
appropriate.
of a code,
data
in each
of its
interface.
implementation
processors,
the
between
it does
Thus
and
approach
parallel
among
at
goals
master-slave
data,
on
computations.
(MPI-1)
started
of tasks
communication
appropriate.
in the
each
spawning
to send/receive
MPI
some
the
evenly
model.
it
the
asynchronous
needs
participate original
a task.
cxccutables.
all the
a slave
not
of communication
of code
processors
as there
often
for
computa-
well.
space,
All
This
is
is called
the
was
global are
not
of premium
concern.
grid
coordinates.
As
blocks,
preliminary
we opted
versions
for the
of parallel
Most
we were full code,
of the
data
confmdent
block
model.
as is now
is devoted that
Such the
case.
we
to storing
the
get
least
would
approach, Figure
at
easier 3.2
unknown as many
to implement,
is
a schematic
of
shows
alternatives.
4. from
between
to coordinate
model.
and
used
buffers
it does
we decided
block
variables,
the
message but
Distribution.
work
block
When
model,
in charge
Data
However,
two
graph
tional
full
to
code,
of the
3.2.
the
former
having
arc
fashion.
differences
sequential
in favor
the
to handle
problem
approaches.
designated
required
slaves,
of the prototype
of well-establisehd
a processsor
possibly
of the
of the
types
section
The free.
MPI Among
implementation: the
most
widely
LAM. developed
There and
are more
6
several robust
implementations are
the
Argonne
of MPI, MPICH
all and
available the
Ohio
GLOBALADDRESSING SPACE
FULL BLOCK METHOD
I II " ]E"' II "} TM
F_ RC3
,,.C2C2 I ML__J IF--I MII SHRUNKEN
FIG. 3.1.
Data
distribution
among
Supercomputer
Center
Local
used
the
in its
current
The
LAM
former
dynamic the
added
that
allow
process Some
of
MPI
Area
has
functionality
(LAM)
extensions. includes
of MP1-2,
a sequential
[7].
code the full block is easier to implement
For
the
parallelization
of PAB3D
wc have
6.1.
some
LAM
process/processor
from
Multicomputer
version
spawning.
Starting
still
control
For
example,
extensions
in the
through
making.
the
to
do
In
addition,
a file called
original
this.
application
MPI
These
should
LAM
comes
schema
[9] does not with
and
be
not
allow
confused
utility
for with
commands
configuration
control
a
schema. commands
example,
TASK
version process
processors.
BLOCK METHOD
the
allow command
(G/L)
probing
of the
status
mpitask
shows
the
of the following
FUNCTION
PEERIROOT
remote
hosts,
described
by
TAG
COMM
COUNT
0/0
pab3d
Bcast
010
WORLD*
6438865
pab3d
WaitAll
0/0
WORLD*
256
process
REAL
1/I
pab3d
Beast
0/0
WORLD*
6438865
REAL
pab3d
Bcast
0/0
WORLD*
6438865
REAL
The
display
The
status
irunning_
shows, indicates
the
processor
non-MPI
and
activity.
For
REAL
2/2
information
schema.
DATATYPE
8/8
3 pab3d
the
output
name The
of process, PEER
and
its TAG
MPI fields
function involves
at
the
moment.
point-to-point
communication.COMM with
which
MPI
is complex
control,
is the
exchange
enough,
initialization
point
communicator
messages).
communication
involved
COUNT
but
most
indicates
applications
termination
and
and collective
operations,
(an MPI
notion
to delimit
the size of message require
identification,
only
the
a dozen
most
[8]. A short
commands
important
sampler
the
group
and DATATYPE
or so.
commands
of typical
of process
its type. Besides
are
operations
the
for point-to-
in each category
follows.
Point-to-point
The
communication
•
MPI_RECV
•
MPI_IRECV
•
MPI_SEND
MPI_SEND
"event"
in four flavours.They
has happened.
Collective
MPI_RECV
in the sense
is blocking,
that
the call will not return
but MPIARECV
until
some
is not.
communication
•
MPI_BCAST
•
MPI_GATHER
•
MPI_REDUCE
MPI__BCAST collects
Similarly,
are "blocking"
is used
information
operations
to distribute from
to obtain
other
a result
information processes.
in a single
among
all process
MPI_REDUCE process
might
but which
in the communicator.
MPI_GATHER
be used
with
involves
in conjuction
arithmetic
all process,
as when
doing
a scalar
the defining
the data
orgainization,
product. 5. the
An
actual
through
approach implementation
M4.
Source
appropriate
suffix.
independent
testing.
An important
as described
code
remain
Phase
developed in the without cases, The
that
MI:
do most
computation
MPI
calls was carried
calls corresponding to include
very useful,
only
of the main reduced
out in four
to a phase the routines
was to develop
independently
implementation
Start-up. to prevent
part
debugging
code
the
After
to write
was identified necessary
In the part
an interface
phases,
by providing
for their
a communication
code.
distinct
complete
subsystem,
which
M1 an and
we
of the commmunication
to COMMSYS.
The original
unchanged.
any consideration
resulting
MPI
we tried
was tested
over a number
for simple
the
proved
the
computations.
incorporation
code with
below,
virtually
that
passing
In each step
which
step,
5.1.
and
technique
call COMMSYS,
codes
to message
It is widely parallelizaition
of years, of the
it is natural
acknowledged
that
the single characteristic
is the issue of I/O.
In large
that
and improvements
modifications
code.
Developers
tend
to add
for parallel
processing.
This
is either
I/O
statements
done
of the sequential
engineering
software
are done essentially
to the
for genuine
projects,
code
reasons
generously, or, in many
purposes.
is extremely
difficult
or expensive
to parallelize.
8
The reason
being
is that
each
I/O
cannotbeexecuted by morethana singleprocessor. Furthermore, if codeis modifiedsoasto ensure execution by a singleprocessor, inputstaments mustbefollowed by expensive broadcasts. An exception to this is whenthe codewrittenfromthestartwith distributed I/O. This is an active area of research these
days.
Eventually,
The
principal
beginning
goal of phase
of the
solver.
all parallel
main
of these
In short, •
there
are avoidable,
the
goals
At the
Run
starting
enough
were
speed
limited
I/O
out,
delicate
the
question.
At higher
B-segments
PAB3D,
this
that
data
was identified
more
on this on the final
must
have
that
than
complicated
at the
as routine
things
further.
step.
then
about
times would
have
1600 I/O
Thus,
FORTRAN
the
of the
been less and thus would
that
we could
it will not scale this task
technique
was ruled
statements,
constants
size.
The straighforward case,
do.
( whose sizc is determine
for the purpose,
in the input
same
routine.
application-oriented,
to find out later
project.
broacast
size.
actual
utilities
the
process
arrays
the
full-scale,
ad-hoc
single
Dummy
than
exactly
computational
it to actual
rather
only
of the overall
system
operation.
turnaround
as the
each main
limiting
prototype,
and
contains
process
and commons
calls,
we developed
consuming
each
was completely
a reduced-scale
approach
this
have
not afford properly.
was by far the
of restricting
out from
approach
the
the
beginning.
was clearly
out of
was needed.
where
subroutines
was to partition
tend the I/O
to be called intcnsive
within
or I/O
the same
subroutine
context
and functionality,
intensive,
in in two type
of
B segments.
were constructed
1. There
have the proper
We will discuss
after input
rather
to one processor
C and
for efficiency.
fact
so as to accomodate
But as our project
level routines,
segments:
I/O
routine,
variables
in the broadcast
A new technique
a very useful
local
size) broadcasting,
and time
As the PAB3D
this
part,
by broadcasting
and in spite
statement
processors
prototype
within
computational
in full scale
(actual
risk by testing
As it turn
are not.
processors,
were used
development.
to take
most
dealt
dimensions
By doing
some
others
all computing
In the
I/O
of the
The first goal was achieved
arrays
routine.
is still some
in its arguments,
with
at run-time)
that
distributed
of M1 were two:
information •
be able to deal with
M1 was to ensure
computational
Unfortunately,
Some
codes must
according
is no branching
process
(to avoid
I/O
to the
into the conflicts),
following
segment.
characteristics
The
idea is that
and thus code branching
the
segment
will execute
into the segment
risks being
in a single improperly
executed. 2. The segment a minimal 3. The
must
of computation
number
of derived
4. Due to potential The
B-segments
processing. the segment. must
coincide
The
concept
In particular,
be broadcast.
maximize
of input
operations,
while
at the same time must
contain
in between. variables
side-effects, with
the number
the
of derived
must
be kept
whenever notion
any variablc
possible
of "critical
variable
to a minimum avoid subroutines code"
is introduced
involved
established
here
in a READ
within. since
the
to mean
any variable
is a derived
variable.
early that
days
of parallel
is modified
All derived
variables
in
read{77,*) do
40
it,
40
B1
ib),i=l,nblock)
j=l,kix
p(j,ib)
=
a(j,ib)**2.0
e(j,ib)
=
P(j,ib)/
*
rho(j,ib}
/ gammar(nsp)
(gammar(nsp--l.O)
+0.5*rho(j,ib)*u(j,ib)
continue
Derived
read
variables:
(77,*)
call
it,
FIG.
5.1.
In B1
depend._
on
the
Figure
5.1 shows
tricky.
Consider
very difficult
p,
becomes work,
syntax
by the
B-segments
for example
to determine Proper
existence
of dummy
the
might
be adjustable
variable
the
has to be careful Following
with
the
contains
analyzers
outputs
gammar(?)
variables
second
derived
(rarely
variables.
are routines
of a routine.
Some
available
use of good
while
Detecting
within
in the
routines
compilers
that
derived
case,
B_,
variables
the B-segment.
in PAB3D
in practice),
i.e. in the
A possibility other
might
provide
case it modifies
is to make
difficulties
designed
may
in this
the possibility
it all
In some
prototype provide
can very cases,
have useful
cross-references
it is
more
than
information.
which
facilitates
the
routine
non-local in-line,
variables,
the situation
and let the compiler
do the
arise.
respect:
variables
of side-effects.
in commons
However,
are explicitly
this lead to another
passed
problem:
as the
arrays. of an array (passed
is provided
as argument),
as real a(1), at broadcast
broadcast
is show
the
derived
there
side-effects,
it is well
a B-segment,
actual
routine
output
altogether
dimension
is declared
the
the case when
complicated.
PAB3D
Often
to detect
a(?), rho(?),
variables.
within
avoiding
showing
documentation
of derived
extremely
arguments,
analyzer
p (?),
subroutines
but even in this situation
Fortunately,
ga/rm_r)
it is possible
two
If the subroutine
rho,
it, igf, e (?),
can be complemented
the detection
a,
82
variables:
called
200 arguments. This
it, igf, p, e
(igf(iib),i=l,nblock)
energy(e,
Derived
the
(igf(i,
the
might
the
at some
syntax higher
not represent
of code.
The
real
level common.
its actual
dimension,
Thus
dimension,
if a local
and
thus
one
time.
commands.
with a fixed format in Figure
or determined
declaration
as an source MPI
only for checking
5.3.
Message
include
file, the
In many
M1 source
cases,
these
are
provides generated
as shown
in Figure
5.2.
lengths
for arrays
are determined
10
Segment
a '*.bct'
file which
automatically
characteristics from
static
contains
from
syntax
for a higher data
and
thus
lever the
XX =
nblk*nsec*6*20*nprt*nzon
call
MPI_BCAST(ibcf,XX,MPI_INTEGER,
+
MASTER,MPI_COMM_WORLD,
ierr)
XX
= nblk*nzon*ngt
call
MPI_BCAST(ibf,XX,MPI_INTEGER,
+
MASTER,MPI_COMM_WORLD,
ierr)
XX = nblk* call
(21+2*npcmx+l)
MPI_BCAST(iptf,XX,MPI_INTEGER,
+
MASTER,MPI_COMM_WORLD,
ierr)
XX = jkmx*(ncsp)+l call
MPI_BCAST(qOs,XX,MPI_REAL,
+
MASTER,MPI_COMM_WDRLD,
ierr)
FIG. 5.2. Automatically Segmt.
Length
Der.
Var
generated
'*.bet' file
Subrtn.
I/O
units
BCT
file
B1
192
G1
244
B2
32
G2
10
B3
45
36
7
So13-MIB3
B4
24
19
97
Sol3-M1B4
G3
145
49
5
3O
So13-M2B1
98,99
Sol3-M1B2
rinput
zonm inidct init jkbar solver
B5
3
outfl
FIG. 5.3. Segment
ordering The
of the
executable
in all
Figure
Phase
structure M1,
Figure
5.7
code, above
a convenience
in and
5.4
M2:
of the
described
output,
calls
staments
process.
5.2.
are
broadcasts
illustrates
important.
that
form
the
shows
the
actual
in order has
to
carried to
define
residuals the
buffer
Section
of the
is necessry
the
breaking computed.
for
exchanging
6, we are
up
points
to
data,
the
qbuf,
11
does
processors
topic
G-segments.
computation
Solver
this
of the
a little
compututation,
and
the
parts
understand and
to retake
are
significant
to solver.
in
going
B-segments,
most
communication
parallelization
arc
for high level routine
of the
division
It
describe
the
In
complement
Communication.
PAB3D partial
is not
characteristics
code
more
These
run
in phase
M1.
about
subsystems. a few
after
P1
global which
and
P6
again.
the
global
The
phase
iterations. partial
involved
This solutions
in
the
solver
regtbl2
turbke
Solverl3.f
initdct
zonm
]
out
BI
I D3fl3I
FIG. 5.4. Segmentation
activity.
The
(process)
is responsible
boundary. For each routine
buffer
Referring global
consists
solver4
consists
lower
can happen
we must
solver4
level
(block
make
designed
consist
as a system
cases contain
cells of the and
pieces
where
the
they
calls are shown
create
the
pieces
off-line which
10 and 25 while it it here
loops as follows:
loop)
processor
pieces,
receiving
P1 sends
sure tbat the
Single
is called once;
of nested
And this is done through COMMSYS
of tile
to the figure,
It is at the
for M1.
for sending
iteration,
1
where
as dashed
using
the
connects
patcher.
it with
P6 receives
the
lines
core computations
take
loop,
zone loop, and (a sequence
process
must
work independently.
all the pertinent
place.
of) block
However,
information
before
from other
of two parts of include
which
interact
files which
exclusively
contain
only declarations/definition
commsyspar.h ber of blocks, ction
•
contains maximun
can be obtained
parameters commsys.h
provide
or
The loops. this
process.
COMMSYS.
while
with the arrays
its own databases.
the upper
of the
patching
The include
case ones contain
system.
files named
executable
three
pieces directly
more
declares
parameters, per block,
and maximun
from the PAB3D
flexibility. the
for the commsys.h
arrays.
commons,
The dependencies The
global
numbers:
12
total
arrays
arrays:
statements.
pieces. leaving
might
While these
be stated
correspond
maximum
num-
this informa-
as independent in the makefile.
to the
global
It is
with lower
arc:
•
block
block
them.
time
have received
Each
another
piece
They
1
2
P1
io
,
25
:
recv send
send
recv
P6
I
FIG. 5.5.
add_piece:
(scalar) sending
recv_piece:
receiving
and
the
local
n_piece:
•
of the
block
piece
the
block
the
file
the
are
13 patching
the
the the
block
: the
global
issued
here.
commsys.h
and P6
receives.
are
receives
contain
set
and
arrays
according
blocking
declarations
sending
in commsyspar.h), of the
send
arrays
of communication
matching
PI
I
information:
block
Non-blocking
number
(a parameter
5.6 shows
loops.
block
global
Since
5.3.
local
recv_piece:
addition,
Figure
to the
that
COMM_LOCAL.h:
status.
of piece,
piece
calls
in processes
I
in qbuf.
block
global
MPI
o] qbuf
'I I
of piece
of pieces
COMM_GLOBAL.h
•
per
arrays
number
addres
block
send_piece:
actual
In
piece
from_piece:
Two copies
I
it was executables
to
sends
runtime
are
pertaining
receiving natural
to include
COMM_GLOBAL
issued
to
messages
information.
No
here.
communications
request
depends
on the
number
this
type
of declarations
and
COMMLOCAL
and
of pieces here.
to the
Version
subsystem.
Phase In each
M3: of the
Computation. cases,
the
As
index
we mention
variable
the
of the
loop
innermost
is called
loop 'ib';
in solver4
for
simpicity
consists we will
of block to these
as
ib-loops. Early
in solver4,
instances. tain
the
the
Subsequent proper
ib-loop ib-loop
informacion
that do since
invokes the
the
communnication
computations
it has
been
on
exchanged
13
the
system blocks
through
(which qbuf).
has
been
we can Similarly,
executed assure
as that
we execute
differenc they
con-
separate
............................. iIPCBL(b!ock,
i
face,localpiece_
..............................
a m a ¢'}
,
r ........................ IPCB
r ........................
(piece,
pieceinfo IPTF(block,
blockinfo)i
* number ofcells
* number of _eces
*
* loc__ece
face/block
belonging
* face/blockadjacency
* adjacent loc_ _oces block database
patch database
1
1
add_piece{2,
npie)
ffJ from
o o
piecelnpie)
n_oieclnb2k) send_91ece(nlpi,
nb2k)
l
recv_iece(nlpi,
nblk)
1
to_iece(npie)
COMM_GLOBAL.h
COMM_LOCAL.h
FIG. 5.6. Relation between the databases of PAB3D and those of COMMSYS
instances
of this.
required Using
data the
nication this
newly
has
stage.
phase.
provided
been The
taken
information,
6.
Status,
in 9 different
There
processors
completion
6.1.
of effort.
Timely
as in the case in the ordcr LAM/MPI
work
with
the patching
information,
will find the
Work
the
i of loop
final
phase
is the
in processor
commands but this
updating
of an broadcast
in charge
i-1.
addcd
All commu-
is already
to the done
system
at
in an earlier
of the
operation,
solution, a gather
at
the
end of the
operation,
in which
of the output.
Conclusions.
them
ib, will run
no communication
is for self-identification,
reverse
and
and bring
instance
a parameter.
Thc
This
As wc have been able to compute
to a designated
root
processor,
correct
we are very optimistic
residuals about
the
of the project. a few things
The
Messaging.
in which MPI
communication
that
experiences
of our prototype
and
MPI
to the processor
F_urther
arc, however,
degrees PAB3D.
the
using
Updating.
for output.
is passed
successful
which
M2, so there
with
is simply
M4:
iterations data
routines,
care in step
only intcrphase
Phase
global
to-point
computational
Self-identification
5.4.
block
The
in qbuf.
they
need to be perfected.
learned
hcrc
should
In an environment problem
as discussed
Here
provide
where
is a partial
clues on the design
computational
in Section
list which
loads
from
2, it is fundamental
involve
of future
blocks that
a different versions
of
vary
widely,
messages
arrive
are posted.
provide
a number
commands.
of options,
In order
such as tagging
to use the optimal
14
and specific
method
implementation
for our purposes,
of point-
a test-problem
OPTIMAL ?
I
[
r
]
G
segment
B
segment
bcl
FIG.
must the
be propcrly code,
with
designed. realistic
implementation
6.2. scribed
I/O
Broadcast
trivial
operation
5 may
Certainly,
the above
the
The organization
work,
is necessary: required
or not.
can be a quite
conditions
on, answered
A test-problem
the question
is needed
However,
the whole
code either
G. Clearly,
improvement,
especially
dummy
this
(number
elaborated
robustness,
of on the construction
restrictive.
size of the broadcast
whether
to guide
further
of process).
arrays,
defines
the
hierarchy that
is making
This idea is illustrated
in Figure
the code in its present
the
broadcast
each with
a
6.2.
state
admits,
of most
derived
lengths
can be
files.
5, the message
declared
de-
of partitions:
problem,
in Section
involve
a whole
partition
an optimization
However,
only involve
as we explain
define
G or B or its opposite,
properly
problem. which
they
of a segment
dimension.
lengths These
lengths. of the bct files, Figure
for this provision.
but should
a little
like making
for future
variables, to actual
transition
in parallel
The
complement
a straightforward
Due to provisions
changed
seem
partitions B and the
being
broadcast
segmentation
such as the one wc worked
run
optimization.
parameter
in principle,
Optimal
decisions.
in Section
from the
can
file
6.1.
A prototype,
data,
I
Finding
be relatively
some
to rearrange
of the actual the
5.1 has been purposedly
the correct
easy for someone lenghts
who knows
are broadcast
bet so as to let them
left in a uniform
value for the size of the message thc code well.
as derived
bc broadcast
15
variables
before
they
format
to allow easily
a
involvc
significant
Nonetheless,
a word
of caution
themselves.
Thus,
are used.
might
it might
be
full
6.3.
Memory
block
approach.
For
shrunken
block
using
the
This
is a major
leave
the
7.
way
Distribution.
of the
code,
for distributed
University
we explain
the
reasons
that
total
the
core
data
would
like
to
why
memory
we
decided
requirements
to start must
with
the
bc reduced
by
as it
modifies
structures.
This
improvement
should
I/O.
Acknowledgements.
Dominion
Section
it is imperative
model.
change open
In
efficiency,
The
first
for his support
and
author
for making
the
thank
project
David
Keyes
of
ICASE
and
Old
possible.
REFERENCES
[1]
K.
ABDOL-HAMID, equations:
[2] --,
K.
Preliminary
Implementation Report
[3]
A multiblock/multizone
ABDOL-HAMID,
applications, of algebraic
CR-4702,
NASA,
and
Tech.
stress
(PAB3D-v2)
Report
model
for
CR-182032,
in a general
the
three-dimensional
NASA,
3-d
Navier-Stokes
October
navier-stokes
1990.
method
(pab3d),
Tech.
1995.
J. CARLSON,
to attached
code
separated
AND B.
flows
for
LAKSHMANNAN,
use
with
k-e
Application
turbulence
of Navier-Stokes
model,
Tech.
Report
code
PAB3D
TP-3489,
NASA,
1994. [4] K.
ABDOL-HAMID,
J.
and
patch
[5] A.
ECER, ics,
conservatie J.
PERIAUX,
North
Holland,
Pasadena, [6] C.
C.
of workstations Dynamics, 1996,
pp.
Kinnear [8] W. [9] OAK
GROPP, RIDGE Tenessee,
AND S.
algorithm, N.
Tech.
SATOFUKA,
January
PAO,
1996.
Calculation
Report
of turbulent
95-2336,
AND S. TAYLOR, Proceedings
AIAA, eds.,
of
flows
mesh
sequencing
1995.
Parallel
Parallel
using
Comptuational
CFD
conference,
Fluid June
Dynam-
26-28,
1995,
California.
FISCHBERC,
[7] GDB/RBD,
CARLSON,
RHIE, for
A.
Ecer,
R.
ZACHARIAS,
production J.
P.
running
Periaux,
N.
BRADLEY,
of parallel Satofuka,
AND W. DEsSUREAUAULT,
cfd applications, and
S.
in Parallel
Using
hundreds
Computational
Taylor,
eds.,
Stony
Brook,
report,
Ohio
Supercomputer
North
Fluid Holland,
9 22. MPI Road, E.
primer/
developing
Columbus, LUSK,
NATIONAL
AND LAB,
OH A.
with
43212,
tech.
November
SKJELLUM, MPI:
LAM,
A message
Using
Center,
1224
1996. MPI,
passing
1995.
16
The
MIT
interface
Press, standard,
1994. tech.
report,
University
of
17
i
REPORT
DOCUMENTATION
PAGE
OMB No 0r_0188 Form Approved
I
i Public reporting burden gathering and maintaining collection of information, Davis
Highway.
Suite
1. AGENCY
for this collection of information is estimated to average 1 hour the data needed, and completing and reviewing the collection including suggestions for reducing this burden, to Washington
1204.
Arlington,
VA
USE ONLY(Leave
22202-4302.
I 4. TITLE
AND
Parallel
and
to the
blank) I 2. REPORT April
Office
of
per response, including the time for reviewing instructions, searching existing data sources, of information, Send comments regarding this burden estimate or any other aspect of this Headquarters Services. Directorate for Information Operations and Reports. 1215 Jefferson
Management
and
Budget.
Paperwork
Reduction
3. Contractor REPORT
1998
TYPEReport AND
SUBTITLE
PAB3D:
Project
(0704-0188).
DATES
with
DC
20503
COVERED
5. FUNDING
Experiences
Washington.
DATE
a prototype
NUMBERS
in MPI C NAS1-19480 WU
505-90-52-01
6. AUTHOR(S) Fabio
Guerinoni
Khaled
S. Abdol-Hamid
S.
Pao
Paul
7. PERFORMING Institute Mail
ORGANIZATION
for Stop
Computer
403,
Hampton,
NASA
VA
Langley
Aeronautics
Langley
]1.
VA
Space
ADDRESS(ES)
Report
No.
31
10. SPONSORING/MONITORING AGENCY REPORT NUMBER 1998-207636 Interim
Report
No.
31
Dennis
M.
Bushnell
STATEMENT
12b. DISTRIBUTION
CODE
Unlimited Category
60
Availability:
NASA-CASI
communities.
It
report
on
processing.
takes
the
We
data structure (COMMSYS)
Message
AND
Administration
(301)621-0390
ABSTRACT (Maximum 200 words) PAB3D is a three-dimensional
14. SUBJECT
NAME(S)
Interim
ICASE
Monitor:
Nonstandard
problems
ICASE
NOTES
Distribution:
first
Center
NASA/CR-
DISTRIBUTION/AVAILABILITY Unclassified
13.
8. PERFORMING ORGANIZATION REPORT NUMBER
Engineering
Center
SUPPLEMENTARY
Subject
and
23681-2199
Langley Technical Final Report
12a.
ADDRESS(ES)
Science
Research
AGENCY and
Research
Hampton,
AND in
23681-2199
9. SPONSORING/MONITORING National
NAME(S)
Applications
Navier
discuss
Stokes
computational
implementation
used for for MPI
of this
as
of
briefly
the
PAB3D
Last,
we
levels
set
using
of
has
disjoint
the of
is derived and some identify
that
a
characteristics
communication communication,
nature.
solver
domain,
gained
Message
the
code
acceptance
blocks Passing and
define
from preprocessing general techniques improvement
from
in
covering
the
(MPI),
a prototype be
current
version
point-to-point
and
domain.
for
and
for
testing.
The
(MPI),
Navier-Stokes
solver,
structured
meshes,
broadcasting,
communication
is the parallel
principal
a simple interface when working on outline
future
15. NUMBER Interface
industrial This
a standard
We describe enconuntered
TERMS Passing
research
physical
Interface
"patching". likely to the
the
work.
OF PAGES
21 16. PRICE CODE A03
17. SECURITY CLASSIFICATION OF REPORT Unclassified
OF THIS PAGE | 18. SECURITY CLASSIFICATION I Unclassified /
_/SN 7540-01-280-5500
19. SECURITY CLASSIFICATION OF ABSTRACT
20. LIMITATION OF ABSTRACT
'Standard Form 298(Rev. 2-89) Prescribedby ANSI Std Z39-18 298-102
18