MPI Programming Environment for IBM SP1/SP2 1 ... - Semantic Scholar

7 downloads 4039 Views 182KB Size Report
programmers the trace of all the events that ultimately a ect e ciency of a parallel .... communication adapter is dedicated to one process per node. Protection isĀ ...
MPI Programming Environment for IBM SP1/SP2 Hubertus Franke, C. Eric Wu, Michel Riviere, Pratap Pattnaik, Marc Snir IBM T. J. Watson Research Center, P.O. 218, Yorktown Heights, NY 10598. email: ffrankeh,cwu,riviere,pratap,[email protected]

Abstract In this paper we discuss an implementation of the Message Passing Interface standard (MPI) for the IBM Scalable Power PARALLEL 1 and 2 (SP1, SP2). Key to a reliable and ecient implementation of a message passing library on these machines is the careful design of a UNIX-Socket like layer in the user space with controlled access to the communication adapters and with adequate recovery and ow control. The performance of this implementation is at the same level as the IBMproprietary message passing library (MPL). We also show that in the IBM SP1 and SP2 we achieve integrated tracing ability, where both system events, such as context switches and page fault etc., and MPI related activities are traced, with minimal overhead to the application program, thus presenting application programmers the trace of all the events that ultimately a ect eciency of a parallel program.

1 Introduction A large portion of computations in big computers are performed using either commercially available or in house developed large program packages. Examples of such packages are NASTRAN for structural engineering, SPICE for circuit simulation, PISCES for device simulation, CHARMM for biomolecular dynamics, GAUSS-xx for quantum chemistry and ORACLE or DB2 for database applications are examples of these packages. These packages are typically fairly large and contain code developed over a number of years by a number of developers. Availability of these packages for modern parallel computers, such as for the IBM SP1 & SP2, is essential to the wide spread use of parallel computers in production environments. Two of the requirements by application software developers for migrating their application packages to parallel computers are the availability of standardized application programming interfaces (API) across a wide range of

parallel platforms and tools to easily debug and monitor programs. To ful ll part of these requirements, a number of computer manufacturers, university faculties and members of national laboratories, have de ned a commonly agreed upon Message Passing Interface (MPI) Standard, which provides a good degree of portability for applications, ranging from workstation clusters to massively parallel platforms. Even though public domain versions of MPI are available [4], it is expected that parallel computer manufacturers will develop ecient implementations of MPI for their own hardware. MPI provides a rich set of both basic and advanced communication functions (a total of 132), which can be classi ed into one of the following categories: (1) point-to-point communication, (2) group management, (3) collective communication, (4) virtual topology, and (5) environment management; details of which can be found in [10]. The rst part of this paper discusses the design of our MPI implementation, called MPI-F, and it's integration with the SPx system software. We compare MPI-F performance with MPL, the proprietary library for the SP, and the public domain version of MPICH [4] running on top of MPL. MPI provides a simple but powerful pro ling interface de nition. The second part of this discusses the design of a program performance tracing in our MPI implementation given this framework, and shows that, while one obtains the traces for system events such as page fault, process swap and I/O delays along with MPI related activities, the overhead encounted at application level is very small.

2 System Architecture The 9076 PowerPARALLEL System (SP1/SP2) is a distributed memory multiprocessor produced by IBM [8]. Each SP1/SP2 node consists of an RS/6000 processor. Each node runs a full copy of the AIX

UNIX operating system. Fast communication is provided by a multistage packet-switching network with a hardware capability for 40 Mbyte/s duplex transfers from each node, and a total latency which is below one microsecond. The e ective performance achieved on SP1 is much lower, due to limitations of the SP1 communication adapter. This is a passive device attached to the Microchannel bus and provides a simple interface consisting of FIFO bu ers for incoming and outgoing packets and control registers. The SP2 communication adapter is DMA capable and can sustain signi cantly higher bandwidth. The parallel operating environment (POE) on the SP consists of several components: (1) communication libraries, (2) single parallel job management, (3) overall job management, and (4) tools. Including MPI-F, three high performance communication protocols are currently supported on the SP: MPL [2], PVM [5], and MPI [10]. The communication library is linked with the application. Note however, that it is not a requirement to link with one of these libraries to execute a parallel job. When a parallel job is executed a single partition manager (PM) executes which contacts the system's resource manager (RM) and requests a number of resources, most importantly the number of nodes. In accordance with the chosen allocation strategies, the RM reserves and assigns a number of nodes to the PM. The PM then spawns the parallel job on these nodes and monitors it's livelihood via a control network. It is also responsible to direct stdin/stdout in accordance with a speci ed strategy. If any of the program of the parallel job terminates (normally or abnormally), the entire job is terminated and the PM returns all resources to the RM before terminating itself. In the tool category three basic tools are provided, a parallel debugger (pdbx), a tracing and visualization tool (VT), and a parallel output interface (pmarray). Although, MPI-F required providing a new communication library, no changes were/are necessary (with the exception of message passing tracing in VT) to any of the POE tools. In the following section we discuss in detail the MPI-F communication software

3 Communication Architecture High performance in terms of latency and bandwidth can only be achieved if the expensive UNIX operating system call can be avoided in the critical communication code path. Hence, the entire communication stack is executed in user space, as opposed to invoking

a UNIX device driver. With this implementation, the communication adapter is dedicated to one process per node. Protection is provided by labeling each packet with a suciently large partition key. The communication software is structured in three layers, a message layer providing point-to-point message passing functionality. The message layers sits on top of commonly shared point-to-point layer, the pipe-layer, which in turn sits on top of a packet layer that implements interfaces to the di erent communication mediums. All other communication services, e.g. collective communication and virtual topologies services, are implemented on top of point-to-point message passing. In MPI-F we utilize public domain code [4], but we are gradually switching to proprietary code. After a discussion of the shared layers, i.e. the packet and the pipe layer, we discuss the issues involved in moving the message layer from MPL to MPI.

3.1 Packet Layer

The packet layer provides software point-to-point packet transport facility among processors by direct interaction with the communication network via the various communication adapters. This layer creates packets, and inserts appropriate routing information into each of the generated packets. In order to preserve the exibility of choosing various routing strategies and simplify the error recovery, analogous to IP-UDP, this layer is not assumed to provide reliable transport or packet ordering. However this layer is expected to provide uncorrupted packets, performing checksum on packets if necessary. Currently packet drivers for SP1 adapters, SP2 adapters and UDP are supported.

3.2 Pipe Layer

The pipe layer provides a reliable, ow-controlled, byte-stream oriented communication layer. Each processor maintains a send and a receive bu er (called a pipe) to every other processor in the parallel job. These bu ers perform functions similar to UNIX pipes and sockets (that is how their name evolved), yet they shall not be confused with them, as they are placed in user space to avoid the cost of a UNIX kernel access. This layer uses the typical mechanisms, such as ow control, acknowledgments and retry after time-out etc. Flow control is achieved by associating tokens with the contents of the bu ers. Tokens ow back when the message layer reads data from the receive pipe. Accordingly, the sending side is not permitted to enter new data into the send pipe if tokens are not available. The pipe layer provides a non-blocking functional in-

terface to the higher level protocol, the message layer. The pipe layer has to drain the packets from the network to avoid network congestion. Since packets may arrive out of order, it must place the incoming data at the correct position in the pipe bu er and acknowledge the packet. When data arrives at the front of a receive pipe, the message layer is noti ed via a callback that data can be read. Data does not have to be read immediately, but if read, tokens will ow back. Upon token arrival by which new send pipe bu er space is freed the message layer is called to insert more data into the pipe. Since either the data packets or acknowledge packets can be lost, the pipe scheduler will resent unacknowledged data after a certain time-out, and consequently the pipe layer must recognize duplicate packets and drop them. The pipes are scheduled in a fair manner to access the underlying network. Sending a large message will not prevent the delivery of pending sends on other pipes. Contigous messages larger than the pipe-bu er can be copied straight out of the user-bu er, only requiring that parts of the message are ow-controlled and copied to/from the pipe. This makes the amount of data to be copied linear to the number of messages not to the message size. The pipe scheduler does not run as a separate thread. Rather, it is invoked by the message layer in almost every point-to-point communication request. Progress must be guaranteed even in the situation that the application executes communication unrelated code for long periods of time. This is achieved by periodically invoking the scheduler asynchronously in the background using UNIX signal handling. An interrupt driven version, which calls the scheduler when packets arrive from the high performance switch, also exists. This implies that the message layer (MPL/MPI/PVM) must be reentrant and resource contention free.

3.3 Message Layer

In this section we describe the MPI-F message layer and the changes that were required to migrate from the MPL message layer to an ecient MPI message layer. The MPL message passing library consists of a small number (32) of functions for point-to-point communication, collective communication, and environmental inquiry and management. The basic point-to-point layer consists of nine functions: blocking and nonblocking sends and receives of contiguous data; blocking send and receive of strided data, and a combined send and receive function. Messages carry a tag (or type) and a receive can select messages by sender or

by tag, with wild cards allowed for either or both. The MPI point-to-point message passing layer is much richer than the MPL one (53 functions). Also, the functionality of the basic message-passing functions is somewhat di erent, requiring changes in the implementation. An overriding concern of our implementation has been to achieve the same level of performance as MPL achieves, for those basic communication functions that MPL supports. One signi cant change introduced by MPI is the use of communicators. In MPL (as in most other messagepassing libraries) the dest (destination) parameter is an absolute index that identi es the message destination. In contrast, in MPI, dest is the relative index of the destination within an ordered group of processes that is identi ed by the comm (communicator) argument. This mechanism provides important support for modular development of large codes and libraries: a module running on a subset of processes can use a local name space for its communication. This change introduces one additional level of indirection as an absolute address has to be derived from dest by lookup in the communicator member table. The communicator argument (which is a handle to a communicator object) also identi es a communication context. Communications using di erent communicators occur in di erent \communicationuniverses", and do not interfere with each other. To ensure this, messages carry an additional context eld; exact matching on context is required at the receiving end. This requires minor changes in the implementation of sends and receives. More signi cant changes were required because of the di erent semantics of the basic send and receive functions. All message passing libraries have to cope with the limited amount of bu er space that can be made available to the library. If message production runs too much ahead of message consumption, then the system may run out of bu ers, which will force blocking senders, or aborting the program. If the rst option is chosen, then programs may deadlock. Thus, any message passing library has to assume that the user program is \well-behaved", to some extent, in its bu ering requirements. Various trade-o s are possible in the implementation of a message passing library. Generally, an implementation that is more \lenient" to the user (requires looser coordination between producer and consumer) and bu ers more aggressively, will use more memory and will have higher communication overheads, because of the need for additional memory-to-memory copies and dynamic bu er allocation. The current MPL implementation is very strict (or

Latency on SP2 (MPIF, MPICH, MPL) 100

90 MPICH

microsecs

80

70

MPL

60

MPIF

50

40

30 0

32 64

128

256 Message Length in Bytes

512

Figure 1: Latency in MPI-F, MPICH, and MPL Bandwidth on SP2 (MPIF, MPICH, MPL) 40 35 30 Megabytes/sec

restricted) in its bu er allocation policy: no bu ering is provided, in addition to that available in the pipe layer. If the pipe between Ps and Pr is full, then no additional data can move into the pipe. A send from Ps to Pr blocks until a receive is posted at Pr for the message at the head of the pipe. Many users seem to desire more bu ering. More importantly, MPI speci es that communication with di erent communicators is non-interfering. This becomes important if processes are multithreaded: blocked communication between two threads should not prevent communication in another context between two other threads of the same processes. Thus, the pipe between two processes needs to be fairly multiplexed by all contexts that use it, and, consequently, the pipes must be drained to allow succeeding messages to get through. To achieve this goal, we are providing additional bu ering for early arriving messages at the receiver side. Short messages are sent eagerly, and are bu ered by the receiver. Long messages use a rendezvous protocol: the sender rst issues a \request-to-send", the receiver acknowledges this request when a matching receive is posted. At that point, the sender forwards the data. Thus, the amount of bu ering that needs to be provided is proportional to the number of messages sent before a receive is posted, not to their total size. The additional overhead for the rendezvous protocol is negligible for long messages. In order to provide good bu er utilization, we dynamically allocate bu er space to early arriving messages from one shared pool. Note that the malloc UNIX function cannot be used for that purpose: malloc is not a protected, atomic system call. On the other hand, the communication library can be invoked asynchronously by a UNIX signal to handle an incoming message. Instead, MPI-F uses its own \private" bu er management. This bu er management is also used for allocating space to the various objects that are created dynamically by MPI calls. Additional message ow control can be enabled to restrain the sending side from sending to many unmatched messages. All parameters, i.e. message tokens, early arrival bu er size, and rendezvous threshold, can be speci ed on a per job base at start time. To determine the latency and bandwidth for point-topoint communication we have measured the time of a simple Ping-Pong program using MPI and MPL Send and Recv for contiguous messages of various sizes. We also provide performance number for MPICH a portable public domain version, which in this case sits on top of the native MPL library. Figure 1 shows latency numbers for SP1 and SP2

MPL

25

MPIF

20 15 MPICH 10 5 0 256 512 1K

2K

4K 8K 16K 32K 64K 128K256K 512K1Meg Message Length in Bytes

Figure 2: Bandwidth in MPI-F, MPICH, and MPL (\wide nodes = 590s"). The additional functionality of MPI has not led to a change in performance, as compared to MPL. The reason is that no additional bu ering occurs in the \best" case where receives are posted ahead of sends (no early arrivals). Essentially, MPI-F will provide additional bu ering only in situations where MPL would block. The steps in the latency indicate the xed overhead associated with sending a packet (packet-size=232bytes). MPICH has a constant overhead of 15 secs per message. The bandwidth shown in Figure 2 shows again very similar performance between MPL and MPI-F. The setback at 8KBytes in MPI-F is due to the fact that MPI-F switches to a rendezvous based protocol to avoid resource problems. MPICH gures demonstrate that, although the public domain version understandably penalizes small messages, it exposes the full band-

width of the underlying system for large message.

3.4 Derived Datatypes

The MPL call argument msglen speci es the length of the message in bytes. This argument is replaced in MPI by the two arguments: type and count. Type speci es the basic datatype of the message component (integer, real, etc.), whereas count speci es message length in multiples of this basic component. This simpli es the task of the programmer and provides better portability across machines with di erent sizes for the same basic type. More importantly, this allows data conversion for MPI implementations across heterogeneous systems. Support of MPI types requires no signi cant overhead, for simple, prede ned types in a homogeneous environment: it is merely that the count argument needs to be scaled by a factor that depends on the type argument. However, MPI also supports userde ned types. Such type basically speci es a sequence of displacements (relative to the initial address of the communication bu er), and the basic datatype of the element at this displacement. New types are built by applying a variety of type constructors to previously de ned datatypes (such as concatenation, replication with stride, replication with a sequence of userprovided displacements, etc.). Datatype de nitions can be nested to an arbitrary depth. With such userde ned datatypes one can send or receive, with one call, a structure, or a submatrix of an array, or, indeed, an arbitrary collection of objects. The communication operation needs to interpret the user-de ned datatype, in order to gather the data from the communication bu er, or scatter it to the communication bu er. One approach is to prepare a \ attened" description of the datatype, as a sequence of displacement, blocklength pairs. In order to collect the data, one need merely to traverse sequentially this at structure. The disadvantage of this approach is that the attened datatype descriptor can be exponentially larger than the compact de nition provided by the de nition of that datatype (i.e., by the labeled directed acyclic graph that encodes the expression de ning the datatype). Indeed, the attened description can take a size proportional to the message size. The alternative is to \evaluate" the expression that de nes the datatype on the y, using a recursive traversal algorithm. This evaluation computes the sequence of displacements speci ed by the datatype expression and gets (puts) simultaneously the elements at these displacements from (into) the communication bu er.

A further degree of sophistication is needed, if one wishes to avoid copying and packing the entire message before it is sent out (pre/post packing method). One needs to be able to collect the \next" n bytes of a communication bu er speci ed by such expression, and save the state of the traversal algorithm at that point; n may be xed (the size of a bu er), or variable (the amount of space currently available in the pipe). We have implemented such an on-the-fly packing/unpacking algorithm, so as to move data directly from the user communication bu er to the pipe, and vice-versa. The state of the recursive traversal algorithm is encoded in a simple frame-stack object attached to the internal message descriptor. Each frame holds the 5 variables necessary to implement emulation of the recursive algorithm. If a frame was prematurely interrupted, the next time data can be send/received to/from the pipe the algorithm will pick up at that state. Note that several messages to di erent destinations can be concurrently active. The copy of contiguous blocks in this algorithm utilizes pipelined copies through oating point registers for 8-byte aligned data and block-size ge 64 bytes, thus yielding higher copy bandwidth. We have measured the performance for sending/receiving large 2D strided vectors as a function of inner blocksize using a ping/pong program. The stride was chosen to be double the inner blocklength. Besides on-the-fly and pre/post packing, which utilizes the same frame-object based algorithm, we also implemented three possible schemes how users would pre/post pack having direct knowledge of the data layout. The rst is referred to as usr-C-packing, where data is copied simply using the most appropriate C-type to copy in the inner loop (char, short, int or double) and the second we refer to as memcpy-packing where the user uses memcpy as the copy mechanism for the inner loop. The third is called fmemcpy-packing and is similar to memcpy-packing, but uses the MPI-F internal fast memcpy mechanism. Figure 3 shows that the on-the-fly method performs signi cantly better than other methods for inner loop sizes of 8 or more bytes. For smaller inner blocks, direct C-type coding performs best, but levels of very quickly. The memcpy method virtually has no advantages on the SP. Regardless both usr methods reduce portability. Comparing fmemcpy-packing with pre/post packing demonstrates that the general internal stack-frame based packing/unpacking mechanism used to implement MPI Pack and MPI Unpack does not have any signifcant overhead as compared to user-written optimized code. Note, by sending small message (less then pipe size) the advantage of the

2-D Strided Vector Bandwidth 40

Noncontigous 3D Vector T[..][d1][d0] pack-on-the-fly

Bandwidth 30 25 20 15 10 5

35

Megabytes/sec

30

30

pre-/post- packing

25

25

usr-fmemcpy 20

20

usr-memcpy 15

15

10

usr-C-packing

1 4

5

10 5

1

0 1

4

16

64 256 Inner Block Size in Bytes

1K

4K

16K

16

4

16

64 d1

256

1K

4K

64 256 1K 4K

d0

Figure 3: Bandwidth for 2-D strided communication in MPI-F

Figure 4: Bandwidth for 3-D strided communication in MPI-F

on-the-fly will be even greater as even contigous messages are entirely copied into the pipe bu er (see pipe-layer section). We show, in Figure 4, the bandwidth achieved for communication of a 3D char matrix A[d2][d1][d0] as a function of d0 and d1 (d2 = 16Meg=(d1  d0 )). The matrix is transferred using a derived datatype with a 3-nested de nition. As can be expected, for small d0 the overhead is large. However, the graph shows that the bandwidth basically depends only on the size d0 of the inner block and that 85% of the maximum achievable bandwidth is already achieved at approx d0 = 128 bytes (for SP1 this is at 64 bytes). I.e., the overhead of one transition in the datatype evaluation process is less than 15% if the \leaf" object has 128 bytes.

with any code (e.g. tracing, graphics, printfs, etc.) and export them as ocial MPI functions. Typically this can be achieved by instructing the linker to support each MPI function also under the name shift. Providing such a general mechanism has several advantages for building pro ling libraries:

4 UTE/MPI Trace Library Parallel programs are more dicult to understand than sequential ones. It is common to conceive of a parallel algorithm, implement it, and then be puzzled by its disappointing performance, even though the program executes correctly. What is needed then is an instrumentation to collect data which leads to an understanding of the program's behavior. To facilitate the building of program instrumentation, MPI provides a pro ling interface in which all of the MPI de ned functions may be accessed with a name shift. That is, all of the MPI functions which normally start with the pre x MPI are also accessible with the pre x PMPI . Thus, the pro ling interface provides a simple mechanism to \wrap" original MPI functions

1. The overhead of generating traces is only present in the pro ling library and is not part of the base communication library. 2. Di erent tracing and pro ling facilities can be utilized with the same base communication library. 3. The pro ling library can be partial, e.g. only certain functions may be \wrapped". 4. Application code does not have to be changed. A no-op routine MPI Pcontrol() is also provided in the MPI library to be used for the purposes of enabling and disabling pro ling in an MPI pro ling library. Thus, we use this MPI pro ling interface to build our MPI trace library on top of the AIX trace facility for IBM SPn systems. The choice to build a Uni ed Trace Environment (UTE) for MPI applications using the AIX trace facility provides a uni ed and easily expandable trace environment for performance analysis. Without such a uni ed environment, it would require multiple trace facilities to trace various software layers such as MPI, PIOFS (a parallel le system), and HPF. That would not only make trace generation more intrusive, but also make performance analysis tedious and dicult.

4.1 Tracing under AIX The AIX trace facility is able to capture a sequential ow of time-stamped events, providing a ne or coarse level of detail on message passing and system activities. The AIX operating system is instrumented to provide general visibility of system events. Possible system events include process dispatch, page fault, system calls, and I/O events such as read and write, etc. The AIX tracing facility allows the de nition of user speci c events to be generated with the same mechanism. Using this extension together with the name shifting provided by the MPI pro ling interface was chosen to implement the UTE/MPI tracing library. Many trace systems such as those in [3, 6, 11] require source code modi cation to generate trace events. The UTE/MPI trace library, on the other hand, requires only re-linking for trace generation. If the application source code is available, additional user markers provided by the trace facility can be inserted into the source code. This allows the user to mark speci c portions of the program for performance analysis. An ideal trace facility should be able to generate usercontrollable events with minimal overhead. If trace overhead is large, the timestamp associated with each event may have been altered signi cantly, and the workload statistics and run-time data obtained in performance analysis may be hardly meaningful. Care was taken in the design and implementation of the AIX trace facility to make the collection of trace data ecient, so that system performance and ow would be minimally altered by activating trace generation. For example, the trace facility pins the data collection bu er in the main memory to reduce trace generation overhead, and the size of the data collection bu er can be speci ed by the user at the time of activating trace generation. This also avoids tracing side a ects, e.g. page-fault, which ultimately yields to undeterministic overhead in the tracing itself. In the UTE/MPI trace library, trace generation is controlled by an environment variable TRACEOPT, which de nes trace options such as the system and message passing events the user is interested in, the size of the data collection bu er pinned in the main memory, and the le name pre x for trace les. This allows to selectively enable generation of events (system or message passing events) at execution time. If the environment variable is not de ned, the application will run without generating any trace events. If a user is only interested in message passing and process dispatch events, other system events, such as page fault and I/O events, will not be generated as long as the

user does not explicitly ask for them in the environment variable TRACEOPT. The cost of cutting a trace record is broken into two parts: the cost of testing whether the event is enabled and then calling the trace bu er insertion routine, and the cost of the trace bu er insertion routine. If a typical trace record has 3 words of data in addition to a one-word event header (so-called a hookword which identi es the event type and record length) and a oneword timestamp, the average cost of cutting a trace record is around 110 machine instructions. Thus, the trace generation facility is very ecient and adds only a few seconds to the elapsed time for each trace event. The AIX trace facility uses its local RTC (RealTime Clock) to generate timestamps. Since separate trace streams are produced independently by multiple processors in a SP system, the logical order of events cannot be guaranteed due to discrepancy among local clocks. This may lead to the partial ordering problem described in [9]. In the presence of a global clock, as is provided by the IBM SP1/SP2 communication switch, this problem could be completely avoided if all events use the global clock instead of the local clock. However, our experience shows that it is much more expensive to access the global clock than the local ones. This is because the local clock register resides inside the processor and can be accessed in a few machine instructions, while the global clock register is on the adapter and several microseconds are required, including software overhead, to access it. In addition this approach is unfeasible as it requires changes to the AIX tracing facility. Not only are the system clocks typically out of sync, but they also drift relative to each other. We have monitored the drift o the system clocks in an SP1 machine over 3 months. The maximumdrift observed was 40 msec/hour. Hence just cutting a clock adjustment trace event at the beginning of the program execution is not sucient. We therefore access the global clock once every 400 msecs (i.e. when the low level communication timer res) to guarantee that the maximum drift between two timestamps can be adjusted to less then 5 secs, which is well below the message passing latency.

4.2 MPI Event Generation, Analysis, and Visualization

For MPI events, we capture the begin and end events for each MPI routine. Figure 5 illustrates how the UTE/MPI trace library is written and interfaces to the AIX trace facility.

#define ev_send_start #define ev_send_end

(0x10) (0x11)

int MPI_Send(void* buf, int cnt, MPI_Datatype type, int dst, int tag, MPI_Comm comm) { int rc; cut_event(ev_send_start,cnt,type, dst,tag,comm); rc = PMPI_Send(buf,cnt,type,dst,tag,comm); cut_event(ev_send_end); return rc; }

Figure 5: MPI Send in Pro ling Library The pro ling interface as de ned also has certain drawbacks. Without having access to the MPI internal datastructures, it can be dicult to trace all functions eciently. For instance, for visualization, ranks are most likely to be displayed as a global rank and not as communicator relative local ranks. This information is readily available in the MPI internal data structures, but, if not accessible, must be obtained through a series of MPI function invocation, ultimately increasing the tracing overhead. A merge utility, utemerge, is used to merge multiple trace les. Trace les are merged based on global timestamps genereted from the local timestamps and the adjustment computed from the periodically taken global-to-local timestamp. The merged trace stream is then passed to other tools for trace listing, performance analysis, or visualization. Another utility, lsute, is used to list and analyze UTE/AIX trace les. With no option set, the lsute utility lists each event, including node ID, timestamp, event name, and associated data words. The utilities can also generate a histogram for MPI routines to report the number of times each MPI routine is called, and the total and average elapsed times for each routine called in the application. Since each node in an IBM SP system may be shared by other processes, the information on how the total elapsed time was partitioned may be very useful. For the main process, the utility shows both the time when the CPU is running it and the time when the main process is in its compute mode (i.e. not running in any MPI routine). Table 1 shows the example of a time partition table for a set of four trace les. The analysis of parallel program tracing typically in-

Table 1: A time partition table Node 0 1 2 Main pid 15076 15183 18901 Elapsed Time 27.155 27.696 27.799 Other processes 0.689 0.210 0.127 Idle time 0.293 0.259 0.310 Main process 26.171 27.226 27.360 Compute time 14.649 14.726 14.609

3 11172 27.702 0.111 0.282 27.308 14.559

volves matching events in one stream with a related event in another stream. For example, in message passing systems it is important to provide users with run-time data such as the observed message passing time and local wait time for each message. Detailed descriptions of the analysis techniques can be found in [12]. For visualization we use upshot, a public domain visualization tool. A conversion utility, ute2ups, was developed to convert events to the format which upshot understands. Matched sends and receives can be displayed by arrows, from the begin event of a send (such as MPI Send) to the end event of a corresponding receive (such as MPI Recv). Figure 6 shows an example of upshot visualization, in which an arrow indicates a pair of send and receive. This is only the rst step for proof of concept and we plan to interface with other visualization tools.

5 Conclusion Our current experience with MPI indicates that, notwithstanding the large number of functions and options, basic communication can be implemented to be as fast in MPI as with simpler libraries, such as MPL. Furthermore, the added functionality proves useful, both in order to achieve better performance for more complex communication patterns and as support to parallel libraries. The pro ling interface has proven to be an e ective interface to build powerful program analysis tools which will hopefully speed up the development of industry-strength portable parallel libraries and application packages.

References [1] D. Bailey, J. Barton, T. Lasinski, and H. Simon, The NAS Parallel Benchmarks, Interna-

MPI_Barrier

MPI_Comm_rank

MPI_Init

MPI_Irecv

MPI_Comm_size MPI_Isend

MPI_Finalize

MPI_Recv

MPI_Test

MPI_Testall

MPI_Testany

MPI_Wait

MPI_Waitall

MPI_Waitany

MPI_Send

MPI_Testsome MPI_Waitsome

0 1 2 0.2862

0.2876

0.2891

0.2906

0.2921

0.2935

0.2950

Figure 6: An upshot visualization

[2] [3] [4]

[5] [6]

[7] [8] [9] [10]

tional Journal of Supercomputer Applications, vol 5, pp. 63-73, 1991. V. Bala et al., The IBM external user interface for scalable parallel systems, Parallel Computing 20 (1994), pp. 445{462. H. Davis, S. Goldschmidt, and J. Hennessy, Multiprocessor Simulation And Tracing Using Tango, Proc. 1991 Int'l Conf. on Parallel Processing, pp. II-99 - II-107, Aug. 1991. N. Doss, W. Gropp, E. Lusk, and A. Skjellum, An Initial Implementation of MPI, Technical Report MCS-P393-1193, Mathematics and Computer Science Division, Argonne National Laboratory, Dec. 1993. A. Geist et al., PVM 3 user's guide and reference manual, Tech. rep. ORNL/TM-1287, Oak Ridge National Laboratory, May 1993. G. Geist, M. Heath, B. Peyton, and P. Worley, A Users' Guide to PICL: A Portable Instrumented Communication, Library. Technical Report ORNL/TM-11616, Oak Ridge National Laboratory, October 1990. IBM, IBM AIX parallel environment operation and use, Manual SH26-7230, IBM Corp., Sept. 1993. G. Khermouch, Technology 1994: large computers, IEEE Spectrum, 31.1 (1994), pp. 46{49. L. Lamport, Time, Clocks, and the Ordering of Events in a Distributed System, Communications of the ACM, July 1978, vol. 21, no. 7, pp. 558-565. MPI Forum, Document for a standard messagepassing interface, Tech. Rep. CS-93-214, University of Tennessee, Nov. 1993.

[11] K. So, A. Bolmarcich, F. Darema, and V. Norton, A Speedup Analyzer for Parallel Programs, Proc 1987 Int'l Conf. on Parallel Processing, pp. 653 662, Aug. 1987. [12] C.E. Wu, Y.H. Liu, and Y.Hsu, Timestamp Con-

sistency and Trace-Driven Analysis for Distributed Parallel Systems, IBM T.J.Watson Research

Center, Sep. 94.