Performance Measurement, Visualization and Modeling of ... - CiteSeerX

2 downloads 17788 Views 5MB Size Report
Recom Technologies, MS T27A-1, NASA Ames Research Center, Moffett .... performance data at execution time by making calls to a performance-monitoring.
SOFTWARE—PRACTICE AND EXPERIENCE, VOL. 25(4), 429–461 (APRIL 1995)

Performance Measurement, Visualization and Modeling of Parallel and Distributed Programs using the AIMS Toolkit jerry yan, sekhar sarukkai and pankaj mehra

Recom Technologies, MS T27A-1, NASA Ames Research Center, Moffett Field, CA 940351000, U.S.A. (e-mail: {yan,sekhar,mehra}Knas.nasa.gov)

SUMMARY Writing large-scale parallel and distributed scientific applications that make optimum use of the multiprocessor is a challenging problem. Typically, computational resources are underused due to performance failures in the application being executed. Performance-tuning tools are essential for exposing these performance failures and for suggesting ways to improve program performance. In this paper, we first address fundamental issues in building useful performance-tuning tools and then describe our experience with the AIMS toolkit for tuning parallel and distributed programs on a variety of platforms. AIMS supports source-code instrumentation, run-time monitoring, graphical execution profiles, performance indices and automated modeling techniques as ways to expose performance problems of programs. Using several examples representing a broad range of scientific applications, we illustrate AIMS’ effectiveness in exposing performance problems in parallel and distributed programs. key words: parallel and distributed programming; performance evaluation; visualization tools; prediction; message-passing programs

modeling and

INTRODUCTION Although parallel processing promises to speed up scientific applications by several orders of magnitude in the near future, acheiving and sustaining high performance on scientific programs requires that (i) the best known algorithm be used for solving the problem at hand, (ii) the parallel implementation of the algorithm be tailored to the given architecture, and (iii) the implementation be evaluated and tuned so as to run efficiently on a large number of processors as well as on large problem sizes. The tools required for each of these phases are varied and important. In general, a parallel implementation of the best sequential algorithm will perform well. However, a parallel implementation whose sustained performance will eventually scale up to teraflops is, in general, harder to design. The performance achieved depends upon CCC 0038–0644/95/040429–33  1995 by John Wiley & Sons, Ltd.

Received 23 May 1994 Revised 11 November 1994

430

j. yan, s. sarukkai and p. mehra

several factors, including the multiprocessor architecture, system software, data distribution and alignment, as well as the methods used for partitioning the application and mapping its components onto the architecture. As more complex applications are developed, as new parallel-programming languages/paradigms are adopted, and as large computing systems with heterogeneous components are used, there will be a growing need for a common platform to measure and tune parallel-program performance. Tools for performance tuning and analysis developed by researchers at academic institutions and government laboratories1 continue to exist only as research prototypes because the following characteristics of real-world parallel-processing environments hinder the development of robust tools. 1. Lack of stable hardware/software platforms. In the past two years, almost all major computer vendors have announced new scalable, highly parallel multiprocessors (e.g. TMC’s CM-5, Intel’s Paragon, Cray’s T3D, IBM’s SP-1, and Convex’s SPP). Each of these machines offers its own programming model, communication library and hardware architecture. Their system software is relatively immature (less than five years old, delivered incrementally as the hardware changes). Their hardware is incrementally exploited in the field as new versions of system software are delivered. 2. Many logistical issues still prevent the development and dissemination of tool technology. The large start-up cost of building performance tools* makes it difficult for the research community to keep up with the industry. Many research groups do not have the expertise or interest to produce software of release quality. The few that do lack the necessary financial support and personnel to produce complete, robust tools. Although a few conferences enable tool developers to meet annually and exchange information, many organizations stay away because of concerns about intellectual property rights and national security; this hinders open and free exchange of parallel-tool technologies. 3. The gap between user expectations and tool capabilities persists. On the one hand, users do not want to ‘waste time’ testing prototype tools from the research community. On the other hand, tool developers are more concerned about exploring and incorporating new performance metrics and tuning methodologies into their tools; they do not want to ‘waste time’ making their tools robust/appealing or to rewrite the documentation to meet commercial standards. Even though vendors are beginning to bundle performance tools with their machines (e.g. MPP Apprentice on the Cray-T3D, XiPD on the Intel-Paragon and Prism on TMC-CM5), the usage of such proprietary tools confines the user to a specific methodology to improve program performance on one particular machine. One cannot experiment with new performance methodologies across different platforms. In addressing these issues, it becomes apparent that performance-tuning tools should not be closely coupled to a specific architecture or particular system software. Instead, the architecture-specific components of a toolkit should be isolated, and interfaces to them clearly defined in order to enhance portability across platforms * A performance tool-set would at least include a (language-dependent) source-code analyser and a (machine-dependent) run-time monitor, both of which requires a lot of work to be brought up for a single kparadigm, machinel combination, not to mention porting across different machines.

performance measurement, visualization and modeling

431

from different vendors. Toolkits based on source-code-level instrumentation (e.g. Pablo,2 AIMS3) meet this requirement. Instrumented programs generate streams of performance data at execution time by making calls to a performance-monitoring library (or monitor). Typically, this monitor is the only part that needs to be ported to new hardware platforms because the collected performance data can be characterized using platform-independent performance indices and visualized using architecture-independent views. In addition to rapid changes in architecture and system software, newer programming paradigms—such as object-oriented programming and data-parallel languages— are becoming more popular. In order to continually support the latest developments in such a dynamic environment, performance-tuning tools should leverage on compiler front-end tools and open software architecture. For performance tools to be useful in practice, a number of requirements must be met: (a) (b) (c) (d)

automated instrumentation of parallel programs low overhead in monitoring program execution compensation of intrusion caused by monitoring execution profiles that isolate program regions whose performance dominates overall performance (e) graphical views and indices to expose performance bottlenecks (f) rapid prediction and characterization of large-scale applications. The first three requirements ensure that performance data can be collected reliably and with low overhead. Profiles, indices and graphical views can be used to analyse the performance of a single program execution. Scalability-analysis tools are useful for predicting a program’s performance without having to first execute it; these are especially useful for studying large-scale scenarios, i.e. when system parameters (such as number of processors) and application parameters (such as the problem size) take on values so large that it is not feasible to obtain reasonable execution traces. In this paper, we present a methodology based on AIMS (automated instrumentation and monitoring system), a toolkit for tuning and predicting the performance of message-passing programs on both tightly- and loosely-coupled multiprocessors. AIMS was developed at NASA Ames Research Center, under the High Performance Computing and Communications Program. It consists of a suite of software tools for measurement and analysis of performance, as shown in Figure 1. An instrumentor first parses the application’s source code and inserts instrumentation at selected points. Upon execution of the instrumented program (linked with a machine-specific monitor), a stream of events is collected into a tracefile. This tracefile’s format is consistent across architectures. Run-time measurements needed to generate tracefiles perturb program execution and can significantly alter the comunication characteristics and computation times of recorded events. Such intrusion is removed by an intrusioncompensation module using appropriate communication-cost models for the underlying architecture.4 This yields a compensated tracefile that can then be input to the various post-processing tools that determine the performance characteristics of a program. AIMS provides four trace post-processing kernels to support, respectively, performance visualization, profiling, performance tuning, and performance modeling:

Figure 1 Overview of the AIMS toolkit: AIMS includes 1. source-code instrumentors for various message-passing paradigms, 2. monitor libraries for various multiprocessors, 3. an intrusion compensator, 4. a view kernel for performance visualization, 5. a statistics kernel for profiling, 6. an index kernel to guide peformance tuning, 7. a modeling kernel for performance modeling, and 8. trace converters enabling the use of other performance-tuning tools

432 j. yan, s. sarukkai and p. mehra

performance measurement, visualization and modeling

433

1. VK. The view kernel displays the dynamics of program execution using animations. 2. SK. The statistics kernel presents an overall execution profile of a program in terms of the time spent in each procedure doing various activities (namely, computation, communication and synchronization). 3. IK. The index kernel locates and explains performance failures via simple indices. 4. MK. The modeling kernel automates the process of building and simulating parallel-program models. Based on such models, users can obtain asymptotic performance characteristics for either the entire program or individual components thereof. Users may use VK and SK to determine how well execution progressed during a monitored run of a program. IK complements these tools by providing plausible explanations for observed performance in terms of commonly occurring performance problems in message-passing programs. Finally, MK provides a convenient means of estimating how the program would behave if the execution environment were modified. Using VK, SK and IK, the performance of a scaled-down program can be tuned; MK and IK can also be used to determine whether and why performance will fail to scale-up. In addition to these kernels, trace converters can translate AIMS traces to formats compatible with other performance tools (such as ParaGraph5 and Pablo2) as well as commercial visualization packages (such as Explorer and AVS). In this paper, we show how AIMS was used to characterize, tune and predict the performance of a number of applications on the iPSC/860 and two PVM applications on a cluster of Sun workstations. The next section describes first the overall architecture of AIMS and then its components. Subsequent sections describe the processes of source-code instrumentation and monitoring in parallel and distributed systems. The remaining sections describe how individual kernels of AIMS are used for analyzing the performance of parallel and distributed applications. Our examples cover a wide range of scientific applications. We conclude with a summary of results and the future direction of our work. OVERVIEW OF AIMS The AIMS toolkit provides a framework to which different architecture-specific building blocks can be added. Portability is particular important for researchers at NASA Ames Research Center because computing platforms are constantly evolving as new and faster machines become available. AIMS’ modular design has resulted in it being ported to the iPSC/860, CM-5, Intel Paragon and workstation clusters (SunSparcs and SGIs). Even so, certain platform-specific monitoring concepts have to be incorporated. For example, inter-process interactions and background load are monitored on workstation clusters and not for implementations on other platforms. AIMS’ user interface provides a common look and feel across all these diverse platforms. This important feature eliminates the need for users to learn new techniques for tuning the performance of their programs on different platforms. System-level issues are hidden from the user and efforts have been made to present performance information in terms of source-code entities. For example, views in VK have a ‘click-back’ feature that pinpoints the exect source-code location and data structure causing a performance problem. Similarly, IK and SK present performance figures

434

j. yan, s. sarukkai and p. mehra

in terms of procedures and source-level data structure interactions. This design philosophy is carried over to MK where model structure mirrors program structure, tracefiles from simulation are compatible with those from actual execution, and all of MK’s views support direct click-back to source code. In addition to providing a wide variety of tools, AIMS also offers a robust platform for instrumenting large scientific applications with complex directly structure, that may include a combination of C and FORTRAN modules.* Figure 2 shows an overview of AIMS’ approach to performance tuning. A number of important components enable the study of parallel program performance: (i) (ii) (iii) (iv)

a source-code instrumentor (xinstrument) a performance monitoring library (the monitor) an intrusion-compensation module a set of trace post-processing kernels.

These components work together to instrument, measure and tune a program’s performance. Xinstrument inserts instrumentation into the source-code to record performance data from the program’s execution. Instrumentation is essentially a sourceto-source program transformation, which uses the control-flow and data-flow information of the program. Xinstrument is written using the Sage compiler front-end library.6 The use of Sage enables us to perform simple program transformations to insert instrumentation calls at appropriate locations, and to perform flow analysis for tracking data structures.7 In addition, the use of such a compiler front-end allows us to leverage on rapidly advancing compiler technology, while focusing on important performance-tuning issues. The monitor consists of data-collection routines called by the instrumented code during program execution. The monitor is also responsible for

Figure 2. Performance tuning with AIMS involve four simple steps: 1, instrument source code, 2. link with monitor and execute, 3. apply intrusion compensation, 4. performance analysis via various kernels * The original source code is left unaffected. Instead, a copy of the instrumented source code is saved in a new directory tree having the same nesting structure as the original source. The level and detail of instrumentation can be controlled by the user. Commonly used choices are provided as defaults.

performance measurement, visualization and modeling

435

flushing taced events to a tracefile and for generating a data-structure look-up table used by the post-processing tools to present performance information in terms of source-level data structures. The tracefile is compensated for intrusion,8 and subsequently used by VK, IK, SK and MK. INSTRUMENTATION Performance evaluation requires some form of instrumentation—a mechanism whereby performance data can be generated during program execution. Many such methodologies have been proposed;1 these include event sampling, hardware instrumentation, and software instrumentation. Although event sampling produces the smallest amount of data, it is not adopted in AIMS because many anomalous behaviors cannot be explained with sampled events. Besides, event sampling can also be intrusive if there is no hardware support. On the other hand, relying on hardware instrumentation renders the monitor non-portable. In order to support uniform performance evaluation in heterogeneous computing environments and across different platforms, software instrumentation provides the most pragmatic approach. Although intrusion may not be negligible for software instrumentation, we have shown that we can characterize and compensate for its effects and recover the original program behavior from tracefiles.4,8 There are two possible mechanisms for software instrumentation: (i) instrumented system software and (ii) instrumented source code. The use of instrumented versions of communication libraries and operating systems is quite appealing since performance data can be obtained without modifying the application program. However, this requires vendor participation and does not provide an easy mechanism for the user to turn monitoring on/off for different portions of the application. Source-code instrumentation performs better on both counts: it does not require vendor participation and collects exactly what the user deems necessary. Furthermore, source-code instrumentation is highly portable since the instrumented programs will run on any machine where the original program runs. However, the biggest drawback of sourcecode instrumentation is the effort required to insert instrumentation calls. Insertion of all instrumentation calls should not be left to the programmer, as it can easily become very tedious and time-consuming. AIMS automatically identifies important regions of code and instruments them. It also provides default instrumentation strategies (such as placing instrumentation calls to monitor the beginning and end of procedures and communications) and, at the same time, allows the users to selectively turn instrumentation on/off and insert instrumentation calls that depend on the semantics of the application being executed. Xinstrument: AIMS’ source-code instrumentor There are two ways to instrument programs using AIMS. One could use a batch command to specify the platform, language and files that need to be instrumented. This works very well for standard instrumentation of programs. For more complex instrumentation operations, the user can use a graphical user interface to selectively disable or enable instrumentation. The graphical user interface called xinstrument is shown in Figure 3. The dialogwindow on the left allows users to load files for instrumentation. The window in

Figure 3 The xinstrument interface: users may select specific instruments to be turned on/off via simple mouse clicks. Both FORTRAN and C modules are accepted. With the Intel iPSC/860, the user needs to specify whether the module to be loaded executes on a node or the host (i.e. front end). The desired message-passing library also has to be specified (in this case PVM)

436 j. yan, s. sarukkai and p. mehra

performance measurement, visualization and modeling

437

the center shows the construct tree with instrumented constructs being highlighted. The program segment associated with a particular node can be displayed with a pop-up source program widow (shown on the right). One may instrument all loaded files using a default strategy. For custom control of instrumentation, one can browse through the source code using a source-tree abstraction. This source-tree abstraction allows entire files to be viewed in one window so the user can quickly select or deselect instrumentable constructs in a program; individual leaves of the tree can be expanded to reveal corresponding points on the source code. To instrument a construct, the user simply clicks on the corresponding leaf of the source-tree display.9 A help window is also available to provide information about available options. Currently, AIMS supports parallel programs written in FORTRAN and C using three possible message-passing paradigms: Intel’s NX, TMC’s CMMD, and PVM. Besides inserting instrumentation at appropriate locations in the program code, xinstrument also generates two key data structures—an application database (or APPL DB) and an instrument-enabling profile (profile) (see Figure 2). The APPL DB stores static information about the application’s source-code (e.g. file names, unique construct identifiers, and line numbers of instrumented constructs). An application’s APPL DB is inserted into each tracefile (created when its instrumented version executes). This enables AIMS’ analysis tools to relate traced events back to the instrumented syntactic constructs and data structures in the source code. The profile contains a table of flags that selectively trigger instrumentation inserted in the program. The profile can also be edited and saved using xinstrument. Since the monitor reads the profile at the beginning of execution, the user need not recompile the application in order to obtain performance data for different parts of the program. Customized instrumentation strategies supporting data movements tracking and automated performance modeling In addition to the standard instrumentation options, special operations are performed for data-structure instrumentation and for automatic generation of models. Data structure monitoring is useful in the context of data-parallel languages and in tracking data-structure movements in explicit message-passing programs. Automatic generation of models is useful in the context of studying scalability of parallel programs. Detailed discussions of these issues can be found in References 11 and 17, respectively. Our approach to tracking data-structure movements exploits methodologies and tools already available for tracking messages in distributed-memory programs. Our model of interprocessor communication is as follows. Communication is required when a portion of an array (the source array) on one node is needed for a computation (by a destination array) on another node. In order to determine the source and destination arrays, flow analysis is performed* with support from the Sage compiler front-end. The source arrays and destination arrays are identified and recorded into the APPL DB. In order to predict large-scale performance, the modeling kernel requires source code to be instrumented in such a way that events are generated for all communication and synchronization calls and each time the control thread crosses a syntactic * The portion of the source array to be sent may have to be copied into a temporary array (a send buffer) to fit it into contiguous memory before sending. The immediate destination for the data (the received array) on the receiving node may only serve as a buffer, to be unpacked (into the destination array) for use in actual computation.

438

j. yan, s. sarukkai and p. mehra

boundary surrounding a communication or synchronization call.11 Traces thus obtained contain sufficient information for modeling the complexities of those application components that contain communications and are, therefore, likely targets for performance tuning. Xinstrument support an MK instrumentation strategy that detects these conditions and inserts appropriate instrumentation calls. MONITOR After the source code has been instrumented, it must be compiled and linked with a performance-monitoring library (the monitor). The monitor defines a set of monitoring (or event-recording) routines that have been inserted into the application by xinstrument; these generate the trace-files used by AIMS’ analysis tools. Figure 4 illustrates the basic function of these routines during program execution; they write event records into a buffer at each processing node. Either at the end of execution or whenever the buffer fills up, the buffer is written (or flushed out) to the system’s disks. Generated event records include: (a) (b) (c) (d) (e)

state entrance/exit—subroutines, loops and user-defined code segments communication—sending/receiving of messages I/O—file system read/write times barriers—global reduction operations, barrier synchronization, and event waits data-structure definitions—look-up table for data structure address bindings, and (f) markers—user-specified events. Each event includes a time-stamp, processor ID, event type and some event-specific information. The mechanism for obtaining time-stamps, strategies for writing out to tracefiles and the resolution and synchronous nature of system clocks, distinguish monitoring on parallel machines from that on distributed machines.

Figure 4. Data collection in AIMS: during program execution, inserted instrumentation writes event records into a buffer at each node. Either at the end of execution or whenever the buffer fills up, the buffer is written (or flushed) to the system’s disks

performance measurement, visualization and modeling

439

Monitoring on parallel machines Figure 4 represents a typical scenario for monitoring parallel applications on tightly-coupled multiprocessors. Upon initialization, the monitor allocates a fixed (user-selectable) size buffer for storing events generated at each node of the parallel machine. When the buffer fills up, each processor either asynchronously flushes it to a tracefile on a shared disk (as in the case with the iPSC/860 and the Paragon) or sends it to a front-end that handles all trace I/O (as in the case with AIMS, the CM-5). The smaller the event buffer, the more often it flushes; and, the more often it flushes, the larger is the intrusion introduced during monitoring. Even when trace events are of minimal size, flushing may be inevitable for long executions. Hence, the intrusion compensator (see below) forms an integral part of AIMS; it ensures that event time-stamps present a realistic picture of program execution. Monitoring PVM programs on distributed machines: issues and approach PVM application programs can be thought of as communicating sequential tasks. More than one task may reside at the same processor. Task-initiation primitives (pvm spawn and pvm mytid) and task-termination primitives (pvm kill and pvm exit) need to be monitored. In addition to the PVM analogs of the traditional communication calls (pvm send, pvm mcast, pvm recv and pvm nrecv), one might need to monitor the calls used to initialize the sender’s active buffer (pvm initsend) and the calls used to pack data into it (various pvm pkktypel calls). Since PVM supports dynamic process groups, global operations (such as pvm bcast and pvm barrier) and calls that affect group membership (pvm joingroup and pvm lvgroup) need to be monitored as well. Workstation clocks can drift apart at significant rates (tens of milliseconds per second). Consequently, when the monitor starts up, there may be a significant mutual skew between clocks at different sites. Moreover, even if one were to compensate for initial skew, the clock drift could cause significant skew to build up during execution. Since clock skews distort one’s view of a program’s computation and communication patterns and may even result in apparent loss of causality between the sender and receiver of a message, it is important to monitor both skew and drift and to (eventually) use this information to correct the event time-stamps recorded by the monitor. Two factors further complicate skew compensation: (i) the lack of a global point of reference against which to monitor initial skews; and (ii) the possibility of accruing significant drifts during periods of no communication. Numerical computations programmed as communicating sequential tasks12 are also affected by loads on the CPU and the network. The processes of a parallel application may be delayed waiting for resources held by competing processes because workstation-based computing environments are characterized by significant background workload, the load generated by processes belonging to other users and/or other applications.13 Parallel applications in workstation environments may also be affected by configurational heterogeneity, whereby different sites may have different amounts of raw computing power. It is the monitor’s responsibility to calibrate and measure resource availability14 so that dynamic differences in performance across sites can be explained (and possibly exploited). Future releases of AIMS will incorporate more comprehensive techniques for skew

440

j. yan, s. sarukkai and p. mehra

and drift compensation* as well as tracing of background workload and calibration of resource availability. At present, we use the standard monitor, the xload utility for monitoring load, and AIMS’ intrusion compensation for preserving the causality of communications. The next subsection considers intrusion compensation in more detail. Intrusion compensation Tracefiles generated by the monitors at different nodes are concatenated, sorted by time and then passed through the intrusion compensator. The intrusion compensator removes the overhead associated with calls to the monitor library as well as the time spent flushing events to the tracefile. The result of intrusion compensation is an updated tracefile with time-stamps that more closely match the execution characteristics of the uninstrumented program. This tracefile can then be used by AIMS’ post-processing kernels to study program behavior. Our intrusion compensator uses the execution semantics of communication operations to reduce or eliminate intrusions introduced during measurement. It attempts to satisfy the following semantic truths: (a) causality of communication (messages should not be received before they have been sent) (b) conservation of partial order for events along each thread of control (e.g., a communication cannot be initiated before the completion of a previous communication upon which it depends) (c) deadlocks are not introduced into a deadlock-free program. Causality can be preserved by enforcing ordering and timing relationships between send and receive operations. Partial order is preserved within each processor by guaranteeing that the logical order of any two events is not changed. By satisfying causality and partial-order constraints during intrusion compensation, it follows that no deadlocks will be present in the compensated execution trace if none occurred during the monitored execution of the program. Our algorithm for intrusion compensation attempts to resolve the timing of instrumented events on each processor in sequence. The intrusion-compensator uses a simple model to predict message transmission time. The time-stamp of an event is adjusted based on intrusions introduced by events occurring prior to it and the compensated timing of events upon which it depends. Since the intrusion compensator uses a single program execution, the compensated execution cannot accurately remove intrusions from an inherently non-deterministic program. For a wide range of applications, compensated execution times lie within ±6 per cent of the uninstrumented execution times.4,8 In the rest of the paper, we discuss how compensated traces can be used to study the performance of various parallel scientific applications. * Clock skew due to the lack of a global clock can be attacked by explicitly communicating time-stamps across all processors used by an application soon after all the tasks start up. The drift problem can be resolved by explicitly forcing each procesosr to communicate with other processors for the purpose of tracking variations in clock skew. The CPU workload can be measured by the number of processes waiting for the CPU at any given time and raw CPU capacity can be calibrated by running a representative computation loop under no load. Network workload may be measured in terms of the number of packets communicated per second and raw network capacity can be calibrated by the round-trip times for messages of different lengths under no load.

performance measurement, visualization and modeling

441

VK: VIEW KERNEL VK displays a program’s execution using animated views. These views present information indicating when specific program constructs were executed, when messages were sent, for how long messages were queued up before being processed, and when certain processors were idle. Some views scroll as time passes, showing a segment of the program’s history, whereas others animate each state in sequence (updating the previous state). Some of VK’s views are shown in Figure 5. A full description of these views can be found in Reference 3. Each horizontal bar in OverVIEW, for example, shows the progress of execution on each node, using colors to represent different procedures of the application program; lines connecting these bars represent messages. The color of a message line indicates the data structures communicated in that message. The tracefile can be viewed step-by-step or at high speed. VK can be instructed to pause (or start) at any instrumented construct (such as a particular message send/receive or when a certain subroutine is invoked) or after a certain time (say, 3·54 ms) into the execution. VK also provides a source-code click-back capability, allowing tracefile events pictured on the display to be mapped back to application code. For example, clicking on a procedure bar in OverVIEW reveals the name of the subroutine active at that time/node point. Clicking on a message line reveals information about the associated send and receive statements. As shown in Figure 5, this information may be displayed either in a text window containing the corresponding code (wherein the exact source line is pointed to by a marker) or as a construct-tree view showing the relationship between an observed event and an instrumented point in source code. In addition to the location and type of message, the names of the data structures involved in the communication are also displayed. Figure 6(a) shows the OverVIEW for a parallel integer-sort program.15 In the initial run, sorting required 841 ms. Note that white areas represent idle times. Using AIMS’ click-back feature, the programmer identified those regions of the source code where the program spent its time idling. He simply replaced a loop of global operations with one single global operation (done on a vector) and reduced sorting time to 184 ms (Figure 6(b)). VK can also be used for analyzing the behavior of the PVM programs. Figure 7 shows an OverVIEW of a PVM implementation of the MG benchmark. This code, due to White et al.12 assigns (N/P + 1) rows of an N × N grid to each of the P processors. The resulting communication pattern is cyclic, as shown by VK’s SpokesVIEW in the right portion of the Figure. The middle portion of the Figure shows the average background workload on each of the sites of the distributed system. This particular execution involved four PVM tasks, one on each of four workstations connected by a LAN. Task 2 ran on the fileserver under conditions of heavy background workload. Thus, even though work was divided equally among all processors, task 2 took longer to finish its share of the work. The states shown in SpokesVIEW explain this phenomenon of the slowest processor dominating the execution time: tasks 0 and 3, having finished thier portion of the work for this iteration, are waiting for messages from task 2; task 4, having done its share and then some more, is waiting for messages from tasks 0 and 3. VK can also explain the impact of multiprogrammed distributed sytems—characterized by long message latencies and highly variable loading conditions—on the

Figure 5 Source-code click-back: by clicking on a message line in OverVIEW, users can obtain more information about the selected message; this information may be displayed as (i) a text window with the corresponding source code (where the exact line is pointed to by a marker) or (ii) as a construct-tree view showing the relationship between the observed event and an instrumented point in the source code

442 j. yan, s. sarukkai and p. mehra

performance measurement, visualization and modeling

443

Figure 6(a). The original integer sort program

Figure 6(b). Integer sort program after its modification

Figure 7. Analyzing PVM version of MG using VK

execution of PVM-based parallel programs. CG, a NAS parallel benchmark kernel, uses the conjugate-gradient method for iterative solution of linear systems. With this implementation12 each CG iteration involves one matrix–vector product and two dot products. Given a p × p mesh of processes, it partitions the matrix (block, block) among these processes. The vector is partitioned among the columns of the mesh and then replicated among the row processes. Each CG iteration involves the following steps: 1. getpfull: processes along the diagonal of the p × p mesh broadcast their portion of the vector to all other processes in the same column of the mesh. 2. matvec: each process computes its share of the matrix–vector product locally. 3. rowsums: the resultant vectors are summed along each row of the p × p mesh by a gather-and-scatter implementation of global summation. 4. dotpro: the vectors are partitioned along the rows of the p × p mesh and replicated along its columns. Each process computes its share of the dot product. The global summation needed to compute the final result is achieved via a gather-and-scatter operation done by the first process.

444

j. yan, s. sarukkai and p. mehra

Figure 8(a) shows the impact of long message latencies on different phases of CG. This particular execution involved four PVM tasks running on four workstations. Other than matvec, which involves local computation only, all other phases are adversely affected by the large latencies. This is evident from the amount of idle time (white space) on each of these timelines. Figure 8(b) shows the impact of load imbalance on the execution time of a CG iteration later during the same run. A sudden surge in load on the machine at which task 0 was executing caused all other processes to idle waiting for task 0 to finish its matvec phase and begin participating in rowsums. One might consider tuning the performance of this application by executing the

Figure 8. (a) Impact of long message latencies on the execution of CG benchmark (PVM version) (b) Impact of load imbalance on an iteration of CG (c) Effect of multiprogramming on PVM tasks sharing a CPU

performance measurement, visualization and modeling

445

four tasks on two processes (instead of four) with the tasks in the same column being assigned to the same processor. Figure 8(c) shows that matvec takes almost twice as long to finish because both the PVM tasks sharing a CPU get slowed down by contention. Any gains in communication time due to (supposedly) low latency of local communication are offset by the multiprogramming delays suffered by matvec. SK: STATISTICS KERNEL Although graphical representations of program measurements expose the nature of program execution, they fail to scale to large problem sizes or large numbers of processors. SK (the statistics kernel) generates a list of resource-utilization statistics on a node-by-node and routine-by-routine basis. These statistics can help point out inefficient sections of code, which can then be more closely examined with VK or IK. Other tools have concentrated on presenting performance statistics in the flavor of those presented by SK; these include IPS-216 and TMC’s Prism. These tools do not pinpoint possible causes of poor performance. Instead, they can only be used as indicators of locations that exhibit poor performance. The burden of inferring the cause of poor performance is left largely to the user. SK performance data is presented in the form of a table that can be easily plotted using standard spreadsheet packages. Aggregate data such as total execution time, communication blocking time, global operation (such as a barrier or a reduction) blocking time and busy time are presented in terms of subroutines and processors. In addition, performance metrics such as the ratio of communication time to lifetime in each subroutine, and the ratio of communication time in a subroutine to the total lifetime of the program can be used to sort the order in which subroutines have to be considered for tuning. We illustrate the operation of SK with ARC2D, a two-dimensional CFD solver written by S. K. Weeratunga at the Numerical Aerodynamic Simulation Systems Division, NASA Ames Research Center. As shown in Figure 9, SK carries out statistical analysis on the tracefile and classifies elapsed time into four categories: 1. Busy time: CPU time spent in executing user program. 2. Send blocked time: time spent waiting for ‘synchronous sends’ to complete. 3. Receive blocked time: time spent waiting for the arrival of a message. 4. Global blocked time: time spent waiting for the completion of a barrier operation. Figure 9(a) shows that most of the time is spent computing at each node. On inspecting the same data distributed across procedures (as shown in Figure 9(b)), we quickly identify procedures ypledge, edge news and edge ns as communication intensive. Although SK can help in detecting the location where the CPU is idling (or busy), it does not explain its cause. To actually determine why the program exhibits poor performance, IK or VK has to be used. In the next section, we show how metrics (or indices) of performance can provide a scalable method of detecting and explaining performance problems. IK: INDEX KERNEL IK attempts to explain performance failures by isolating possible causes and expressing results in terms of the application program’s data structures and procedures. It

446

j. yan, s. sarukkai and p. mehra

Figure 9. (a) Examples of some execution statistics distributed across nodes (b) Examples of some execution statistics distributed across procedures

differentiates between locator and characterizer indices. Locators identify a region of code or data structure exhibiting poor performance while characterizers explain the reason for poor performance in the located regions.10 Since the aim of any performance-tuning process is to minimize the total execution time (or lifetime) of the program, each index in IK highlights the contribution of a particular performance problem to the lifetime of the program. Each index represents the percentage reduction in lifetime that could be achieved if the problem were completely eliminated. By comparing these indices, one can determine the most significant performance problems and consider ways of eliminating them from the program. Such issues have been considered in detail in Reference 10. We will simply mention the causes of performance bottlenecks and consider their use in

performance measurement, visualization and modeling

447

studying the performance of two programs. Some of the common causes of poor performance include: load imbalance (LI), communication link contention (LC), communication overhead (CO), and poor link utilization (LU). Normalized indices for each of these factors are defined, with the goal of identifying the causes of poor performance in a particular data-structure interaction. All of IK’s indices are presented with respect to individual data structures, subroutines, data–structure pairs, and data–structure pairs in individual subroutines (or userdefined blocks of code). IK generates these indices automatically and from trace data. The range of values for all of these indices is between 0 and 1; 0 indicating good and 1 indicating bad performance. Having a normalized range for performance indices helps in quickly tracking down the significant factors contributing to poor performance. Owing to space limitations, we do not discuss how index values are extracted from tracefiles (refer to Reference 10). However, we show how these statistics can help in analyzing program performance. MG: the multigrid benchmark The multigrid method offers an approach to increase the robustness of iterative finite-element methods. Meshes of increasingly coarser refinement are used; the solution obtained on a coarse grid at modest computation cost is projected back onto the next finer grid as input to the next approximation to the final solution. The projection and interpolation of solutions between different mesh refinements involve concurrent matrix–vector operations similar to that for the conjugate-gradient method. The standard multigrid method processes each grid level one at a time and in parallel. The results of a grid level are projected and interpolated from one grid to another using specific operators. For good performance, the number of levels should take into consideration the granularity of work assigned to each processor. In the truncated method, the number of levels is restricted so that each processor has at least the minimum grain size. The overheads at coarser levels do not significantly affect the performance of the program, mainly because the amount of work performed at coarser levels is much less than that performed at finer levels. For small mesh sizes, however, computation at coarse levels could become a bottleneck. Table I compares the execution of the multigrid application on mesh sizes of 32 × 32 × 32, 64 × 64 × 64 and 128 × 128 × 128 executed on 16 processors. For the 32 × 32 × 32 problem, load imbalance is a significant problem because the loadimbalance index LI = 0·23. This indicates a possible reduction of 23 per cent in execution time if the load-imbalance problem were completely eliminated. Further, Table I. Comparison of indices with varying problem size (executed with 16 processors). Communication block-times are significant for smaller grid sizes N LI 32 × 32 × 32 64 × 64 × 64 128 × 128 × 128

Performance indices LC CO

0·23 0·04 0·035

0·01 0·00 0·00

0·02 0·00 0·00

448

j. yan, s. sarukkai and p. mehra

while communication overhead contributes about 2 per cent to the lifetime of the problem for small problem sizes, it has no effect as the problem size is increased. In order to pinpoint the nature and location of load imbalance, the tracefile is animated using VK. Figure 10 displays an OverVIEW of MG about 15·8 s into its execution. It is evident that while processor 11 is busy computing (in the lowest level of grid refinement), other processors are blocked on a receive (indicated by the white spaces)—leading to poor processor utilization. In this example, the performance indices automatically expose the existence of a significant load imbalance problem. Because we knew what we were looking for before the animation began, the load-imbalance problem was pinpointed rapidly. For larger numbers of processors, graphical views become more cluttered, making the detection of performance problems and isolation of their causes even harder. In such cases, the use of performance indices will become even more appealing. Unlike graphical views that only indicate the presence of a performance problem, performance indices prioritize the significance of various problems based on percentage savings in lifetime that could be achieved if certain performance problems were resolved. Xtrid: an illustrative application program In this section we show, with the help of a tri-diagonal solver example, how performance indices expressed in terms of data structures can help the user locate performance problems. The solution of scalar and block tri- and penta-diagonal linear systems is a recurring step in the numerical solution of PDEs using implicit methods. 15 The particular benchmark considered here, called Xtrid, is a FORTRAN program for the iPSC/860 Hypercube; it supports 14 different methods for simultaneous solution of N scalar tridiagonal systems, each containing N equations in N unknowns. The program contains approximately 5000 lines of code, and was written by S. K. Weeeratunga at NASA Ames. Figure 11 graphically illustrates the problem. Since each matrix of coefficients (shown leftmost in the Figure) contains non-zero entries only along the three diagonals, the complete data for the Ith system can be captured in four vectors (AI, BI , CI and FI). The right side of the figure shows the initial distribution of the four matrices (A, B, C, and F) formed by juxtaposing their corresponding vectors. One parallel solution method requires the four matrices to be transposed in a first stage

Figure 10. MG’s load imbalance problem pinpointed using VK—AIMS’ trace animation tool

performance measurement, visualization and modeling

449

Figure 11. Running example (Xtrid) and the initial data distribution of matrix A for 4 processors. Matrices B, C, and F are similarly distributed

so that data for several complete systems of equations will become local to each processor. With system size P, each processor has complete information about N/P systems. In the second stage, these N/P systems are solved locally on each processor using Gaussian elimination. Finally, the matrix of solution vectors, XK, is transposed so that the solution vectors will be distributed in the same way as the coefficients were distributed in the beginning. The first and third stages are communicationintensive because of the complete exchange required by the transpose operations. The particular array-transposition algorithm implemented here requires log2(P) communication steps, each involving bidirectional nearest-neighbor communication along a different dimension of the Hypercube. Each message in the transpose operation involves N2/2P values. Figure 12 shows a summary view of all of the indices for each array pair in the tridiagonal solver program. Each axis corresponds to one performance index; six different indices are shown. A line is drawn connecting the corresponding values in each axis, for each array pair interaction. If performance indices for data structure interactions are the same, then the index values are plotted on top of each other. Figure 12(A) shows the indices for two buffers (both called tmp but local to two different functions). From the Figure, we can gather that the most significant performance problem with these buffer movements is associated with communication

Figure 12. Comparison of various performance indices for array interactions. (A). Performance indices for arrays xtrans.tmp and xrtrans.tmp (B). Performance indices for arrays A, B, C and F

450

j. yan, s. sarukkai and p. mehra

overhead.* These temporary buffers are used to transmit zero-byte messages so that the transpose can be carried out using forced messages. The start-up time is (of course) significant with respect to the communication time of these messages (of zero length!). However, since the communication index involving these data structures is very small, we know that reducing the time to communicate these two buffers will have little or no significant impact on the lifetime of the program. Figure 12(B), on the other hand, shows that the other arrays (A, B, C and F) do not have a communication-overhead problem, because their message sizes are large (in this case, each message is approximately 16 Kbytes). The Figure also indicates that the only performance issue that could potentially be improved is communicationlink utilization. As shown in the Figure, the link utilization index is around 0·67. This implies that, on the average, only about a third of the links are used during communication phases. Hence, in order to improve the performance of this program, we need to reorganize the code to use the links more effectively. Consider the algorithm for transpose, to understand why only a third of the links are used on the average during communication. There are log2(P) stages (P = 8) in a transpose. The processors communicate by exchanging data along each dimension of the cube, one dimension at a time. Thus, each processor uses only one of its bidirectional links at any given time. This implies that for a P-processor Hypercube, there are only P/2 bidirectional links used during each stage of transpose. In other words, [P × log2(P) − P]/2 links are not in use. For a cube with eight processors, the number of links not in use for each stage is therefore twice that in use. This agrees with the link-utilization index, for each data-structure, presented in the datastructure statistics. In addition to the statistics discussed above, graphical representations showing interprocessor data movement can be useful for exposing the nature of program execution. Figure 13 shows one such graphical view of the program with datastructure information. This is an OverVIEW of the tridiagonal solver on eight processors. Data-structure information is represented by the color of message lines; all communications involving the same pair of sending and using data structures

Figure 13. A time-line diagram of the tri-diagonal solver: data-structure information is indicated by the color of the communication lines * The value of communication overhead in Figure 11 is expressed as a fraction of the communication time between the two data structures and not in terms of the entire program’s execution time. Forced messages are messages which exhibit better communication characteristics on the iPSC/860, by eliminating the need for lower-level handshaking. Zero-byte messages are used to set up links and buffers for the forced message containing the actual data to be communicated.

performance measurement, visualization and modeling

451

have the same color. For example, in Figure 13, the first 75 per cent of the computation is dominated by communication. During this time, four different arrays are transposed (each in log2(P) = 3 stages); this is reflected in the OverVIEW by four communication phases, each represented by messages of a different color. Displaying data-movement information in views such as this enhances the user’s understanding of program execution and performance. MK: THE MODELING KERNEL A key performance-related issue in parallel programming is that of scalability, i.e. predicting how the performance characteristics of a parallel program will be affected by increasing the size of the problem it solves (hereafter, N) and the number of processors it uses (hereafter, P). Whereas instrumentation, monitoring, and visualization concern a single performance point in a program’s N–P space, scalability concerns the asymptotic behavior of a program’s performance for all large values of P and N. Scalability analyses help developers by identifying those values of N and P where performance saturates. Scalability tools are only as accurate as the models they use. However, interactive usability and response time of a scalability tool decrease as its accuracy of the modeling increases. At one end of this spectrum are simulation tools that can be used to simulate program execution for every N–P pair and at the other are tools that express scalability as functions of N and P. The latter technique is more difficult to implement, but can provide rapid first-order estimates of program performance17 over a wide range of N and P. We present two ways to analyze scalability using automatically generated models of parallel programs based on both static program-syntax information and dynamic program-execution information. The first subsection describes compiler-assisted complexity analysis; the next, regression-based modeling and simulation-based performance prediction. The final subsection compares and contrasts the two methods. Compiler-assisted complexity analysis Although traditional scalability metrics18 express the program performance as functions of N and P, they have been targeted to specific applications, and there are no tools to automatically and rapidly obtain simple first-order scalability trends for message-passing parallel programs. A tool that fits well with this need is an executiondriven scalability analyzer. Such a scalability analyzer has the following features: 1. It provides a simple first-order model of program scalability, based on static and dynamic information of the program execution. 2. It expresses scalability as a function of N and P. Hence, speed-up curves for a wide range of N and P can be generated by simply plotting these functions. Only a few existing tools can perform scalability analysis, primarily because of the complexity of the tasks to be performed. ATexpert,19 developed for the Cray Y/MP, predicts speed-ups obtainable in various regions of a program as the number of processors is increased. However, it does not deal with the issue of problem size and SPMD (single-program-multiple-data) programs. In Reference 20, a methodology for incorporating scalability studies into SIEVE and the approach for answering what-if? questions is presented. However, neither of these tools automatically pro-

452

j. yan, s. sarukkai and p. mehra

duces program models and neither produces the detailed tracefiles produced by MK’s simulations. Inputs from various components are needed in order to automatically generate scalability trends. The three important components are: (a) static information regarding the orders of execution time for communications and computations (b) dynamic information in the form of trace files from which constant factors can be extracted (c) an algorithmic communication-cost model to determine the expected communication times for a given architecture, for varying N and P. The source program is analyzed to determine the communication and computation complexity, with the help of the Sage toolkit.6 The output of the analysis performed using Sage is an expression indicating the communication and computation complexity of the function. The source program is automatically instrumented using xinstrument. The AIMS-generated tracefile is piped to a symbolic engine (such as gnuplot or Mathematica) which interprets the complexity expressions and uses the communication cost models to plot 3-D and 2-D curves showing the expected speed-ups and execution times of selected functions. Any function can be selected and analyzed interactively. Further, the complexity expressions can be modified to study the effect of modifications to the computations or communications, on the expected behavior of the program. Since we are interested in generating communication complexities, we need static program analysis to express the computation complexity and the number and volume of communications as functions of N and P. Determining this information by hand can be cumbersome. Program-analysis tools can be used to generate the desired information. Although the complexity analysis problem is undecidable in general, there have been a few efforts to solve restricted cases of the problem for sequential programs such as a prototype tool Metric (for lisp programs). Previous work by Sarkar21 presents a framework for determining average program execution times of a sequential program based on its internal structure and its control-dependence graph. However, this work did not consider using symbolic complexity analysis. The scalability analyzer considers only loops whose indices follow an arithmetic progression, that is (possibly multiply nested) loops with constant strides. It should be noted that this method does not work for generic ‘Do . . . while’ loops since the iteration spaces are not well defined and cannot be determined easily. However, loops in a number of typical scientific applications have well-defined iteration spaces. The scalability analyzer does not determine the coefficients associated with the (N,P) terms in the complexity expressions. These coefficients can be determined by executing an appropriately instrumented program and collecting data about times spent in the various computation and communication regions of interest. The detail of instrumentation determines the accuracy to which execution times can be predicted. Our approach is, therefore, only as accurate as achievable using a given granularity of instrumentation. Since instrumenting all loops is expensive (both in terms of time and space), we instrument just the beginning and end of functions and communication operations. This implies that only the leading order terms (and hence the most significant contributors to execution time) in the computation and communication complexity are used in determining expected speed-ups.

performance measurement, visualization and modeling

453

While communication complexity specifies how communication time scales up due to the nature of the algorithm, no information regarding the nature of the network architecture and its influence on the expected performance has been considered so far. We use the most common communication-cost model used in algorithm design for large-scale multiprocessors. This model assumes that communication and computation phases alternate during program execution and that communication requires time linear in the size of message (n) plus a start-up time (o). Communication time may also be influenced by the number of links which need to be traversed (h) by the communications. Thus, communication time can be expressed as: o + n c1 + h c 2 (where c 1 and c2 are defined by the machine architecture). For instance, the communication time for a message of size n traveling over h hops of an Intel iPSC/860 Hypercube is found to be 76 + n × 0·04 + h × 11 ms, if n , 100 and 136 + n × 0·4 + h × 33 ms, if n $ 100. Such expressions effectively capture the communication overhead and latency of message passing, but do not account for resource contention. Given this framework, we can automatically obtain first-order estimates of a function’s scalability. Given a tracefile from an execution of a program for a specific (N,P) pair, we can automatically determine the expected behavior of the same program when the problem size or the number of processors are scaled up. Results of compiler-assisted complexity analysis We now reconsider the tridiagonal solver (Xtrid) introduced in the previous section. The execution time of Xtrid is dominated by the explicit transpose of various arrays. Since communication times are the most difficult to predict accurately and since the execution of this program is communication-intensive, we will compare the predicted communication times and observed communication times for performing the transposes. In this example, the tracefile was generated by an execution of the program with N =128 and P = 8 and array dimensions 128 × 128. The constant values using just this single tracefile are c1 = 0·821, c2 = 0·271 and c3 = 23·5. Table II compares the predicted communication times in xrtrans with the actual execution times. The predicted communication times are quite accurate and obtained with the help of an eight-processor trace. The close match seen here is partly due to contention-free communication. For programs with contention, there could be some deviation be between the predicted and actual times. Table II. Comparison of predicted and actual communication times across all processors (in milliseconds) in transpose, based on an execution trace obtained with N = 128 and P = 8 N 128 256 512

Predicted Actual Predicted Actual Predicted Actual

P=8

P = 16

P = 32

P = 64

* * 288·5 288·5 1134·5 1135

111 112 393 394 1521 1519

161 165 513·3 516 192 192·5

245 238 668 676 2360 2377

454

j. yan, s. sarukkai and p. mehra

Although this approach is appealing due to its flexibility and scope for automatically generating scalability trends for individual subroutines, scalability analysis of the entire program requires sophisticated inter-procedural anlaysis to propagate those input variables whose value affects problem size. The communication-cost model used currently is accurate only as long as the overlap between communication and computation does not change with changes in problem size or the number of processors. Performance prediction via modeling and simulation Modeling and simulation of parallel programs and architectures are important tools in addressing both scalability analysis and instrumentation-monitoring-visualization. From the scalability point of view, they allow one to predict the performance of large-scale runs that may be either impossible or too time-consuming to attempt on actual machines. From the instrumentation point of view, simulation allows collection of large performance traces without unduly perturbing the execution of the program being modeled. Simulation is also a convenient tool for answering ‘what-if’ questions, such as (i) What if the communication links were twice as fast? (ii) What if the CPU on each node could be speeded up twofold? and (iii) What if one could have a machine with 8192 processors running an accordingly scaled-up problem? AIMS’ modeling kernel (or MK) represents a collection of tools that help the user answer these questions. The main* component of MK is GPPM:11 it models parallel programs at the coarsest level, capturing only the duration of sequential blocks, the lengths and destinations of messages, loop bounds and conditional branch probabilities. It ignores all references to I/O and memory. Model structure mirrors program structure and is derived from Xinstrument-generated parse trees, one per Fortran or C module. In order to strike a balance accuracy of modeling and ease of simulation, one needs to be careful about selecting those control constructs (loops, procedures and conditionals) of the application that are preserved in its model. GPPM preserves all control constructs that surround communications, plus any others that the user wishes to retain. The basic data structure used by GPPM is the augmented parse tree or APT; it is obtained from AIMS’ parse trees by first pruning out the subtrees corresponding to all the unmodeled control constructs and then augmenting each of the remaining nodes with formulae (in terms of N and P) for their numerical attributes. Associated with each node in the parse tree are certain attributes. With a loop node, there is a formula representating its loop bound (i.e. how many times the loop goes around) as well as the name of the index variable. Nodes representing sequential blocks (BLK) have a formula representing time as a function of N, P and any surrounding loop indices. Nodes associated with calls to message-sending functions (SND) have formulae expressing destination and length in terms of N, P and any surrounding loop indices; those to message-receiving functions (RCV) have only one formula for the source. GPPM estimates all the aforementioned formulae from AIMS-generated tracefiles. First, the structure of the model is inferred from Xinstrument-generated APPL DB * MK includes the AXE multiprocessor-architecture simulator, the BDL programming language for representing models of parallel applications, and GPPM for automated model building.23

performance measurement, visualization and modeling

455

file of the parallel program. Next, timing information about various constructs of interest is extracted from AIMS-generated tracefiles by grammar-driven interpretation (see below). Finally, large tables of trace information are replaced by simple formulae describing the computation and communication complexities of the application. Inference of model structure Figure 14 shows a parse tree produced by AIMS and the corresponding APT skeleton (so called because it lacks the formulae that would make it a complete APT). Given the nesting structure shown in Figure 14(A), the constructs to be modeled are selected first; these are shown with thick black lines. Then a regular expression is derived as follows. A unique identifier is associated with each uninterrupted block of sequential code, an index variable is associated with each selected loop, and the various send and receive statements are numbered consecutively. The expression associated with the selected constructs is shown to the right of Figure 14(a). The interesting property of this expression is that it completely specifies the structure of the AIMS trace produced at any one node in the parallel system. Once the expression has been derived, the structure of the APT skeleton shown in Figure 14(b) immediately follows. The APT skeletons are written out, as are the regular expressions describing the structure of tracefiles; they capture the essential structure of a coarse-level parallel-program model. The regular expressions are consumed by a grammar-generator and are used in creating a tree-structured database of information about the missing numerical attributes of APTs. Extraction of timing information This is achieved by grammar-driven interpretation of tracefiles11 as follows. Each tracefile is first sorted by node. Then the data for each node are parsed using a Yacc grammar automatically derived using the regular expression generated in the

Figure 14. The nesting structure (a) of an SPMD parallel program and the associated APT (Augmented Parse Tree) skeleton

456

j. yan, s. sarukkai and p. mehra

previous phase. This operation is repeated for all nodes and all tracefiles; the actions associated with the rules of the grammar attach timing information to the various nodes of the APT skeleton in memory. Intermittently, the data are written out to files; each file contains data useful for statistical estimation of one numerical attribute. Thus, loop iterations are counted and then used for estimating loop bounds. Sequential block timings are extracted based on the times of events immediately preceding and immediatley following the block in the APT; these timings are then used in estimating a formula for the computational complexity of each block. Statistical estimation of complexity. Statistical regression is performed on the information tabulated in the previous phase. Interesting non-linear terms are computed ahead of time and then linear regression (linear in coefficients) performed on the expanded data set. First, correlation analysis is performed in order to select the terms exhibiting significant correlation with the dependent variable. Linear regression is then performed using only the significant terms. Finally, partial-correlation analysis is used for checking the significance of estimatd coefficients. Significant terms are retained in the formula and the final coefficients computed. (See Reference 11 for details.) Results from modeling and simulation Figure 15 shows how MK derives models using information contained in the application database and tracefiles of an instrumented parallel program. The bottom part of the figure uses VK’s OverVIEW to visually compare traces from simulation with those from actual execution. This particular comparison (involving a 32node simulation of the Xtrid benchmark for N = 512) revealed problems in AXE’s implementation of global synchronization operations in the Hypercube. This was causing the large amount of idle time at the beginning of the simulation trace (Figure 15, bottom right). Once a more accurate simulation of global synchronization was implemented, there was even tighter agreement between simulation and measurement, where total execution times for several different N and P values were found to fall within 8 per cent of each other. (See Reference 22 for details.) Thus, the good qualitative agreement evident here is accompanied by good quantitative agreement as well. Therefore, simulation can be relied upon to rapidly locate causes of performance problems. For example, a 512-node simulation of a model of the Xtride benchmark described above (with N = 8192) took approximately 30 minutes on a Sun4. Automated modeling with GPPM currently lacks reasonable ways to invent simple expressions for N and P that will appear in the overall complexity expressions; instead, it performs a combinatorial search in the space of possible expressions, which is slow and data intensive. Therefore, considerable computational effort needs to be expended and tens of small-scale traces are necessary for regression to be effective. Integration with compiler-assisted complexity analysis is one potential way to reduce modeling costs. Comparison of the two approaches As noted earlier, analysis of scalability requires predicting the performance of a parallel application over a large range of N and P values. Even though simulation using AXE/BDL is fast, it is still not practical to perform hundreds, if not thousands,

Figure 15 A schematic of MK showing the automatic derivation of models using information about program structure (contained in Xinstrumentgenerated application database) and information about time spent in various constructs (contained in tracefiles obtained from small-scale runs of instrumented programs). Simulation of MK’s models produces traces compatible with VK so that simulations can be validated both visually and by quantitative means

performance measurement, visualization and modeling 457

458

j. yan, s. sarukkai and p. mehra

of simulations in order to analyze how performance scales to larger problem sizes and system sizes. This is where the wide range of tools that form MK proves especially useful. Compiler-assisted complexity analysis can project trends in expected behavior, and simulations can be used to comprehend the program behavior for a specific N–P pair. For well-balanced parallel applications with contention-free communication patterns, we can have high confidence in the accuracy of scalability plots (obtained using either of the two methods described) because of the good quantitative agreement between speed-ups predicted by analysis and simulation. Figure 16 shows that (for method 5 of the Xtrid benchmark) the predictions made by analysis and simulation fall within 6 per cent of each other. In fact, for reasonable values of N (N . 2P), the error is less than 2 per cent. The scalability plot shown in Figure 1 (also for method 5 of the Xtrid benchmark) reveals that speed-up saturates as N is increased for a given P or when P is increased for a given N. Simulation can be performed with specific values of N and P in the region of poor performance in order to find out the exact causes. There still are a few issues in MK pending consideration: 1. Performance prediction of programs with data-dependent complexities has still not yet been resolved satisfactorily and is a topic of ongoing work; 2. Modeling the behavior of programs in which condititional constructs surround communication operations is difficult because large-scale communication patterns could differ drastically depending on whether a certain condition evaluates to true for given values of N and P. Even worse, the condition could be a function of input data. Even so, MK represents not only a significant advance in simulation technology (where it overcomes the burden of manual modeling) but also an entirely new way of characterizing scalability (where asymptotic complexity analysis is first used to locate the problematic values of N and P, and simulation is then used to explain

Figure 16. The comparison of predicted execution times: simulation versus analysis

performance measurement, visualization and modeling

459

poor performance). By producing traces compatible with AIMS, MK’s models can leverage upon the growing body body of AIMS’ analysis tools, such as SK and IK, to explain and optimize simulated performance. CONCLUSIONS We have described a comprehensive set of tools required to expose performance problems in parallel message-passing programs. We raised and addressed a number of issues in performance tuning and then described the AIMS toolkit in detail. Using a number of examples that represnt the computation and communication behavior of key scientific applications, we were able to demonstrate the effectiveness of AIMS in detecting and characterizing performance problems. The first example considered was a parallel integer sort. We were able to systematically reduce the total execution time of the program using VK. PVM versions of the MG and CG benchmarks were then considered. Using these examples, we showed that skew-compensated tracefiles can show meaningful information about distributed-memory programs and can provide invaluable feedback regarding static load-balancing issues and background load. Next we demonstrated the effectiveness of IK in explaining performance problems with the MG benchmark. IK’s indices not only emphasize potential causes of poor performance but also indicate the percentage reduction in execution time achievable if the performance problem were completely eliminated. A tridiagonal solver, which represents the core computation in the numerical solution of partial differential equations, was then analyzed. We showed that indices expressed in terms of data structures can identify specific problems and solutions for these performance problems. Both these examples show that non-graphical means of representing performance information can also be valuable in characterizing performance failures. Finally, we considered the efficacy and importance of using MK for scalability analysis. Using the tridiagonal-solver example, we showed that our simulation models closely match actual execution, and are useful in exposing the saturation of speedups for the tridiagonal solver on scaled-up executions. AIMS is currently functional on a number of platforms including the CM-5, Intel iPSC/860, Intel Paragon, and workstation clusters running PVM. It is available for distribution from the authors. It currently supports Fortran 77 and C (with messagepassing calls). The message-passing library supported by native hardware and PVM are both supported. Some topics of our ongoing research topics are 1. On the instrumentation front, we are working on extensions that will allow us to handle data-parallel languages (such as HPF) as well as other message passing libraries (e.g. MPI). 2. On the monitoring front, we are considering techniques that help reduce the amount of performance data gathered, automatically discover the node-to-node interconnection topology and resource-utilization of parallel virtual machines, as well as porting AIMS to other multiprocessor platforms (such as the IBM SP-2). 3. On the trace-analysis side, we plan to work on scalable and informative graphical views with data-structure information, automated performance-debugging guide, integrated visualization of computation and background workload, and a more robust and extensible modeling kernel.

460

j. yan, s. sarukkai and p. mehra acknowledgements

The authors wish to acknowledge the entire AIMS development team, which includes Melisa Schmidt, Cathy Schulbach, and Brian VanVoorst. We also want to acknowledge the computing facilities provided to us by the Numerical Aerodynamic Simulation Systems Division, NASA Ames Research Center. The work described here is supported under NASA’s High Performance Computing and Communications Program. REFERENCES 1. E. Kraemer and J. Stasko, ‘The visualization of parallel systems: an overview’, J Parallel and Distributed Computing, 18, (2), 105–117 (1993). 2. D. A. Reed, R. D. Olson, R. A. Aydt, T. M. Madhyastha, T. Birkett, D. W. Jensen, B. A. A. Nazief and B. K. Totty, ‘Scalable performance environments for parallel systems’, Proc. 6th Distributed Memory Computing Conference, April 1991. 3. J. C. Yan, ‘Performance tuning with AIMS—an automated instrumentation and monitoring system for multicomputers’, Proc. 27th Hawaii International Conference on System Sciences, Wailea, Hawaii, January 1994. 4. S. R. Sarukkai and Allen Malony, ‘Perturbation analysis of high level instrumentation for SPMD programs’, Proc. Symp. on Principles and Practice of Parallel Programming, San Diego, May 1993, pp. 44–53. 5. M. Heath and J. Ethridge, ‘Visualizing the performance of parallel programs’, IEEE Software, 8, (5), 29–39 (1991). 6. D. Gannon, J. K. Lee, B. Shei, S. R. Sarukkai, S. Narayana, N. Sunderam, D. Atapattu and F. Bodin, ‘SigmaII: a toolkit for building parallelizing compilers and performance analysis systems’, Proc. Programming Environments for Parallel Computing Conf., Edinburgh, April 1992. 7. S. R. Sarukkai and J. K. Gotwals, ‘Analyzing data-structure movements in message-passing programs’, Technical Report #402, Indiana University, Dept. of Computer Science, March 1994. 8. J. C. Yan and S. Listgarten, ‘Intrusion compensation for performance evaluation of parallel programs on a multicomputer’, Proc. ISCA 6th International Conference on Parallel and Distributed Computing Systems, Louisville, KY, 14–16 October 1993, pp. 427–431. 9. J. C. Yan, P. Houtalas, S. Listarten, C. Fineman, M. Schmidt and C. Schulbach, ‘The automated instrumentation and monitoring system (AIMS) reference manual’. NASA Technical Memorandum. 108795, NASA Ames Research Center, Moffett Field, CA 94035-1000, U.S.A., December 1993. 10. S. R. Sarukkai, J. C. Yan and Jacob K. Gotwals, ‘Normalized performance indices for message passing parallel programs’, Proc. Int. Conf. on Supercomputing, July 1994. 11. P. Mehra, M. Gower and M. Bass, ‘Automated modeling of message-passing programs’, Proc. Int. Workshop on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS 94), IEEE Computer Society Press, Durham, NC, January 1994, pp. 187–192. 12. S. White, A. Alund and V. S. Sunderam, ‘The NAS parallel benchmarks on virtual parallel machines’, Dept. of Mathematics and Computer Science, Emory University, Atlanta, GA, 1993. 13. C. H. Cap and V. Strumpen, ‘Efficient parallel computing in distributed workstation environments’, Parallel Computing, 19, 1221–1234 (1993). 14. S. T. Leutenegger and X-H. Sun, ‘Distributed computing feasibility in a non-dedicated homogeneous distributed system’, Proc. Supercomputing ’93, ACM, 1993, pp. 143–147. 15. D. Bailey, J. Barton, T. Lasinski and H. Simon (eds), ‘The NAS parallel benchmarks’, Report RNR91-002, NASA Ames Research Center, January 1991. 16. J. K. Hollingsworth, R. B. Irvin and B. P. Miller, ‘The integration of application and system based metrics in a parallel program peformance tool’, Proc. Third ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming (PPOPP), 1991, pp. 189–200. 17. S. R. Sarukkai, ‘Scalability analysis tools for SPMD message-passing parallel programs’, Proc. Int. Workshop on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS 94), IEEE Computer Society Press, Durham, NC, January 1994, pp. 180–186. 18. D. Nussbaum and A. Agarwal, ‘Scalability of parallel machines’, Comm. ACM, 34, (3), 57–61 (1991). 19. J. Kohn and W. Williams, ‘ATExpert’, J. Parallel and Distributed Computing, 18, (2), 205–222 (1993).

performance measurement, visualization and modeling

461

20. S. R. Sarukkai and D. Gannon, ‘SIEVE: a performance debugging environment for parallel programs’, J. Parallel and Distributed Computing, 18, (2), 147–168 (1993). 21. V. Sarkar, ‘Determining average program execution times and their variance’, SIGPLAN Notices, 24, (7), 298–312 (1989). 22. P. Mehra, C. Schulbach and J. Yan, ‘A comparison of two model-based performance-prediction techniques for message-passing parallel programs’, Proc. SIGMETRICS Conf. on Measurement and Modeling of Computer Systems, ACM, Nashville, TN, May 1994.