Performance Debugging Shared Memory Parallel Programs Using ...

5 downloads 82768 Views 253KB Size Report
mentation on the same application suite we use for perfor- mance debugging. ...... potheses postulated by the tool builder, and is re ned until a few hypotheses ...
Performance Debugging Shared Memory Parallel Programs Using Run{Time Dependence Analysis Ramakrishnan Rajamonyy and Alan L. Coxz Departments of Electrical & Computer Engineeringy and Computer Sciencez Rice University Houston, TX 77251-1892 frrk,[email protected] Abstract

We describe a new approach to performance debugging that focuses on automatically identifying computation transformations to reduce synchronization and communication. By grouping writes together into equivalence classes, we are able to tractably collect information from long{running programs. Our performance debugger analyzes this information and suggests computation transformations in terms of the source code. We present the transformations suggested by the debugger on a suite of four applications. For BarnesHut and Shallow, implementing the debugger suggestions improved the performance by a factor of 1.32 and 34 times respectively on an 8{processor IBM SP2. For Ocean, our debugger identi ed excess synchronization that did not have a signi cant impact on performance. ILINK, a genetic linkage analysis program widely used by geneticists, is already well optimized. We use it only to demonstrate the feasibility of our approach to long{running applications. We also give details on how our approach can be implemented. We use novel techniques to convert control dependences to data dependences, and to compute the source operands of stores. We report on the impact of our instrumentation on the same application suite we use for performance debugging. The instrumentation slows down the execution by a factor of between 4 and 169 times. The log les produced during execution were all less than 2.5 Mbytes in size. 1 Introduction

Programmers often rely on performance analysis tools to improve the performance of parallel applications. These performance debuggers typically provide data and performance metrics from the execution that can be used to transform the program, improving its performance. Synchronization  This research was supported in part by the National Science Foundation under NYI Award CCR-945770 and by the Texas Advanced Technology Program. Ram Rajamony is also supported by an IBM Cooperative Fellowship.

To appear in the 1997 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, June 1997, Seattle, WA

and communication are two signi cant sources of overhead in shared memory parallel programs. In this paper, we describe a performance debugger that suggests source-level transformations to reduce these overheads in explicitly parallel shared memory programs. In contrast to existing performance tools, our debugger analyzes the data and control dependences within and across processors at run{time. This enables us to suggest changes to the computation structure. In the shared memory paradigm, processors can access a global namespace from anywhere in the system. Multiple processors can simultaneously read and write locations in this shared address space. To order con icting accesses (read{write or write{write accesses to the same location), the processors must synchronize. By requiring processors to interact, synchronization may slow down execution. In addition, it may also cause processors to block. Hence, while the parallel programmer must use sucient synchronization in order to prevent data races (concurrent con icting accesses) [16], any extra synchronization can lead to poor performance. A synchronization operation is unnecessary if no true, anti- or output dependences [4] need to be enforced, or if other synchronization satis es the dependences it enforces. Excess synchronization can arise due to several reasons. Programmers often oversynchronize, forcing more processors to interact than needed. The synchronization constructs that are used may also impose more ordering than required. In a previous paper, we describe a performance debugger that automatically analyzes data collected from a program execution to detect excess synchronization [17]. By providing feedback in terms of lines in the source program, our debugger eliminates the need for the programmer to reason about low{level execution statistics to determine the cause for poor performance. Even if a program contains no excess synchronization as written, it may be possible to reduce synchronization and data communication by changing the computation structure. If the results of some computation are not needed immediately and the source operands are still available later, the computation can be postponed. Postponement will be pro table if the dependences due to the computation can be enforced by other synchronization. Synchronization that is otherwise necessary may then be weakened or even eliminated, resulting in improved performance.

Pa

x=0

Sync

Sync

S0

S1

Sync

Pa

x=0

S1

t = x+y+z

t = x+y+z

Pb

Pb Pc

y=0

Pd

z=0

Postponed computation

Pc

y=0

Pd

z=0

Time

Pa

x=0

Sync S0

Sync S1

Time

x=1

Computation cannot be postponed, because x,y,z are not available at the postponement point.

y=1

There are anti−dependences between the reads of x,y,z and their writes, represented by

t = x+y+z

Pb Pc

y=0

Pd

z=0

z=1 Time

Figure 1: Eliminating synchronization by postponing computation. The execution of four processors is shown, with computation indicated in the boxes. In the top half, synchronization S 0 can be eliminated by postponing computation on Pb . In the bottom half, postponement is not possible. S 1 is required due to dependences not shown in the gure. Shaded computation is not relevant to the discussion. Figure 1 explains this with an example. In the top half, processor Pb reads data written by the other processors after synchronizing with them. This true dependence necessitates synchronization S 0. After reading x; y and z, Pb writes t. The results of this computation (t) are not read by the other processors after synchronization S 1. Moreover, the source operands of the computation (x; y; z ) are still available unmodi ed if this computation is postponed to after S 1. Synchronization S 0 can then be eliminated. In the bottom half of gure 1, x; y and z are modi ed after synchronization S 1. The source operands will therefore not be available if the computation is postponed. Postponing the computation would thus a ect program correctness. The amount of data communicated is also an important factor in determining program performance. Ideally, computation should be performed on the processor that has computed most of the source operands, and uses (reads) most of the results. By attributing data movement to the computation whose memory accesses induce the movement, we can determine if it would be better for the computation to be performed on another processor. If this relocation eliminates all the data dependences enforced by some synchronization, the synchronization can also be removed. In gure 2, processor Pb performs some computation that uses data created by Pa . The results of this computation are again used by Pa . Hence, as it stands, there is data communication between the processors, and synchronization S 0 and S 1 are required. Both the data communication and the synchronization can be removed by relocating the computation to Pa. Our objective is to enable a performance debugger to detect such cases, and identify them to the programmer. This

Pa

x=0

Sync

Sync

S0

S1

Pb

y=x

x += 1 Time

Pa Pb

x=0

x += 1

y=x

Relocated computation

Time

Figure 2: Eliminating communication and synchronization by relocating computation. Arrows indicate data dependences. requires two pieces of information. First, we need to know the source operands of the computation. During program execution, every write to memory can be used to generate information linking that write to its operands. However, recording the operands for all writes becomes infeasible for programs that execute for more than a very short period of time. In addition, using a performance debugger to analyze the large amount of gathered data also becomes infeasible. Hence, performance tuning real{world applications requires a tractable means of collecting run{time dependence information. We present a method for doing this that groups writes together and collects operand information in terms of equivalence classes. Second, we need to know the cross{ 2

Programmer

from our debugger into the program. The feedback we provide is still very useful as it greatly reduces the state space of transformations the programmer has to explore in order to improve performance. We have implemented methods both for collecting dependence information at run{time and for analyzing the collected data in a performance debugger. We present the results of applying our approach to four applications. Ocean and Barnes{Hut are applications from the Splash benchmark suite [27]. Shallow is a version of the shallow water benchmark from the National Center for Atmospheric Research [20]. Unlike the other three, ILINK [7] is a \real" application widely used by geneticists for genetic linkage analysis. For Ocean, the performance debugger found excess synchronization in the program. For Barnes-Hut and Shallow, the debugger suggested computation transformations that reduced synchronization and communication. Implementing these suggestions improved the performance of the programs by a factor of 1.32 and 34 respectively, on an 8{ processor IBM SP2 running TreadMarks [9]. TreadMarks is a software distributed shared memory system that provides a shared memory abstraction on message{passing machines. Our debugger did not nd any transformations that would bene t ILINK. As ILINK is already a well optimized program, this is not surprising. Our intent in including it for analysis was to demonstrate the feasibility of our approach for long{running applications. Our instrumentation slowed down the execution of these programs by a factor between 4 and 169, producing log les that were less than 2.5 Mbytes in size. This paper makes contributions in both methodology and implementation. Our performance debugging approach analyzes data gathered at run{time to suggest changes (at the source{code level) to the computation structure of the program. These changes, when implemented, reduce the synchronization and communication in the program. To make the run{time data collection tractable, we use equivalence classes to group writes together. These are de ned in such a way that the collected data still has the information content needed for analysis by a performance debugger. At the implementation level, we present run{time methods for converting control dependences to data dependences and for recording the source operands of computation. We present an ecient method to save and restore the control dependences in e ect when entering and leaving the scope of a controlling basic block (one that that can change the control{ ow). To determine the source operands of computation, we instrument the basic blocks of the program. When an instrumented basic block executes, it produces an access trace that we feed to its mirror basic block, which we generate. The mirror basic block uses this trace to determine the source operands for the computation in the original basic block. The outline of the rest of the paper is as follows. In section 2, we describe the method for collecting operand information in terms of equivalence classes. In section 3, we explain how the performance debugger analyzes this information. The algorithms used in the debugger to determine when computation can be postponed or relocated are explained here. We also use a suite of four applications to

Feedback

Explicitly parallel shared−memory program

Compile (gcc)

assembly code

Instrument for run−time dependence analysis

+ Performance debugger Executable

Parallel Computer

Trace output

Log files

Program Output

Figure 3: A high{level overview of our performance debugging approach. processor data and control dependences at run{time. Hence, we also record reads and writes to shared locations during program execution. Later, during analysis, we combine the access information with the operand data and use it for performance debugging the application. Figure 3 gives a high-level overview of our entire approach. We compile the explicitly parallel source program using a standard compiler. We analyze the assembly output from the compiler and instrument it to collect the dependence information at run{time. The executable is then run on the shared memory parallel computer. The instrumentation we insert produces trace data, which the program writes to log les. After the program is nished executing, we process the log les using our performance debugger. The debugger makes suggestions to the programmer for transforming the program, in terms of the source code. Our approach requires programmer involvement at two levels. First, we use information about the synchronization accesses in the program to both collect and analyze the run{time data. Thus, the performance debugger is able to provide useful feedback only if the programmer uses synchronization primitives provided by the shared memory system. Second, as the analysis is carried out with information gathered at run-time, the performance debugger only makes suggestions for improving performance. The programmer has to ensure program correctness when incorporating feedback 3

report on the computation restructuring suggested by the debugger, and its impact on performance. The remaining sections discuss how our approach can be implemented. In section 4, we describe the algorithm used at run{time to collect operand information in terms of equivalence classes. Section 5 describes the static analysis we perform in order to instrument the program. Section 6 presents details speci c to the run{time system we use, and discusses the e ects of instrumentation on the application suite. We conclude with related work and conclusions in sections 7 and 8.

Pa Pb Pc

2 Equivalence Classes

To enable a performance debugger to suggest changes to the computation structure, we need two pieces of information. First, we need to know the data on which the computation depends. This is made up of the source operands that are used in the computation. Second, we need to see if changing the computation structure would cause a dependence (true, anti- or output) from not being enforced by existing synchronization. This requires knowledge of the accesses made by the program. We rst discuss how to determine the source operands of computation. As a program executes, it writes values to memory that are derived from the contents of other memory locations.1 Thus, during execution, a write can be expressed in terms of earlier writes to the memory locations on which it depends at that point (these are the source operands of the computation). However, collecting operand information at such ne detail presents several problems. First, the amount of memory required at run{time to keep track of this information becomes very large. Storing the collected data for analysis also becomes impractical. Data analysis in the performance debugger typically requires (costly) combinatorial operations such as set intersections and unions, which with detailed operand information, can become time{bound. All of these factors combine to make this approach impractical. We can avoid these problems by grouping writes to memory locations into equivalence classes and collecting the operand information for later writes in terms of these equivalence classes. By referring to the group instead of to the individual writes, a smaller amount of data gets collected. Di erent criteria can be used to form equivalence classes; we describe the method we use below. In the shared memory programming paradigm, memory accesses are ordered using synchronization. These synchronization operations divide the execution of a process2 into intervals. Consider the writes made by a process in a particular interval. We can group these writes into an equivalence class characterized by a [processor,interval] tuple. Then, at a later write, instead of recording the actual locations on which the write depends (i.e. the source operands), we use the equivalence classes corresponding to the last writes to these locations. Figure 4, showing the execution of processors Pa , Pb and Pc illustrates this. Here, the source operands of the write to t by Pa are q; s; u; v; x; y and z . These locations were last written to by

1

1

2

2

q

t=

q+s+ u+v+ x+y+z

3 x

1

3

s

y

z

2 u

Time

v

q, s belong to the equivalence class

[P ,2 ]

x, y, z belong to the equivalence class

[P ,3]

u, v belong to the equivalence class

[P ,2]

a b c

Figure 4: Using equivalence classes to represent operand information. Arrow heads indicate the beginning and end of intervals (shown as boxed numbers). Synchronization is not depicted. represents writes to variables.

Pa and Pc in interval 2 and Pb in interval 3. The operand

information can hence be represented compactly as Writes in [Pa ; 2], [Pb ; 3] & [Pc ; 2] src ops- Write to t For brevity, in the remainder of this paper, we represent this just as [Pa ; 2], [Pb ; 3], [Pc ; 2] src ops- t where this is understood to mean that the source operands for the write on the right hand side come from the writes in the intervals on the left hand side. De ning equivalence classes in this manner enables us to reduce the amount of operand data collected. Although a write may have a large set of source operands, the number of distinct [processor,interval] tuples to which its operands belong is much smaller. For analysis by the performance debugger, we need to regenerate the memory locations making up the source operands. Hence, at the end of each interval, we aggregate the reads and writes to shared memory (in the interval just completed) and record them as access ranges. By combining this with the operand information, we can determine a superset of writes which make up the operands. Within an interval, we do not need to know the order in which accesses occur. This permits us to record them compactly using access ranges. The access data is also used to determine if there are true, anti- and output dependences that may be unenforceable if the computation is restructured. This is the second piece of information that we need for performance debugging. 3 Performance Debugging Using Equivalence Classes

In this section, we explain how the performance debugger uses the operand data (recorded in terms of equivalence classes) and the access information to regenerate the memory locations that are the operands for a write. We then describe how this information is used to determine when to postpone and relocate computation. During the execution of a program, the program order and the inter{process synchronization impose a partial ordering on all accesses. This partial ordering is used to create

1 We assume I/O from/to the external environment to be reads and writes from and to memory that is private to each processor. 2 We use the terms process and processor interchangeably.

4

j be the interval after j . If the writes in W Pb are postponed to j , the synchronization S can possibly be removed. If any other processor reads a location in W Pb before Pa synchronizes with the processor after interval j , the writes in W Pb

a partial{order graph, made of as many disjoint subgraphs as there are processors. The nodes of a subgraph represent the synchronization operations issued by a processor. The program order imposes a total ordering on all nodes in a subgraph, represented by directed edges connecting the nodes. Between subgraphs, synchronization operations are used to create directed edges. An edge is drawn from the release of a synchronization variable to its subsequent acquire, as observed during the execution. A release is a synchronization write and an acquire, a synchronization read [6]. This graph enables us to evaluate the happens{before-1 - )[1]. Given accesses x and y, x hb1- y i relation ( hb1 there is a path in the partial{order graph from the release immediately succeeding x to the acquire immediately preceding y. This relation allows us to determine the earliest interval in which a processor can become aware of a write on another processor. Consider a write to x on processor Pa . Let the source operands of the write be memory locations that were last written in interval i on processor Pb . At run{time, the operand information for the write to x is recorded as [Pb ; i] src ops- x - relation to determine the actual memory We use the hb1 locations which make up the source operands for the write. Let j be the earliest interval on Pa such that [Pb; i] hb1 [Pa ; j ] (where the relation is de ned for intervals as it is for accesses). We examine the shared reads on Pa , beginning with interval j and nd the locations read, that were written in [Pb ; i]. This gives us a superset of the locations which make up the operands of the write to x. Note that it is not enough to examine the shared reads in just the interval in which x is written. This is because x may depend on a private location written in a previous interval, which in turn may depend on data written in [Pb ; i]. A limitation of our approach is that we can determine only a superset of the memory locations that make up the source operands for a write. Consider an example where x and y are written by a processor in the same interval. Let x depend on u and y on v, where u and v are last written in [Pb; i]. At run{time, the source operands for these writes are recorded as [Pb ; i] src ops- x and [Pb ; i] src ops- y During analysis, we can only determine that the superset of locations which are the source operands for x and y are u and v. We discuss this more in sections 3.1 and 3.2.

0

0

0

cannot be postponed. We use the partial{order graph to nd the intervals where we need to check if elements of W Pb are read. Let m be the rst interval on a processor Pp such that - [Pp; m]. Also, let l be the rst interval on Pp [Pa ; j ] hb1 - [Pp; l]. We refer to the intervals from such that [Pa ; jk ] hb1 [Pp ; m] to [Pp ; l], including the former and excluding the latter, as the \in{between" intervals. We examine the reads in the \in{between" intervals on all processors. If any element of WjPkb is read in these intervals, the computation characterized by the writes in W Pb cannot be postponed. If the writes in W Pb can be postponed based on the above criterion, we next determine if the source operands of W Pb are still available unmodi ed when W Pb is postponed to [Pa ; j ]. To do this we rst determine the set of memory locations on which W Pb depends (see section 3). Let DEP jk denote the elements of this set that are read in [Pa ; jk ]. We refer to these collectively as DEP j . We then examine if any element of DEP jk is written to in the \in{between" intervals. For all processors Pp , if no element of DEP j is written, the source operands will still be available if the computation is postponed. Note that as described in section 3, only a superset of the locations involved in a particular dependence can be determined. This limitation does not a ect us for the following reason. Consider the writes in [Pa ; j ] which depend on data written in [Pb ; i]. In order to remove synchronization from Pb to Pa , all these writes need to be postponed. For a particular write, only the superset of locations on which it depends can be determined. However, all the writes taken together depend on a collective set of locations which can be precisely determined. This is what we require for analysis. After determining that the writes in W Pb can be postponed, a nal check is required before the performance debugger can suggest that synchronization S can be removed. This arises because of the transitive nature of synchronization. Consider the [processor,interval] tuples on which the writes in the \in{between" intervals depend. If removing S results in one of these dependences not being enforced (i.e. all paths in the partial{order graph enforcing this dependence pass through S), we cannot remove S. 0

0

3.1 Postponing Computation

Let Pb be a processor that synchronizes with Pa . Our aim is to nd if this synchronization S , can be removed. We start by nding the set of intervals on Pa bounded by the acquire corresponding to S and the next acquire corresponding to a release by Pb . Let j0 : : : jn be these intervals, which we collectively refer to as j . Next, we nd the shared writes in [Pa ; j0 ] : : : [Pa ; jn ] that have source operands that were last written on Pb , and denote these by WjP0b : : : WjPnb . This is obtained from the operand information recorded as [procesPb sor,interval] tuples at run{time. We use W jk to refer to any P one of these, and W b to refer to them collectively. We next determine if these writes can be postponed. Let

3.2 Relocating Computation

Relocating computation with the aim of reducing synchronization and communication is at cross-purposes with parallelization. For instance, reducing communication by relocating computation greedily could result in all the computation being collapsed onto a single processor. Therefore, load balancing also needs to be taken into account. Our performance debugger uses a local algorithm to determine when computation can be relocated, looking only at intervals separated by barriers. Consider the writes in a barrier{separated interval j on processor Pa . We rst determine the superset of memory locations on which these writes 5

Our goal therefore, is not to detect performance problems in the program, but rather to demonstrate the feasibility of our run{time dependence analysis approach to real-world applications. We ran ILINK with the data set corresponding to the bipolar a ective disorder disease gene.

depend. This gives an upper bound on the amount of data that is communicated from every other processor to Pa . We next nd the amount of data communicated from Pa to the other processors when they read the data written in [Pa ; j ]. This can be exactly determined by nding subsequent reads to this data on the other processors. For each processor, we nd the sum of the amount of data it supplies for the writes in [Pa ; j ] and the amount of the results that it subsequently reads. The performance debugger then suggests that the computation in [Pa ; j] be relocated to the processor for which this sum is maximum (if di erent from the one on which the computation is currently performed).

3.4.2 Debugger Feedback and Performance Impact

Our debugger identi ed several instances of excess synchronization and suggested computation restructuring to reduce synchronization and communication in the programs. We present some of these ndings below. This program has a phase where one processor collects information from the others and prints it out. Our debugger found out that the data being printed was not modi ed until much later, and suggested postponing the printing to a later point. This resulted in a barrier in the program being required only to satisfy anti{dependences. We were thus able to weaken this synchronization. On an 8{processor IBM SP2 running TreadMarks, the speedup of the program improved from 4.1 to 5.4 after restructuring. A signi cant portion of this improvement came about as a result of reduced false sharing [24]. Shallow: The main data structures in this program are a number of arrays. Each processor is assigned the computation on a band (set of contiguous rows) of the arrays. Sharing is required only at the boundary rows of the bands. Periodically, the border elements of the arrays are exchanged. In the code we examined, one processor was doing all of the data exchanges. Our debugger pointed out that the exchanges should be relocated to the processors which were assigned the rows. This reduced the data communication in the program signi cantly. The computation restructuring had a large impact on the running time of the program. On an 8{processor IBM SP2 running TreadMarks, the restructured program achieved a speedup of 3.4 compared to 0.1 (a slowdown!) for the original version. Ocean: Although our debugger did not identify any computation restructurings for this program, it pointed out a barrier which was not used most of the time, which could then be conditionally executed on each iteration. This transformation did not have a signi cant impact on the running time.

Barnes:

3.3 Relating Computation to Source Lines

Once we identify computation that must be postponed or relocated, we need to relate these writes to source lines in the program. Similar to memory locations, we also maintain operand information for source code lines. The operand information of all the stores corresponding to a source line is merged and attributed to the line. Given a set of stores that need to be postponed or relocated, we identify the source lines as those that depend on those equivalence classes on which the stores depend. 3.4 Findings

We report on the results of applying our performance debugging approach to four programs, namely, Ocean, BarnesHut, Shallow and ILINK. We describe these programs below, and also the feedback suggested by our performance debugger. We also present the improvements in speedup obtained after implementing the debugger suggestions. These results are obtained from an 8{processor IBM SP2 running TreadMarks [9], a software distributed shared memory system. TreadMarks provides a shared memory abstraction on distributed memory (message{passing) machines. 3.4.1

Applications

Barnes-Hut and Ocean are complete applications that are part of the SPLASH benchmark suite [27]. Barnes-Hut is a simulation of a system of bodies in uenced by gravitational forces. We ran Barnes{Hut with 16384 bodies for 5 timesteps. Ocean is a simulation to study the role of eddy and boundary currents in in uencing large-scale ocean movements. We used a 130  130 grid as the input data set. Shallow is a version of the shallow water benchmark from the National Center for Atmospheric Research [20]. This program solves di erence equations on a two dimensional grid for the purpose of weather prediction. The code we used was given to us with a request for improving its performance on TreadMarks. It makes an excellent test case for our performance debugger, because it represents programs written by inexperienced parallel programmers. We ran Shallow for 6 timesteps on a 256  256 input grid. ILINK is a parallelized version of a genetic linkage analysis program and is part of the FASTLINK package [7]. In contrast to the other programs, this application is widely used by geneticists and is heavily optimized for performance.

4 Collecting Source Operands

In this section, we explain the run{time algorithm we use to collect operand data at the level of [processor,interval] tuples. Our algorithm works by associating certain state with all memory locations and source lines that correspond to shared stores. This state is used to record operand information as well as the shared locations accessed in an interval. To relate computation (i.e. writes) to lines in the source code, we look at the symbol table information and the assembly 6

3. For every interval, we need to know two pieces of information. First, we need the identity of shared locations written in an interval, along with the equivalence classes of their operands. Second, we need the set of shared locations accessed in the interval. While we need the identity of locations, we do not need the order in which the locations were accessed. Since most programs exhibit some degree of temporal and spatial locality, this information can typically be represented very compactly. This in turn enables the log les to which we write this information to grow slowly. Accesses to private memory are not recorded. Since these locations can be read only by the writing processor, we e ectively short{circuit them when recording operand information. Once this information has been recorded (in the log les), we clear the state associated with the shared locations in preparation for the next interval. By clearing the Pcurr elds of D and CREAT, we are able to pinpoint the interval in which shared accesses occur. For shared locations written in an interval, we also set the PROCINT information. This is the equivalence class to which the write belongs, and is used by a subsequent writes that use this data as a source operand. This information is used when the location is later read. The existing synchronization in the program orders the read{write accesses to the PROCINT eld. We also clear the PROCINT information associated with the source lines. We do not clear the PROCINT information for private locations. This enables us to short{circuit these locations when recording operand information. For both logging the state of shared locations and for clearing them, we need to scan the shared state information. We defer a discussion of how this can be done eciently, without scanning all the shared state, to section 6.

output from the compiler. This enables us to determine a superset of the lines that correspond to shared stores. A private location has only the PROCINT state. Source lines that could correspond to shared stores also have CREAT associated with them. Locations in shared memory have D associated as well. PROCINT: The PROCINT eld of a location keeps track of the [processor,interval] tuples of the source operands of the data in the location. It speci es when and where the operands were created. For a source line, the PROCINT eld keeps track of the PROCINT elds of all stores in this interval, that correspond to the line. CREAT: If a shared location is marked CREAT, it was written by the marking processor in this interval. If a source line is marked CREAT, a store corresponding to the line wrote a shared location in this interval. D: If a shared location is marked D, it contains or contained data created on some processor in a previous interval that has been read in this interval by the marking processor. The state for private locations and source lines is maintained in private memory. State for shared locations is maintained in shared memory. Moreover, shared locations have as many CREAT and D elds as there are processors. Thus, when multiple processors read a shared location written elsewhere, each processor marks its D eld of the location. Figure 5 speci es the algorithm that uses this state to obtain at the dependence information. Pcurr signi es the processor executing the algorithm. We explain each step in the algorithm below. 1. When a shared location is read, if it was written in a previous interval or on a di erent processor, we mark it as being read (with D). This permits us to later determine the shared locations read by Pcurr in this interval. Note that the equivalence class to which the location belongs is already available in its PROCINT eld, which was set by the processor that last wrote the location (see step 2). 2. To determine which shared locations have been written by Pcurr in an interval, we mark these locations with CREAT. Locations in private memory are not marked as they cannot be involved in cross{processor dependences. Although a private location cannot be read by other processors, it can however still be read by the writing processor in a later interval. Consider a location in private memory x, that depends on shared data. When x is read in a later interval, we want to know the locations on which x depends. Hence, for private as well as shared writes, we keep track of the equivalence classes of the operands on which the write depends. A single line in the source code may correspond to stores to di erent locations. Hence, in contrast to the stores, we merge the PROCINT information of the operands of the store with the existing PROCINT information for the line.

5 Static Analysis and Instrumentation

In section 4 we outlined the algorithm to collect the dependence information in terms of equivalence classes. In this section, we explain how we determine the source operands of a store. This is then recorded in terms of the equivalence classes of the last writes to these locations. We obtain this information in several steps. First, as a basic block executes, we record the locations accessed by the loads and stores. We also convert control dependences (as they are encountered during execution) into data dependences. Then, at the end of the basic block, we process this address trace, using information from the basic block, to derive the dependence relations between memory locations. All of our static analyses are carried out with the assembly code output from the compiler. 5.1 Collecting Address Traces for Basic Blocks

We instrument the loads and stores in the program, working with the assembly code. The instrumentation we insert 7

1. On every load, if the location is shared and was last written

 

On another processor

OR

In a previous interval on the same processor

set the

Pcurr

field of D for that location.

2. On every store (a) If the location is in shared memory, set the

Pcurr

field of CREAT for that location.

(b) Set the PROCINT field of the location to the merged result of the PROCINT fields of all its operands. (c) If the location is in shared memory, merge the PROCINT information of the line corresponding to the store with the values already there.

3. On reaching a synchronization operation: (a) Write to the log file



 

P

All locations in shared memory that have the curr field of CREAT set, along with their PROCINT fields. All locations in shared memory that have the curr field of D set. All source lines marked CREAT, along with their PROCINT fields

P

(b) Clear the

Pcurr

field of D for all shared locations.

(c) If a shared location has its

 

P

Pcurr

field of CREAT set:

Reset the curr field of CREAT Set PROCINT to [ curr current interval].

P

;

(d) If a source line has CREAT set:

 

Reset CREAT Reset PROCINT

Figure 5: The run{time algorithm used for obtaining dependence information. Pcurr is the processor executing the algorithm. point condition code and its subsequent use.

collects the address trace for the basic block in a statically allocated area. At the end of the basic block, we insert a call to a function that processes this information. As we fully process the trace generated by each basic block at its end, the space required for collecting the trace (while the basic block is executing) is minimal. We reserve four registers from the application for the instrumentation. The -ffixed-register option of the Gnu C compiler makes this easy to do. One register each is used for computing the access address and for holding the pointer to the location in which to store the address trace. The hardware platform we use is based on the SPARC v8 architecture, in which user{level code cannot read the integer condition code registers except via branch instructions. Hence, when we insert an instrumentation call between the setting of the condition code (by say, a compare instruction), and its subsequent use (by say, a branch instruction), we use these registers to temporarily store the operands of the instruction that sets the condition code. In the SPARC, only ALU instructions a ect the condition code. Hence, just prior to the branch, we re{execute the instruction that set the condition codes using the saved values, discarding the result. The instrumentation procedures do not contain any oating point code. Hence, we do not have to take similar measures when inserting calls between the setting of a oating

5.2 Converting Control to Data Dependences

During execution, control dependences must be converted to data dependences. Consider the code segment shown in gure 6. Here the write to x depends not only on y and z , but also on v. Moreover, all accesses made inside doit() (and any function calls it might make) depend on v too.

f

if (v == 0) x = y + z t = u + x doit (t)

g

Figure 6: A code fragment illustrating control dependences. The rst step in converting control to data dependences is to compute the control ow graph (CFG) [5] of each function in the program. This is a directed graph with nodes representing basic blocks and edges, the possible ow of control. We use the assembly output to compute this graph. Next, 8

f

1. Save ccd in cinfo[G] 2. Set cinfo[G] to FULL 3. ccd ccd, u, v

if (cinfo[ipostdom] is EMPTY) Save ccd in cinfo[ipostdom] Set cinfo[ipostdom] to FULL

g

1. Save ccd in cinfo[E] 2. Set cinfo[E] to FULL 3. ccd ccd, p

Merge data dependences of controlling instruction with ccd

Figure 7: Algorithm executed after a controlling basic block.

B

C

we use this graph to compute the immediate postdominator of each basic block in a function. The immediate postdominator is de ned as follows [5]: If X and Y are CFG nodes, and X appears on every path from Y to the exit node, X postdominates Y . If X postdominates Y , but X 6= Y , then X strictly postdominates Y . The immediate postdominator of Y is the closest strict postdominator of Y on any path from Y to the exit node. The exit node is the (possibly arti cial) node through which all basic blocks exit a function. We use the statically computed immediate postdominator information to determine the run{time scope of a controlling basic block (one that can change the ow of control). Within this basic block, we refer to the instruction(s) causing control{ ow as the controlling instruction(s). A store between a controlling basic block B and its immediate postdominator depends on the accesses in B that cause the change in control ow. For instance, in gure 9, stores in basic block F depend on the locations accessed by the controlling instructions of basic block A. The stores in C and D depend on the controlling instructions of B as well. At any point in the execution, we maintain the equivalence classes of the locations on which a store at that point will be control dependent. We use the term ccd (current control dependence) to refer to this information. The ccd needs to be updated only after executing a controlling basic block or its immediate postdominator. Hence, after executing a controlling basic block we rst save the current ccd in some state associated with the immediate postdominator for that basic block. We then update the ccd with the operands of the controlling instruction in the just executed basic block. On reaching an immediate postdominator for any controlling basic block, we rst restore the saved ccd. As multiple controlling basic blocks may have the same immediate postdominator, we save the ccd only if a previously executed controlling basic block has not already saved the ccd. Figures 7 and 8 show respectively, the pseudo{code for the instrumentation added at the end of controlling basic

1. Restore ccd from cinfo[E] 2. Set cinfo[E] to EMPTY 3. Don’t save ccd (cinfo[G] is FULL) 4. ccd ccd, q

A

if (*u == *v) ...

if (*p) ...

D

E

if (*q) ...

F

1. Restore ccd from cinfo[G] 2. Set cinfo[G] to EMPTY

G

Figure 9: A control ow graph illustrating the working of the algorithms in gures 7 and 8. ccd ?! ccd, p indicates that the dependence information of p is merged with ccd. blocks, and at the beginning of their immediate postdominators. In both gures, cinfo refers to the area in memory where the ccd information is stored. ipostdom refers to the immediate postdominator. We discuss the storage requirements for the ccd information in greater detail towards the end of this section. Figure 9 illustrates the working of these two algorithms with an example. Each node represents a basic block in the CFG and is annotated with the controlling instruction (in italics) and the operations carried out to convert the control dependence to a data dependence. Note that basic block G is the immediate postdominator of both A and E . Hence, if control reaches node E , it does not save the ccd in the state associated with G. Converting control to data dependences requires us to maintain the ccd through function calls. Thus, a callee inherits the ccd at the call site. At the exit basic block(s), the ccd is restored to that in e ect when the call was made. The maximum number of concurrently saved ccd's in a function is bounded by the number of immediate postdominators for its controlling basic blocks. This number is independent of the run{time control ow behavior of the function. Each ccd takes a small amount of space for representing the [processor,interval] pairs of the operands on which it depends. Thus, on entry to a function, we allocate the space needed for saving the ccd's in a separate stack. This stack grows as the program makes calls, much like the data stack.

Restore ccd from cinfo[this basic block] Set cinfo[this basic block] to EMPTY

5.3 Generating the Mirror Functions

A basic block speci es the operands on which each store depends, in terms of the registers live on entering the basic

Figure 8: Algorithm executed before the immediate postdominator of a controlling basic block. 9

block and the loads performed inside it. In addition, it also speci es the operands on which registers that are live on exiting the basic block depend. A trace of the memory locations accessed in a basic block is not sucient to determine the source operands of a store. For instance, in gure 6, the code for the two assignments before the call to doit() could result in a run{time trace

5.4 Perturbation Effects

The instrumentation we discuss in sections 5.1, 5.2 and 5.3 is highly intrusive and could signi cantly distort the run{time behavior of the program. Here, we show how the e ects of this perturbation can be reduced so that it has negligible impact on the performance debugging. As our goal is to pinpoint sources of program ineciency that are independent of the target architecture, we need only architecture independent information (memory accesses) from the execution. Hence, we can rst execute the program, tracing only the relative order in which synchronization operations take place. The perturbation introduced in doing this is minimal. The recorded trace of synchronization events can then be used to \replay" the program, enforcing that the synchronization operations take place in the same order. As long as the program is free of data races, this results in the same access ordering as during the \record" phase. This technique and the perturbation it introduces is discussed in greater detail in [19]. We have not yet implemented this technique. The applications we consider here use only barriers and locks. The locks are used only to enforce mutual exclusion when computing reductions. Hence, for these applications, the instrumentation did not distort the synchronization order. Currently, we use a sparse data structure for storing the state information described in section 4. This permits faster execution of the instrumented code at the expense of extra memory. By using a more compact data structure, we can a ord to spend more time during execution, using the \record{replay" technique to reduce perturbation e ects.

load y, load z, load u, store x, store t

where we have assumed a compiler that re{orders memory accesses. We therefore scan the assembly code to determine register{register, register{memory and memory{memory dependences for a basic block. This gives us operand information for the stores in the basic block and the registers live on exiting the basic block, in terms of the loads in the basic block and the registers live on entering the basic block. A possible method of keeping track of the dependences at run{time would be to call an interpreter at the end of every basic block that combines the statically determined information with the run{time address trace. This would provide the operands of a store, which could then be expressed in terms of the equivalence classes used in the algorithm in section 4. As interpreting the basic block information can be slow, we take a di erent approach. After scanning the basic blocks in the assembly code, we use the register{register, register{ memory and memory{memory dependences to generate a mirror basic block. When executed, the mirror takes the run{time address trace from the original basic block as its argument and executes the algorithm in gure 5. This approach shares some similarities with abstract execution [10], which expands a small set of events recorded at run-time into a full trace. However, while our mirror basic blocks can be considered abstract versions of the original basic blocks, we execute them in conjunction with the original program to eciently compute operand information. Our prototype currently generates the mirror code in the form of C functions, which we compile and link with the application program. At the end of each basic block, we insert a call to its mirror function, which then maintains the operand information for the basic block just executed. The mirror functions also set all assignments to registers and memory to depend on the current control dependence information (section 5.2). The SPARC architecture uses register windows, causing a function to get some new registers and share some of its registers with its caller (a caller's output and a callee's input registers are the same). Hence, like the current control dependence information, we keep the register information also in a stack that grows as the program makes function calls. Instrumentation that we insert at the entry and exit of every function ensures that the mirror functions operate on the proper register set. Although we could have used an instrumentation system such as ATOM [23] for collecting the per{basic block address trace, or EEL [11] which also provides control ow information, we would need to extend them to convert control to data dependences and to examine the basic blocks to obtain the operand information. Hence, we could not trivially use these existing instrumentation systems.

6 Run–time System

After instrumenting the assembled source program, we compile and link with the mirror functions. Our run{time system is the TreadMarks software distributed shared memory system [9]. TreadMarks provides a shared memory abstraction on distributed memory (message{passing) machines. In the algorithm described in section 4, when reaching a synchronization point, we need to scan the state associated with shared locations to log data and also to clear the states. We use the virtual memory protection mechanism to do this eciently. Initially, we protect all shared pages from read and write accesses. On the rst access to the page, we record the page as having been accessed and unprotect the page. On reaching a synchronization point, we scan the state for only those pages that have been accessed. Before proceeding, we again protect these pages from all accesses. When the accesses demonstrate spatial locality, this method has a low overhead. TreadMarks already performs these actions for maintaining coherence. Hence, we use TreadMarks' internal data structures to determine the pages that have been accessed. As the parallel program executes, it generates log information. We compress this data on{the{ y using the Gnu gzip utility and write it to disk in the form of per{process log les. After the program has nished executing, we analyze these les using the performance debugger. In our implementation, we limit the number of [processor,interval] tuples used for recording operand information. 10

App 1-P 4-P 4-P, instr Dilation Log size Ocean 15.53 56.39 233.0 4.1 1.899 Shallow 3.58 23.46 339.7 14.5 0.244 Barnes 31.56 14.20 1372.4 122.4 0.657 ILINK 285.71 155.96 26369.6 169.1 2.410

in the program. Columns 3 and 4 present the rate at which instrumentation calls are made. Column 3 presents this gure for mirror function calls, and column 4 for the memory accesses that are instrumented using function calls. As can be seen, Barnes{Hut and ILINK have more operands per store than the other programs; hence they get penalized. The second point is that the running times and the log le sizes do not exhibit a correlation. This is because the log le sizes are determined by the synchronization rate and the spatial and temporal locality in the accesses of the program. When we write the data to the log les, we merge adjacent address ranges together. Thus, if the program has a high spatial and temporal locality, and its synchronization rate is small, we are able to merge a large number of the accesses and represent them more compactly.

Table 1: Execution characteristics When a location depends on more [processor,interval] tuples than we can accommodate, we discard the oldest interval and keep the rest. For the experiments we report on, we limit this number to four intervals per processor (per location). During analysis in the performance debugger we can thus encounter operand information that has \over own". In this case, we (conservatively) attribute all the reads that are not accounted for to each write in the interval. We then carry out the analysis in section 3.

7 Related Work

Existing methods to improve performance fall into two categories. Compile-time transformations change the control structure or data layout of the program based on static analysis. Run{time performance debuggers observe the program execution and present the gathered information to the programmer.

6.1 Instrumentation Impact

In this section, we report on the instrumentation impact of our approach. The applications and data sets we used are described in section 3.4.1. We ran the programs on a network of four SparcStation20s connected by a 155Mbps ATM network. Table 1 presents the results from running the applications. The rst column lists the application. The second column shows the time taken to run the program on one processor (the sequential running time). The next two columns list the times taken to run the uninstrumented and instrumented versions of the program on four processors. All of the times shown are in seconds. The next column lists the execution time dilation as the ratio of the instrumented and instrumented parallel execution times. The last column reports the sizes (in Mbytes) of the compressed log les generated during execution. There are two points worth mentioning. First, the execution time dilation shows a wide disparity. We explain the reason for this below. The dominant cost of instrumentation is the merging of the PROCINT information in the mirror basic blocks. The total number of merge operations during execution is proportional to the number of source operands of the stores (see gure 5). Function calls to the mirror basic blocks, and to the pro ling code for memory accesses also have an impact, though to a much lesser degree. We pro led the sequential versions of the applications to measure the e ects of the instrumentation costs. The results are presented in table 2. We used the sequential code here to separate the instrumentation e ects from the parallelization quality of the parallel program. Column 2 presents the average rate at which the mirror functions have to merge the source operands for all writes

7.1 Compile–time Techniques

Compile-time techniques restructure either the data or the computation in the program to reduce synchronization and improve locality. Data dependence analysis [15] forms the core of all compile-time data transformations. However, dependence analysis is not very accurate in the presence of procedure calls, unknown loop bounds, indirection arrays and pointer aliases. Compute transformations are centered around loop interchange and loop tiling [15, 26]. These techniques can be applied successfully only to highly regular codes. Tseng has implemented an algorithm to eliminate barriers [25] in the SUIF compiler for shared memory multiprocessors [2]. Two optimizations are used to reduce overhead and synchronization. By combining adjacent SPMD (fork{ join) regions, the overhead of starting up parallel tasks is reduced. By augmenting dependence analysis with communication analysis [8], the compute partitioning is taken into account when checking whether a barrier is needed. In some instances, the barrier is also be replaced by pair{wise synchronization. The performance of the implementation is compared against code generated by the original compiler; modest improvements are reported. Parallelizing compilers also resort to dependence information obtained at run{time [12, 18]. When a loop cannot be parallelized, the compiler uses the inspector{executor paradigm [21] in which the inspector determines dependences at run{time. It then prepares an execution schedule, which is used by the executor to carry out the operations in the loop. The goal of the run{time analysis is to nd if there are true, anti- or output dependences between di erent iterations of a loop. In contrast, we present a complete, run{time dependence analysis technique that works over the whole program, also determining the source operands of computation.

App Operands/sec Mirror-calls/sec Acc-calls/sec Ocean 14213729 561239 2058847 Shallow 53048918 510085 18671110 Barnes 120965950 20655665 21752421 ILINK 127207042 11752997 19690907 Table 2: Instrumentation e ects 11

optimized; we used it only to demonstrate the feasibility of our approach to long{running applications. We also give details on how our approach can be implemented. We use two novel techniques in our implementation to convert control dependences to data dependences, and to compute the source operands of stores. We carry out the former by using the immediate postdominator information to save and restore the control dependences in e ect when entering and leaving the scope of a controlling basic block (one that a ects control{ ow). To determine the source operands of computation, we instrument the basic blocks of the program. When an instrumented basic block executes, it produces an access trace that we feed to its mirror basic block. The mirror is generated from the original basic block and uses this trace to determine the source operands for the computation in the original basic block. We report on the impact of our instrumentation on the same application suite we use for performance debugging. The instrumentation slows down the execution by a factor between 4 and 169. The log les produced during execution were all less than 2.5 Mbytes in size.

7.2 Run–time Techniques

Paradyn [14] dynamically instruments the program being traced, and searches the execution for performance bottlenecks. The search space is constructed from a set of hypotheses postulated by the tool builder, and is re ned until a few hypotheses accurately re ect the performance problem. Quartz[3] uses the normalized processor time metric to rank the contribution of procedures in the program to the overall execution. The resultant listing of the \importance" or di erent procedures to the execution a la gprof can be used for performance tuning. MemSpy [13] and ParaView [22] concentrate on the memory performance of programs. Both these tools simulate the program being tuned and present the collected trace information in various formats. MemSpy categorizes cache misses and presents data{oriented statistics. ParaView presents the times spent in computation, synchronization and the memory hierarchy. It also identi es false sharing in the program. In contrast to the tools and methods outlined above, we concentrate on automating the analysis of information gathered at run{time. Our approach di ers from tools like Paradyn and Quartz that use architecture dependent metrics based on resource pro les. Instead, by using architecture independent information from the program execution, we are able to pinpoint sources of program ineciency that are independent of the target architecture. In contrast to performance tools that rely on simulation, we gather run-time information from an instrumented version of the program that executes in parallel on a parallel machine. Our data gathering phase hence takes less time.

References

[1] S.V. Adve and M.D. Hill. A uni ed formalization of four sharedmemory models. IEEE Transactions on Parallel and Distributed Systems, 4(6):613{624, June 1993. [2] S. P. Amarasinghe, J. M. Anderson, M. S. Lam, and C. W. Tseng. The SUIF compiler for scalable parallel machines. In Proceedings of the 7th SIAM Conference on Parallel Processing for Scienti c Computing, February 1995. [3] T. E. Anderson and E. D. Lazowska. Quartz: A tool for tuning parallel program performance. In Proceedings of the International Conference on Measurement and Modeling of Computer Systems (Sigmetrics '90), May 1990. [4] U. Banerjee. Dependence analysis for supercomputing. Kluwer Academic Publishers, Boston, Massachusetts, 1988. [5] R. Cytron, J. Ferrante, B. K. Rosen, M. N. Wegman, and F. K. Zadeck. Eciently computing static single assignment form and the control dependence graph. ACM Transactions on Programming Languages and Systems, 13(4):451{490, October 1991. [6] K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons, A. Gupta, and J. Hennessy. Memory consistency and event ordering in scalable shared-memory multiprocessors. In Proceedings of the 17th Annual International Symposium on Computer Architecture, pages 15{26, May 1990. [7] S.K. Gupta, A.A. Scha er, A.L. Cox, S. Dwarkadas, and W. Zwaenepoel. Integrating parallelization strategies for linkage analysis. Computers and Biomedical Research, 28:116{139, June 1995. [8] S. Hiranandani, K. Kennedy, and C. Tseng. Compiling Fortran D for MIMD distributed-memory machines. Communications of the ACM, 35(8):66{80, August 1992. [9] P. Keleher, S. Dwarkadas, A.L. Cox, and W. Zwaenepoel. Treadmarks: Distributed shared memory on standard workstations and operating systems. In Proceedings of the 1994 Winter Usenix Conference, pages 115{131, January 1994. [10] J. R. Larus. Abstract execution: A technique for eciently tracing programs. Software Practices and Experience, 20(12):1241{ 1258, December 1990. [11] J. R. Larus and E Schnarr. EEL: Machine-indepent executable editing. In Proceedings of the ACM SIGPLAN 95 Conference on Programming Language Design and Implementation, June 1995.

8 Conclusions

The goal of our work is to assist inexperienced programmers in developing ecient parallel programs. To this end, we have developed a new approach to performance debugging that focuses on automatically analyzing a program execution to arrive at computation transformations that reduce synchronization and communication. This requires knowledge of the cross{processor data and control dependences encountered at run{time, and the source operands of computation. During program execution, every write to memory can be used to generate information that links a write to its operands. Collecting operand information at this level of detail is infeasible for all but very small applications. Instead, we group writes into equivalence classes and collect operand information at this higher level, thereby solving this problem tractably. Our performance debugger then analyzes this information and suggests computation that can be postponed or relocated in terms of the source code. We present the transformations suggested by the debugger on a suite of four applications. While three of the applications, Barnes-Hut, Ocean and Shallow are benchmark programs, the fourth, ILINK is a genetic linkage analysis application widely used by geneticists. For Barnes-Hut and Shallow, implementing the debugger suggestions improved the performance by a factor of 1.32 and 34 times respectively on an 8{processor IBM SP2. ILINK is already well 12

[12] S. Leung and J. Zahorjan. Improving the performance of runtime parallelization. In Proceedings of the 1993 Conference on the Principles and Practice of Parallel Programming, pages 83{91, May 1993. [13] M. Martonosi, A. Gupta, and T. E. Anderson. Tuning memory performance of sequential and parallel programs. IEEE Computer, 28(4), April 1995. [14] B. P. Miller, M. D. Callaghan, J. M. Cargille, J. K. Hollingsworth, R. B. Irvin, K. L. Karavanic, K. Kunchithapadam, and T. Newhall. The paradyn parallel performance measurement tools. IEEE Computer, 28(11), November 1995. [15] D. A. Padua and M. J. Wolfe. Advanced compiler optimizations for supercomputers. Communications of the ACM, 29(12):1184{ 1201, December 1986. [16] Dejan Perkovic and Pete Keleher. Online data-race detection via coherency guarantees. In Proceedings of the Second USENIX Symposium on Operating System Design and Implementation, November 1996. [17] R. Rajamony and A. L. Cox. A performance debugger for eliminating excess synchronization in shared-memory parallel programs. In Proceedings of the Fourth International Workshop on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, pages 250{256, February 1996. [18] L. Rauchwerger, N. M. Amato, and David A. Padua. Run{time methods for parallelizing partially parallel loops. In Proceedings of the 1995 International Conference on Supercomputing, July 1995. [19] M. Ronsse and W. Zwaenepoel. Execution replay for treadmarks. In Proceedings of the Fifth EUROMICRO Workshop on Parallel and Distributed Processing, January 1997. [20] R. Sadourny. The dynamics of nite-di erence models of the shallow-water equations. Journal of Atmospheric Sciences, 32(4), April 1975. [21] J. Saltz, R. Mirchandaney, and K. Crowley. Runtime parallelization and scheduling of loops. IEEE Transactions on Computers, 40(5):603{612, May 1991. [22] Evan Speight and John K. Bennett. Paraview: Performance debugging of shared-memory parallel programs. Technical Report ELEC TR 9403, Rice University, March 1994. [23] A. Srivastava and A. Eustace. ATOM: A system for building customized program analysis tools. In Proceedings of the ACM SIGPLAN 94 Conference on Programming Language Design and Implementation, June 1994. [24] Joseph Torrellas, Monica S. Lam, and John L. Hennessy. False sharing and spatial locality in multiprocessor caches. IEEE Transactions on Computers, 43(6):651{663, June 1994. [25] C.-W. Tseng. Compiler optimizations for eliminating barrier synchronization. In Proceedings of the 5th Symposium on the Principles and Practice of Parallel Programming, July 1995. [26] M. E. Wolfe and M. S. Lam. A data locality optimizing algorithm. In Proceedings of the ACM SIGPLAN 91 Conference on Programming Language Design and Implementation, June 1991. [27] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. Splash-2 programs: Characterization and methodological considerations. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, pages 24{36, June 1995.

13