Structural Prediction Models for High-Performance ... - Semantic Scholar

20 downloads 0 Views 257KB Size Report
DTMH96] D. M. Deaven, N. Tit, J. R. Morris, and K. M.. Ho. Structural optimization of ... SSLP93] Jonathan Schae er, Duane Szafron, Greg Lobe, and Ian Parsons.
Cluster Computing Conference, 1997

Structural Prediction Models for High-Performance Distributed Applications Jennifer M. Schopf  Computer Science and Engineering Department University of California, San Diego http://www.cs.ucsd.edu/users/jenny [email protected]

Abstract

plications. We intend this approach to be applicable to production distributed systems with their constraints on information availability and inherent variability. Since no single modeling approach may have usable results for all applications in this environment, we allow for the needed exibility to choose the best-suited model for each application component. By examining an explicit approach to modeling, we are able to gain insights into the reasons performance predictions are dicult, and to show where additional e ort is best spent. This paper is organized as follows: Section 2 provides context and describes an example application. Section 3 presents our structural approach to modeling. We decompose the performance model in accordance with the structure of the application and its constituent subtasks. Then, using a high-level application pro le and available input information as guides, we choose separate component models appropriately as described in Section 4. Section 5 presents the results of this approach for a genetic algorithm, and Section 6 for a successive over-relaxation code, both demonstrating the need for a

exible framework approach. We conclude in Section 7 with future work.

We present a structural performance model to predict an application's performance on a set of distributed resources. We decompose application performance in accordance with the structure of the application, that is, into interacting component models that correspond to component sub-tasks. Then, using the application pro le and available information as guides, we select models for each component appropriately. This allows di erent modeling approaches for di erent application components as needed. As a proof-of-concept, we have implemented this approach for two distributed applications, a master-slave genetic algorithm code and a red-black stencil successive over-relaxation code. We achieve predictions within 10% while demonstrating the exibility this framework allows.

1 Introduction Clusters of distributed machines have become a common platform for high performance applications but remain a challenging environment in which to achieve good performance. One reason for this is the diculty of predicting an application's execution time in this dynamic setting where only minimal information may be available. In particular, as a part of the AppLeS scheduling project [BW96, BWF+ 96], we are interested in estimating the performance of applications for use in dynamic scheduling. Because this environment combines varying non-dedicated resources and possibly dissimilar implementation paradigms, many established modeling techniques do not satisfy the needs of our scheduler. The goal of this paper is to describe our approach to modeling high performance distributed parallel ap-

2 Context In this section we will give some basic application characteristics for distributed parallel applications, and describe how our approach addresses each characteristic. Then we will present a master-slave application to be used as an example throughout.

2.1 Application Characteristics Through conversations with developers [ASWB95], we found several common characteristics for distributed parallel applications. A good prediction model should address these concerns, and we have developed our approach to do so.

 This research is supported in part by NASA GSRP grant #NGT-1-52133. This paper is also available as UCSD, CSE Technical Report #CS97-528.

1

Very coarse grain. We mirror the coarse grain struc-

the accurate model for the smaller component. In using the application pro le as a guide, we make sure our modeling e ort is well spent.

ture of the application by breaking the overall prediction model into coarse-grain components; individual models are then chosen to represent these components. Since component sub-tasks of the application are often implemented with a modular approach, shifting from one implementation to another depending on the speci c architecture or load characteristics at run-time, we allow component models to be exchanged individually as needed. Small number of sub-tasks. Commonly, applications are broken into a small number of sub-tasks, usually fewer than ve [CS95, DS95, DTMH96, MMFM93, PRW94, WK93, YB95]. This is a manageable number to model individually. In addition, we can use an application pro le to help direct our e orts to model the most important components in more detail. Implementations are di erent. Sub-tasks are implemented in di erent languages using di erent computational paradigms (data parallel, sequential, vector, etc), so no single modeling approach will suit all possible implementations. Even if we had equally detailed information, one approach (e.g. using a benchmark model for computation time) might be adequate for one implementation and still achieve insucient accuracy for another. This is due to both application features (non-determinism, unstructured code, etc.) and system features (dramatically varying contention, poor memory performance, etc.). In some cases we may not even have the same information about two di erent implementations, so there must be a choice of models available.

Developers can supply basic information. From

our work in application-centric scheduling we found that application developers could often supply the low-level details speci c to their application domains. That is, while a single developer may not know all of the low level details for his or her platform, the details that are relevant to that application will be understood. Included in those details are a knowledge of the structure and pro le information for the application. We have tried to allow for a variety of sources of application data of varying qualities and use. As in any setting, more (and better) data will lead to better predictions, so our approach also allows the upgrading of information if initial data leads to awed predictions. Examples of the possible data to be supplied are: a time per element estimate for the computation components (in our experience, these are achieved most often through operation counts or benchmarks); message sizes and bandwidth or latency information for communication components; and size of data and memory speeds for memory components. This level of information is generally available to some degree of accuracy since distributed parallel codes are often highly tuned.

One common application type that has these characteristics is master-slave. Below we describe an example master-slave application which we will use to explicate our approach in the following sections.

2.2 Master-Slave Example Application

A sub-task may have multiple implementations.

We demonstrate our approach using a common application paradigm for cluster computing, namely masterslave computations. This computational paradigm is widely used in the cluster computing setting due to the simple programming style and coarse structure of tasks. For many problems it is easy to load balance across the slave computations running in parallel and achieve good performance on heterogeneous workstations of di ering performance abilities. Suppose our example application consists of a Master task that determines how much work each of P Slave tasks performs. Also suppose that the Master communicates to the Slaves using a Scatter routine, perhaps a multicast, and receives information back from the Slaves with a Gather routine, perhaps a series of receives. This is represented by the graph in Figure 1 and the pseudocode in Figure 2. Using this example we will walk through our structural modeling approach.

Each sub-task may have multiple implementations, each tuned to a di erent platform or performance characteristic. We would like our application models to apply when di erent implementations are used. Our framework allows a prediction to be updated by simply changing the component model corresponding to an altered sub-task implementation, since the application structure is represented separately from the underlying component models. Not all sub-tasks are created equal.Sub-tasks may contribute unequally to execution time. There is no sense in spending a great deal of time exactly modeling a subtask whose execution is 1% of the application's execution time. This is especially true if another subtask (using, say, 50% of the execution time) is modeled poorly. The error in the model of the larger component will hide any bene t from 2

individual components, or sub-tasks, where a subtask is some functional unit of the application. The de nition of subtask varies between applications|for some it's a function [CS95], for others an inner loop where most of the work is done [DS95], for others a subtask of their application is another application that can stand on it's own [MMFM93]. We de ne a component to be a free-standing opera-

Master Scatter

Slave1

Slave2

SlaveP

tional unit of the application that is not functionally split between two machines. That is, several machines can

perform the same component (for example, the dataparallel slave operation in the example). However, it will not be split into two or more constituent components, each running on a di erent machine. Component models are discussed in detail in Section 3.2. Interaction operators show the interactions between component models. For example, we use the operator + as a combine operation. If the results of the components are execution times, then + will be equivalent to addition. However, it is possible that additional data such as variances or error projections will be included for each component model, in which case the combine operation would need to take these factors into account. Likewise with other interaction operations, such as Max below.

Gather Master

Figure 1. Graph of Master-Slave example execution.

(1) (2) (3) (4) (5) (6) (7) (8)

For i = 1 to MaxIterations Master Computation Broadcast Data Each Slave: Receive Data Compute Slave Work Send Data to Master Master receives all Slave data

Figure 2. Pseudocode for Sample Master-Slave Computation

3.1 Structural Models

3 Our Approach

We use an application developer's description of an application to construct the structural model. Often in describing an application a developer will use a graphical representation like a program dependency graph (PDG) [FOW84], a more detailed representation such as Zoom [ASWB95], or even a visual programming language representation such as HeNCE [BDGM93], CODE 2.0 [NB92], Enterprise [SSLP93] or VPE [DN95]. From these graphical representations, de ning a structural model is straightforward. As a part of the AppLeS project [BW96, BWF+ 96], we are building a library of structural templates to guide the structural model de nition process. Certain classes of applications (master-slave, stencil, etc.) are common, and we believe that a library of common structural models will cover the most common applications. Notationally, we will show component models in boldface in the structural models below, and the component interactions in italics.

In general, a performance model can be thought of as a black box that uses two inputs, a problem size and a resource set, to predict execution time. For our purposes, a model is an equation expressing the performance of an application or a part of an application. Informally, we follow four steps to build our prediction models. First, we examine the structure of the application and construct a top-level model to represent this structure. Second, we use an application pro le that describes where execution time for the application is spent to determine which components must be modeled to achieve the needed accuracy. Third, available data sources and error allowance guide the choice of component models. Finally, we analyze the accuracy of the model and determine a course of action. We address two types of models, top-level structural models and constituent component models. Structural models represent the top-most functionality of the application. We de ne a structural model to be a delin-

3.1.1 Example: Structural Model

eation of the functional execution of an application to the level of independent sub-tasks and their interactions.

We provide a structural model for the example masterslave application. Our model has four components: one for the Master computation, one for the Scatter communication, one for the Slave tasks and one for the Gather communication. Throughout, our execution time is wall clock time.

These models consist of component P models and interaction operators, such as Max, , or +. All interactions between modeled components (including overlap) are re ected in the top-most structural model. Component models represent the performance of

3

One possible top-level structural model is: ExTime = Master + Scat + Max [Slavei] + Gat (1) where  ExTime : Overall execution time for the application,  Master : Execution time for Master computation (line 2 of the pseudocode in Figure 2),  Scat: Execution time for the Master to send data to all of the Slaves, and for it to be received (lines 3 and 5),  Max : A maximum function,  Slavei : Execution time for Slave computation on processor i (line 6), and  Gati : Execution time for all Slaves to send data to the Master, and the Master to receive it (lines 7 and 8). This structural model makes several assumptions about the interactions of the component tasks. The rst is that the time for the slave tasks is equal to Max[Slave ], that is, the amount of time for the total slave operation is equal to the maximum of all the individual slave operations. To evaluate this function all i slave times must be assessed. If the code was load balanced (so that each slave processor received the right amount of data to balance the compute times), we could minimize the complexity of this function by calculating only one slave time. In this case, we have:

Each of these structural models are suitable for master-slave applications with di erent implementations or environments. Determining if the proper one has been chosen is a part of the error analysis process discussed in Section 4.3. Once a structural model has been de ned, individual models must be built for the constituent components.

3.1.2 Related Work: Structural Models

One common approach to modeling parallel applications is to separate the performance into application characteristics and system characteristics [TB86, Moh84, KME89, ML90]. This approach may also be adequate when the underlying workstations are very similar (for example, Zhang [YZS96] considers only workstations that do oating point arithmetic identically). However, in the cluster environment this line is not easily drawn. We do not separate application and system characteristics at the structural level because developers address their codes as implementations|or combinations of both application and system. For example, a task will be tuned to t an architecture's speci c cache size or to use an architecture-speci c library routine. The structural modeling approach is similar in spirit to the use of skeletons for parallel programming [Col89, DFH+ 93]. With skeletons, useful patterns of parallel computation and interactions are packaged together as a construct, and then parameterized with other pieces of code. Such constructs are skeletons in that they have structure, but lack detail, much as the top-most structural model shows the structure of the application with respect to its constituent task implementations, but the details of the tasks themselves are supplied by individual component models. Similarly, Chandy [Cha94] addresses programming archetypes, abstractions from a class of programs with a common structure and includes class-speci c design strategies and a collection of example program designs and implementations, optimized for a collection of target machines. However, this work concentrates on deriving program speci cations for reasoning about correctness and performance, not developing models for the same.

i

ExTime = Master + Scat + Slavei + Gat

(2)

for some Slave task i. Another assumption both these structural models make is that there is a synchronization between each subtask. This may be the case if the processor running the Master task also has a Slave task, and the entire application is load balanced well. More likely (and especially if there is no Slave process on the Master processor), there will only be a synchronization point at the Master component. In this case we would have the following structural model: ExTime = Master+Max[Scati + Slavei + Gati] (3) where  Scat : Execution time for Master to send data to Slave , and for it to be received, and  Gat : Execution time for Slave to send data to the Master, and for it to be received.

3.2 Component Models

Our models consist of not only a structural model but of underlying component models for each sub-task. As stated above, a sub-task is a functional unit of the application that is not split between two machines. These component models have two important features. First, they are models of implementations that combine both application and system characteristics, and a single sub-task may have several di erent implementations, each needing its own model. Second, a single implementation may have several di erent model

i

i

i

i

4

models for these three cases are:

choices to select from, depending on available information, needed accuracy, use of the model, etc. In this section we discuss some possible component models for the master-slave example. Their selection is discussed in Section 4. Component models are de ned, possibly recursively, as combinations of input values (benchmarks, constants, arithmetical operations, etc.) and/or other component models. In the equations, component models are labeled with boldface, and input values with italics. A given component model may have several possible instantiations. We envision a suite of component models where selection is based on available information and needed accuracy according to the application pro le. Input information for the models can come from several places. As a part of the AppLeS project, application developers ll out a Heterogeneous Application Template (HAT) form that contains data about the general structure, speci c implementations, and the interface between two given sub-task implementations. In the MARS system [GR96] this data is obtained by instrumenting the code and using an application monitor to record the application's behavior. System speci c data, such as bandwidth values, CPU capacity, memory sizes, etc., can be supplied by system databases as envisioned for most resource management systems (such as the Meta-Computing Information Service (MCIS) for the Globus project [FK97] or the Host Object Database for Legion [Kar96, GWtLt97]) or by online tools such as the Network Weather Service [Wol96] which can supply dynamic values for bandwidth, CPU usage and memory on a given system.

Scat1 =

where

X[PtToPt(M; S )]

i Scat2 = Max[PtToPt(M; Si)] Scat3 = MulticastBM(P)

PtToPt(x; y) =

NumElt  Size(Elt)=BW(x; y) i

(4)

and

 NumElt : Number of elements in a message,  Size(Elt) : Size of a single element in bytes,  BW(x; y) : Bandwidth in bytes per second between i

x and y,  MulticastBM(P) : Benchmark for multicast, parameterized by P, the number of processors in the receiving group, also the number of Slaves in this case. The component models Scat1 and Scat2 are de ned in terms of both input values that would need to be supplied by the user or system, and another component model for the point-to-point communication times, namely PtToPt. This nested structure is common in the component models. Scat3 is de ned in terms of only the input benchmark for the multicast. It is unlikely that this benchmark would be feasible in a heterogeneous environment. Group operations are not only non-deterministic on a distributed platform, but even in a dedicated heterogeneous environment many factors play a part in their execution and benchmarks are, at best, unreliable. This is in contrast to bandwidth and latency gures of which rough estimates are almost always available. If instead we were using a structural model like Equation 3 we would need only the time for one point-to-point message, and could use the PtToPt model above. If this were not sucient we could extend it with latency or contention information:

3.2.1 Example: Communication Components

As an example of what we mean by component models, let's examine the Scatter routine from the example application in more depth. On di erent resource management systems, and on distinct architectures controlled by a single resource management system, Scatter and Gather routines may be implemented di erently. In addition, these routines can be a ected by the computation of both the master and the slave [FB96], so a model that provides accurate predictions in a production setting may not be intuitive. If we had the structure model in Equation 1 or 2, then we would need to examine the Scatter routine as a whole. There are three straightforward ways to do this. If the Scatter is implemented naively as a series of sequential sends of equal sizes, then the time it will take to execute will be the sum of P point-to-point sends from the Master to the P Slaves. If the sends are executed in parallel, the time would be the maximum of the P point-to-point sends. A third way to predict the execution time would be to use a benchmark speci c to that system and the number of slaves. The performance

PtToPt2(x; y) = Lat(x) + PtToPt(x; y) PtToPt3(x; y) = Contention(x; y)  PtToPt(x; y)

where:  Lat(x) : Latency or startup costs for a message on processor x, and  Contention(x; y) : Fraction of bandwidth between processor x and processor y available to this message. 5

4 Component Model Selection

3.2.2 Example: Computation Components

Likewise, the two computation components of the code has several possible models, which we will demonstrate for the Slave component. Most estimates of computation are based on evaluating some time per data element, and then multiplying by the number of elements being computed for the overall problem. There are two widely used approaches for this: counting the number of operations involved in computing one element and benchmarking. The performance models for these two cases are:

Our model selection is based on the principle of using the simple model that provides accurate results. Model selection is based on several factors. The application pro le guides selection by showing where execution time is spent, also where more accurate models are needed. In addition, available information for a given component will limit possible model choices. After these decisions are made, error analysis is performed on the models, and component selection may be altered. These three forms of selection are discussed in the following subsections.

i

Slavei1 = NumElt  Op(i; Elt)=CPU Slavei2 = NumElt  BM(Slave ) i

i

i

4.1 Application Pro le

i

where  NumElt : Number of elements computed by Slave ,  Op(i; Elt) : Number of operations to compute a single element on Slave processor i,  CPU : Time to perform one operation on Slave processor i, and  BM(Slave ) : Benchmark time for Slave processor i to process a single element. If contention on the machines a ected the computation time of the slave routines, we might want to add a slowdown (SD) factor to the above equations:

For many applications, execution time is concentrated in a subset of the sub-tasks. In order to best spend our modeling e ort, we use an application pro le to identify these areas of execution signi cance. A pro le is data that describes the signi cant characteristics of some object or set of objects. For our purposes, an application pro le describes where execution time is spent with respect to the components of the model. One easy, and common, way to express this is by using estimate percentage values for each component, given a xed set of resources and medial problem size. Admittedly, this is a dicult value to de ne exactly, as it will depend on many factors: architecture, speci c implementations of the application, load on relevant machines, problem size, etc. In practice, these values can be found through pro ling tools [GKM82, Sil] or timing runs of the application, or an estimated value can be supplied by the application developer. Often, these estimates are accurate enough for our purposes, namely deciding the component models on which to concentrate our modeling e ort.

i

i

i

i

LoadedSi = SD  Slavei1 LoadedSi = SD  Slavei2

The model for SD could be a dynamically supplied value, for example from Wolski's Network Weather Service [Wol96], or from an analytical model of contention, for example [FB96, LS93, ZY95].

4.1.1 Example: Application Pro le

3.2.3 Related Work: Components

In some sense, all modeling work grounded in practical methods can be considered related work to the component model approach. We hope to leverage o of existing modeling approaches to build the component models, and to make selections between them. One of the problems we have encountered in doing this is that many approaches to modeling assume the availability of pieces of data that we do not have for our system. Thus, we have grounded our work in the models used for current applications in this area [CS95, DS95, DTMH96, MMFM93, PRW94, WK93, YB95]. These models all take into account complexity of computation, information availability, and accuracy needed for predictions in this setting

For example, on a congested network with a large amount of data owing between the master and slave applications, we might have a pro le for the example application of:

Master = 10% Scat = 40% Max[Slavei] = 10% Gat = 40% In this case the most important pieces to model would be the Scatter and Gather components. 6

4.2.1 Related Work: Selection

If instead the networks were lightly loaded, the amount of information shared between the Master and the Slaves was small, or the workstations were computationally slow, we might have the following pro le: Master = 10% Scat = 5% Max[Slavei] = 80% Gat = 5% In this case, we could concentrate our modeling e orts on just the Slave component. The number of components modeled will depend in part on the needed accuracy for the use of the model.

Related work to our approach can be found in the area of algorithm and platform classi cations. Jamieson [Jam87] examines the relationship between architecture and algorithm characteristics for signal processing applications. This work attempts to identify the in uence of speci c application characteristics on architecture characteristics. Saavedra-Barrera [SBSM89] de ned micro-benchmarks to characterize machine performance, and de ned a visualization technique called pershape to provide a quantitative way of measuring the performance similarities of multiple machines. Both of these e orts provide some quanti cation of application/architecture anity.

i

4.1.2 Related Work: Application Pro le

4.3 Error analysis

Various application pro les have been addressed in the literature. A detailed application pro le is of common use in compiler optimizations [Wal90, CGS95]. For program comprehension and tuning, tools like gprof [GKM82] and pixie [Sil], can supply application pro le tables of data to help determine the percentage of time spent in a function, how many times a basic block was run, how often a global variable was accesses, and even for some tools the number of memory accesses and how many misses occur at each level of memory. Many scheduling approaches use what can be thought of as an application pro le for the performance prediction portion of their scheduling methodology. They require information about the frequency and average execution time for each I/O operation, memory read or write, and oating point operation throughout the application. Obtaining this level of information is dicult and often time consuming if even possible. Zhang et al. [YZS96] have developed a tool to measure these basic timing results needed for their approach, but they avoid the problem of having the pro le change with respect to problem size by analyzing a single problem size over a range of possible workstation networks. Others, like Simon and Wierum [SW96], require complex microbenchmarks to be run on each system.

Finally, we analyze the accuracy of the model and determine a course of action. The degree of accuracy needed by a prediction model varies by the use of the model. For example, in the AppLeS scheduling agents, when models are used as a part of the Resource Selection subsystem, models can be fairly inaccurate as long as they give a valid relative ranking of resource sets. However, in the Performance Prediction subsystem, models should provide an exact estimate of high accuracy. If our model is not suciently accurate for its use, there are several transformations we can perform:  Change a part of a component model. This is done to avoid using input data that is inaccurate.  Re ne a component model by increasing the detail. This is done when an execution property is not being modeled within a component: for example, memory hierarchy behavior or communication contention over a shared network.  Add other component models. This is done in accordance with the pro le, or when there is evidence that the pro le is in error.  Re-structure the top-level model. This is done when tasks or task interactions not being captured by the current model approach. Examples of each of these appear in Section 6.5. Evaluating the exact cause of an error in a prediction involves making use of additional information about the input data. Several forms of data about input data are common: variance information showing how values may range in time, con dence factors demonstrating a developer's con dence in the accuracy of the data, or history information if the values are input repetitively. In addition, if a given application has been modeled previously, there may be information about the component values used and their accuracy. This is the subject of our current research.

4.2 Selection Based on Available Input Information

We are currently basing component selection on the sources of information that are available. The basis for this approach is to use the simplest model that provides the level of needed accuracy. Simple models are based on simple input parameters, more often available then mode detailed or esoteric input parameters. As data for a component model is likely to be scarce, the possible choices of a model are narrowed signi cantly. In situations where several choices are possible, we also plan to analyze accuracy issues and their tradeo with the complexity of the chosen model. Our framework provides the needed exibility to allow for additional information this level. 7

5 Modeling a Genetic Algorithm

The rst step in modeling this application is to build a top-level structural model. Because of the synchronization structure, we will use the same structural model as Equation 3: ExTime = Master + (5) Max[Scati + Slavei + Gati] We now need to identify possible component models for the four components identi ed here.

As an example from the master-slave class of applications, we have developed performance models for a genetic algorithm (GA) optimization for the Traveling Salesman Problem (TSP) [LLKS85, WSF89]. Genetic algorithms were originally developed by the arti cial intelligence community as an optimization technique for NP-complete and NP-hard problems. As such, they are now being used by several groups of computational scientists [DTMH96, SK96, PM94] to address problems such as protein folding. Our distributed implementation uses a global population with synchronization between generations [Bha96]. It was written in C using PVM on a heterogeneous cluster of Sparc and RS6000 workstations in the Parallel Computation Laboratory (PCL) at UCSD.

5.2 Modeling GA Computation

We examined two methods for evaluating the computational tasks for this application, counting the number of operations and using benchmarks. In the case of operation counts, we have the following models for the Master and Slavei computations: Master1 = NumElts  Op(i; Elt)=CPU (6) Slavei1 = N  Op(i; Elt)=CPU (7) where  NumElt : Number of elements computed by the Master,  Op(i; Elt) : Number of operations to compute a single element on processor i,  CPU : Time to perform one operation on processor i, and  N : Number of individuals Slave generates. For many high performance applications, the majority of the computation takes place in well de ned inner loops, so an accurate estimate of Op(i; Elt) can be achieved by evaluating the number of operations in the inner loops. However, the computation involved in the GA Slave routines are involved and non-deterministic. Therefore, we also used benchmark models to determine the Master and Slave computation time per element. For the Master and Slave?i computations, we have the following models: Master2 = NumElts  BM(M ) (8) Slavei2 = N  BM(S ) (9) where  NumElt : Number of elements computed by the Master,  BM(M ) : Time for Master processor i to process an individual,  N : Number of individuals Slave generates, and  BM(S ) : Time for Slave processor i to generate and evaluate an individual. i

5.1 Structural Model for GA

i

This application is structured much as the master-slave example presented in Section 2.2. All of the Slaves operate on a global population (each member of the population is a solution to the TSP for a given set of cities) which is broadcast to them by the Master using a PVM multicast routine. Each Slave works in isolation to create a speci ed number of children (representing tours), and to evaluate them (in this case each child indicates how long a tour will take). This data is sent back to the Master. Once all the sets of children are received by the Master, they are sorted (by eciency of the tour), some percentage are chosen to be the next generation, and the cycle begins again. Figure 3 shows the pseudocode for this application. In our implementation, no Slave process runs on the Master processor. The e ect of this is that synchronization only occurs at the Master component.

i

i

i

For i = 1 to NumberofGenerations Master startup Mcast entire population Each Slave: Receive population from Master For j = 1 to NumberofChildren Randomly pick two parents Cross parents to create child Evaluate child Send all children to Master Master receives all children subsets Master sorts children Percentage for next generation is selected Return best child

i

i

i

i

i

i

Figure 3. Pseudocode for GA Application

i

i

8

5.3 Modeling GA Communication

Upon error analysis for this model, we found the predictions to be inadequate; in fact, o by up to 100%, as seen in Figure 4. Therefore, we chose another component model that avoided the faulty data, and, using benchmarks for a median problem size, achieved results within 10% of the actual run time. However, obtaining this benchmark signi cantly added to the development time needed for this model.

In addition to the two computational tasks, there are two communication tasks to model, a Scatter and a Gather. For this GA, the Scatter is implemented as a multicast that sends the same data to each of the Slave processes. Because we are evaluating each Slave communication individually, both the scatter and the gather are simple point-to-point communications:

Scat1 = PtToPt(M; Si) Gat1 = PtToPt(Si; M)

300.0

(10)

Operation Count Model Actual Execution Time Benchmark Model

(11) where PtToPt was de ned for the example application (Equation 4). Another possible model for communication would include latency information:

PtToPt2(x; y) = Lat(x) + PtToPt(x; y)

Time (seconds)

200.0

100.0

(12)

Having identi ed various possible model components and associated input information, the next subsection will discuss their selection.

0.0 0.0

200.0

400.0 600.0 Problem Size

800.0

1000.0

Figure 4. Actual times versus model times for 5slave GA application.

5.4 Selecting GA Component Models

Our selection methodology is based on three criteria: the application pro le, available input information and error analysis. The GA application was implemented on networked resources in the UCSD Parallel Computation Lab. The con guration includes two Sparc 2's (one running only the Master process), one Sparc 5, one Sparc 10 and two RS6000's, all connected over Ethernet. The timings were taken using a dedicated system in order to test the validity of the models in isolation (we will be adding contention e ects to the models in the future so that they can be used in production environments). For this application, because the networks were unloaded, the message sizes were small and several of the processors involved were slow, the application developer informed us that the majority of the computation time would be spent in the Slave portion of the code [Bha96]. If needed, this information could be veri ed by implementing the code with timing routines. Following this application pro le, we decided to concentrate our modeling e ort on the Slave section of the model and analyzed two fairly detailed choices for this component, an operation count model (Equation 7) and a benchmark model (Equation 9), setting the other components equal to zero. Our rst choice, driven by the availability of input information, was to model the Slave computation using an operation count model. However, we did not have a strong con dence in this information, due to the fact that the inner loops of the GA code are non-deterministic and irregularly structured, therefore it was dicult to derive an accurate operation count.

Figure 4 shows the results of the two models, one using an operation count model, the other using a benchmark model, as compared to actual execution time of the GA code. The problem size is the total population divided evenly over the 5 Slaves. The benchmark models become less accurate as the number of individuals per Slave increases. We believe this is in part due to caching e ects in the Slave computation. Benchmarks were taken only at a single medial data size, so memory a ects for large data sizes would not be accurately re ected. Cache e ects could be included in the model at a cost to the complexity of computing component values. The trade-o between complexity of computing a model and the accuracy of its estimation is a key issue we intend to explore. We evaluated several models for the communication components (for bandwidth only using equations 10 and 11, as well as latency with bandwidth using equation 12) but they had no e ect on the accuracy of the prediction. Communication costs were low, in part because the experiments were run in isolation so there was no extraneous network trac. Communication costs would contribute more if latency were higher, or contention caused considerable communication delay. In the face of network contention, we would need to we would need to address better communication models for both the Gather and Scatter routines. These would also need to include a factor of interference from the communications of the other Slaves| either through using a partial bandwidth value or some other mechanism.

i

i

9

6 Red-Black SOR Application

running faster will be idle undetected, due to \drift" in the loose synchronization. However, if the work is load balanced across all processors to t their execution capacities, the drift between processors will be small. If drift were to become a problem it would appear as wait time during the communication components. In that case, we would need to factor it into the structural model as an outside in uence on a component. For example, we could use the following structural model instead of the one above: ExTime = Max [RedCompi] + Max [RedCommi]+ Max [BlackCompi] + Max [BlackCommi]

We also examined an SOR (successive over-relaxation) algorithm which solves Laplace's equation. This application is a typical distributed stencil application, using an N by N grid of data divided over the processors in strips [Fig96]. It was written in C with PVM for the heterogeneous cluster of machines in the PCL as the GA code was.

6.1 Structural Model for SOR

In our implementation, the application is divided into \red" and \black" phases, with communication and computation alternating for each [Bri87]. This repeats for a prede ned number of iterations. The data decomposition for this code is depicted in Figure 5. P1

P2

However, in our implementation the drift is small so we can use the original structural model without penalty.

6.2 Modeling SOR Communication

P3

The two communication models, unlike the Gather and Scatter for the GA, contain both sends and receives. Component models for them could be:

where

Figure 5. SOR Application

RedComm = SendLR + ReceLR BlackComm = SendLR + ReceLR

SendLR = PtToPt(i; i + 1) + PtToPt(i; i ? 1) ReceLR = PtToPt(i; i + 1) + PtToPt(i; i ? 1)

A structural model for this application might be: ExTime = Max [RedCompi + RedCommi+ BlackCompi + BlackCommi]

using the component PtToPt as de ned for the GA and example applications (Equation 4).

6.3 Modeling SOR Computation

where

For computation, we can use models similar to those from the GA application. In the case of operation counts, we have:

 Max : A maximum function,  RedCompi: Execution time for Red Computation

phase on processor i,  RedCommi: Time to send and receive receive data with neighbors during Red phase,  BlackCompi: Execution time for Black Computation phase on processor i, and  BlackCommi: Time to send and receive receive data with neighbors during Black phase. The time for a single iteration is equivalent to the time for the slowest processor to complete that iteration, assuming there is only synchronization once an iteration. In reality, there will be loose synchronization between neighbors at each communication. If we use this method as a part of a scheduler it is possible that processors

RedCompi1 = NumElts  Op(i; Elt)=CPU (13) BlackCompi1 = NumElts  Op(i; Elt)=CPU (14) i

i

i

i

Likewise, using benchmarks we have:

RedCompi2 = NumElts  BM(Red ); (15) BlackCompi2 = NumElts  BM(Black ); (16) i

i

i

i

where  BM(Red ) : Time for processor i to compute a Red element, and  BM(Black ) : Time for processor i to compute a Black element. i

i

10

6.4 Selecting SOR Component Models

Change a part of a component model. To avoid

using inaccurate data we might change a part of a component model. For example, if we were modeling the SOR in a contentious environment we might add a dynamic input for machine capacity values, since static values would not be adequate. The resulting models would be: LoadedCompi = Capacity  Compi where capacity would need to be supplied from a dynamic source, for example [Wol96]. Re ne a component model. It might be possible to capture the memory behavior as a part of the computation component. For example: BetterComp = Memoryi  Compi where Memory would represent the behavior of the memory access on processor i for larger problem sizes. Add other component models. If this were executed over a contentious network, communication would need to be modeled instead of being set to Nil, as it was in our experiments. Re-structure the top-level model. It is possible that the memory spill behavior is not limited to the computation module of the application. In this case, we would need to examine di erent top-level structural models that could include the memory behavior and its interactions.

Our initial application pro le suggested this would be computationally intense on our network of machines due to the relative slowness of several workstations and the unloaded network. Therefore we concentrated on developing component models for the two computation components. The SOR application was implemented on the same platform as the GA application, but each processor ran similar code as there is no master process. Again, the workstations and networks were quiescent. In this case, as shown in Figure 6, both a benchmark model (using Equations 15 and 16) and an operation count model (using Equations 13 and 14) achieved predictions within 10% of actual execution time|until the problem size spilled from memory. In order to model larger problem sizes, the model would need to be transformed to included memory. As such, this application reinforces the usability of our approach in identifying models for a given problem range, and allowing exibility as needed over a range of problem sizes or implementations.

i

600.0 Actual Computation Benchmark Operation Count

7 Conclusion and Future Work

400.0

We have presented a new structural approach to modeling cluster applications, and two proof-of-concept examples. Both real applications require exible models to predict their behavior, but the choice of models is not obvious. Our paradigm is well suited to this decision. Our current work involves formalizing the error analysis and transformation decisions, and compensating for error in input information. Future work includes de ning a library of structural and component models as a part of the AppLeS project, and de ning selection techniques based on accuracy and complexity.

200.0

0.0 0.0

1000.0

2000.0

3000.0

4000.0

5000.0

Figure 6. Actual times versus model times for SOR application.

Acknowledgements

I would like to thank Fran Berman from UCSD for her valuable comments, discussions and corrections. Thanks also to Allen Downey, from UCB and SDSC, who read and commented on drafts of this paper. Original implementations of the GA and SOR codes are the result of hard work by Karan Bhatia and Silvia Figueira, both at UCSD. Thanks also to the UCSD Parallel Lab members and the AppLeS Corp for many useful discussions and their patience with my midnight experiments.

6.5 Transformations to Reduce Error for SOR Our current work involves identifying which transformations on the model should be performed in order to reduce error (Section 4.3). To clarify them we detail each in this section using the SOR example. 11

References

[DTMH96] D. M. Deaven, N. Tit, J. R. Morris, and K. M. Ho. Structural optimization of lennard-jones clusters by a genetic algorithm. Chemical Physical Letters, 256:195, 1996. [FB96] Silvia Figueira and Fran Berman. Modeling the e ects of contention on the performance of heterogeneous applications. In Proceedings of the Fifth IEEE Symposium on High Performance Distributed Computing, 1996. [Fig96] Silvia Figueira. Personal communication, 1996. [FK97] Ian Foster and Carl Kesselman. Globus: A metacomputing infrastructure toolkit. International Journal of Supercomputer Applications, (to appear) 1997. Also available at

[ASWB95] Cosimo Anglano, Jennifer Schopf, Richard Wolski, and Fran Berman. Zoom: A hierarchical representation for heterogeneous applications. Technical Report #CS95-451, University of California, San Diego, Computer Science Department, 1995. [BDGM93] A. Beguelin, J. Dongarra, A. Geist, and R. Manchek. HeNCE: A heterogeneous network computing environment. Scienti c Programming, 3(1):49{60, 1993. [Bha96] Karan Bhatia. Personal communication, 1996. [Bri87] William L. Briggs. A Multigrid Tutorial. Society for Industrial and Applied Mathematics, Lancaster Press, 1987. [BW96] Fran Berman and Richard Wolski. Scheduling from the perspective of the application. In Proceedings of the Fifth IEEE International Symposium on High Performance Distributed Computing, 1996. + [BWF 96] Fran Berman, Richard Wolski, Silvia Figueira, Jennifer Schopf, and Gary Shao. Applicationlevel scheduling on distributed heterogeneous networks. In Proceedings of SuperComputing '96, 1996. [CGS95] Brad Calder, Dirk Grunwald, and Amitabh Srivastava. The predictability of branches in libraries. In Proceedings of the 28th International Symposium on Microarchitecture, 1995. Also available as WRL Research Report 95.6. [Cha94] K. M. Chandy. Concurrent program archetypes. In Proceedings of the 1994 Scalable Parallel Libraries Conference, 1994. [Col89] Murray Cole. Algorithmic Skeletons: Structured Management of Parallel Computation. Pitman/MIT Press, 1989. [CS95] Robert L. Clay and Peter A. Steenkiste. Distributing a chemical process optimization application over a gigabit network. In Proceedings of SuperComputing '95, 1995. + [DFH 93] J. Darlington, A. J. Field, P. G. Harrison, P. H. J. Kelly, D. W. N. Sharp, Q. Wu, and R. C. While. Parallel programming using skeleton functions. In Proceedings of Parallel Architectures and Languages Europe (PARLE) '93, 1993. [DN95] J. Dongarra and Peter Newton. Overview of VPE: A visual environment for message-passing parallel programming. In Proceedings of the 4th Heterogeneous Computing Workshop, 1995. [DS95] J. Demmel and S. L. Smith. Performance of a parallel global atmospheric chemical tracer model. In Proceedings of SuperComputing '95, 1995.

[FOW84]

ftp://ftp.mcs.anl.gov/pub/nexus/reports/ globus.ps.Z.

J. Ferrante, K. J. Ottenstein, and J. D. Warren. The program dependence graph and its use in optimization. In Proceedings of the International Symposium on Programming 6th Colloquium, 1984. [GKM82] S. Graham, P. Kessler, and M. McKusick. gprof: A call graph execution pro ler. In Proceedings of the SIGPLAN '82 Symposium on Compiler Construction, 1982. Also published in SIGPLAN Notices, volume 17, number 6, pages 120{126. [GR96] Jorn Gehring and Alexander Reinefeld. Mars a framework for minimizing the job execution time in a metacomputing environment. Future Generation Computer Systems, Spring 1996. [GWtLt97] Andrew S. Grimshaw, William A. Wulf, and the Legion team. The Legion vision of a worldwide virtual computer. Communications of the ACM, 40(1), January 1997. [Jam87] Leah Jamieson. The Characteristics of Parallel Algorithms, chapter 3, Characterizing Parallel Algorithms. MIT Press, 1987. eds. L. Jamieson, D. Gannon and R. Douglass. [Kar96] John F. Karpovich. Support for object placement in wide area heterogeneous distributed systems. Technical Report CS-96-03, University of Virginia, Department of Computer Science, January 1996. [KME89] A. Kapelnikov, R. R. Muniz, and M. D. Ercegovac. A modeling methodology for the analysis of concurrent systems and computations. Journal of Parallel and Distributed Computing, 6:568{ 597, 1989. [LLKS85] Lawler, Lenstra, Kan, and Shmoys. The Traveling Salesman Problem. John Wiley & Sons, 1985. [LS93] S. Leuttenegger and X. Sun. Distributed computing feasibility in a non-dedicated homogeneous distributed system. Technical Report Techinical Report Number 93-65, NASA - ICASE, 1993.

12

[WK93]

[ML90]

V. W. Mak and S. F. Lundstrom. Predicting the performance of parallel computations. IEEE Transactions on Parallel and Distributed Systems, pages 257{270, July 1990. [MMFM93] C. R. Mechoso, C. Ma, J. D. Farrara, and R. W. Moore. Parallelization and distribution of a coupled ocean-atmosphere general circulation model. Monthly Weather Review, 121(7), July 1993. [Moh84] J. Mohan. Performance of Parallel Programs: Model and Analyses. PhD thesis, Carnegie Mellon University, July 1984. [NB92] P. Newton and J. C. Browne. The Code 2.0 graphical parallel programming language. In Proceedings of the ACM International Conference on Supercomputing, July 1992. [PM94] Jan Pedersen and John Moult. Determination of the structure of small protein fragments using torsion space monte carlo and genetic algorithm methods. In Proceedings of Meeting on Critical Assessment of Techniques for Protein Structure Prediction, Asilomar Conference Center, December 1994. [PRW94] A. T. Phillips, J. B. Rosen, and V. H. Walke. Molecular structure determination by convex global underestimation of local energy minima. Technical Report Tech Report UMSI 94/126, University of Minnesota Supercomputer Institute, 1994. [SBSM89] R. Saavedra-Barrera, A. J. Smith, and E. Miya. Machine characterization based on an abstract high-level language machine. IEEE Transactions on Computers, 38:1659{1679, December 1989. [Sil] Silicon Graphics. Pixie Man Page. also Ste en

[SSLP93]

Jonathan Schae er, Duane Szafron, Greg Lobe, and Ian Parsons. The Enterprise model fo developing distributed applications. IEEE Parallel and Distributed Technology, 1(3), August 1993. Jens Simon and Jens-Michael Wierum. Accurate performance prediction for massively parallel systems and its applications. In Proceedings of Euro-Par '96 Parallel Processing, volume 2, 1996. A. Thomasian and P. F. Bay. Analysis queueing network models for parallel processing of task systems. IEEE Transactions on Computers c35, 12:1045{1054, December 1986. David W. Wall. Predicting program behavior using real or estimated pro les. Technical Report WRL TN-18, Digital Western Research Laboratory, 1990.

[TB86] [Wal90]

[WSF89]

[YB95] [YZS96]

[ZY95]

http://web.cnam.fr/Docs/man/Ultrix-4.3/ pixie.1.html.

[SK96]

[SW96]

[Wol96]

Schulze-Kremer.

http://www.techfak.uni-bielefeld.de/bcd/ curric/proten/proten.html, 1996.

13

Yi-Shuen Mark Wu and Aron Kupermann. Prediction of the e ect of the geometric phase on product rotational state distributions and integral cross sections. Chemical Physics Letters, 201:178{86, January 1993. Rich Wolski. Dynamically forecasting network performance using the network weather service. Technical Report TR-CS96-494, University of California, San Diego, Computer Science and Engineering Dept., October 1996. D. Whitley, T. Starkweather, and D'Ann Fuquay. Scheduling problems and traveling salesman: The genetic edge recombination operator. In Proceedings of International Conference on Genetic Algorithms, 1989. W. Young and C. L. Brooks. Dynamic load balancing algorithms for replicated data molecular dynamics. Journal of Computational Chemistry, 16:715{722, 1995. Yong Yan, Xiaodong Zhang, and Yongsheng Song. An e ective and practical performance prediction model for parallel computing on nondedicated heterogeneous NOW. Journal of Parallel and Distributed Computing, October 1996. Also University of Texas, San Antonio, Technical Report # TR-96-0401. X. Zhang and Y. Yan. A framework of performance prediction of parallel copmuting on non-dedicated heterogeneous networks of workstations. In Proceedings of 1995 International Conference of Parallel Processing, 1995.