Performance Prediction in a Grid Environment - Semantic Scholar

4 downloads 0 Views 493KB Size Report
SP with 8 by 16 Nighthawk Power3 processors (Kadesh). Dimemas tracefiles of local executions on the SP3 were extracted with mpid- trace, the Dimemas ...
Performance Prediction in a Grid Environment Rosa M. Badia2 , Francesc Escal´e2 , Edgar Gabriel1,3 , Judit Gimenez2 , Rainer Keller1 , Jes´ us Labarta2 , Matthias S. M¨ uller1 1

High Performance Computing Center Stuttgart, Allmandring 30, D-70550 Stuttgart, Germany

2

European Center for Parallelism of Barcelona (CEPBA), Technical University of Catalonia (UPC), Campus Nord, M` odul D6, Jordi Girona, 1-3, 08034 Barcelona, Spain 3

Innovative Computing Laboratories, Computer Science Department, University of Tennessee, Knoxville, TN, USA {keller,gabriel,mueller}@hlrs.de {rosab,fescale,jesus,judit}@cepba.upc.es

Abstract. Knowing the performance of an application in a Grid environment is an important issue in application development and for scheduling decisions. In this paper we describe the analysis and optimisation of a computation- and communication-intensive application from the field of bioinformatics, which was demonstrated at the HPCChallenge of Supercomputing 2002 at Baltimore. This application has been adapted to be run on an heterogeneous computational Grid by means of PACX-MPI. The analysis and optimisation is based on trace driven tools, mainly Dimemas and Vampir. All these methodologies and tools are being extended in the frame of the DAMIEN IST project.

1

Introduction

The efficient execution of a complex scientific application in a distributed, heterogeneous environment where resources are shared with others is a challenge for the developer and user. The DAMIEN [4, 9] project (Distributed Applications and Middleware for Industrial Use of European Networks) aims to produce a tool-chain to support developers and scientific users of Grid-applications. To achieve this goal, new tools are created as well as existing and widely accepted tools are extended to work in a distributed and heterogeneous Grid-environment. For the HPC-Challenge at Supercomputing 2002 in Baltimore, we demonstrated a computationally and communication-intensive application of the area of bioinformatics, RNAfold [7], running on a computational Grid, successfully employing the DAMIEN tool-chain. This computational Grid consisted of 22 high-performance computers installed at the sites of our partners, constituting a distributed, heterogeneous Metacomputer.

In this paper we focus on two of the DAMIEN-tools, namely PACX-MPI, which will be introduced in the next section, and Dimemas, described in section 3. Section 4 then describes experiments done with these two tools in conjunction with the application RNAfold.

2

PACX-MPI

The middleware PACX-MPI [6] is an optimized MPI implementation and enables MPI-conforming applications to be run on a heterogeneous computational Grid without requiring the programmer to change the source code. The hosts making up the computational Grid are coupled through an interconnecting network, e. g. the Internet. Communication between MPI-processes within the host is done with the optimized vendor MPI, while communication between MPI-processes on different hosts is carried out over the interconnecting network. For this communication to run efficiently, two MPI-processes on each host, the so called daemons, are employed. Any communication between two remote MPI-processes has to be passed to these daemons, who then transfer the message to the other host. While the internal connections provide very high bandwidth in the range of several hundred MB per second with latencies as low as 4 µs, the external connection between hosts sometimes offer a connection of poor quality, depending on various factors, e. g. distance between hosts, network provider and even time of day. The main interest is therefore to speed up these external connections. 2.1

Extensions of PACX-MPI within the DAMIEN project

In order to speed up the interconnection between hosts, several methods are being examined within the scope of the DAMIEN project. The first one is to use several TCP/IP connections instead of just one connection between the daemons. These multiple connections are implemented using threads, each one serving a single TCP/IP connection between the hosts. Depending on the thread-safety of the underlying MPI-implementation two different approaches are implemented on the receiving end of a connection: 1. For MPI implementations featuring only funneled thread-safety, i. e. only the main thread may execute MPI calls, all communication is queued for the main thread to send to the receiving MPI processes. 2. For thread-safe MPI implementations, each thread on the communication daemon sends the message to the receiving MPI processes. The second method considered in the DAMIEN project consists of giving the application programmer the ability to use the QoS-capabilities offered by the network. RNAfold is optimized to hide communication latencies by doing computation. Still, the application is tightly coupled, since after every computational step, communication has to be done: either a row of a matrix or a column of a matrix

has to be sent or received. Since only one communication is being started, PACXMPI is not able to take advantage of parallel connections by sending multiple packets at the same time.

3

Performance prediction: Dimemas

Dimemas is a performance prediction simulator for message passing applications. It reconstructs the time behavior of a parallel application on a target architecture. The inputs to Dimemas are: a tracefile of an execution of the application on a source machine that captures the CPU bursts versus the communication pattern information; and a configuration file that contains a set of parameters that model the target architecture. Complemented with a visualization tool [2, 10, 12], it allows the user to gain insight into the application behavior. The initial Dimemas architecture model considered networks of Shared Memory Processors (SMP). In this architecture, each node is composed of a set of processors and local memory. The interconnection of these nodes is modeled with input/output links and a set of buses. In the IST project DAMIEN [4], Dimemas is being extended to predict for distributed or Grid architectures. Therefore, the target architecture is now a set of machines, each of them can be a network of SMPs. Different machines are connected through an external Wide Area Network (WAN) or through dedicated connections. Dedicated connections are considered to have a certain QoS, with guaranteed latency and bandwidth, while the WAN latencies and bandwidth can be affected by the traffic present on the network. Dimemas has different models for point to point communications and collective operations, with user-configurable parameters to fit different communication library implementations and networks characteristics. For point to point communications a very simple model expressed in Fig. 1 is considered [5]. This general model can be applied to any type of communication, no matter if it is local or remote. What will vary depending on the case are the values (for example, a local latency will often be shorter than a remote latency) or the existence itself of some of the parameters (for example, the function of traffic is only applied on remote communications through the WAN to model its contention). Collective operations are modeled with a general formula, which takes into account the nature of these operations: Tcoll

op

external internal external internal = F ANin + F ANin + F ANout + F ANout

and: size ) · model factor (1) BW Where L and BW are the external latency and bandwidths, size is the size of the communication, and model factor is a user-configured parameter that external F ANin/out = (L +

Fig. 1. Point to point Dimemas communication model

allows to model different implementation behaviors (logarithmic, linear, constant behavior, or null when the phase does not apply). When the collective operation only involves one machine, the external components of the formula are null. For the internal components, the formula is the same as above, but in this case it is calculated for each of the machines that are involved in the collective communication and then the maximum value is considered.

4

Experiments

Every organism’s code of live, the genes, are stored in deoxyribonucleic acid (DNA), which is available in every cell of the organism. Still the ribonucleic acid (RNA) plays an even more important role in gene expression and protein synthesis. While DNA forms a stable double helix, RNA is created through a process called transcription, i. e. copying the relevant part of the DNA into messenger RNA (pre-mRNA). The sequence of the resulting mRNA is complementary to the strand of DNA it was synthesized from, forming a single strand. This single-stranded mRNA is subject to a process called folding, where the RNA’s bases start pairing in a semi-chaotic fashion, increasing the order of the mRNA-structure into a tertiary (3-dimensional) structure, resulting in a state of minimal energy. This tertiary structure determines the RNA-sequence’s function. The ab-initio computation of such a tertiary structures is infeasible to achieve with contemporary computers, but may be predicted with the help of the corresponding secondary structure [1]. The application RNAfold calculates the secondary structure of RNA. It was developed as part of the ViennaRNA package at the University of Vienna [8] and MPI-parallelized. For iGrid2002 and the HPC-Challenge at SC2002, this application was enhanced by Sandia National Labs and HLRS, namely the energy parameters were updated, the application was integrated into the framework of a virtual reality environment based on Covise [3] and the communication

pattern was improved in several consecutive steps to make it more efficient. Particularly the last optimization made it viable to run on a Metacomputer. Still it’s computationally expensive in the order of O(n3 ), and also communication intensive: O(n2 ), n being the number of bases in the sequence. At first, the communication was analyzed, using Vampir [12]. Here, it showed, that the computation is split in two parts: the first main communication follows a regular pattern: each process is communicating with each predecessor and its successor. In addition, process zero, which collects the intermediate results, distributes them onto a single, distinguished node storing it for the second step. In the second step, the secondary structure of minimal energy is reconstructed by collecting the data onto process zero again. The former communication step used five MPI Isends to/from it’s predecessor and successor, which were integrated into one MPI Isend. The second communication pattern was specifically inefficient for high-latency Metacomputing connections: process zero requests the data necessary for reconstruction from the process storing the particular data (sending three integers), which then delivers the data (returning one integer). The data requested consists of the F M -matrix, FijM storing the minimum free energy of substructure [i, j] of the sequence, if i and j are part of a so-called multi-loop and the F B -matrix, FijB being the minimum free energy of substructure [i, j] of the sequence, with bases i and j forming a pair. Since it was recognized, that process zero requests all values twice (some values even more often), a caching mechanism was implemented. Also, it was recognized, that values are requested following a regular pattern (see Fig. 2 for access pattern of the F B -matrix and the F M -matrix; colors encode the ordering of accesses, blue to red). To improve the efficiency of the communication, prefetching was introduced. While for F B matrix a square-sized area of the matrix was transmitted, for the F M -matrix a heuristic was integrated guessing the future values of the matrix being asked for: either a horizontal or a vertical line, both either incrementing or decrementing.

0

200

400

600

800

1000

1000

800

800

600

600

400

400

200

200

0 1000

0

200

400

600

800

Fig. 2. Access pattern of the F M -matrix (left) and the F B -matrix (right)

0 1000

This improved the computation time even for single HPC calculation of large sequences. Still the performance on Metacomputers was slow. With abundant memory, it was recognized, that process zero was able to store the complete matrices. This eliminated the need for communication from process zero to each other process, but more importantly it was thus possible to eliminate the last communication step and the need for caching, leaving a very regular communication pattern for Metacomputing, as may be seen in Fig. 3.

Fig. 3. Bytes send between 8 processes with the Metacomputing-version (1000 basepairs)

4.1

Experiments at CEPBA-UPC

First set of experiments was performed at the European Center for Parallelism of Barcelona(CEPBA) at the Technical University of Catalonia (UPC). The heterogeneous Metacomputer used in this case was composed of two machines: an SGI Origin O2000 server with 64 processors (Karnak) and an IBM RS-6000 SP with 8 by 16 Nighthawk Power3 processors (Kadesh). Dimemas tracefiles of local executions on the SP3 were extracted with mpidtrace, the Dimemas tracefile generator. These tracefiles were then used to feed Dimemas simulator with different input configurations. To estimate the characteristics of the connection between the two machines the ping program was used. Depending on the network situation at each moment, the nominal values of the WAN parameters can vary at each moment. Afterwards, measurements of distributed executions between these two machines were performed to validate Dimemas predictions. Table 1 shows some of the results obtained in this environment. The prediction for these cases was obtained by setting the network bandwidth to 0.5 MB/s and the flight time (network latency) to 4.9 ms.

Benchmark

# processors Time Kadesh Karnak Predicted Measured Test 5000.seq 4 4 245 secs 289 secs Test 5000.seq 14 14 107 secs 116 secs Yeast DNA seq 4 4 338 secs 409 secs Table 1. Results obtained at CEPBA-UPC site.

4.2

Parametric studies with Dimemas

Rather than predicting for a given configuration Dimemas is also very useful for performing parametric studies of the applications. In this sense, one can study which will be the best configuration in terms of machines, nodes and processors to run its application or which will be the threshold on the network parameters that will degradate the behavior of the application communications. As a example the yeast DNA sequence has been used as data input, run with eight processors. The input configuration was set to two machines, with four processors each. A set of two hundred simulations were performed, randomly varying the values of two parameters: the network bandwidth and the flight time (latency of the network due to the distance). For each value of the flight time and network bandwidth a prediction was calculated. Figure 4 shows the value predicted by Dimemas for the execution time against the network bandwidth and the flight time. Since this is a three-dimensional plot only projections are shown. This experiment was performed using the Metacomputing tool ST-ORM [11].

1000

900

900

800

800

700

700

Time [s]

Time [s]

1000

600

600

500

500

400

400

300

300

200

0

1

2

Bandwidth [MB/s]

3

4

200

0.025

0.05

Flight time [s]

0.075

Fig. 4. Results of parametric study using Dimemas and ST-ORM

0.1

4.3

Experiments between both sites

Testing the prediction of Dimemas on links with high-latency was performed on a distributed heterogeneous Metacomputer consisting of a 512 processor Cray T3e900 (Hwwt3e) installed at HLRS and the IBM-SP3, as well as the Origin2000 installed at CEPBA. Figure 5 summarizes the results of the tests performed. In the x-axis there are the different benchmarks and configurations that we have used for these experiments. The benchmarks are Test 1000 to Test 6000, with 1000 and 6000 bases respectively, which are synthetic data sets to RNAfold. The configurations that have been considered are 4+4, with 4 processors on the O2000 and 4 on the T3E; and 6+6 and 14+14, with 6 and 14 processors on the IBM-SP and on the T3E respectively. For all these cases, some of the measurements were done more than one time (Measure 1 and Measure 2) showing how influent the network state can be on the achieved performance. To estimate the latency and bandwidth of the network between Barcelona and Stuttgart, a combination of ping and traceroute tools were used. From the values obtained with those tools, a range of flight times and bandwidths were identified. These range of values were then used to feed Dimemas and obtain different prediction lines: for first line (in green), the bandwidth was set to 70KB/s and the flight time to 10ms; for second line (in red), to 100KB/s and 10ms and for third line (in blue) to 200KB/s and 1ms respectively. It can be observed how Dimemas is able to capture the behavior of the application under this range of values. 2000 Measure 1 Measure 2 0.07 MB/s, 10ms 0.1 MB/s, 10ms 0.2 MB/s, 1ms

1750

1500

1250

1000

750

500

4

4 +1

4

+1

14

14

60

00

4

+1

14 50

00

4

+1

14 40

00

00

4

+1

+1

14

14

00

20

00

00

60

10

30

6

6 6+

6

6+

6+ 50

00

6

6 6+

00

00

40

6

6+ 30

00

20

00

6+

4 4+ 10

00

50

40

00

4+

4

4 4+

4+

00

00 30

20

10

00

4+

4

4

250

Fig. 5. Results obtained with a distributed Metacomputer

5

Conclusion

The use of performance analysis and prediction tools for parallel codes is a must on computational Grids due to the complexity of the environment and of the codes themselves. The DAMIEN project aims at the development of tools for message passing applications on heterogeneous computational Grids. This paper presented some of the work performed in this project. We demonstrated that the tools work for real applications, especially that Dimemas is not only able to analyze the sensitivity to network parameters like latency and bandwidth, but also to predict the behavior of the application in the extreme environment of the Grid. In addition, we introduced some application optimizations, like caching and prefetching, to reduce the sensitivity to latency and thus improve the performance.

Acknowledgments This work was supported by the European Commission under contract number DAMIEN IST-2000-25406.

References 1. R. L. Baldwin and G. D. Rose. Is protein folding hierarchic? II – Local structure and peptide folding. Tibs 24, pages 77–83, February 1999. 2. H. Brunst, W. E. Nagel, and H.-C. Hoppe. Group-Based Performance Analysis of Multithreaded SMP Cluster Applications. In R. Sakellariou, J. Keane, J. Gurd, and L. Freeman, editors, Euro-Par 2001 Parallel Processing, pages 148–153. Springer, 2001. 3. Covise. Internet, October 2002. http://www.hlrs.de/organization/vis/covise. 4. DAMIEN. Internet, October 2002. http://www.hlrs.de/organization/pds/ projects/damien. 5. Dimemas. Internet, October 2002. http://www.cepba.upc.es/dimemas. 6. E. Gabriel, M. Resch, T. Beisel, and R. Keller. Distributed Computing in a Heterogeneous Computing Environment. In PVM/MPI, pages 180–187, 1998. 7. I. Hofacker, W. Fontana, L. S. Bonhoeffer, M. Tacker, and P. Schuster. Vienna RNA Package. http://www.tbi.univie.ac.at/∼ivo/RNA. 8. I. L. Hofacker, W. Fontana, L. S. Bonhoeffer, M. Tacker, and P. Schuster. Vienna RNA Package. Internet, October 2002. http://www.tbi.univie.ac.at/∼ivo/RNA. 9. M. S. M¨ uller, E. Gabriel, and M. Resch. A Software Development Environment for Grid-Computing. In Concurrency and Computers – Practice and Experience, 2003. Special issue on Grid computing environments. 10. Paraver. Internet, October 2002. http://www.cepba.upc.es/paraver. 11. ST-ORM. Internet, October 2002. http://www.easi.de/storm. 12. Vampir. Internet, October 2002. http://www.pallas.com/e/products/vampir/ index.htm.