A Parallel Computational Model for ... - Semantic Scholar

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,

VOL. 17,

NO. 12,

DECEMBER 2006

1

A Parallel Computational Model for Heterogeneous Clusters Jose Luis Bosque and Luis Pastor Abstract—Heterogeneous clusters claim for new models and algorithms. In this paper, a new parallel computational model is presented. The model, based on the LogGP model, has been extended to be able to deal with heterogeneous parallel systems. For that purpose, the LogGP’s scalar parameters have been replaced by vector and matrix parameters to take into account the different nodes’ features. The work presented here includes the parametrization of a real cluster, which illustrates the impact of node heterogeneity over the model’s parameters. Finally, the paper presents some experiments that can be used for assessing the method’s validity, together with the main conclusions and future work. Index Terms—Parallel computational models, performance evaluation, heterogeneous systems, cluster computing, LogGP model.

Ç 1

INTRODUCTION

D

URING the last decade, Beowulf clusters have had tremendous dissemination and acceptance. In just a few years, Beowulfs have mobilized a community around a standard architecture and tools, making high-performance computing much less expensive and much more accessible for industries, laboratories, and universities. Improving the clusters’ hardware and software is essential to improving the efficiency and effectiveness of these systems [8]. However, the design and implementation of efficient parallel algorithms for clusters is still a problematic issue. The diverse architectures, interconnection networks technologies and topologies, routing algorithms, and the presence of some degree of system heterogeneity have a pervasive impact on algorithm performance. Each architecture has distinct features which can deeply affect algorithm performance. The problem is thus how to design these parallel algorithms to accommodate to the particularities of different computers. One way to solve this problem is by developing practical parallel computational models and using them to guide the high-level design of effective parallel algorithms and to estimate their performance before actually running them on a parallel machine (a parallel computational model is a mathematical abstraction of parallel computers which hides architecture details from software designers). The challenge then becomes to develop a general-purpose parallel model in such a way that it is sufficiently detailed to take into account realistic aspects impacting algorithm performance while still remaining abstract enough to be architecture independent and easy to analyze. During the past few years, a range of parallel computational models have been developed as a basis for the design of fast and portable parallel algorithms [25].

. The authors are with the Departamento de Arquitectura de Computadores y Ciencias de la Computacin e Inteligencia Artificial, Universidad Rey Juan Carlos, Calle Tulipan S/N, 28.933 Mostoles, Spain. E-mail {joseluis.bosque, luis.pastor}@urjc.es. Manuscript received 20 Jan. 2005; revised 30 Sept. 2005; accepted 12 Nov. 2005; published online 25 Oct. 2006. Recommended for acceptance by Y. Robert. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TPDS-0038-0105. 1045-9219/06/$20.00 ß 2006 IEEE

The techniques proposed most recently often try to use many parameters in order to achieve better accuracy. However, a careful balance must be sought between incorporating detail and becoming too finely parametrized, increasing model complexity too much. Therefore, identifying resource metrics and selecting them accordingly is critical in the design of models of parallel computation. All of the models proposed in the literature so far have been developed having in mind shared and distributed memory multiprocessors, typically composed of a potentially large number of identical processors. The situation for clusters is different since their flexibility permits easy system reconfigurations, which frequently result in heterogeneous systems, integrated by nodes showing different features and performance. Consequently, there is a strong need for developing tools and techniques for analyzing the behavior of heterogeneous environments. In this paper, a new heterogeneous parallel computational model based on the LogGP model is proposed. The model takes into account system heterogeneity both regarding the computational nodes and the communication network. The LogGP model has been selected as the starting point for several reasons, listed in Section 3.1. The rest of the paper is organized as follows: First, a brief summary of parallel computational models is presented. Next, the heterogeneous parallel computational model presented in this paper is described. Section 4 presents a parametrization of a real cluster using the proposed method. Finally, Section 5 presents some experimental results obtained while applying this model to a practical problem and Section 6 summarizes the main conclusions reached and future work to be performed.

2

BACKGROUND

It has always been difficult to develop reasonable abstractions of parallel computers because these machines have traditionally exhibited a remarkable structure diversity. During the last decade, a large number of parallel computer architectures have been developed and, consequently, a large number of new parallel computing models have been proposed. Each new approach tries to incorporate more Published by the IEEE Computer Society

2


detail in order to increase the model accuracy and reliability.

2.1 Basic Synchronous Models Historically, the most widely used parallel model is PRAM (Parallel Random Access Machine) [22]. This model extends the sequential RAM model by replacing the processor part: A PRAM machine is a set of sequential processors sharing a global memory, each of them having its own private unbounded local memory. PRAM assumes that all of the processors are synchronized after executing each instruction. The cost of each memory access (either to local or global memory) and computation steps is the same and interprocess communication is free. Another synchronous extension of the RAM serial model is the Vector Random Access Machine model (VRAM) for vector architectures. The VRAM model is a serial random access machine with the addition of a vector memory, a vector processor, and vector input and output ports [13]. PRAM has proven to be useful for allowing algorithm designers to focus on the computational tasks structure rather than on the architecture details of a currently available machine. But, PRAM also makes a large number of assumptions in order to simplify algorithm design and is often inaccurate in predicting the real behavior of many existing parallel machines at runtime. This problem has spurred the development of several extensions of PRAM and other models. 2.2 Asynchronous Models One of the strongest restrictions of the models cited above is that they are accurate enough only while modeling tightly coupled multiprocessor systems because of the synchronous features included in them. In order to have models that can work with both tightly and loosely coupled systems, it is necessary to incorporate some degree of asynchrony into these models. For example, Phase PRAM extends PRAM including semi-asynchrony [24]. Another proposal that can be cited is the Asynchronous PRAM model (APRAM), which is a “fully” asynchronous model [17]. As in PRAM, an APRAM machine consists of a global shared memory and a set of processors with their own local memories. 2.3 Models with Latency and Bandwidth The models presented here fall into two different subclasses: models that extend the PRAM paradigm by introducing latency and those which also address the issue of bandwidth limitations. For example, the Local-Memory PRAM model (LPRAM) adds the notion of latency to the PRAM model [1]. Also, the Block PRAM model takes into account the reduced cost associated with the transference of data blocks [2]. In addition to PRAM extensions, a set of completely different models have also been proposed. The Postal Model proposed by Bar-Noy and Kipnis [7] is a distributed memory model with point to point communication that incorporates message transfer latencies. The Bulk-Synchronous Model (BSP) is a distributed memory and semiasynchronous paradigm [32]. In BSP, each computation is decomposed into a set of sequential supersteps with all of the processors being synchronized after each superstep. The BSP model is described in terms of three elements: processes/memory modules, a router which delivers messages between pairs of components, and a synchronizer which synchronizes all or a subset of the components.

VOL. 17, NO. 12,

DECEMBER 2006

Specific cluster features such as asynchrony and loose coupling have forced new models to be developed, such as [14], [20] for dedicated clusters and [5], [11] for borrowed clusters. For heterogeneous clusters, several frameworks have been proposed [6], [9], [10]. The HiHCoHP model (Hierarchical Hyperclusters of Heterogeneous Processors) was proposed in [16], [29] as a realistic communication model for heterogeneous clusters. It includes several features, such as the bandwidth and transit costs of networks and their ports, while retaining enough mathematical tractability to allow algorithm design. The authors present a complete theoretical development, but they do not report any experimental results. Finally, the LogP model is motivated by current technological trends in high performance computing toward networks of large-grained sophisticated processors. The LogP model uses the following parameters: Latency, L: An upper bound for the latency associated with transferring a single word message from its source to its target processor. . Overhead, o: The period of time that a processor is engaged in sending or receiving a message. . Gap between messages, g: The minimum time interval between consecutive message transmissions or receptions at a specific processor. The inverse of g corresponds to the available per-processor communication bandwidth for short messages. . Number of processors, P : The machine’s number of processors. LogP allows processors to run in a completely asynchronous way, although the communication performance can only be accurately predicted when short messages are sent. In that case, the total time for a point to point communication in the LogP model requires L þ 2o cycles. Alexandrov et al. [3], [4] extended the basic LogP model with a linear model for long messages, adding an additional parameter: Gap per byte, G, which captures the bandwidth constraints for long messages. Under the LogP model, sending a k byte message requires sending wk messages, where w is the underlying message size of the machine. This would take o þ ðk 1Þ=w max ðg; oÞ þ L þ o cycles. In contrast, sending everything as a single large message takes o þ ðk 1Þ G þ L þ o, cycles under the LogGP model. It has to be noted that this model is valid for predicting communication performance only as long as there is not a considerable amount of contention in the communication network. Nevertheless, the LogGP model has proven to be useful, achieving a good compromise between accuracy and complexity, maybe because distributed memory multiprocessors are most efficient for applications where the ratio computation/communication is high. .

3

HETEROGENEOUS LOGGP

In this section, a new extension of LogGP which has been specifically developed for heterogeneous systems is presented. This section describes the method’s most salient features; subsequent sections describe how this method has been applied to a real cluster and how the model was used to predict the cluster response when a real application was executed.

BOSQUE AND PASTOR: A PARALLEL COMPUTATIONAL MODEL FOR HETEROGENEOUS CLUSTERS

3.1 Reasons for Selecting LogGP Model There are a number of reasons for selecting the LogGP parallel computational model as the starting point for devising a parallel computational model for heterogeneous clusters. The most important are the following: LogGP seems appropriate to be adapted to heterogeneous clusters because the underlying architecture is very similar to a cluster [18]. The authors of LogGP developed it for distributed-memory multiprocessors, where processors have a large amount of local memory and communicate with each other using point-to-point messages. The processors run asynchronously and the interconnection network has a set of limiting features, such as network latency and bandwidth. Additionally, the model takes into account a set of practical problems which allow it to perform more realistic predictions. . LogGP removes the synchronization points needed in other models such as PRAM and BSP. This situation is much more realistic and corresponds to the typical structure of parallel applications developed for clusters, based on independent processes communicating over the interconnection network. Having a synchronization point forces system processors to block themselves until all of them reach that point. Therefore, this point introduces idle processor times and heavy network overheads, especially for heterogeneous clusters. . The model allows overlapping computation and communication operations, within the limits imposed by network capacity. This overlap is possible because the model decomposes communication operations into pipelined sequences of independent steps, matching a different parameter to each of these steps. This allows algorithm designers to get more efficient implementations. . Furthermore, LogGP allows considering both short and long messages [3], [4]. Other models assume fixed size messages, generally one word long. But, in many architectures, including clusters, it is much more efficient to send a long message with multiple pieces of information than several independent short messages. . LogGP assumes finite network capacity, avoiding situations where the network becomes a bottleneck, damaging the model accuracy. For example, this method prevents creating algorithms where a node sends a large number of consecutive messages to other nodes without taking into account bandwidth limits. In this case, when the network is saturated, the communication times will severely penalize the system’s performance because the processor will need to wait between consecutive messages during a period of time inversely proportional to the communication bandwidth. . Finally, this model encourages techniques that yield good results in practice, such as reducing the overall communication demands, designing algorithms with balanced communication patterns, and coordinating the work assignment and data placement to reduce communication requirements. All of these advantages have made LogP and its extension LogGP the most frequently used models for .

3

designing and analyzing parallel algorithms and architectures, [12], [21], [26], [27], [28], [33].

3.2 HLogGP Definition There are two sources of heterogeneity in a cluster: the computational nodes and the communication subsystem. The nodes can have different processor, memory, and input/output features. With respect to communications, two questions have to be considered: the network interface and the network structure and technology. In the LogGP model, each parameter is associated with a communication phase and therefore with a component. In this section, an extension of the LogGP model for heterogeneous clusters is presented. For this purpose, each of the model’s parameters will be analyzed in order to study how heterogeneity affects it. This study is based on an M nodes cluster, ðN1 ; . . . ; NM Þ, in which both the processors and network are heterogeneous. The proposed model is based on the following parameters: .

.

Latency, L: Communication latency depends on both network technology and topology. In a heterogeneous cluster, different network technologies such as Fast Ethernet, Gigabit Ethernet, or Myrinet can coexist. The network’s topology defines the number of hubs, routers, and switches and their interconnection patterns, which also affect the messages’ latencies: Each of the elements included within a message’s path adds some time to the final latency. Given the fact that communication among nodes is so dependent on the cluster’s communication configuration, it it not possible to use a single latency parameter for all node to node communication. In the proposed model, rather than a single scalar, an M M cost matrix is used in which each element gives the latency cost during a communication between two specific nodes. The Latency Matrix of a heterogeneous cluster can be defined as a square matrix L ¼ fl1;1 ; . . . ; lM;M g, in which each element li;j , records the latency associated with the delivery of a message sent from node Ni to node Nj . L is usually a symmetric matrix, having its diagonal representing the communication latencies associated with processes which are being executed in the same node. Overhead, o: the time needed by a processor to send or receive a message is referred to as overhead. The operations involved in sending or receiving messages are quite different, therefore producing different amounts of overhead. In consequence, the communication overhead must be broken down into two independent parameters [19]: sender overhead, Os, which can be defined as the time needed by the sender to emit a message, and receiver overhead, Or, which can be defined as the time needed by the receiver to get a message. Additionally, due to system heterogeneity, these overheads depend on each node’s computational power, resulting in different Os and Or values for different nodes. Rather than scalar parameters, vector parameters for both overheads are needed, resulting in a sender overhead vector, Os ¼ fos1 ; . . . ; osM g, in which each element osi records the sender overhead

4


.

.

.

for node Ni . Also, a receiver overhead vector, Or ¼ for1 ; . . . ; orM g, records the receiver overhead for each node Ni . Gap between messages, g: This parameter reflects each node’s proficiency at sending consecutive short messages (it has to be noted that, in LogP, the message size is an inherent parameter since all of the messages considered in this model are short). Therefore, the gap is the minimum period of time for sending two consecutive short messages. Since this value depends on the network interface, it depends on the nodes’ features just like overheads do. Therefore, a gap vector g ¼ fg1 ; . . . ; gM g can be defined in which each element gi represents node i’s gap for one byte long messages. Gap per byte, G: The Gap per byte depends on network technology, in particular, on the network capabilities for managing long messages. In a cluster under TCP/IP, all messages are long. Consequently, this parameter is fixed by the network bandwidth. On the other hand, the bandwidth in a heterogeneous cluster depends on both the network technology and topology. Due to the fact that, in a heterogeneous network, a message can cross different switches with different bandwidths, the total Gap for a particular message depends on its path. The switch with the lower bandwidth between two particular nodes determines the real communication bandwidth between those two nodes. For example, if two nodes have Myrinet 1 Gbps interface networks and they are connected through a single Myrinet switch, their real bandwidth is 1 Gbps. But, if the nodes are connected through several switches and one of them is a 100 Mbps switch, the bandwidth between these two particular nodes is limited to 100 Mbps. Just like for latency, a Gap matrix G ¼ fG1;1 ; . . . ; GM;M g can be defined in which each element Gi;j determines the gap per byte for sending a long message from node Ni to node Nj . The inverse of the element, 1=Gi;j , is the available bandwidth for sending long messages between this pair of nodes. Values in the diagonal correspond to the bandwidth between a couple of processes executing in the same node. Computational power, Pi : The number of nodes cannot be used in a heterogeneous model for measuring the system’s computational power because each node can have very different computational features, yielding very different response times for a specific task. Therefore, it is necessary to account for each node’s computational power, denoted by Pi , which can be defined as the amount of work finished by that processor during a unit time span for a specific application. Pi depends on the node’s physical features (CPU organization and speed, memory and I/O capabilities, etc.), but also on the particular algorithm implementation actually being processed. The original scalar P parameter can be replaced by a computational power vector P ¼ fP1 ; . . . ; PM g in which each element Pi corresponds to the computational power of the node Ni . Considering this vector, a possible simplification is to compute Pi for a standard task, although the

VOL. 17, NO. 12,

DECEMBER 2006

model’s accuracy will be reduced depending on the actual task to be processed. Globally, the model’s main drawback is the number of parameters that need to be estimated before a particular system can be simulated. Nevertheless, it should be noted that, whenever a system becomes heterogeneous, the model should allow different nodes to present different features [16]. In other words, if a homogeneous model does not yield accurate results and an heterogeneous model becomes mandatory, each of the machines’ own features have to be estimated and the number of parameters in the model will grow accordingly. This increase in model complexity can be seen as a direct consequence of the system’s own additional complexity. Additionally, an automated procedure for estimating model parameters will be extremely useful for modeling large systems.

4

HLOGGP VALIDATION

The model described in the previous section is very general, being well suited for describing and analyzing a wide variety of parallel machines, including heterogeneous systems (which were not covered by many previous approaches). After describing the model, the following two sections in this paper are devoted to showing how the model can be applied to predict the behavior of a specific cluster running a real application. For that purpose, it is necessary to fulfill two tasks: To estimate the different parameters that describe a specific parallel system. 2. To analyze how the application will run on that system. Subsequent comparisons between theoretical predictions and actual execution results can cast some light on the model’s validity and applicability. In this section, a real cluster will be described and parametrized. Section 5 will present the predicted and measured results when a specific application is executed in this cluster. 1.

4.1 Cluster Description In order to assess the method’s ability to model heterogeneous systems, a number of trials have been performed on a real heterogeneous PC cluster. This system is composed of eight nodes, four of them based on 733 MHz Pentium III processors, and the remaining four based on 550 MHz Pentium III processors; throughout this paper, they will be cited as (fast, F ) and (slow, S) processors. The cluster communication network is composed of a Fast Ethernet 100 Mbps switch and a Ethernet 10 Mbps hub. To increase the heterogeneity, two fast and two slow processors were connected both to the switch and to the hub. Table 1 summarizes the cluster configuration. 4.2 Parametrization Benchmarks A set of specific benchmarks has been developed in order to measure the cluster parameters. These benchmarks have been programmed in C using the MPI/LAM 6.5.6 library [23], [30]. The first benchmark has been used for determining overhead, latency, and gap per byte for different message sizes. It is composed of two independent processes executed in each of the nodes to be parametrized; Benchmark 1 shows a fragment of its code. The two processes are


TABLE 1 Node Main Features and Identifier

symmetric, being composed of a couple of loops where the send and receive functions are executed. Measures for message sizes ranging from 1 byte to 16 million bytes have been taken to analyze how message size affects the system behavior. Benchmark 1: Source code of the benchmark for measuring communication overhead, latency, and gap per byte.

5

composed of two independent symmetric processes which are executed in different nodes; Benchmark 2 shows a fragment of its code. Before explaining the results achieved with these two benchmarks, it is necessary to discuss some issues about the behavior of the MPI functions. These functions change their behavior when the message size is larger than the TPC/IP buffer (64 KB by default in the LAM/MPI implementation). For message sizes shorter than this threshold, the blocking functions have a fire and forget protocol, where the message is sent in a unique burst without any receiver acknowledge and the send function is blocked until the send buffer is free. If the message size is increased beyond the default threshold, the local node’s TCP/IP buffer overflows and the send function uses a rendezvous protocol. In this case, the sender node sends a request message to notify the remote node that it wants to begin a communication process, remaining blocked until an acknowledge message is received from the remote node. After this acknowledge is received, the message is sent using the procedure described above. Therefore, above the 64 KB threshold, the send function changes its behavior, becoming synchronized and having the communication procedure waiting until both processors are ready. The main reason behind the rendezvous protocol is the desire to avoid additional memcpy/ allocations on the receiver node when the message size is too large. However, the 64 KB TCP buffer threshold is a software limitation which can easily be changed using suitable arguments on the LAM/MPI functions. This is an important issue that has to be taken into account whenever a new benchmark is developed.

4.3 Overhead The sending overhead is evaluated by measuring the response time of an MPI blocking send function in process_1 of the first benchmark. The results achieved are presented in Tables 2 and 3. These results show several aspects: .

Benchmark 2: Source code of the benchmark for measuring the gap between messages.

.

.

The gap between messages has been measured using a second benchmark, also specifically developed for this purpose. As in the previous case, the benchmark is

Given the fact that sender and receiver overheads are quite different, having a different parameter for each of them increments the model accuracy. The differences between these two parameters come from the different operations that the send and receive functions have to perform in MPI. The actual value of both sender and receiver overheads depends on the nodes’ processor, but not on the network: Nodes with identical processors have similar values, although they are interconnected in a different way. Furthermore, the difference between nodes with different processors is very high, which shows that heterogeneity is an important feature that should be taken into account in any parallel computational model. The values of both sender and receiver overheads also depend on the message size. For message sizes between 1 and 1,000 bytes, the overhead figures stay almost constant. For message sizes above 1,000 bytes, the overheads increase almost linearly. As a consequence, for the system under consideration, the

6


VOL. 17, NO. 12,

DECEMBER 2006

TABLE 2 Sender Overhead in Microseconds for Different Node Configuration and Message Sizes

TABLE 3 Receiver Overhead in Microseconds for Different Node Configuration and Message Sizes

TABLE 4 Communication Latency between Different Nodes, Given in Microseconds

overhead should be considered a function of the message’s size, having a constant fraction that predominates for messages shorter than some hundred bytes and a linear term which becomes apparent for longer messages.

4.4 Latency Estimating latency is not easy. In this paper, it has been done by measuring the period of time needed to transmit a unit message from the sender to the receiver node in a point to point communication. This is done by the MPI_Probe function in process_1 of the first benchmark. After a synchronization barrier, process_2 sends a message and process_1 measures the time taken by the message to arrive (the difference between start_L and stop_L marks in process_1). By definition, the latency parameter in the LogGP model is an upper bound of the delay associated with the transmission of a single word message from its source to its target node. Therefore, it is independent of the message size and it has been measured only for one word messages.

Table 4 shows estimated latency values for each kind of node and communication path. From the results shown in Table 4, the following aspects can be pointed out: Communication latencies are largely independent of processor features. This can be shown by comparing columns in Table 4 corresponding to nodes with different processors but similar network connections or the other way around. . Finally, communication latencies do depend on the communication path, as depicted also in Table 4: With the current configuration, there are three paths in the cluster: switch - switch, switch - hub, and hub - hub. Considering these values and the cluster’s topology, the latency matrix can be estimated. The smallest latency has been obtained for nodes connected to the same switch. The second best value corresponds to nodes connected to the hub since this device has more internal delay. Finally, the worst value is obtained when the messages have to cross both the switch and the hub because each device adds some time to the final latency figure. .


7

Fig. 1. Response times for (a) switch-hub and (b) switch-switch connections.

4.5 Gap between Messages This parameter depends on each node’s capability of transfering consecutive messages, therefore being determined by the network interface (NIC) capacity and the interface between the NIC and the processor. In order to estimate the gap between messages, a set of 50 consecutive short messages were sent in Benchmark 2, measuring afterward the response time of the send function (MPI_Send) for each of them. The increment of the response time of the send function can give an estimation of the gap parameter; Fig. 1 shows the results achieved in these tests. It can be seen that the response time of the send function is similar in all cases, which implies that the gap parameter does not have any influence on the communication time in this case. Also, it can be seen that the communication path between nodes does not have a significant effect on the gap parameter. The reason for this behavior is that the sender overhead is the real bottleneck in this cluster and is always larger than the gap. Therefore, when a new message is ready for transmission, the last message has already been sent and there is no collision between consecutive messages. As a consequence, this parameter can be ignored without affecting the predictions’ accuracy while using this cluster. 4.6 Gap per Byte The inverse of the Gap per byte ð1=GÞ characterizes the available per processor communication bandwidth for long messages. For simplicity, the Gap per byte will be presented in this section in the form of bandwidth values. The bandwidth can be measured from the time needed by a long message to go from the sender to the receiver nodes, subtracting the sender overhead and latency. The results obtained point out that, in our case, the bandwidth is not node-dependent. Table 5 shows the results for fast processors. From these results, it is also noticeable that the bandwidth depends on the message’s path. For short messages (up to 1,000 bytes), the maximum per processor bandwidth is not reached, and the measured values are not representative of the G parameter. Differences begin to appear for messages longer than 1,000 bytes; in those cases, there is a large difference in bandwidth between the nodes connected to the switch (100 Mbps) and the nodes connected to the hub (10 Mbps). Table 5 shows that, when the message size is increased, the bandwidth

approaches the theoretical maximum values. From these values, the Gap per byte matrix can be easily computed.

4.7 Computational Power The last parameter for characterizing the cluster is the nodes’ computational power. In order to estimate it, a benchmark has been executed a number of times in each node and an average response time has been calculated. The computational power is then obtained as the inverse of the average response time. This value depends both on the nodes’ capabilities and on the selected benchmark. Table 6 shows the computational power vector for this cluster: Figures in the computational power vector cluster around two values, representative of fast (c1-c4) and slow (c5-c8) nodes. The differences in computational power among nodes with theoretically similar capabilities are small, as expected (although noticeable since they correspond to different computers). But, the differences among fast and slow processors are much higher (around 37.5 percent). Again, the processing power’s heterogeneities are an important issue for accurate performance predictions.

5

EXPERIMENTAL RESULTS

A practical experiment was set up in order to test the heterogeneous model’s accuracy by comparing predictions and experimental results achieved using the cluster presented in the previous section. Three objectives were pursued in the tests presented here:

TABLE 5 Bandwidth for Different Communication Paths (Bits per Second)

8


VOL. 17, NO. 12,

DECEMBER 2006

TABLE 6 Computational Power of the Cluster Nodes

The values correspond to the inverse of the benchmark’s response times.

1.

2.

3.

To verify experimentally that the heterogeneous model HLogGP introduced in this paper is accurate enough to predict the response time of a parallel program using a theoretical analysis which includes computation and communication time. To verify that heterogeneity has a strong impact on system performance and that it is necessary to take it into consideration in order to ensure the model accuracy. To show how the cluster parametrization may be used for determining the performance of a parallel program on a real application environment.

5.1 Experimental Setup A volumetric magnetic resonance image compression application was selected and a theoretical response time analysis was performed based on the heterogeneous LogGP model and the parameters estimated for the cluster described above. After its implementation, real response time results were computed for a certain number of cases in order to validate the theoretical studies. The volumetric image is composed of a set of planar slices and the compression is performed in the three spatial dimensions (X, Y, and Z). The sequential process may be divided into the following stages: 1. 2.

3.

Data acquisition: The image is stored offline in float16 format using two bytes per pixel. Data read and memory allocation. It is necessary to allocate two different buffers, the first one in float16 format (2 bytes per pixel) for reading the saved image from hard disk and the second buffer in float format (4 bytes per pixel) for performing the compression operations. Computation of the 3D Haar wavelet transform: To transform 3D signals, the 1D wavelet transform can be applied in all three dimensions separately, resulting in a 3D tensor product wavelet transform [31]. This operation has a complexity of OðnÞ, where n ¼ ðNumber of pixels in X dimensionÞ; x ðNumber of pixels in Y dimensionÞ; and x ðNumber of pixels in Z dimensionÞ:

4. 5.

6.

Thresholding: The user may define a threshold for discarding the pixels smaller than it. Encoding of the subbands using the run-length encoding (RLE) compression algorithm: Repetitive strings of data are replaced by a control character followed by the number of repeated characters and the repetitive character itself. Write back of the compressed image.

There are three stages which involve a much larger computational load than the others: the computation of the 3D Haar wavelet transform, the image thresholding, and the RLE compression algorithm for each of the subbands that compose the transformed image. The need to execute these steps in order, due to functionality dependences, together with the large amount of data to be processed, requires solving the problem by exploiting data parallelism. This can be done just by dividing the workload among a set of independent nodes. Due to the complexity of the whole process, the parallelization strategy is based on a multiphase approach, where there are two tightly interleaved computational phases separated by synchronization points [34]. In this case, the first phase is the 3D Haar wavelet transform and the second phase involves the thresholding and the compression steps. Initially, the volumetric image is stored in the master node. Let S be the number of slices in the image volume and N the number of nodes. In order to perform the wavelet transform in the spatial domain, the image is partitioned in N blocks, each of them having a number of whole slices. Then, each block is sent to a slave process. After receiving its blocks, each slave can start computing the wavelet transform of the image portion assigned to it, obtaining as a result a partial wavelet transform in the Z dimension. After the wavelet transform is computed, each slave process has only a block of each of the transformed image’s subbands. Before the run-length encoding can be done it is necessary that each complete subband be available to some specific processors. As a consequence, the slave processors have to send back the partial results to the master node, which has to gather them and finish computing the wavelet transform at slice boundaries. Then, the master has to distribute each of the eight subbands of the transformed image (after one step of the wavelet transform) to a node. After each slave process receives a complete subband of the transformed image, it can start thresholding and compressing it. The resulting compressed subbands have to be sent back to the master, which has to set up the compressed image and write it back to disk. Last, the application has been programmed using the MPI libraries [23] (LAM/MPI 7.0.2 version from the Laboratory for Scientific Computing of Notre Dame University [30]).

5.2 Time Response Analysis In this section, a theoretical analysis of the application’s response time is presented. For simplicity, the application will be split into six stages: 1.

First stage: The master distributes the raw data among the slave processors. If the workload were evenly


sending a k-bytes long message takes Os þ ðk 1Þ G þ L þ Or time. During this stage, there are N slave processes sending messages to the master node and, therefore, some degree of network contention is expected. Nevertheless, if the implicit load balancing operation performed in stage 1 is adequate, each node should finish computing its portion of the wavelet transform approximately at the same time as the previous node has just finished sending back its results, therefore minimizing the amount of network contention. Due to the differences in the overhead and latency parameters within each node, the total response time of the third phase is calculated as a sumatory:

distributed, the application response time would be determined by the slowest node, which would finish its work last. As a consequence, the size of each data package should take into consideration each node’s own computational power. If S is the number of total slices and PT is the cluster’s total computational power (excluding the master processor), each slave i with computational power Pi will receive Si slices, according to the following expression: Si ¼

S Pi : PT

ð1Þ

This way, the master node performs an implicit static load balancing operation. As soon as the first message arrives at the first slave node, it starts computing the wavelet transform of its image slice while the remaining slices are being transferred to the other nodes. As a consequence, the time involved in the transmission of these packages should not be taken into consideration since it will be included in the second stage, where the wavelet transform is computed. The delays involved in the operation of the last node will be considered later on. Following the LogGP model, sending a message of k bytes involves first Os cycles of sending overhead to get the first byte into the network. Subsequent bytes take G cycles to be sent. The last byte is sent at time Os þ ðk 1Þ G. Each byte travels through the network for L cycles and, finally, the receiving processor spends Or cycles in receiving overhead. Therefore, the whole message is available at the receiving processor at time Os þ ðk 1Þ G þ L þ Or . The sending and the receiving processors are only busy during the overhead time; the rest of the time they can overlap computation and communication. Therefore, the total time for this stage is: T1 ¼ ðOsm þ ðS1 1Þ Gms1 þ Lms1 þ Ors1 Þ;

2.

T2 ¼

3.

S Si PT Pi S ¼ ) T2 ¼ : PT Pi Pi

T3 ¼

N X

ðOssi þ ðSi 1Þ Gsim þ Lsim þ Orm : ð4Þ

i¼1

4.

Fourth stage: This stage is similar to the first stage because the master process has to send an image subband to each of the slave processes. The only difference is that, in this case, the message size is the same for all of the slave nodes being given by the subband size (16 MB). Also, the number of nodes involved in this stage might be smaller than N. Therefore, the total time for this stage is: T4 ¼ ðOsm þ ðk 1Þ Gms1 þ Lms1 þ Ors1 Þ;

5.

ð3Þ

For this stage, the nodes’ computational power has been measured as the inverse of the time taken by each of them to calculate the Haar wavelet transform of a single slice. Third stage: During this stage, the master process has to first gather the partial results produced by all of the slave processes. Afterward, it has to generate the whole transformation. It was shown above that

T5 ¼

6.

ð5Þ

where k ¼ 16MB. Fifth stage: This stage is similar to the second, although, in this case, the amount of work is not distributed according to the nodes’ computational power. During Stage 5, the amount of work at each slave process is the same and is given by the subband size. Therefore, the total time for this stage will be given by the slowest slave. This time could be given approximately by the following expression:

ð2Þ

where S1 is the number of slices sent to the first slave according to (1). Second stage: In this case, the response time is the time spent by the last slave to finish its work. The implicit load distribution performed in the first stage implies that all of the cluster nodes take a similar time for computing the wavelet transform. Therefore, the total response time for the second stage is estimated as the response time of a generic slave processor:

9

Subband size : minN i¼1 Pi

ð6Þ

In this case, the computational power of each node has been measured as the inverse of the time needed to process the RLE encoding algorithm on a certain number of elements. Using different Pi figures for Stages 2 and 5 improves the accuracy; if the computations to be performed are similar, a single set of Pi values could be used. Sixth stage: This stage is similar to the third stage, although in this case, the message’s sizes cannot be determined a priori because they depend on the compression ratio reached by the RLE algorithm on each subband. As a consequence, it is only possible to give an upper bound by estimating the total time needed to transmit the data when there is no compression. T6 ¼

N X

ðOssi þ ðk 1Þ Gsim þ Lsim þ Orm ; ð7Þ

i¼1

where k is determined by the subband size.

10


TABLE 7 Theoretically Estimated and Experimentally Measured Execution Times Using HLogGP Model, Expressed in Seconds

5.3 Execution Results The HLogGP model has been developed taking into consideration the different aspects which affect the performance of heterogeneous parallel systems. In order to assess how useful this model is, it is interesting to compare real execution times with those predicted using the HLogGP model. For that purpose, the application described in Section 5.1 has been executed on the cluster presented in Section 4. The workload is composed of a volumetric image composed of 128 slices of 512 512 pixels. Using these data, it is possible to predict the system’s response time. The comparison of this prediction with the experimental times can give an indication of the model’s accuracy. Table 7 lists the predicted and measured times for a volumetric image. The excellent correlation among the estimated and experimental times has to be noted, with a relative error around 8.3 percent in the worst case and 0.5 percent in the best case for each stage. The relative error on the total response application time is less than 1 percent. These values show an excellent model accuracy for a cluster which is heterogeneous both regarding node computational power and communication network. With respect to the necessity of addressing specifically heterogeneity issues, Table 8 shows the estimated time if the values of the LogGP model are estimated without taking into account cluster heterogeneity. It should be pointed out that the errors here are very high, especially in the communication-intensive stages, due to the large difference on network bandwidth capabilities. Regarding the compute-intensive stages, the execution times in different nodes can be widely different also, depending on each node’s particular features and the task to be performed. Thus, depending on how each kind of node takes part in any parallel task, the differences between theoretical estimations and experimental results can grow very strongly whenever heterogeneity is not considered in the theoretical model. Another interesting aspect is that the HLogGP model allows the programmers to optimize the application’s response time. For example, during the fourth stage, the workload is distributed evenly among nodes by assigning them exactly the same amount of data. Since the data distribution is performed in a sequential manner, it is interesting to send the data corresponding to the fastest nodes last so that the slower ones can start processing their data while the faster ones are waiting to be served. Also, the part of the wavelet transformed image corresponding to the high-pass filter will show much fewer coefficients that fall below the threshold and, therefore, will achieve a much smaller compression. Possibly some of the cluster nodes will be best fit for performing this task and, therefore, should be in

VOL. 17, NO. 12,

DECEMBER 2006

TABLE 8 Theoretically Estimated and Experimentally Measured Execution Times Using LogGP Model, without Taking into Account Cluster Heterogeneity, Expressed in Seconds

charge of it. But, all of this node ranking and discrimination process can be performed only from an heterogeneity-based approach because it will precisely take advantage of node heterogeneity to improve the response time.

6

CONCLUSIONS

AND

FUTURE WORK

In this paper, a new computational model for heterogeneous clusters has been proposed and validated, using a single example. Other examples using different parallel applications can be found in [15]. The model can be applied to heterogeneous clusters where either the nodes, the interconnection network, or both are heterogeneous. It can also be applied to other parallel systems which show a certain degree of heterogeneity. The proposed method extends the LogGP model by replacing scalar parameters by vector parameters for overhead and gap and by matrix parameters for latency and Gap. The number of system processors is replaced by a computational power vector, which describes the computational power of every node within the cluster. In order to test the proposed model, a heterogeneous cluster has been parametrized, measuring the different vector and matrix parameters, by means of a set of benchmarks especially developed for this purpose. The parameterization results show that there are strong differences between the values corresponding to different processors and interconnection networks. The experiments carried out in this paper also show that the proposed model can accurately predict the behavior of heterogeneous clusters, independently of the source of heterogeneity (the nodes processing power or the interconnection network’s features). Additionally, the experiments conducted with HLogGP show that estimating the parameters that allow customizing the model for adapting it to a particular machine can be done through simple procedures, yielding values that permit performing accurate predictions. The availability of accurate models such as the one presented here allows parallel application developers to perform theoretical analysis of algorithm performance over heterogeneous clusters. These studies allow the comparison of different algorithms or implementations, the detection of bottlenecks for optimization issues, or the performance of static process and data assignments in order to optimize the system performance. Future work in this area includes considering the effects of network traffic since the proposed model so far can neither manage collisions nor situations with heavy network traffic, just as LogGP does. Another interesting issue to be analyzed is the behavior of collective communication operations,


which need synchronization of all of the cluster nodes. Additionally, an automated procedure for estimating model parameters would be useful for modeling large systems.

ACKNOWLEDGMENTS This work has been partially funded by the Spanish Commission for Science and Technology (grant TIC200308933-C02) and Madrid’s community GATARVISA group. The authors wish to thank the referees for their helpful comments and suggestions.

REFERENCES [1]

[2] [3]

[4]

[5]

[6] [7]

[8] [9]

[10]

[11]

[12] [13] [14]

[15] [16]

[17]

A. Aggarwal, A.K. Chandra, and M. Snir, “On Communication Latency in PRAM Computations,” Proc. ACM Symp. Parallel Algorithms and Architectures, pp. 11-21, June 1989, preliminary version. A. Aggarwal, A.K. Chandra, and M. Snir, “Communication Complexity of PRAMs,” Theoretical Computer Science, vol. 71, no. 1, pp. 3-28, Mar. 1990. A. Alexandrov, M. Ionescu, K.E. Schauser, and C. Scheiman, “LogGP: Incorporating Long Messages into the logP Model—One Step Closer towards a Realistic Model for Parallel Computation,” Proc. Seventh Ann. ACM Symp. Parallel Algorithms and Architectures (SPAA ’95), pp. 95-105, July 1995. A. Alexandrov, M.F. Ionescu, K.E. Schauser, and C. Scheiman, “LogGP: Incorporating Long Messages into the LogP Model for Parallel Computation,” J. Parallel and Distributed Computing, vol. 44, no. 1, pp. 71-79, July 1997. B. Awerbuch, Y. Azar, A. Fiat, and T. Leighton, “Making Commitments in the Face of Uncertainty: How to Pick a Winner Almost Every Time (Extended Abstract),” Proc. 28th Ann. ACM Symp. Theory of Computing, pp. 519-530, May 1996. M. Banikazemi, V. Moorthy, and D.K. Panda, “Efficient Collective Communication on Heterogeneous Networks of Workstations,” Proc. 27th Int’l Conf. Parallel Processing (ICPP ’98), Aug. 1998. A. Bar-Noy and S. Kipnis, “Designing Broadcasting Algorithms in the Postal Model for Message-Passing Systems,” Proc. Fourth Ann. ACM Symp. Parallel Algorithms and Architectures (SPAA ’92), pp. 1322, June 1992. G. Bell and J. Gray, “What’s Next in High-Performance Computing?” Comm. ACM, vol. 45, no. 2, pp. 91-95, Feb. 2002. P. Bhat, C.S. Raghavendra, and V. Prasanna, “Efficient Collective Communication in Distributed Heterogeneous Systems,” Proc. 19th Int’l Conf. Distributed Computing Systems (ICDCS ’99), May 1999. P.B. Bhat, V.K. Prasanna, and C.S. Raghavendra, “Adaptive Communication Algorithms for Distributed Heterogeneous Systems,” J. Parallel and Distributed Computing, vol. 59, no. 2, pp. 252279, Nov. 1999. S.N. Bhatt, F.R.K. Chung, F.T. Leighton, and A.L. Rosenberg, “On Optimal Strategies for Cycle-Stealing in Networks of Workstations,” IEEE Trans. Computers, vol. 46, no. 5, pp. 545-557, May 1997. G. Bilardi, K.T. Herley, A. Pietracaprina, and G. Pucci, “On Stalling in LogP,” Lecture Notes in Computer Science, vol. 1800, 2000. G.E. Blelloch, Vector Models for Data-Parallel Computing. MIT Press, 1990. R.D. Blumofe and D.S. Park, “Scheduling Large-Scale Parallel Computations on Networks of Workstations,” Proc. Third Int’l Symp. High-Performance Distributed Computing, pp. 96-105, Aug. 1994. J.L. Bosque and L. Pastor, “Hloggp: A New Parallel Computational Model for Heterogeneous Clusters,” Proc. IEEE/ACM Int’l Conf. Cluster Computing and the Grid, Apr. 2004. F. Cappello, P. Fraigniaud, B. Mans, and A.L. Rosenberg, “HiHCoHP: Toward a Realistic Communication Model for Hierarchical HyperClusters of Heterogeneous Processors,” Proc. 15th Int’l Parallel and Distributed Processing Symp. (IPDPS ’01), pp. 42-42, Apr. 2001. R. Cole and O. Zajicek, “The APRAM: Incorporating Asynchrony into the PRAM Model,” Proc. First Ann. ACM Symp. Parallel Algorithms and Architectures, pp. 169-178, June 1989.

11

[18] D. Culler, R. Karp, D. Patterson, A. Sahay, K.E. Schauser, E. Santos, R. Subramonian, and T. von Eicken, “Log P: Towards a Realistic Model of Parallel Computation,” Proc. Fourth ACM SIGPLAN Symp. Principles & Practice of Parallel Programming (PPOPP ’90), ACM SIGPLAN Notices, pp. 1-12, July 1993. [19] D.E. Culler, L.T. Liu, R.P. Martin, and C. Yoshikawa, “LogP Performance Assessment of Fast Network Interfaces,” IEEE Micro, Feb. 1996. [20] S.R. Donaldson, J.M.D. Hill, and D.B. Skillicorn, “Predictable Communication on Unpredictable Networks: Implementing BSP over TCP/IP and UDP/IP,” Concurrency: Practice and Experience, vol. 11, no. 11, pp. 687-700, Sept. 1999. [21] A.C. Dusseau, D.E. Culler, K.E. Schauser, and R.P. Martin, “Fast Parallel Sorting under LogP: Experience with the CM-5,” IEEE Trans. Parallel and Distributed Systems, vol. 7, no. 8, pp. 791-805, Aug. 1996. [22] S. Fortune and J. Wyllie, “Parallelism in Random Access Machines,” Proc. 10th ACM Symp. Theory of Computing, pp. 114118, 1978. [23] M. Forum, “A Message-Passing Interface Standard,” 1995, http://www.mpi-forum.org. [24] P.B. Gibbons, “A More Practical PRAM Model,” Proc. First Ann. ACM Symp. Parallel Algorithms and Architectures, pp. 158-168, June 1989. [25] B.H.H. Juurlink and H.A.G. Wijshoff, “A Quantitative Comparison of Parallel Computation Models,” ACM Trans. Computer Systems, vol. 16, no. 3, pp. 271-318, Aug. 1998. [26] T. Kalinowski, I. Kort, and D. Trystram, “List Scheduling of General Task Graphs under LogP,” Parallel Computing, vol. 26, no. 9, pp. 1109-1128, July 2000. [27] T. Kielmann, H.E. Bal, and K. Verstoep, “Fast Measurement of LogP Parameters for Message Passing Platforms,” Lecture Notes in Computer Science, vol. 1800, 2000. [28] W. Lo¨we and W. Zimmermann, “Scheduling Balanced TaskGraphs to LogP-Machines,” Parallel Computing, vol. 26, no. 9, pp. 1083-1108, July 2000. [29] A.L. Rosenberg, “Sharing Partitionable Workloads in Heterogeneous Nows: Greedier Is Not Better,” Proc. Third IEEE Int’l Conf. Cluster Computing (Cluster ’01), pp. 12-131, 2001. [30] J.M. Squyres, K.L. Meyer, M. McNally, and A. Lumsdaine, LAM/ MPI User Guide, 1998. [31] E.J. Stollnitz, T.D. DeRose, and D.H. Salesin, Wavelets for Computer Graphics: Theory and Applications. Morgan Kauffman, 1996. [32] L.G. Valiant, “A Bridging Model for Parallel Computation,” Comm. ACM, vol. 22, no. 8, pp. 103-111, Aug. 1990. [33] J. Verriet, “Scheduling Outtrees of Height One in the LogP Model,” Parallel Computing, vol. 26, no. 9, pp. 1065-1082, July 2000. [34] J. Watts, M. Rieffel, and S. Taylor, “A Load Balancing Technique for Multiphase Computations,” Proc. HisG Performance Computing Conf., pp. 15-20, 1997. Jose L. Bosque graduated in computer engineering from the Universidad Polite´cnica de Madrid in 1994. He received the PhD degree in computer science from the Universidad Polite´cnica de Madrid in 2003. He has been an associate professor at the Universidad Rey Juan Carlos in Madrid, Spain, since 1998. His research interests are parallel and distributed processing, performance and scalability evaluation, and load balancing.

Luis Pastor (BSEE 80, MSEE 83, PhD 85) is a professor at the Rey Juan Carlos University. His research interests include parallel algorithms and architectures and VR and imaging applications.