Parallel Programming Optimization with Hybrid MPI-OpenMP Thread

0 downloads 0 Views 222KB Size Report
performance for the CG benchmark as much as 4.5 times in some cases. † The Graduate School of ... thread-to-thread communication method (Hybrid TC model) over a pure MPI model .... free to execute computation duties. However with the.
Optimization for Hybrid MPI-OpenMP Programs on a Cluster of SMP PCs TA QUOC VIET, TSUTOMU YOSHINAGA, YOSHIO OGAWA, BEN A. ABDERAZEK AND MASAHIRO SOWA† This paper applies a Hybrid MPI-OpenMP programming model with a thread-to-thread communication method on a cluster of Dual Intel Xeon Processor SMPs connected by a Gigabit Ethernet network. The experiments include the well known HPL and CG benchmarks. We also describe optimization techniques to get a high cache hit ratio with the given architecture. As a result, the hybrid model shows performance prominence over the pure MPI model with about 27% for CG and 12% for HPL. Besides, with a relatively small programming effort, we have succeeded in reducing the cache miss ratio and thus significantly risen up performance for the CG benchmark as much as 4.5 times in some cases. †

The Graduate School of Information Systems, University of Electro-Communications

1. Introduction Pure MPI and Hybrid MPI-OpenMP are the strongest candidates now for a programming model on a cluster of SMPs. In a previous work, we demonstrated advantages of a hybrid model with a thread-to-thread communication method (Hybrid TC model) over a pure MPI model through solving a dense linear system on a cluster of Sun Enterprise SMP workstations connected by a fast Ethernet network [1]. In this study, we continue to present advantages of the Hybrid TC model on a more up-todate architecture, a cluster of Intel based SMP PCs. Clusters of SMP PCs indeed have gained a wide popularity among high performance computing world for its cost-efficiency as well as simplicity on building, operating and management. The bottleneck of such a cluster in many cases is the limited bandwidth of the main memory bus [2]. For this reason, in applications requiring a lot of memory access such as the CG benchmark, using both processors per node cannot even gain any computation benefit over using a single one. Hence for achieving a good performance on such a cluster, decreasing memory access by increasing cache hit ratio is one of the most important factors. This study presents a technique to increase the cache hit ratio in the sub-section 4.2.3 For experiments, we decided on the two contrastive problems from the two extra-popular, HPL and NAS-CG parallel benchmarks. While HPL solves a random dense linear equation system using a block LU decomposition algorithm, CG uses the inverse power method to find an estimate of the largest eigenvalue of a symmetric positive definite sparse matrix. While HPL targets on solving a dense linear system, the heaviest part of CG is to solve a sparse one.

The original solutions for both HPL and CG were implemented with a pure MPI model and are referenced as “MPI” in this paper. MPI model’s contenders are “Hybrid PC” and “Hybrid TC”, which stand for “Hybrid MPI-OpenMP with Process-toprocess Communication Method” and “Hybrid MPIOpenMP with Thread-to-thread Communication Method” respectively. The hybrid solution is based on the original MPI code with as-little-as-possible changes. Problem size in this study is limited to meet hardware computation capacity. For CG, problem size is specified by classes. While class A is too small for our system, class C and D are too large. As a result, we use a class B for the experiments. With HPL, matrix sizes are varied within the range between 16000 and 20000 columns (rows). Since 2GB of main memory capacity from a single node cannot afford such a big problem size, HPL experiments are performed with two nodes or more.

2. Environment Specification 2.1 Hardware Configuration Our cluster includes 8 Dual Intel Xeon Processor 2.8GHz nodes, connected via a Gigabit Ethernet network. Each processor has a level two (L2) cache of 512KB. Main memory is 2GB of DDR-SDRAM per node. Load latency for L2 cache and main memory are 18 and 370 clock cycles respectively.

2.2 Software Configuration The cluster has Linux Red Hat 8.0 with a version 2.4.18-24.8.0 kernel as an operating system. MPI is implemented with MPICH 1.2.5.2 [3]. OpenMP is supported by Intel C and Fortran Compilers 7.1 for Linux. ATLAS 3.4.1 [4] supplies the BLAS library necessary for HPL, and PAPI 3.0 Beta version [5] supplies a tool to access hardware performance counters for performance analysis and optimization.

2.3 Hyper-Threading Efficiency With CG, the Hyper-Threading (HT) Technology does not show any efficiency. A possible reason is the memory bus access bottleneck of the Intel Dual Processor node architecture. Computation flow is stuck at the bottleneck no matter how we set the HT function. Thus, CG results presented in this study are obtained without enabling HT function. With HPL, HT in most cases gives only a small efficiency with both MPI and hybrid solutions. Computation part here includes a lot of matrix-matrix multiplication where matrix elements are accessed several times consecutively, which causes a high cache hit ratio and probably enables the efficiency of the HT Technology. Table 1 shows the performance improvement caused by enabling the HT feature. Data were taken with a matrix size of 20000 and with an optimal block size. A hyphen for MPI with 2 nodes implies that HT is worthless there. Table 1 Hyper-Threading Efficiency for HPL Nodes

MPI

Hybrid TC

2

-

7.63%

3

0.50%

3.18%

4

1.20%

2.68%

5

4.10%

1.90%

6

5.69%

3.30%

7

9.08%

6.35%

8

0.94%

8.25%

This study does not intend discussing HT deeply, and hereafter all the results presented on the paper are selected as the better between enabling and disabling this feature.

3. Pure MPI and Hybrid MPI-OpenMP 3.1 Key Factors to Parallel Performance Obviously, the total number of executed instructions directly affects overall performance for any problem on any platform. However in this study, since we almost keep the whole original algorithm for both HPL and CG problems, this number does not change a lot, with a small exception when we try to optimize a subroutine for CG. This exception will be discussed in sub-section 4.2.3. The number of executed instructions can be determined through the “preset event” PAPI_TOT_INS from PAPI, the “Performance Application Programming Interface”. The total number of main memory accesses is also extremely important, especially for Intel PC clusters, where memory bus bandwidth is the main bottleneck that degrades performance in almost the cases. To examine main memory accesses, we use PAPI_L2_LDM (Level 2 Load Misses) and

PAPI_TLB_TL (Total Translation Look-aside Buffer Misses) preset events. Besides computation, communication makes a considerable impact to parallel performance. Communication time includes the communication process itself (sending and receiving time) and synchronization time (waiting time). Time spent for communication is a regular restriction to scalability of a parallel algorithm. Communication costs are the major difference between MPI and hybrid models.

3.2 Why Hybrid? A cluster of SMP nodes is configured with both distributed and shared memory models. Naturally, a hybrid model seems to be suitable for such a platform. A hybrid model has a hierarchical structure, where a node is mapped to a MPI process and a processor element is responsible for an OpenMP thread. While inter-node communication is realized by MPI communication functions, intra-node communication is not necessary since all the processors in a same node can directly access to its shared memory space. The number of MPI processes in a hybrid model is significantly decreased, and equivalent to the number of nodes rather than the total number of processors as in the MPI model [1][6][7]. In reality, a hybrid model can demonstrate advantages over a pure MPI model. The difference usually comes from communication costs. At first, communication volume is decreased significantly with the hybrid model. All the intra-node communication necessary for the pure MPI model is omitted. Meanwhile inter-node communication is usually reduced simultaneously with decreasing the number of MPI processes. Furthermore, scalability is a potential advantage for the hybrid model. Parallel efficiency usually goes down when the number of MPI processes goes up. Therefore a hybrid model is usually potential to reach the “peak parallel performance point” slower than its opponent. Here the “peak parallel performance point” is a point where additional processors cannot improve overall performance. However a hybrid model also has restrictions. A hybrid code is obviously more complicated than a MPI version. Additionally, in order to realize all the potential advantages, an adjustment in task scheduling is also needed. 3.3 Hybrid PC and Hybrid TC Some previous studies suggested a hybrid model in a relatively simple form, the hybrid PC, which is illustrated by figure 1. A hybrid PC solution can be easily generated from a pure MPI source code by parallelizing computation tasks on a node with OpenMP directives. With such a code amendment, all

the communication tasks are located outside of the OpenMP parallel regions. MPI Process Communication Fork OpenMP Threads Join Communication

.

SMP 1

SMP 0

Figure 1 The hybrid PC model

Figure 2 shows a hybrid TC model with a small adjustment in comparison with a hybrid PC. Communication tasks now are moved inside OpenMP “fork and join” parallel regions. To overcome possible MPI un-thread-safety, on a certain OpenMP parallel region, only one thread per node has the right to call MPI communication functions. This can be implemented with OpenMP “SINGLE” or “MASTER” pragmas. MPI Process Fork Communication OpenMP Threads Join

Communication

SMP 0

SMP 1

Figure 2 The hybrid TC model

The idea of the communication-part adjustment is to salvage computation resources during communication time. In the hybrid PC model, nodes communicate by MPI processes. During communication sessions, all the processors but the one that actually performs the communication task must idle. Applying the hybrid TC model, when a processor performs a communication task, others are free to execute computation duties. However with the original pure MPI task-schedule, it is usually impossible to find tasks to assign to noncommunication threads. Thus, an adjustment in taskscheduling is necessary. We will describe in detail such an adjustment for CG and HPL in sections 4.4.2 and 5.3 accordingly.

4. CG Optimization This section presents optimization techniques applying in both computation and communication. These techniques are available for both MPI and hybrid models.

4.1 Problem Description 4.1.1 Problem Definition The CG benchmark uses the inverse power method to find an estimate of the largest eigenvalue of a symmetric positive definite sparse matrix with a random pattern of nonzeros. Since the heaviest part of the algorithm is to solve a sparse linear system by the “conjugate gradient” method, the benchmark is named as “CG”. The sparse matrix has n rows and n columns, where n=75000 for the class B [8]. 4.1.2 Process Grid MPI CG accepts only a power-of-2 num_procs number of processes that are mapped onto a process grid npcols * nprows, where num_procs = npcols * nprows. If num_procs is a square, npcols and nprows are equivalent. Otherwise, npcols = 2 * nprows. 4.1.3 Computation Pattern A process stores and operates on a n / nprows by- n / npcols sparse sub-matrix A. Computation volume is equally distributed among processes. Performance analysis shows that, more than 90% computation costs are spent for a matrix-vector multiplication [9]. Figure 3 shows the part of the original Fortran source code that executes the multiplication operation between matrix A and vector p and stores the result to a vector w. do j=1,lastrow-firstrow+1 sum=0.d0 do k=rowstr(j), rowstr(j+1)-1 sum=sum+a(k)*p(colidx(k)) enddo w(j)=sum enddo Figure 3 The w=Ap part from the original MPI CG Table 2 MPI CG, process grid and communication costs for a single iteration Nprocs

2

4

8

16

32

Grid

2*1

2*2

4*2

4*4

8*4

Msize

0.5

0.5

0.25

0.25

0.125

Msgs

1

2

3

3

4

Costs

0.5

1

0.75

0.75

0.5

4.1.4 Communication Pattern In CG, processes communicate by exchanging messages with small and large sizes. Since large messages dominate CG communication time,

especially when the number of processes is small enough [7], we focus on the large messages only, which are implemented by non-blocking communication in the original MPI CG. In a single iteration, a process communicates by exchanging data with its partners. Sending and receiving messages have the same length msize, equivalent to 1 / npcols times of n double precision; with n is the number of rows (columns) of the matrix A. In the class B, n=75000. The number of messages to be exchanged msgs is log 2 npcols + 1 . With nprocs = 2 , the processes exchange one message with itself and this message should not be counted. Listing of process grids as well as communication costs for a single iteration is shown in table 2. The unit of measurement for message size is n double precision. Communication costs for a process are found by multiplying the number of messages to message size. 4.1.5 MPI Data Exchange Bandwidth Bandwidth is measured as the speed that processors exchange data. With a 40KB message, intra-node bandwidth is about 2.5Gbit per second, for both blocking and nonblocking communication. Intranode communication appears only in a pure MPI solution. Figure 4 presents an inter-node data exchange bandwidth. Results are measured while using one and two processor(s) per node, applying the blocking and nonblocking communication functions. Nonblocking communication is applied in the whole original MPI solution, so that its bandwidth is measured with the original message sizes which can be found from table 2. With blocking communication, bandwidth depends on message size. The figure’s data are in accordance with a fixed 1750 double precision long message, where the bandwidth is optimal, or close-to-optimal, for any number of nodes. In all the cases, the blocking bandwidth is faster than the nonblocking bandwidth. 400

Bandwidth (Mb/s)

350 300 250 200 150 1 PE, blocking

100

2 PEs, blocking

1 PE, nonblocking

50

2 PEs, nonblocking

0 2

4

6

8

Number of nodes

Figure 4 Data exchange bandwidth with blocking and nonblocking communication

4.1.6 The Super-linear Speedup Feature From figure 8, we can see the result for the original MPI CG benchmark executed on our platform. A sudden performance growth-up for two processes in comparison with one process is remarkable. Results with the same feature also can be found on other studies [10]. Table 3 shows such a super-linear speedup aspect. 4.2 Computation Optimization 4.2.1 A Matrix-vector Multiplication To explain the super-linear speedup aspect, we make an analysis on CG performance improvement when increasing the number of processors from one to two. Since inter-node communication does not exist in either case, the 4.1-time speedup must be a result of computation activities, where a matrixvector multiplication occupies more than 90%. Therefore, we can focus only on the source code part shown in figure 3, performing the operation w=Ap. Table 3 Original CG Speedup num_procs

speedup

2

4.10

4

5.52

8

10.68

4.2.2 The p Size Affect In CG, the number of columns in a local submatrix A is just the length of the local vector p. Table 4 Data size for CG class B with 1 and 2 process(es) Data items

1 process

2 processes

Matrix A

75000*75000

75000*37500

Vector p

75000

37500

Looking at table 4, we can see the half-sizedecrease of p size with two processes. During a matrix-vector multiplication operation, each element of matrix A is accessed only once. Meanwhile, elements of vector p are accessed repeatedly after a certain period of time. If that period is small enough, the necessary element of p still remains in the cache, and we have a cache hit. Otherwise, a cache miss and consecutively a main memory access action would appear. The length of such a time period is defined by p size. Therefore we suppose that p size does affect the cache hit ratio and cause the super-linear speedup aspect. To confirm the role of p size, we simulate the w=Ap operation on a single processor with different values of p size, where A size is adjusted accordingly so that all the cases have the same computation volumes. The operation is executed repeatedly with

the same number of iterations as the real CG benchmark, i.e. 75*26 times for class B. Results of the simulation that are shown in figure 5 completely reconfirm our suggestion. A 37500 p size is about 4.3 times quicker than a 75000 p size. We also carry out a test by forcing both processors on a node to execute simultaneously the same code on the same data, i.e. double-up the computation volume. The test has been done with both MPI and OpenMP implementations. As a result, execution time also doubles up, implying that no speedup is presented with either MPI or OpenMP. The result can be explained by the memory bus access bottleneck of the platform. In CG’s matrixvector multiplication, each element of matrix A is accessed only once, with a very small possibility of cache hit. This causes an extremely busy access to the main memory bus, so that jams the computation flow, no matter how many processors are in use.

4.2.4 The w=Ap Optimization As discussed in 4.2.2, reducing p size improves performance. We can reduce p size by splitting vector p into small parts, from p1 to pn. Accordingly, matrix A is divided into A1 till An. The result vector w is formed from partial vectors wi where wi=Aipi. The operations wi=Aipi now have a smaller pi size, so that they cause a smaller cache miss ratio and win a performance improvement. p1 p2 p3 p

A

A1

A2

A3

Figure 7 w=Ap data splitting

700

Time (Seconds)

600 500 400 300 200 100 0 0

10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000

p size

Figure 5 w=Ap execution time with a single process

4.2.3 L2 and TLB misses Figure 6 shows the number of L2 and TLB misses of the multiplication. In fact, TLB miss penalty is not a constant. However for simplicity, we suppose that it is invariable and equivalent to a L2 miss penalty, so that we can add up the two values. The summary line has a shape similar to the time line shown in figure 5. Some small differences between the two figures can be explained by other features such as the total number of executed instructions and the L1 hit ratio. 40,000

L2 mis s es

35,000

T LB mis s es

Misses (*10^6)

30,000

L2+T LB mis s es

25,000 20,000 15,000 10,000 5,000 0 0

10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000

p size

Figure 6 L2 and TLB misses in w=Ap operation

Based on the graph shown in figure 5, we selected a size of 25000 for pi because it is located within the range that w=Ap operation gets optimal results. 4.2.5 Computation Optimization Result Figure 8 compares MPI CG performance before and after optimization. Efficiency comes from both computation and communication optimizations. With a single process, communication does not exist, so that the performance improves thanks to computation amendment. At this point, the optimized MPI CG solution outperforms the original one more than 4.5 times. In the optimized version, adding a second processor from the same node does not improve performance. The reason for such an incident is the main memory bus access bottleneck, which was also discussed in 4.2.3. Figure 5 also implies that, the computation optimization is meaningful if and only if the original p size is greater than 30000, i.e. npcols does not exceed 2, or nprocs does not exceed 4. With nprocs equivalent to 2 or 4, original p size is 37500, optimized computation part can outperform the original one about 30%. However the overall performance should also include the communication part.

4.3 Communication Optimization 4.3.1 Nonblocking and Blocking Communication Original MPI CG applies a nonblocking communication method. However, as illustrated in figure 4, nonblocking communication is remarkably slower than the blocking model in our platform. Therefore performing all communication sessions by applying blocking functions can give benefits. A problem is the message size limitation of a blocking

model. In our MPICH implementation, it is impossible to send or receive a message longer than 12000 double precision numbers in blocking mode. We solve this problem by splitting a long message into some smaller ones, having a size of 1750 double precision. 4.3.2 Communication Optimization Result Blocking communication efficiency can be found from figure 8. At the points where nprocs exceeds 4, all the improvements are caused by communication optimization. Particularly, MPI overall improvements of 17.7% and 26.9% were recorded with 8 and 16 processes (running on 4 and 8 nodes).

represent tasks with communication, and should be executed by one thread only. Dotted lines between the end of Ci and the beginning of Ei imply that, Ei must wait until Ci finishes. Such a dependency can be implemented by a status flag. Pseudo-code for the new task-schedule can be found in figure 10. With the new task-schedule, hybrid TC can perform most the communication task simultaneously with computation.

C

1400

C1

C2

E1

C3

E2

Performance (Gflops)

1200

E3

1000 800

Cn

600

E

400 Optimized

200

En

Original

0 1

2

3

4

5

6

7

8

9

10 11 12 13 14 15 16

Number of PEs

Figure 8 MPI performance of CG after and before computation and communication optimization

4.4 Hybrid TC with CG 4.4.1 q=Ap Part in CG In CG, the most time-consuming task is to find vector q=Ap, where A and p are global data. The global A and p can be formed by merging all the local sub-matrices A and sub-vectors p. CG finds q by the following process: +Computation: processes carry out the operation w=Ap, which was discussed and optimized above. +Communication: processes exchange w, summarize local results to find p. The process is looped 75*26 times for class B and spends about 95% of overall execution time on our platform. All the adjustments on task-schedule are applied for this process. 4.4.2 CG Task-schedule Adjustment Based on the MPI optimized source code, we build up hybrid PC and hybrid TC solutions. Hybrid PC code can be simply generated by adding OpenMP pragmas into MPI computation parts. For hybrid TC, an adjustment in task-schedule is necessary. Original and adjusted task-schedules for q=Ap loop are shown in figure 9. Circles with letters inside represent tasks. Tasks Ci and Ei compute and exchange a partial vector wi accordingly. Grey-background cycles

(a) MPI

(b) Hybrid TC

Figure 9 MPI and hybrid TC task schedules for CG flag=0 !$omp parallel private (i) !$omp single do i=1, n wait until flag=i q(i)=E(i) ! Exchanging w(i) enddo !$omp end single nowait !$omp do do i=1, n C(i) ! Computing w(i) flag=i enddo !$omp end do !$omp end parallel Figure 10 CG’s q=Ap part, pseudo-code for hybrid TC

4.4.3 Hybrid TC Efficiency With the MPI model, we can use one or two processor(s) per node on experiments. Shortly, we call them as MPI one PE and MPI two PEs. Figure 11 demonstrates relative performance comparison between hybrid TC and PC over alreadyoptimized MPI by percentage. A greater-than-100% value implies a prominent. MPI performance is chosen as the better one between MPI one PE and MPI two PEs. These two variants differ only in communication because of the computation performance equivalence between using one or two processor(s) per node. For the same reason, hybrid

PC gains the same performance as MPI one PE. At 2 and 8 nodes, MPI one PE outperforms MPI two PEs, so that the hybrid PC obtains an efficiency of exactly 100%. At 4 nodes, hybrid PC is weaker than MPI because MPI two PEs outperforms MPI one PE. Hybrid TC demonstrates advantage over not only hybrid PC but also MPI in all the cases, thanks to the advanced task-schedule. At any moment, there exists an executing computation task that keeps the memory bus bottleneck busy. Meantime, MPI solution let the memory bus somehow free during communication phases.

computation tasks (2-3 and 4 respectively). The hybrid model with such a task-schedule already demonstrated a performance improvement of about 10% in comparison with the MPI solution for another platform, a cluster of Sun Enterprise 3500 SMPs connected via the 100Mbps Fast Ethernet network [1].

D

U0 U

L0 L

A

P

127%

120%

114%

(a) Data blocks

109% 100%

100%

100%

Efficiency

81%

2-3

1-5-6

80% 60% 40% 20% 0% 2 nodes

4 nodes

Hybrid T C

8 nodes

Hybrid P C

Figure 11 CG, efficiency of hybrid TC and hybrid PC over MPI

5. HPL Optimization 5.1 Problem Description HPL solves a random dense linear equation system using a block LU decomposition algorithm [11]. The heaviest computation part is a matrixmatrix multiplication, which is executed by calling already-optimized functions from the BLAS library. With such a computation pattern, MPI two PEs always significantly outperforms MPI one PE, so that all the results in this section are obtained with MPI two PEs. In addition, we do not try to perform computation optimization with HPL. Our focus is to apply and improve a task-schedule that we already developed for hybrid TC in a previous study [1] and confirm its efficiency for the current platform.

5.2 Process Grid Like CG, processes in HPL are mapped into a grid P*Q. With original MPI, users can freely appoint P and Q and find the optimal pair of values. With hybrid models, P and Q are selected so that their sum is as small as possible. Such a grid causes a minimal communication volume. 5.3 Hybrid TC Task-schedule Based on the MPI algorithm for HPL, we reconstructed iterative loops to meet the hybrid TC model. The new task-schedule is illustrated in figure 12. The tasks containing communication (1-5-6 and 10-11) are simultaneously executed together with

Index 1 2 3 4 5 6 7 8 9 10 11 12

Description D = D - L0U0 U = U – L0U0 L = L – L0U0 A = A – L0U0 Broadcast(D) Decom(D) Pivot(P) U = Solve(DU=U) L = Solve(LD=L) Broadcast(U) Broadcast(L) InitializeNextLoop

7-8-9

10-11

4

12

(c) Task-schedule

(b) Task list

Figure 12 HPL, a task-schedule for hybrid TC

5.4 Communication pattern Unlike CG, HPL processes communicate by broadcasting sub-matrices L and U over process columns and rows. In the kth loop, the owner-process of block D needs to send (1/P) of L to (Q-1) processes on the process-column and (1/Q) of U to (P-1) processes on the process-row. With size(U)=size(L) at any iteration loop and the total size of U (or L) during the whole solution is A/2, the total sending data amount would be A * ( P − 1 + Q − 1) . While the

2 Q P number of nodes varies from 2 to 8, we have process grids as well as communication costs shown in table 5, where the unit of measurement for communication costs is the size of A. Table 5 Communication costs for HPL Nodes 2 3 4 5 6 7 8

Grid 1*2 1*3 2*2 1*5 2*3 1*7 2*4

Communication costs 0.5 1 0.5 2 0.667 3 1

5.5 Performance with Hybrid Models Figure 13 illustrates experimental results achieved with 8 nodes. All the models reach the best performance with a 20000 by 20000 matrix size. At that point, hybrid TC outperforms MPI about 12%. With a smaller data size, the difference is even bigger. 45

Performance (GFlops)

40

35

30

Hybrid TC

25

MPI Hybrid PC

20 16000

17000

18000

19000

20000

Matrix Size

sessions. Applying such a task-schedule for an 8-node cluster of Dual Intel Xeon Processor SMPs, the hybrid TC outperforms the MPI model by 27% for CG and 12% for HPL. With CG and HPL as samples, the study also demonstrated the simplicity in building such a task-schedule based on an already existing MPI solution. We also improved CG performance as much as 4.5 times at some points by simple adjustments in data pattern and communication method. This implies that, for a certain platform, an appropriate data pattern as well as communication method greatly affects the overall performance. They should be a tuning parameter not only for benchmarks but also for any high-performance application. Our hybrid TC solution for HPL is still possible for further optimization in the poorly-balanced points by applying a better data broadcasting algorithm which can be found in [11].

Figure 13 HPL performance with 8 nodes

In figure 14, we fixed the matrix size and changed the number of nodes. The hybrid models shows good results in the points where there is a well-balanced P*Q grid. With 7 nodes, there is not such a balance between P and Q, so that the performance is fairly weak, even worse than 6 nodes. This can be explained by communication costs, which are presented on table 4. MPI does not suffer from such a problem, because there is always a fairly well-balanced P*Q grid, where P is the number of nodes and Q is the number of processors per node. In both figures, the Hybrid PC shows a poor performance, even slower than the MPI model, because of its intra-node synchronization time. 45

Performance (GFlops)

40 35 30 25 20 15 10

Hybrid TC

5

MPI Hybrid PC

0 2

3

4

5

6

7

8

Number of nodes

Figure 14 HPL Performance with 20000-row matrix

6. Conclusions On a cluster of SMP PCs, hybrid TC is potential to outperform MPI. The key factor is a suitable taskschedule, where we can find tasks to assign to noncommunication processors during communication

References [1] T.Q.Viet, T.Yoshinaga, B.A.Abderazek and M.Sowa, “A Hybrid MPI-OpenMP Solution for a Linear System on a Cluster of SMPs”, Proc. Symposium on Advanced Computing Systems and Infrastructures, pp.299-306, 2003. [2] T.Boku, I.Itakura, S.Yoshikawa, M.Kondo and M.Sato, “Performance Analysis of PC-CLUMP based on SMP-Bus Utilization”, Proc. WCBC’00 (Workshop on Cluster Based Computing 2000), Santa Fe, May 2000. [3] “MPICH, a portable MPI implementation”, http://www-unix.mcs.anl.gov/mpi/mpich/ [4] “Automatically Tuned Linear Algebra Software (ATLAS)”, http://math-atlas.sourceforge.net/ [5] “PAPI”, http://icl.cs.utk.edu/papi/ [6] F.Cappello, O.Richard and D.Etiemble, “Investigating the Performance of Two Programming Models for Clusters of SMP PCs”, Proc. High Performance Computer Architecture (HPCA-6), pp.349-359, 2000. [7] F.Cappello and D.Etiemble, “MPI versus MPI+OpenMP on IBM SP for the NAS Benchmark”, Proc. Supercomputing 2000 (SC2000), Dallas, November 2000. [8] “The NAS Parallel Benchmark”, http://www.nas. nasa.gov/Software/NPB/ [9] F.Cappello and O.Richard, “Intra Node Parallelization of MPI Programs with OpenMP”, http://www.lri.fr/~fci/goinfreWWW/1196.ps.gz [10] http://www.riam.kyushu-u.ac.jp/sanny/activity/m ember/kitazawa/2002/kyoto_NPB.html [11] “HPL algorithm”, http://www.netlib.org/benchma rk/hpl/algorithm.html