High-Performance Incremental Scheduling on Massively Parallel

5 downloads 0 Views 262KB Size Report
ABSTRACT — Runtime incremental parallel scheduling (RIPS) is a new approach for load ... Multiple Data (SPMD) programming model being used. Therefore,.
High-Performance Incremental Scheduling on Massively Parallel Computers — A Global Approach Min-You Wu and Wei Shu Department of Computer Science State University of New York at Buffalo Buffalo, NY 14260 fwu,[email protected] A BSTRACT — Runtime incremental parallel scheduling (RIPS) is a new approach for load balancing. In parallel scheduling, all processors cooperate together to balance the workload. Parallel scheduling accurately balances the load by using global load information. In incremental scheduling, the system scheduling activity alternates with the underlying computation work. RIPS produces high-quality load balancing and adapts to applications of nonuniform structures. This paper presents methods for scheduling a single job on a dedicated parallel machine.

1

Introduction

There are two basic scheduling strategies: static scheduling and dynamic scheduling. Static scheduling distributes the workload at compile time. Dynamic scheduling performs scheduling activities at runtime. Static scheduling applies to problems with a predictable structure, which are called static problems. This class of problems includes Gaussian elimination, FFT, etc. Static scheduling utilizes the knowledge of problem characteristics to reach a global optimal, or near optimal, solution [34, 13, 35, 7]. The quality of static scheduling relies heavily on accuracy of weight estimation. Scalability of static scheduling is restricted because a large memory space is required to store the task graph. In addition, it is not able to balance the load for problems with an unpredictable structure. Dynamic scheduling applies to problems with an unpredictable structure, which are called dynamic problems. This class of problems includes multi-grid matrix operation, Chess, N-Queens, 1

and many divide-and-conquer algorithms. Dynamic scheduling has certain advantages. It is a general approach suitable for a wide range of applications. It can adjust load distribution based on runtime system load information [10, 11, 23]. However, most runtime scheduling algorithms, when making a load balancing decision, utilize neither problem characteristics nor global load information. Efforts to collect load information for a scheduling decision certainly compete resources with the underlying computation during runtime. When a dynamic system intends to quickly and accurately balance the load, it could become unstable. It is possible to design a scheduling strategy that combines the advantages of static and dynamic scheduling. This scheduling strategy should be able to generate a well-balanced load without incurring large overhead. With advanced parallel scheduling techniques, this ideal scheduling becomes feasible. In parallel scheduling, all processors cooperate together to schedule work. Parallel scheduling is stable because of its synchronous operation. It utilizes global load information and is able to accurately balance the load. As an alternative strategy to the commonly used static and dynamic scheduling, parallel scheduling can provide high-quality load balancing. A general description of parallel scheduling can be found in [33]. The basic idea behind parallel scheduling is that instead of identifying one task to be scheduled each time, we identify a set of tasks that can be scheduled in parallel. Parallel scheduling can also be applied incrementally to adapt to dynamic problems. When parallel scheduling is applied at runtime, it becomes an incremental collective scheduling. It is applied whenever the load becomes unbalanced. All processors collectively schedule the workload. In this paper, we propose a new method, called Runtime Incremental Parallel Scheduling (RIPS). RIPS is a runtime version of global parallel scheduling. In RIPS, the system scheduling activity alternates with the underlying computation work during runtime. The RIPS system paradigm is shown in Figure 1. A RIPS system starts with a system phase which schedules initial tasks. It is followed by a user computation phase to execute the scheduled tasks, and possibly generate new tasks. In the second system phase, the old tasks that have not been executed will be scheduled together with the newly 2

Start

SYSTEM PHASE collect load information task scheduling Terminate USER PHASE task execution

Figure 1: Runtime Incremental Parallel Scheduling (RIPS). generated tasks. This process will repeat iteratively until the entire computation is completed. Note that we assume the Single Program Multiple Data (SPMD) programming model being used. Therefore, we rely on a uniform code image accessible at each processor. RIPS can be used for a single job on a dedicated machine or a multiprogramming environment. It can be applied to both shared memory and distributed memory machines. Algorithms for scheduling a single job on a dedicated distributed memory machine are described in this paper. Section 2 is devoted to the issues of incremental scheduling and Section 3 to the parallel scheduling algorithms. Previous works are reviewed in Section 4. Experimental study and comparisons are presented in Section 5.

2

Incremental Scheduling

RIPS can be presented in its two major components: incremental scheduling and parallel scheduling. The incremental scheduling policy decides when to transfer from a user phase to a system phase and which tasks are selected for scheduling. The parallel scheduling algorithm is applied in the system phase to collect system load information and to balance the load, which will be discussed in 3

the next section. The transfer policy in RIPS includes two sub-policies: a local policy and a global policy. Based on its local condition, each individual processor determines if it is ready to transfer to the next system phase. Then all processors cooperate together to determine the transfer from the user phase to the system phase, based on the global condition. We consider two local policies: eager scheduling and lazy scheduling. In the eager scheduling, every task must be scheduled before it can be executed. In the lazy scheduling, scheduling is postponed for as long as possible. In this way, some tasks could be executed directly without being scheduled. The eager scheduling is implemented with two queues in each processor. One is called ready-to-execute (RTE) queue, the other ready-to-schedule (RTS) queue. At the beginning of a user phase, all the RTS queues in the system are empty and the RTE queue of every processor holds almost equal number of tasks ready to execute. During the user phase, new tasks can be generated and entered into the local RTS queue, while the tasks in the RTE queue are consumed, as shown in Figure 2(a). When the RTE queue is empty, the processor is ready to transfer from a user phase to the next system phase. At the transfer, some of the RTE queues may be empty and others may have tasks left, because the consumption rate might not be the same due to unequal task grain sizes. In the beginning of the system phase, all the tasks left in the RTE queues, if any, will be moved back to the RTS queues and rescheduled together with the newly-generated tasks. The system phase schedules tasks in all RTS queues and distributes them evenly to the RTE queues. The tasks in RTS queues enter the local RTE queue or a remote RTE queue depending on the scheduling results. The lazy scheduling uses only a single queue, RTE, to hold all tasks, as shown in Figure 2(b). The tasks scheduled to the processor and the tasks generated at the processor are not distinguished. The newly-generated tasks enter the RTE queue directly. Some tasks may be generated and executed in the same processor without being scheduled. The transfer condition from a user phase to the next system phase is the same as the eager scheduling, that is, when the 4

tasks to be consumed

tasks to be consumed

RTS queue

RTE queue

RTE queue tasks generated

tasks generated

(a) two-queue policy

(b) one-queue policy

Figure 2: RTE and RTS Queues RTE queue becomes empty. In this way, only a fraction of tasks are scheduled and the number of total system phases can be reduced. Two possible global policies are called ALL and ANY. The ALL policy states that the transfer from a user phase to the next system phase will be initiated only when all the processors satisfy their local conditions. Whereas, with the ANY policy, as long as one processor has met its local condition, the transfer can be initiated. To test whether a transfer condition is satisfied, a naive implementation periodically invokes a global reduction operation. If the condition is satisfied, the system switches from the user phase to the next system phase; otherwise, it continues in the user phase. The time interval between two consecutive global reduction operations should be carefully determined. An interval that is too short increases communication overhead, and an interval that is too long may result in unnecessary processor idle. The optimal length of the interval is to be determined by empirical study. Although the periodical reduction is a simple and general implementation, it may interfere with the underlying computation many times before the condition is satisfied. This overhead could be eliminated for some policies. The following method can be used for the ALL policy: a processor sends a ready signal to its parent 5

when the local condition is satisfied and a ready signal is received from each of its children. When the root processor satisfies the local condition and receives a ready signal from each of its children, the global ALL condition has been reached. The root processor will broadcast an init signal to all other processors to start the system phase. Some processors can be idle for a while until every processor finished all tasks in the RTE queue. For the ANY policy, an alternative implementation allows any processor that satisfies the local condition to become an initiator and broadcast an init signal to all other processors. A processor, upon receiving the init signal, switches from a user phase to the next system phase. Because of communication delay, more than one processor could claim to be an initiator. Therefore, a processor may receive more than one init signal. A phase index variable is used to eliminate redundant init signals. As each init signal is tagged with a phase index variable, all the init signals with the same phase index variable, except the one received first, are considered to be redundant. Some machines provide a fast or-barrier synchronization, such as the eureka mode in Cray T3D [16]. Implementing the ANY policy can utilize this synchronization. Note that when an idle processor initiates a phase transfer, other processors may still be executing tasks. The idle processor must wait until every processor finishes the current task execution. The ANY-Lazy policy has shown to be the best of all four combinations [24]. Transfer from the system phase to the user phase does not require a synchronization. Each processor terminates by itself and proceeds to the next user phase.

3

Parallel Scheduling

The goal of scheduling is to reach the same work load at each processor. To achieve this goal, it is necessary to have an estimation of the task execution time, which can be done either by a programmer or by a compiler. Sometimes, the estimation is application-specific, leading to a less general approach, and sometimes it is difficult to obtain such an estimation. Due to these difficulties, each task is presumed to require the equal execution time, and the goal of the algorithm is to schedule tasks so that each processor has the same 6

number of tasks. The inaccuracy due to the grain-size variation can be corrected in the next system phase. Another goal of the algorithm is to minimize communication overhead for this load balancing process. Ideally, the tasks with lower communication/computation ratios should have a higher priority to migrate. Although the communication/computation ratio is difficult to predict, it has been observed that the communication/computation ratio does not change substantially in a single application. Thus, we can use the number of tasks migrated instead of the actual communication cost to migrate tasks as the objective function. In a parallel system, N computing nodes are connected by a given topology. Each node i has wi tasks when parallel scheduling is applied. A scheduling algorithm is to redistribute tasks so that the number of tasks in each node is equal. Assume the sum of wi of all nodes can be evenly divided by N. The average number of tasks wavg is calculated by ∑N 1 wi wavg = i=0 : N Each node should have wavg tasks after executing the scheduling algorithm. If wi > wavg, the node must determine where to send the tasks. Many methods can be used to achieve a balanced load. An optimal scheduling minimizes the number of tasks that are transferred between nodes. The objective function is to minimize ∑ ek , where ek is the number of tasks transferred through the edge k. In general, this problem can be converted to the minimum-cost maximum-flow problem [18] as follows. Each edge is given a tuple (capacity; cost), where capacity is the capacity of the edge, and cost is the cost of the edge. Set capacity = ∞ and cost = 1 for all edges. Then, add a source node s with an edge (s; i) to each node i if wi > wavg, and a sink node t with an edge ( j ; t) from each node j if w j < wavg . Set capacitysi = wi wavg and costsi = 0 for all i. Also, set capacity jt = wavg w j and cost jt = 0 for all j. A minimum cost integral flow yields a solution to the problem. The complexity of the minimum cost algorithm is O(n2 v), where n is the number of nodes and v is the desired flow value [18]. This high complexity is not realistic for runtime scheduling. For certain 7

topologies, such as trees, the complexity can be reduced to O(log n) on n processors [25]. For a topology other than trees, we need to find out a heuristic algorithm. A good heuristic algorithm can be designed by utilizing global load information. Here we present a new parallel scheduling algorithm for the mesh topology. The algorithm, called Mesh Walking Algorithm (MWA), is shown in Figure 3. Assume a n1  n2 mesh. Let wi j be the number of tasks in node (i; j ) before the algorithm is applied. The first step scans the partial vector w along each row i, and each node (i; j ) records a w vector wi k ; k = 0; :::; j. Thus, each node stores the weight information of nodes that are in the same row with a column number that is less than or equal to j. At step 2, each node in column n2 1 i 2 1 calculates the sum si = ∑kn= 0 wi k , and ti = ∑k =0 sk is calculated by a scan-with-sum operation along these nodes. The value T = tn1 1 at node (n1 1; n2 1) is equal to the total number of tasks in the mesh. Then node (n1 1; n2 1) computes the average number of tasks per node, wavg. If the number of tasks cannot be evenly divided by the number of nodes, the remaining R tasks are evenly distributed to the first R nodes so that they have one more task than the others. The values of wavg and R are broadcast to all nodes. Also, the values of si ; ti and ti 1 are spread along each row i. In step 3, a quota vector q is calculated in each node so that each node knows if it is overloaded or underloaded. Each node (i; j ) also calculates a row accumulation quota Qi , which is the quota of the submesh consisting of all nodes from row 0 to row i. ;

;

;

Step 4 balances the load among rows. All nodes calculate the values of y and x. If yi < 0, row i will receive jyi j tasks from row i + 1. If xi > 0, row i will receive xi tasks from row i 1. If yi > 0, the submesh from row 0 to row i is overloaded, and yi tasks need to be sent to row i + 1. If xi < 0, the submesh below row i is underloaded, and jxi j tasks need to be sent to row i 1. Then, a δ vector is calculated, which is the number of extra tasks in each node (i; k), k = 0; :::; j. When yi > 0, an η vector is calculated to determine how many tasks di j at each node need to be sent. The total number of extra tasks in row 2 1 i, ∑nk= 0 δi k is larger than or equal to yi . In this algorithm, the first m m 1 nodes (where ∑m k =0 δi k  yi and ∑k =0 δi k < yi ) send yi tasks to row i + 1. The γ vector indicates how many tasks are needed by previous nodes. When xi < 0, a similar way is used to calculate the u vector, ;

;

;

;

8

Mesh Walking Algorithm (MWA) Assume a n1  n2 mesh, and the number of nodes is N = n1  n2 . 1. Let wi j be the number of tasks in node (i; j ). Scan the partial vector w along each row. Each node (i; j ) records a vector wi k ; k = 0; :::; j. 2 1 2. Nodes (i; n2 1) perform si = ∑nk= 0 wi k . Perform a scan-with-sum ti = ∑ik=0 sk . T = tn1 1 is the total number of tasks. Node (n1 1; n2 1) computes wavg = bT =Nc; R = T mod N. The values of wavg and R are broadcast to all nodes. The values of si ; ti and ti 1 are spread along each row i. 3. Each nodecomputes its quota qi j , and row accumulation quota Qi wavg if (i  d + j )  R qi j = and Qi = wavg  d  (i + 1) + ri wavg + 1 otherwise where  (i + 1)  d i f (i + 1)  d  R ri = R otherwise Each node that is not in row 0 also computes the value of Qi 1. 4. All nodes (i; j ) do yi = ti Qi ; x0 = 0; xi = ti 1 Qi 1 for i > 0 If xi > 0, receive tasks and its d vector from node (i 1; j ), wi j = wi j + di 1 j . If yi < 0, receive tasks and its u vector from node (i + 1; j ), wi j = wi j + ui+1 j . ;

;

;

;

;

;

;

;

;

;

;

δi k = wi k qi k ; k = 0; :::; j. If yi > 0, initialize ηi 0 = 8yi , γi 0 = 0, and compute the d vector for k = 0; 1; :::; j < ηi k if δi k > ηi k + γi k > 0 di k = if ηi k + γi k  δi k > γi k : δi k 0 γi k otherwise and γi k+1 = γi k (δi k di k ); δi k = δi k di k ; ηi k+1 = ηi k di k Send di j tasks and its d vector to node (i + 1; j ), wi j = wi j di j . If xi < 0, initialize ηi 0 = 8jxi j, γi 0 = 0, and compute the u vector for k = 0; 1; :::; j < ηi k if δi k > ηi k + γi k > 0 if ηi k + γi k  δi k > γi k ui k = : δi k 0 γi k otherwise and γi k+1 = γi k (δi k ui k ); ηi k+1 = ηi k ui k Send ui j tasks and its u vector to node (i 1; j ), wi j = wi j ui j . 5. All nodes (i; j ) do zi 0 = 0; zi j = ∑kj=01 (wi k qi k ) for i > 0; vi j = zi j + wi j qi j If zi j > 0, receive tasks from node (i; j 1). If vi j < 0, receive tasks from node (i; j + 1). If vi j > 0, send vi j tasks to node (i; j + 1). If zi j < 0, send jzi j j tasks to node (i; j 1). ;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

; ;

;

;

;

;

;

;

;

;

Figure 3: The Mesh Walking Algorithm.

9

;

;

;

;

which indicates how many tasks at node (i; j ) are to be sent to row i 1. In step 5, the load is balanced in each row. The z and v vectors are calculated to determine the task exchange between adjacent nodes in a row. In this algorithm, step 1 spends n2 communication steps to collect load information along each row. Step 2 spends n1 communication steps to collect load information across rows. Broadcasting and spreading operations spend n1 + n2 communication steps. Steps 4 and 5 spend at most n1 and n2 communication steps, respectively. Therefore, the total communication steps of this algorithm is 3(n1 + n2). Now, we prove that the MWA algorithm achieves a fully balanced load. Theorem 1: The difference in the number of tasks in each processor is at most one after executing the MWA algorithm. Proof: After executing the algorithm, the number of tasks in each processor is equal to qi j . Since the value of ri j is either 0 or 1, the difference in the number of tasks in each processor is at most one. 2 ;

;

The MWA algorithm also maximizes locality. Local tasks are the tasks that are not transferred to other processors, and non-local tasks are the tasks that are transferred to other processors. Maximum locality implies the maximum number of local tasks and the minimum number of non-local tasks. In the following, we assume that the number of tasks T is evenly divided by N, the number of processors. When T is not evenly divided by N, the algorithms are nearly-optimal. The following lemma gives the minimum number of non-local tasks. Lemma 1: To reach a balanced load, the minimum number of nonlocal tasks is m = ∑ (wavg j

w j 0 ); j 2 fijN > i  0 and w j 0 < wavg g: ;

;

Proof: Each processor j with w j 0 < wavg must receive (wavg w j 0 ) tasks from other processors for a balanced load. Therefore, a total of m tasks must be transferred between processors. 2 ;

;

The next theorem proves that the MWA algorithm maximizes locality. 10

Theorem 2: The number of non-local tasks in the MWA algorithm is m = ∑ (wavg j

w j 0 ); j 2 fijN > i  0 and w j 0 < wavg g: ;

;

Proof: After each step of the MWA algorithm, the number of tasks in each processor j is not less than min(wi j ; wavg). In all processors, at least ∑ min(wi j ; wavg) tasks are local. Therefore, the number of nonlocal tasks is no more than N  wavg ∑ min(wi j ; wavg), which is equal to m. From Lemma 2, there are at least m non-local tasks. Therefore, there are exactly m non-local tasks in this algorithm. 2 ;

;

;

MWA is a heuristic algorithm and in general will not minimize the communication cost. However, for a system with two or four processors, the algorithm minimizes the communication cost ∑ ek . Lemma 2: The MWA algorithm minimizes the communication cost in a system with no more than four processors. Proof: The communication cost in a system is minimized if there is no negative cycle [18]. In a system of two processors, there is no negative cycle. In a system of four processors, only a path consisting of at least three edges can form a negative cycle. With the MWA algorithm, the longest path has two edges. Therefore, there is no negative cycle. 2

4

Experimental Study

First, we present a performance study of the MWA algorithm. For this purpose, we consider a test set of load distributions. In this test set, the load at each processor is randomly generated, with the mean equal to the specified average number of tasks. The average number of tasks (average weight) in each processor varies from 2 to 100. The normalized communication cost of MWA with respect to the optimal algorithm is measured by CMWA COPT ; COPT where CMWA and COPT are the numbers of task transfers of the MWA and optimal algorithms, respectively. As mentioned in Lemma 2, the number of task transfers of MWA on two or four processors is the 11

Normalized Cost

9% 8 processors

8% 7%

16 processors

6%

32 processors

5% 4% 3% 2% 1% 0% 2

5

10

20

50

100

Weight

Normalized Cost

(a) 8, 16, and 32 processors

50% 45% 40% 35% 30% 25% 20% 15% 10% 5% 0%

64 processors 128 processors 256 processors

2

5

10

20

50

100

Weight (b) 64, 128, and 256 processors Figure 4: Normalized Communication Cost of MWA.

12

minimum. Figure 4 shows the normalized communication costs on 8 to 256 processors. The mesh organization is either M  M or M  M=2. Each data presented here is the average of 100 different test cases. For small meshes, MWA provides a nearly optimal result. However, the cost increases when the number of processors is large. The algorithm can be further improved. A RIPS system using the MWA algorithm has been implemented on a Paragon machine. The system has been tested with three application problems. The first one, the exhaustive search of the N-queens problem, has an irregular and dynamic structure. The number of tasks generated and the computation amount in each task are unpredictable. The second one, iterative deepening A* (IDA*) search, is a good example of parallel search techniques [17]. The sample problem is the 15-puzzle with three different configurations. The grain size may vary substantially, since it dynamically depends on the currently estimated cost. Also, synchronization at each iteration reduces the effective parallelism. Therefore, performance of this problem is not as good as others. The third one, a molecular dynamics program named GROMOS, is a real application problem [29, 28]. The test data for GROMOS is the bovine superoxide dismutase molecule (SOD), which has 6968 atoms [22]. The cutoff ˚ 12 A, ˚ and 16 A. ˚ GROMOS has a more radius is predefined to 8 A, predictable structure. The number of processes is known with the given input data, but the computation density in each process varies. Thus, a load balancing mechanism is necessary. In Table I, we compare RIPS to three other dynamic load balancing strategies: random allocation, gradient model and RID. The performance of RIPS shown in Table I is with the ANY-Lazy policy. The comparison is done with (1) the number of tasks that are sent to other processors, which is a measure of locality; (2) the overhead time Th, which includes all system overhead; (3) the idle time Ti , which is a measure of load imbalance; (4) the execution time T ; and (5) the efficiency µ. Here, the efficiency is defined as µ = TpTsN , where N is the number of processors, Ts is the sequential execution time, and Tp is the parallel execution time. The randomized allocation, although its locality is not good, can balance the load fairly well. The gradient model does not show good performance for the N-Queens problem. However, it performs fairly well on the less irregular, highly parallel 13

Table I: Comparison of Scheduling Algorithms on 32 Processors

Exhaustive search 13-Queens Exhaustive search 14-Queens Exhaustive search 15-Queens IDA* search config. #1 IDA* search config. #2 IDA* search config. #3 GROMOS ˚ (8 A)

GROMOS ˚ (12 A)

GROMOS ˚ (16 A)

Random Gradient RID RIPS Random Gradient RID RIPS Random Gradient RID RIPS Random Gradient RID RIPS Random Gradient RID RIPS Random Gradient RID RIPS Random Gradient RID RIPS Random Gradient RID RIPS Random Gradient RID RIPS

# of tasks 7579 7579 7579 7579 11166 11166 11166 11166 15941 15941 15941 15941 2895 2895 2895 2895 3382 3382 3382 3382 29046 29046 29046 29046 4986 4986 4986 4986 4986 4986 4986 4986 4986 4986 4986 4986

# of nonlocal tasks 7342 4250 2588 312 10830 6310 4218 647 15459 9058 7111 922 2804 2011 619 203 3277 2155 383 258 28137 19223 3031 1137 4828 2305 530 495 4833 2184 549 555 4832 2353 485 583

Th 0.12 0.41 0.13 0.11 0.31 0.55 0.20 0.18 1.03 1.91 0.79 0.51 0.31 0.32 0.40 0.37 1.56 1.32 2.41 1.53 1.37 2.00 3.22 0.75 0.28 0.26 0.14 0.11 0.81 0.84 0.32 0.30 2.42 1.04 0.78 0.68

Ti 0.03 0.38 0.05 0.02 0.11 1.60 0.10 0.04 0.54 7.33 0.07 0.03 0.18 0.60 0.97 0.03 1.01 2.80 6.50 0.22 0.72 3.78 1.37 0.15 0.13 0.31 0.03 0.02 0.49 1.13 0.12 0.05 1.68 2.26 0.34 0.14

T 0.41 1.05 0.44 0.39 2.02 3.75 1.90 1.82 11.9 19.6 11.2 10.9 0.80 1.23 1.68 0.71 3.52 5.07 9.80 2.72 6.81 10.5 9.31 5.60 2.19 2.35 1.95 1.91 6.64 7.81 5.78 5.69 15.4 14.6 12.4 12.1

µ 65% 25% 60% 68% 79% 43% 84% 88% 87% 53% 92% 95% 39% 25% 19% 44% 27% 19% 10% 35% 69% 45% 51% 85% 81% 76% 91% 93% 80% 68% 92% 94% 73% 77% 91% 93%

Th – overhead (second), Ti – idle time (second), T – exec. time (second) µ – efficiency

14

GROMOS program. Generally speaking, it cannot balance the load well, since the load is spread slowly. In addition, the system overhead is large because information and tasks are frequently exchanged. In RID, three parameters, LLOW , Lthreshold , and u are adjusted to 2, 1, and 0.4, respectively. According to [31], the value of the load update factor, u, was suggested to be 0.9. However, in our implementation, the value of 0.9 causes load information exchange too frequently and consequently, the overhead is too large. Therefore, u is set to its optimal value, 0.4. RID shows a better performance than the randomized allocation in most cases. However, it does not perform well for IDA* because the synchronization at each iteration effectively reduces parallelism and a receiver-initiated approach does not do well in a lightly-loaded system [11]. When the problem size becomes large, as in configuration #3, RID’s performance is improved. In RIPS, the MWA algorithm can balance the load very well, and the incremental scheduling is able to correct the load imbalance. One may suspect large overhead from this accurate load balancing algorithm. A surprising observation is that the overhead of RIPS is smaller than that of other dynamic scheduling algorithms. We illustrate the overhead for communication of task migration with 15-Queens as an example. Execution of this problem takes 8 system phases. There are about 1000 non-local tasks and an average of 125 non-local tasks per system phase. Tasks are packed together for transmission, reducing communication overhead. A uniform code image is accessible at each processor and only data are transferred. The maximum distance in an 8  4 mesh is 12. Each communication step to migrate tasks takes about 1 ms, and each system phase takes about 12 ms for task migration. The total time for task migration of 8 system phases is about 96 ms. It is a small fraction of the total system overhead, which is 510 ms. Other system overhead includes overhead for phase transfer, task creation, data communication, etc. The load is well-balanced and the idle time is about 30 ms. The parallel execution time is 10.9 seconds, resulting in a speedup of 30.4 and efficiency of 95%. Next, we use the randomized allocation as a baseline algorithm and show the relative performance of other scheduling algorithms. First, an optimal efficiency is calculated assuming (1) optimal scheduling; and (2) no overhead. The optimal efficiency is the best possible

15

efficiency that can be obtained for a given problem on an ideal system. The optimal efficiencies for different problem sizes are shown in Table II. A measure used to determine the effectiveness of scheduling algorithm g is the normalized quality factor: µopt µrand ; µopt µg where µopt is the optimal efficiency, µrand is the efficiency of the randomized allocation algorithm, and µg is the efficiency of algorithm g. The factor of the randomized allocation algorithm is equal to 1. If the algorithm performs better than the randomized allocation, its value is larger than 1. Otherwise, it is smaller than 1. The normalized quality factors of these test problems are shown in Figure 5. For small problem sizes, it is dominated by the system overhead. Whereas for large problem sizes, it is dominated by scheduling quality because the system overhead is relative small. Therefore, the difference between scheduling algorithms can be easily recognized when the problem size is large. Table II: Optimal Efficiencies for Test Problems Exhaustive Search 13-Queens

98.8% config.#1

14-Queens

99.2% IDA* Search

15-Queens

99.4%

config.#2

config.#3

91.7% GROMOS ˚ 8A 12 A˚ 98.9% 98.9%

97.2%

85.3%

16 A˚ 98.9%

Table III shows a speedup comparison of scheduling algorithms on 64 and 128 processors. Both the randomized allocation and RID scale up well. However, RIPS performs even better. The gradient model does scale up well because it spreads the load slowly. RID performs well for the N-Queen problem, but not for IDA* search because configuration #3 does not have enough parallelism on large systems. The value of u needs to be adjusted for low parallelism on 16

Normalize qulity fctor

3.0 Random 2.5 Gradient 2.0 RID

1.5

RIPS

1.0 0.5 0.0 13-Queen

14-Queen

15-Queen

Normalize quaity factor

(a) Exhaustive Search

3.0 Random 2.5 Gradient 2.0 RID

1.5

RIPS

1.0 0.5 0.0 Config. #1

Config. #2

Config. #3

15-puzzle

Normalize quality factor

(b) IDA* Search

4.5 Random

4.0 3.5

Gradient

3.0 2.5

RID

2.0 RIPS

1.5 1.0 0.5 0.0 8A

12 A

16 A

(c) GROMOS Figure 5: Normalized Quality Factors.

17

large systems. For IDA*, it has been adjusted to 0.7, and for the other two problems, it remains at 0.4. Table III: Speedup Comparison on 64 and 128 Processors

5

Exhaustive search 15-Queen

Random Gradient Model RID RIPS

IDA* search config. #3

Random Gradient Model RID RIPS

GROMOS ˚ (16 A)

Random Gradient Model RID RIPS

Number of Processors 64 128 52.2 99.4 28.8 33.7 57.3 96.8 60.2 107 35.0 53.9 24.6 36.8 20.2 15.4 43.1 67.1 46.7 93.5 41.9 58.1 45.1 78.8 51.0 99.8

Previous Works

RIPS and static scheduling share some common ideas [34, 13, 12, 35, 7]. Both of them utilize the systemwise information and perform scheduling globally to achieve a high quality of load balancing. They also clearly separate the time to conduct scheduling from the time to perform computation. But, RIPS is different from static scheduling in three aspects. First, the scheduling activity is performed at runtime. Therefore, it can deal with the dynamic problems. Second, the possible load imbalance caused by inaccurate grain size estimation can be corrected by the next turn of scheduling. Third, it eliminates the requirement of large memory space to store task graphs, as scheduling is conducted in an incremental fashion. It then leads to a better scalability for massively parallel machines and large size applications. RIPS is similar to the dynamic scheduling in a certain degree. Both methods schedule tasks at runtime instead of compile-time. Their scheduling decisions, in principle, depend on and adapt to 18

the runtime system information. However, there exist substantial differences. First, the system functions and user computation are mixed together in dynamic scheduling. But, there is a clear cutoff between system and user phases in RIPS, which potentially offers easy management and low overhead. Second, placement of a task in dynamic scheduling is basically an individual action by a processor, based on partial system information. Whereas in RIPS, the scheduling activity is always an aggregate operation, based on global system information. Large research efforts have been directed towards the process allocation in distributed systems [23, 10, 11, 27, 5, 6, 15, 26, 30, 2, 19]. A recent comparison study of dynamic load balancing strategies on highly parallel computers is given by Willebeek-LeMair and Reeves [31]. Eager et al. compared the sender-initiated algorithm and receiver-initiated algorithm [11]. There are other scheduling algorithms in this “stop-and-schedule” fashion. It is sometimes referred to as prescheduling, which is more closely related to RIPS. Prescheduling utilizes partial load information for load balancing. Fox et al. first adapted prescheduling to application problems with geometric structures [14, 20]. Some other works also deal with this type of problems [9, 3, 1]. The project PARTI automates prescheduling for nonuniform problems [21, 4]. The dimension exchange method (DEM) is a parallel scheduling algorithm applied to application problems without geometric structure [8]. It balances load for independent tasks with an equal grain size. The method has been extended by Willebeek-LeMair and Reeves [31] so that the algorithm can run incrementally to correct an unbalanced load due to varied task grain sizes. The DEM scheduling algorithm generates redundant communications. It is designed specifically for the hypercube topology and implemented much less efficiently on a simpler topology, such as a tree or a mesh [31]. RIPS uses optimal parallel scheduling algorithms to minimizes the number of communications, as well as the data movement. Furthermore, RIPS is a general method and applies to different topologies, such as the tree, mesh, and hypercube [32].

19

6

Concluding Remarks

Parallel scheduling gives load balancing a new direction. Unlike the traditional approach, processors schedule the load in parallel. It balances the load very well and effectively reduces the processor idle time. Parallel scheduling is fast and scalable. Parallel incremental scheduling combines the advantages of static scheduling and dynamic scheduling, adapts to dynamic problems, and produces high-quality load balance. It has been widely believed that a scheduling method that utilizes global information is neither practical nor scalable. This is not necessarily true when an advanced parallel scheduling technique is used. We have demonstrated a scalable scheduler that uses the global load information to optimize load balancing. Its overhead is comparable to the lowoverhead randomized allocation. Parallel incremental scheduling is a synchronous approach which eliminates the stability problem and is able to balance the load quickly and accurately. It applies to a wide range of applications, from slightly irregular ones to highly irregular ones. A CKNOWLEDGMENTS — We are very grateful to Reinhard Hanxleden for the GROMOS program, Terry Clark for the SOD data, and Marc Feeley for the elegant N-Queen program. This research was partially supported by NSF grant CCR-9505300.

20

References [1] S. B. Baden. Dynamic load balancing of a vortex calculation running on multiprocessors. Technical Report Vol. 22584, Lawrence Berkeley Lab., 1986. [2] A. Barak and A. Shiloh. A distributed load-balancing policy for a multicomputer. Software-Practice and Experience, 15(9):901–913, September 1985. [3] M.J. Berger and S. Bokhari. A partitioning strategy for nonuniform problems on multiprocessors. IEEE Trans. Computers, C-26:570–580, 1987. [4] H. Berryman, J. Saltz, and J. Scroggs. Execution time support for adaptive scientific algorithms on distributed memory machines. Concurrency: Practice and Experience, 1991. [5] T. L. Casavant and J. G. Kuhl. A formal model of distributed decision-making and its application to distributed load balancing. In Int’l Conf. on Distributed Computing System, pages 232– 239, May 1986. [6] T. L. Casavant and J. G. Kuhl. Analysis of three dynamic distributed load-balancing strategies with varying global information requirements. In Int’l Conf. on Distributed Computing System, pages 185–192, May 1987. [7] Y.C. Chung and S. Ranka. Applications and performance analysis of a compile-time optimization approach for list scheduling algorithms on distributed memory multiprocessors. In Supercomputer ’92, November 1992. [8] G. Cybenko. Dynamic load balancing for distributed memory multiprocessors. J. of Parallel Distrib. Comput., 7:279–301, 1989. [9] K. M. Dragon and J. L. Gustafson. A low-cost hypercube load balance algorithm. In Proc. of the 4th Conf. on Hypercube Concurrent Computers and Applications, pages 583–590, 1989.

21

[10] D. L. Eager, E. D. Lazowska, and J. Zahorjan. Adaptive load sharing in homogeneous distributed systems. IEEE Trans. Software Eng., SE-12(5):662–674, May 1986. [11] D. L. Eager, E. D. Lazowska, and J. Zahorjan. A comparison of receiver-initiated and sender-initiated adaptive load sharing. Performance Eval., 6(1):53–68, March 1986. [12] H. El-Rewini and H. H. Ali. Scheduling conditional branching using representative task graphs. The Journal of Combinatorial Mathematics and Combinatorial Computing, 1991. [13] H. El-Rewini and T. G. Lewis. Scheduling parallel program tasks onto arbitrary target machines. Journal of Parallel and Distributed Computing, June 1990. [14] G. C. Fox, M. A. Johnson, G. A. Lyzenga, S. W. Otto, J. K. Salmon, and D. W. Walker. Solving Problems on Concurrent Processors, volume I. Prentice-Hall, 1988. [15] A. Hac and X. Jin. Dynamic load balancing in a distributed system using a decentralized algorithm. In Int’l Conf. on Distributed Computing System, pages 170–177, May 1987. [16] R.K. Koeninger, M. Furtney, and M. Walker. A shared memory mpp from cray research. Digital Technical Journal, 6(2):8–21, 1994. [17] R. E. Korf. Depth-first iterative-deepening: An optimal admissible tree search. Artificial Intelligence, 27(1):97–109, September 1985. [18] E. L. Lawler. Combinatorial Optimization: Networks and Matroids. Holt, Rinehart and Winston, 1976. [19] Z. Lin. A distributed fair polling scheme applied to parallel logic programming. International Journal of Parallel Programming, 20, August 1991. [20] J.K. Salmon. Parallel hierarchical N-body methods. Technical report, Tech. Report, CRPC-90-14, Center for Research in Parallel Computing, Caltech, 1990., 1990. 22

[21] J. Saltz, R. Mirchandaney, R. Smith, D. Nicol, and K. Crowley. The PARTY parallel runtime system. In Proceedings of the SIAM Conference on Parallel Processing for Scientific Computing. SIAM, 1987. [22] J. Shen and J. A. McCammon. Molecular dynamics simulation of superoxide interacting with superoxide dismutase. Chemical Physics, 158:191–198, 1991. [23] Niranjan G. Shivaratri, Phillip Krieger, and Mukesh Singhal. Load distributing for locally distributed systems. IEEE Computer, 25(12):33–44, December 1992. [24] W. Shu and M. Y. Wu. Runtime Incremental Parallel Scheduling (RIPS) for large-scale parallel computers. In Proceedings of the 5th Symposium on the Frontiers of Massively Parallel Computation, pages 456–463, Feb. 1995. [25] W. Shu and M.Y. Wu. Runtime parallel scheduling for distributed memory compupters. In Int’l Conf. on Parallel Processing, August 1995. [26] V. Singh and M. R. Genesereth. A variable supply model for distributing deductions. In 9th Intel. Joint Conf. Artificial Intelligence, pages 39–45, August 1985. [27] J. A. Stankovic. Simulations of three adaptive, decentralized controlled, job scheduling algorithms. Computer Networks, 8(3):199–217, June 1984. [28] Reinhard v. Hanxleden and Ken Kennedy. Relaxing SIMD control flow constraints using loop transformations. Technical Report CRPC-TR92207, Center for Research on Parallel Computation, Rice University, April 1992. [29] W. F. van Gunsteren and H. J. C. Berendsen. GROMOS: GROningen MOlecular Simulation software. Technical report, Laboratory of Physical Chemistry, University of Groningen, Nijenborgh, The Netherlands, 1988. [30] Y.-T. Wang and R. J. T. Morris. Load sharing in distributed systems. IEEE Trans. Comput., C-34(3):204–217, March 1985.

23

[31] Marc Willebeek-LeMair and Anthony P. Reeves. Strategies for dynamic load balancing on highly parallel computers. IEEE Trans. Parallel and Distributed System, 9(4):979–993, September 1993. [32] M. Y. Wu. On runtime parallel scheduling. Technical Report 9534, Dept. of Computer Science, State University of New York at Buffalo, April 1995. [33] M. Y. Wu. Parallel incremental scheduling. Parallel Processing Letters, 1995. [34] M. Y. Wu and D. D. Gajski. Hypertool: A programming aid for message-passing systems. IEEE Trans. Parallel and Distributed Systems, 1(3):330–343, July 1990. [35] T. Yang and A. Gerasoulis. PYRROS: Static task scheduling and code generation for message-passing multiprocessors. The 6th ACM Int’l Conf. on Supercomputing, July 1992.

24

Author Biographical Sketches: Min-You Wu received the Ph.D. degree from Santa Clara University, California. Before he joined the Department of Computer Science, State University of New York, Buffalo where he is currently an Assistant Professor, he has held various positions at University of Illinois at Urbana-Champaign, University of California at Irvine, Yale University, and Syracuse University. His research interests include parallel operating systems, compilers for parallel computers, programming tools, application of parallel systems, and VLSI design. He has published over 40 journal and conference papers in the above areas. He served as a guest editor for The Journal of Supercomputing and Scientific Programming. He is a member of IEEE. The other information can be viewed at ”http://www.cs.buffalo.edu/pub/WWW/faculty/wu/wu.html”. Wei Shu received the Ph.D. degree from the University of Illinois at Urbana-Champaign in 1990. From 1989 to 1990, she worked at Yale University as an associate research scientist. Since then, she is an assistant professor in the Department of Computer Science at the State University of New York at Buffalo. Her current interests include dynamic scheduling, runtime support systems for parallel processing, and parallel operating systems. She is a member of ACM and IEEE. The other information can be viewed at ”http://www.cs.buffalo.edu/pub/WWW/faculty/shu/shu.html”.

25

Copyright © 1995 by the Association for Computing Machinery, Inc. (ACM). Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that new copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Publications Dept, ACM Inc., via fax at +1 (212) 869-0481, or via email at [email protected].