Scheduling Malleable Tasks on Parallel Processors to Minimize the

11 downloads 0 Views 316KB Size Report
The problem of optimal scheduling n tasks in a parallel processor system is ... The objective is to find a task schedule and a processor allocation such that the ...
Annals of Operations Research 129, 65–80, 2004  2004 Kluwer Academic Publishers. Manufactured in The Netherlands.

Scheduling Malleable Tasks on Parallel Processors to Minimize the Makespan ∗ ∗∗ , MACIEJ MACHOWIAK and JAN WEGLARZ ˙ JACEK BŁAZEWICZ ¸

[email protected] Instytut Informatyki, Politechnika Poznanska, ul. Piotrowo 3, 60-965 Poznan, Poland MIKHAIL Y. KOVALYOV

Faculty of Economics, Belarus State University, Minsk, Belarus DENIS TRYSTRAM

Laboratory Informatique et Distribution, IMAG, Grenoble, France Abstract. The problem of optimal scheduling n tasks in a parallel processor system is studied. The tasks are malleable, i.e., a task may be executed by several processors simultaneously and the processing speed of a task is a nonlinear function of the number of processors allocated to it. The total number of processors is m and it is an upper bound on the number of processors that can be used by all the tasks simultaneously. It is assumed that the number of processors is sufficient to process all the tasks simultaneously, i.e. n  m. The objective is to find a task schedule and a processor allocation such that the overall task completion time, i.e. the makespan, is minimized. The problem is motivated by real-life applications of parallel computer systems in scientific computing of highly parallelizable tasks. An O(n) algorithm is presented to solve this problem when all the processing speed functions are convex. If these functions are all concave and the number of tasks is a constant, the problem can be solved in polynomial time. A relaxed problem, in which the number of processors allocated to each task is not required to be integer, can be solved in O(n max{m, n log2 m}) time. It is proved that the minimum makespan values for the original and relaxed problems coincide. For n = 2 or n = 3, an optimal solution for the relaxed problem can be converted into an optimal solution for the original problem in a constant time. Keywords: scheduling, resource allocation, parallel computing

Introduction The problem of scheduling malleable tasks on parallel processors can be formulated as follows, see also (Ludwig, 1995; Rapine, Scherson, and Trystram, 1998; Scherson et al., 1996; Turek, Wolf, and Yu, 1992). There are n independent tasks that are available at time zero. They are to be scheduled for processing in a system consisting of m, n  m, identical parallel processors. We will use natural numbers to denote tasks and processors. At each time instant, any number of processors can be used to execute a task. However, no processor can han∗ Partially supported by KBN. ∗∗ Corresponding author.

66

˙ BŁAZEWICZ ET AL.

dle more than one task at a time and the total number of processors executing the tasks cannot exceed m at any time. An amount pj > 0 of work is associated with each task j . If r processors are used to execute task j in a time interval of length t, then the amount of work for computing this task within this interval is equal to gj (r) · t, where gj (r)  0 is a nondecreasing processing speed function defined for r ∈ {0, 1, . . . , m}, gj (0) = 0. The total amount of work done on task j must be equal to pj , j = 1, . . . , n. For each task, a schedule specifies the time intervals within which this task is executed and the numbers of processors allocated to the task within these intervals. The objective is to find a schedule which satisfies the above constraints such that the maximum task completion time, i.e. the makespan Cmax , is minimized. Denote the minimum ∗ . Cmax value by Cmax A motivation for this problem comes from optimal scheduling of multiprocessor systems designed for large scale parallel computations. This issue will be discussed in more detail in the next section. A similar problem in which the processors represent a continuously divisible renewable resource bounded from above, has been studied by W¸eglarz (1982) (cf. also Bła˙zewicz et al., 1996). In the sequel, we refer to our original problem with a discrete (integer) resource as problem P-DSCR and to the problem where the processor (resource) allocation is not required to be integer (i.e. with a continuously divisible resource) as problem P-CNTN. In problem P-CNTN, functions gj (r) are interpolated by piecewise linear functions between integer points. Observe that in problem P-DSCR the number of processors allocated to a task may change during the task execution. A malleable task can be preempted and the processors allocated to it can be redistributed during the execution. The following results are available for problem P-FIX in which the (integer) number of processors allocated to a task cannot change. Du and Leung (1989) proved that this problem is NP-hard. If each task can use either one or k processors, then the problem can be solved inpolynomial time (Bła˙zewicz, Drabowski, and W¸eglarz, 1986). Turek, Wolf, and Yu (1992) showed that any λ-approximation algorithm for the two dimensional bin-packing problem can be polynomially transformed into a λ-approximation algorithm for problem P-FIX. Based on this result, Ludwig (1995) developed a 2-approximation algorithm for P-FIX and Rapine, Mounie, and Trystram (1999) developed a two phase √ approximation algorithm with worst-case performance guarantee 3 for this problem. Prasanna and Musicus (1995) have considered a P-FIX problem with precedence constraints associated with the task set. In the special case where the processing speed function of each task is p α , where p is the amount of processing capacity applied to the task, 0 < α < 1, a closed form solution for a series-parallel task graph was derived. Drozdowski (1996) provided a survey of the complexity and algorithms for multiprocessor task scheduling problems, see also (Bła˙zewicz et al., 2000). In this paper, we distinguish two special cases of the problem: (a) all processing speed functions are convex and (b) all of them are concave. For case (a), an O(n) algo-

SCHEDULING MALLEABLE TASKS

67

rithm is presented to solve both problems P-CNTN and P-DSCR. For case (b), problem P-CNTN is shown to be solvable in O(n max{m, n log2 m}) time. Moreover, it is proved that the minimum makespan values for problems P-DSCR and P-CNTN coincide. It is shown that problem P-DSCR can be solved in O((m+ 1)3n−3 ) time, which is polynomial if n is a constant. For n = 2 or n = 3, an optimal solution for problem P-CNTN can be transformed into an optimal solution for problem P-DSCR in constant time. The organization of the paper is as follows. Section 2 contains a discussion of the validity of the malleable task model and presents basic results from the optimal resource allocation theory. Section 3 discusses the scheduling problem with convex processing speed functions, while in section 4 scheduling algorithms for concave processing speed functions are presented. The paper concludes with a summary of the results and suggestions for future research. 1.

Preliminaries

1.1. Examples of highly parallelizable computations We start our considerations with few real-life applications of highly parallelizable tasks justifying the model used. Simulation of molecular dynamics The simulation of molecular dynamics is one of the most challenging problems in science. The computation of atom movements is irregular if interactions are spatially limited (cut-off). An efficient execution requires advanced techniques allowing for overlapping of communications by computations like asynchronous buffered communications and multithreading. In the case of protein behavior, computations may require to calculate interactions between hundreds of thousands of atoms (Bernard, Gautier, and Trystram, 1999). Needless to say, such an execution needs a large memory. On some of the top parallel computers, like Cray T3E, in order to simplify hardware and optimize communications, there is no virtual memory management. Hence, when the instance of the problem does not fit into the memory of a processor, the execution cannot be performed directly. To complete the execution, the virtual memory management needs to be done “by hand” using out of core computations, that is loading and storing intermediate computations on a disk. It leads sometimes to the increase of the time of execution. Thus, when the number of processors is sufficient for storing the whole data in the memory of these processors, a superlinear speed-up will be observed, otherwise, the processing speed function is concave (see figure 1). Cholesky factorization Another example of the computations which can be modeled by malleable tasks are big matrix calculations (Dongarra et al., 1999) used, e.g., in Cholesky factorization. Let us assume that the whole matrix does not fit into the cache and thus, it slows down the computation on one processor. When the blocks fit into the cache, the speed of the

˙ BŁAZEWICZ ET AL.

68

Figure 1. Processing speed vs. number of processors.

computation becomes more important than the delays caused by additional processor resources. This leads to superlinear dependence of the processing speed on the number of processor alloted. Otherwise, we have a concave processing speed function. Operational oceanography Numerical modeling of the ocean circulation started in the sixties and was continuously developed since that time for climate study and operational oceanography, i.e. near realtime forecast of the “oceanic weather”, in a way similar to operational meteorology. A major practical problem to be dealt with in ocean general circulation models (OGCM), is their large computational cost, which is notably greater than the cost of corresponding atmospheric models, due to differences in the typical scales of motion. For example, the order of magnitude of the size of dynamic structures like fronts or eddies is a few tenths of kilometers in the ocean, while 5 or 10 times larger in the atmosphere. The horizontal resolution of OGCMs should allow the explicit representation of such structures, which leads to very important memory and CPU requirements. The computations involved by these models are run on vector and/or parallel supercomputers and any simulation requires several hundred or thousand hours of CPU-time. The objective today is to use low cost clusters of PC machines for solving these problems. The parallelization of ocean models is performed by domain decomposition techniques. The geographical domain is divided into subdomains, each of them being allocated to a processor. Most of the existing works usually consider as many subdomains as processors. The computation of the explicit terms is mainly local; it requires only, at the beginning of each time step, some communications between processors corresponding to adjacent subdomains, to exchange model variables along the common interfaces. On the other hand, linear systems for the implicit terms are not local, since they correspond to the discretized form of elliptic equations. Solving these global systems is performed,

SCHEDULING MALLEABLE TASKS

69

for instance, by a preconditioned conjuguate gradient or by domain decomposition techniques (Blayo and Debreu, 1998). An important point for the purpose of this work is to emphasize that ocean models are regular applications, in the sense that the volume of computations can be estimated quite precisely as a function of the grid size and the number of processors. The model of malleable tasks is an efficient way for solving such problems. Typically, the total number of malleable tasks to be solved in parallel, remains very small. 1.2. Continuous resource allocation Now, we will recall basic results from the optimal continuous resource allocation theory (W¸eglarz, 1982) useful in solving the considered problem. When developing a non-integer model, the problem how to define the processing speed function at non-integer points appears. A natural requirement for such a function is to maintain main properties of the original function such as monotonicity and convexity or concavity. These properties will be maintained if we extend the processing speed functions gj by the following piecewise linear functions fj , j = 1, . . . , n. Let fj be a piecewise linear function such that fj (r) = αr gj (r) + (1 − αr ) × gj (r), where αr = r − r, 0  r  m. Note that fj (s) = gj (s) for s ∈ {0, 1, . . . , m} and j = 1, . . . , n. Functions fj can be convex, concave or arbitrary nondecreasing, see figure 2. Let us calculate coefficients bj,s and dj,s such that fj (r) = bj,s r + dj,s for r ∈ [s − 1, s], s = 1, . . . , m, j = 1, . . . , n, and bj,0 = dj,0 = 0. Given gj (r), r = 0, 1, . . . , m, j = 1, . . . , n, all the coefficients bj,s and dj,s can be calculated in O(mn) time. Thus, functions fj can be represented as fj (r) = bj,r r + dj,r

for 0  r  m and j = 1, . . . , n.

Figure 2. Piecewise linear processing speed functions.

˙ BŁAZEWICZ ET AL.

70

Since fj (r) = gj (r) for r = 0, 1, . . . , m, problem P-DSCR with functions fj and problem P-DSCR with functions gj are equivalent. Consider now relaxed problem P-CNTN with functions fj , j = 1, . . . , n, in which a non-integer number of processors can be used to execute a task. In problem P-CNTN, processors represent a renewable continuously divisible resource whose 0 denote the minimum Cmax amount is bounded from above by m at any time. Let Cmax 0 ∗ value for problem P-CNTN. Clearly, Cmax  Cmax . Following (W¸eglarz, 1982), introduce set   n   rj  m R = r = (r1 , . . . , rn )  rj  0, j =1

of feasible (with respect to problem P-CNTN) resource allocations and set   U = u = (u1 , . . . , un ) | uj = fj (rj ), j = 1, . . . , n, r ∈ R of feasible transformed resource allocations. Denote p = (p1 , . . . , pn ). Theorem 1 (W¸eglarz, 1981, 1982). Let n  m, conv U be the convex hull of the set U , i.e., the set of all convex combinations of the elements of U , and u = p/C be a straight line in the space of transformed resource allocations given by the parametric equations uj = pj /C, j = 1, . . . , n. Then the minimum makespan value for problem P-CNTN can be found from    p 0  (1) Cmax = min C C > 0, ∈ conv U . C 0 for problem From theorem 1 it follows that the minimum makespan value Cmax P-CNTN is determined by the intersection point of the straight line u = p/C, C > 0, and the boundary of the set conv U in the n-dimensional space of transformed resource 0 = pj /u0j , j = allocations. Denote such an intersection point by u0 . We have Cmax 1, . . . , n. Note that the nondecreasing function fj may have a flat piece at the end of the interval [0, m]. Therefore, equation fj (r) = u0j can be satisfied for several r and there may exist several optimal resource allocations to u0 . Let us denote by r0 a n corresponding 0 tight optimal resource allocation such that j =1 rj = m and fj (rj0 ) = u0j , j = 1, . . . , n.

2.

Convex processing speed functions

In this section, we assume that all the functions fj are convex. Since the functions are piecewise linear with at most m breakpoints, their convexity can be verified in O(mn) time. Recall that the case of convex speed functions is realistic for some big computational tasks, when there is no sufficient fast memory (out of core computations).

SCHEDULING MALLEABLE TASKS

71

Figure 3. Sets U , conv U and line u = p/C for convex piecewise linear processing speed functions.

Figure 4. Optimal schedule for convex processing speed functions.

It is known (see, for example, (W¸eglarz, 1981, 1982)) that if all the functions fj are convex, then set U is a proper subset of conv U . Consider points v(j ) such that (j ) (j ) vj = fj (m) and vl = 0, l = j , j = 1, . . . , n, l = 1, . . . , n. It is easy to see that v(1) , . . . , v(n) represent all vertices of a facet of the convex polytope conv U , and u0 can be represented as a the intersection point u0 belongs to this facet. n Therefore, (1) (n) 0 (i) convex combination of v , . . . , v : u = i=1 λi v , where λi  0, i = 1, . . . , n, n in figure 3 for n = 2. i=1 λi = 1. This situation is shown n 0 0 0 )vj(i) , We have uj = pj /Cmax = i=1 λi vj(i) , or equivalently, pj = ni=1 (λi Cmax (j ) 0 = pj /vj = pj /fj (m), j = 1, . . . , n. Since j = 1, . . . , n. Calculate j = λj Cmax n 0 j =1 j = Cmax and pj = j fj (m), j = 1, . . . , m, an optimal schedule is obtained by assigning one processing interval of length j to task j and, within this interval, allocating all m processors to this task, j = 1, . . . , n. A diagram of an optimal schedule is given in figure 4.

˙ BŁAZEWICZ ET AL.

72

Thus, if the processing speed functions fj are all convex, then an optimal schedule for problem P-CNTN can be found in O(n) time. Since the processor allocation is integer in this case, this schedule is optimal for our original problem P-DSCR as well. The case when the functions fj are not convex is more difficult. 3.

Concave processing speed functions

In this section, we assume that functions fj are all concave. Again, their concavity can be verified in O(mn) time. Concave processing speed functions are more adequate for the majority of real large scale parallel computations because the efficiency of the task processing degrades while the number of processors increases due to the communication delays. If functions fj are all concave and piecewise linear, then set U is a convex polytope in the n-dimensional space of transformed resource allocations. Therefore, conv U = U in this case. 3.1. Problem P-CNTN We show that problem P-CNTN with a continuously divisible resource can be solved in O(n max{m, n log2 m}) time. Let fˆj (r) denote the strictly increasing part of the function fj (r). Thus, there exists number mj ∈ {1, . . . , m} such that fˆj (r) = fj (r) for 0  r  mj and fj (r) = fj (m) for mj  r  m. Furthermore, function fj has no breakpoint in the interval (ˆrj0 , m). Consider a tight optimal resource allocation r0 . Let X be the set of tasks j for / X, which rj0 > mj . It is easy to see that resource allocation rˆ 0 such that rˆj0 = rj0 , j ∈ and rˆj0 = mj < rj0 , j ∈ X, is optimal as well. Let fˆj−1 denote the inverse function of fˆj , j = 1, . . . , n, see, for example, (Stanat and McAllister, 1977). Denote fˆ(r) = u and fˆ−1 (u) = r for r and u such that fˆj (rj ) = uj , j = 1, . . . , n. To construct an optimal solution, we use theorem 1. The intersection point u0 of the straight line u = p/C, C > 0, and the boundary of the convex polytope U , and the corresponding tight optimal resource allocation r0 can be found as follows. Consider set I of all intersection points of the planes uj = fj (rj ), j = 1, . . . , n, rj = 1, . . . , mj , with the line u = p/C, C > 0, in the n-dimensional space of transformed resource allocations, see figure 5. Find point umin ∈ I with the minimum value of C. If min{pj /fj (mj ) | j = 1, . . . , n} is reached for j = j ∗ , then this point is the intersection of the plane uj = fj ∗ (mj ∗ ) with the line u = p/C. Furthermore, if fˆ−1 (umin ) = rmin and n min  m, then rmin is an optimal resource allocation, from which a tight optij =1 rj mal resource allocation r0 can easily be derived. In the rest of this subsection, assume that nj=1 rjmin > m.

SCHEDULING MALLEABLE TASKS

73

Figure 5. Finding the intersection point u0 of the line u = p/C and the boundary of the convex polytope U for n = 2.

¯ = r¯ , be such a point that nj=1 r¯j  m and there is no point Let u¯ ∈ I , fˆ−1 (u) ¯ such that u1 < u¯ 1 and nj=1 rj  m, where r = fˆ−1 (u). u ∈ I \{u} Since both u0 and u¯ lie on the same line u = p/C, C > 0, and u0 is closer to the ¯ we have u0j  u¯ j , and consequently, rˆj0  r¯j , j = 1, . . . , n. Moreover, origin than u, ¯ then no function fj has a breakpoint since there is no other point u ∈ I between u0 and u, in the interval (ˆrj0 , r¯j ) and, hence, in the interval (rj0 , r¯j ), j = 1, . . . , n. Therefore, we must have

pj (2) u0j = fj rj0 = bj,¯rj  rj0 + dj,¯rj  = 0 , j = 1, . . . , n. Cmax From (2), it follows that dj,¯rj  pj − , j = 1, . . . , n. 0 bj,¯rj  Cmax bj,¯rj  Then, from the condition nj=1 rj0 = m, we can find n j =1 pj /bj,¯rj  0 n Cmax = m + j =1 dj,¯rj  /bj,¯rj  rj0 =

(3)

0 and apply formulae (2) and (3) to calculate u and r0 , respectively. 0 0 Since pj = uj Cmax , j = 1, . . . , n, and nj=1 rj0 = m, there is an optimal schedule 0 ] and for problem P-CNTN, in which all the tasks are processed in the interval [0, Cmax 0 task j uses rj processors, j = 1, . . . , n. From these considerations it follows that in order to calculate r0 and u0 , we have to find the resource allocation r¯ . It can be found as follows. For each task j , j = 1, . . . , n, perform the following bisection search procedure BS. Set u¯ (j ) = φ, l = 0 and t = mj . In each intermediate iteration of proce-

74

˙ BŁAZEWICZ ET AL.

dure BS, for the trial value r = (l + t)/2, and in the first iteration, for r = mj , (j ) (j ) find an intersection point u(j ) = (u1 , . . . , un ) of the plane uj = fj (r) and the line (j ) (j ) u = p/C: uj = fj (r) and ui = pi fj (r)/pj for i = j , i = 1, . . . , n. This can be done in O(n) time. (j ) Find r(j ) = fˆ−1 (u(j ) ) as follows. First, set rj = r. Then, for i = 1, . . . , n, (j ) i = j , find the breakpoint si ∈ {1, . . . , mj }, such that fi (si − 1) < ui  fi (si ). The value si can be found in O(log mj ) time by using a bisection search over the range (j ) (j ) (j ) (j ) (j ) {0, 1, . . . , mj }. Then ui = fi (ri ) = bi,si ri + di,si and ri = (ui − di,si )/bi,si . (j ) Thus, r can be found in O(n log mj ) time. (j ) If ni=1 ri  m, then reset u(j ) = u(j ) , t = r, keep l unchanged and go to the next iteration of BS. If ni=1 fˆ−1 (u(j ) )i < m, then reset l = r, keep t unchanged and go to the next iteration of BS. The procedure stops when t − l < 1. ¯ where u¯ is deterThe resource allocation r¯ can be found as follows: r = fˆ−1 (u), (j ) (j ) mined from u¯ 1 = min{u¯ 1 | u¯ = φ, j = 1, . . . , n}. Since the number of iterations of procedure BS does not exceed O(log m) and each iteration can be executed in O(n log m) time, point u¯ (j ) can be found in O(n log2 m) time for each j . Therefore, resource allocation r¯ can be found in O(n2 log2 m) time. Since O(mn) time is needed to find all the coefficients of the piecewise linear functions fj , problem P-CNTN can be solved in O(n max{m, n log2 m}) time. 3.2. Problem P-DSCR General case ∗ 0 = Cmax . Then, we show that problem P-DSCR can be Firstly, we will prove that Cmax 3n−3 ) time. For n = 2 or n = 3, we show that an optimal solution solved in O((m + 1) for problem P-CNTN can be transformed into an optimal solution for problem P-DSCR in a constant time. We begin with analyzing the properties of the convex polytope U . Let 0 = (0, . . . , 0) represent zero transformed resource allocation. We prove that the resource allocations corresponding to the vertices of U excluding vertex 0 are feasible with respect to problem P-DSCR. Let V (U ) denote the set of vertices of the polytope U excluding vertex 0. Lemma 2. For each vertex v ∈ V (U ), there exists a tight feasible resource allocation r(v) for problem P-DSCR such that fj (r(v)j ) = vj , r(v)j ∈ {0, 1, . . . , m}, j = 1, . . . , n, and nj=1 r(v)j = m. Proof. Let B(U ) denote the part of the boundary of the polytope U such that u ∈ B(U ) if and only if there exists a tight resource allocation r(u) corresponding to u such that fj (r(u)j ) = uj , j = 1, . . . , n, and nj=1 r(u)j = m. Here, we do not require integrality of r(u)j , j = 1, . . . , n.

SCHEDULING MALLEABLE TASKS

75

It is easy to see that V (U ) ⊂ B(U ). Furthermore, from v ∈ V (U ) it follows that there are no two distinct points v and v

from U such that v = 12 (v + v

). Assume that v ∈ V (U ) and corresponding tight resource allocation r(v) is not integer in some coordinates. Since nj=1 r(v)j = m is integer, there exist at least two coordinates of r, say r(v)i and r(v)j , which are both non-integer. Consider resource / {i, j } and ri = r(v)i − δ, allocations r and r

such that rk = rk

= r(v)k for k ∈ rj = r(v)j + δ, ri

= r(v)i + δ, rj

= r(v)j − δ, where δ = min{r(v)i  − r(v)i , r(v)i − r(v)i , r(v)j  − r(v)j , r(v)j − r(v)j } > 0. It is easy to see that r and r

are feasible for problem P-CNTN and r(v) = 12 (r + r

). Moreover, δ is defined so that no function fj has a breakpoint between r and r

. Therefore, v = 12 (u + u

), where  u = f (r ) ∈ U and u

= f (r

) ∈ U . This contradiction completes the proof. ∗ 0 0 = Cmax . Assume that u0 and Cmax are found (see secWe now show that Cmax tion 3.1). Let F 0 be a facet of the polytope U such that u0 ∈ F 0 and let v(i) , i = 1, . . . , k, be all vertices of the facet F 0 . Then, similarly to the case of convex functions fj (see 0 be represented as a convex combination of the section 2), the intersection point k u can (1) (k) 0 points v , . . . , v : u = i=1 λi v(i) , where λi  0, i = 1, . . . , k, ki=1 λi = 1. If v(1) , . . . , v(k) are known, then coefficients λi , i = 1, . . . , k, can be found in O(k 3 ) time. Notice that the number of strictly positive coefficients λi does not exceed n. Assume without loss of generality l that(i)λi > 0 for i = 1, . . . , l, l  n.l (i) 0 0 = Thus, pj /Cmax i=1 λi vj , or equivalently, pj = i=1 (λi Cmax )vj , j = l 0 0 , i = 1, . . . , l. Since 1, . . . , n. Calculate i = λi Cmax i=1 i = Cmax and l (i) 0 pj = i=1 i vj , j = 1, . . . , n, a schedule with the makespan value Cmax can be described as follows. There are l, l  n, processing intervals of lengths 1 , . . . , l . In the interval of length i , the number of processors allocated to task j is equal to r(v (i) )j , j = 1, . . . , n, where r(v(i) ) is a tight feasible resource allocation corresponding to v(i) . Therefore, ∗ 0 = Cmax and the constructed schedule is optimal for both problems P-CNTN and Cmax P-DSCR. The presented approach to solving problem P-DSCR is efficient if facet F 0 can be efficiently found and the number of its vertices is sufficiently small. However, an efficient procedure for identifying F 0 is unknown. Moreover, there exists no reasonable upper bound for the number of vertices of F 0 . We now show that problem P-DSCR can be solved in O((m + 1)3n−3 ) time. Let us construct set N(U ) such that u ∈ U if there exists a tight feasible resource allocation corresponding to u. Let v(1) , . . . , v(|N(U )|) be all elements of N(U ). It is clear that u0 can be represented |N(U )| as a convex combination of the elements of N(U ): u0j = i=1 λi vj(i) , j = 1, . . . , n. This system of linear equations with |N(U )| variables λj and n constraints can be solved in O(|N(U )|3 ) time. In a solution of this system, the number of strictly positive coefficients λj is at most n. Let them be λ1 , . . . , λl .

˙ BŁAZEWICZ ET AL.

76

Figure 6. Points r 0 , r (1) and r (2) .

An optimal solution to problem P-DSCR can be constructed as shown above for the case u0 ∈ F 0 . Since |N(U )|  (m + 1)n−1 , problem P-DSCR can be solved in O((m + 1)3n−3 ) time which is polynomial if the number of tasks n is fixed. Case n ∈ {2, 3} 0 are known and n ∈ {2, 3}, then an optimal solution We now show that if r0 , u0 and Cmax for problem P-DSCR can be constructed in a constant time. Let n = 2. If r0 is integer, then an optimal schedule is obtained by having one process0 ] for both tasks and, within this interval, allocating rj0 processors to ing interval [0, Cmax task j , j = 1, 2. Assume that r0 is not integer. Then r0 belongs to the interval (r(1) , r(2) ), where r(1) = (r10 , r20 ), r(2) = (r10 , r20 ) (see figure 6). In this case, r0 = λr(1) + (1 − λ)r(2) , where λ = r10  − r10 = r20 − r20 . Since no function f1 or f2 has a breakpoint between r(1) and r(2) , we have u0 = λf (r(1) ) + (1 − λ)f (r(2) ). In an optimal schedule for problem P-DSCR with two tasks, there are two process0 0 and 2 = (1 + r10 − r10 )Cmax . In the ing intervals of lengths 1 = (r10  − r10 )Cmax interval of length 1 , the number of processors allocated to tasks 1 and 2 are equal to r10  and r20 , respectively, and in the interval of length 2 they are equal to r10  and r20 , respectively. Let n = 3.

There are three cases to consider:

(1) all coordinates of r0 are integer, (2) one coordinate of r0 is integer and two other coordinates are non-integer, and (3) all coordinates of r0 are non-integer.

SCHEDULING MALLEABLE TASKS

77

Figure 7. Points r0 , r(1) , r(2) and r(3) .

In case (1), an optimal schedule is obtained by having one processing interval 0 ] for all tasks and, within this interval, allocating rj0 processors to task j , [0, Cmax j = 1, 2, 3. Consider case (2). Assume without loss of generality that only r10 is integer. Then it is easy to see that r0 belongs to the interval (r(1) , r(2) ), where r(1) = (r10 , r20 , m − (r10 + r20 )) and r(2) = (r10 , r20 , m − (r10 + r20 )). Then r0 = λr(1) + (1 − λ)r(2), where λ = r20  − r20 . 0 In an optimal schedule, there are two processing intervals of lengths 1 = λCmax 0 and 2 = (1 − λ)Cmax . In both intervals, the number of processors allocated to task 1 is equal to r10 . In the interval of length 1 , the number of processors allocated to tasks 2 and 3 are equal to r20  and m − (r10 + r20 ), respectively, and in the interval of length 2 they are equal to r20  and m − (r10 + r20 ), respectively. In case (3), it is easy to see that r0 belongs to a triangle with vertices r(1) , r(2) , r(3) which satisfy (i) rj(i) ∈ {rj0 , rj0 }, j = 1, 2, 3, i = 1, 2, 3, and (ii) r1(i) + r2(i) + r3(i) = m, i = 1, 2, 3. This situation is shown in figure 7. Points r(1) , r(2) , r(3) can be found as follows. Observe that only the following four points may satisfy properties (i) and (ii):



   

   

b = r10 , r20 , m − r10 + r20 , a = r10 , r20 , m − r10 + r20 ,

 

 

 

  d = r10 , r20 , m − r10 + r20 . c = r10 , r20 , m − r10 + r20 ,

˙ BŁAZEWICZ ET AL.

78

Moreover, only one of the first two points may satisfy (i) and (ii). Suppose both a and b satisfy (i) and (ii). From (i) and (ii), it follows that for a we must have m − (r10 + r20 ) = r30  and for b we must have m − (r10  + r20 ) = r30 . By subtracting the former equation from the latter one, we obtain r30  − r30  = (r10  + r20 ) − (r10  + r20 ) = 2 because both r10 and r20 are non-integer. This contradiction shows that only one of the points a and b satisfies (i) and (ii). If r10  + r20  + r30  = m, then we set r(1) = a. Otherwise, we set r(1) = b. The (3) remaining two points of the triangle containing r0 are r(2) = c and 3r = d. Thus, there exist λ1 > 0, λ2 > 0 and λ3 > 0 such that i=1 λi = 1 and rj0 = λ1 rj(1) + λ2 rj(2) + λ3 rj(3) , j = 1, 2, 3. This system of three linear equations with three variables λ1 , λ2 , λ3 can be solved in a constant time. In an optimal schedule for problem P-DSCR with three tasks, there are three 0 , i = 1, 2, 3. In the interval of length i , processing intervals of lengths i = λi Cmax the number of processors allocated to task j is equal to rj(i) , i = 1, 2, 3, j = 1, 2, 3. As a result of the above procedure we see that problem P-DSCR can be solved in constant time for n = 2 or 3, provided that one knows an optimal solution for the continuous case. To illustrate the above approach we present the following example. Example. Consider problem P-DSCR, in which there are n = 3 tasks with processing times p1 = 7, p2 = 5 and p3 = 2. The number of processors is equal to m√= 5. The processing speed functions are the same for all tasks. They are gj (r) = r for r = 1, . . . , m, j = 1, . . . , n. This problem can be solved as follows. 1. Calculate coefficients of the piecewise linear functions fj (r) = f (r) = br r + dr , 0  r  m, j = 1, . . . , n: √ √ b1 = 1, d1 = 0, b2 = 2 − 1, d2 = 2 − 2, √ √ √ √ d3 = 3 2 − 2 3, b3 = 3 − 2, √ √ √ √ d4 = 4 3 − 6, b5 = 5 − 2, d5 = 10 − 4 5. b4 = 2 − 3, 2. Find resource allocation r¯ such that f (¯r) ∈ I , nj=1 r¯j  m and there is no point n u ∈ I \{f (¯r)} such that u1 < u¯ 1 and j =1 rj  m, where r = f −1 (u) (definition for I see in section 3.1). We have r¯ = (3, 1.57, 0.49). 3. Calculate

n ∗ Cmax

=

0 Cmax

=

j =1

m+

n

pj /b¯rj 

j =1

d¯rj  /b¯rj 

= 4.07.

Using formula (3), calculate r0 = (2.96, 1.55, 0.49). 4. Since r0 is not integer, find points r(1) = (2, 2, 1), r(2) = (3, 1, 1) and r(3) = (3, 2, 0) such that properties (i) and (ii) are satisfied.

SCHEDULING MALLEABLE TASKS

79

Figure 8. An optimal schedule for the example (n = 3 and m = 5).

5. Solve the following system of linear equations: 2.96 = 2λ1 + 3λ2 + 3λ3 , 1.55 = 2λ1 + λ2 + 2λ3 , 0.49 = λ1 + λ2 , and find λ1 = 0.04, λ2 = 0.45 and λ3 = 0.51. 0 0 0 = 0.17, 2 = λ2 Cmax = 1.83 and 3 = λ3 Cmax = 2.07. 6. Calculate 1 = λ1 Cmax

An optimal schedule is shown in figure 8. 4.

Conclusions

The problem of scheduling n malleable tasks on m  n parallel processors has been studied. Processing speeds of the tasks depend nonlinearly on the number of processors granted. If processing speed functions are all convex, then the problem is solvable in O(n) time. If all of them are concave, then the continuous case of resource (processor) allocation can be solved in O(n max{m, n log2 m}) time. On the other hand, the more realistic discrete case can be solved in O((m + 1)3n−3 ) time for arbitrary n and in a constant time for n = 2 or 3, given an optimal solution for the continuous case. The number of tasks is often equal to 2 or 3 in centers solving highly parallelizable scientific computing programs. Further research can be concentrated on finding more efficient algorithms for the discrete case and the number of tasks n  4, as well as for the case n > m. Acknowledgments The results of this paper were obtained while M.Y. Kovalyov was visiting Technical University of Poznan. Support of this visit is gratefully acknowledged.

80

˙ BŁAZEWICZ ET AL.

References Bernard, P.E., T. Gautier, and D. Trystram. (1999). “Large Scale Simulation of Parallel Molecular Dynamics.” In Proceedings of Second Merged Symposium IPPS/SPDP, 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing, San Juan, Puerto Rico. Blayo and L. Debreu. (1998). “Adaptive Mesh Refinement for Finite Difference Ocean Models: First Experiments.” Journal of Physics Oceanogr. 29, 1239–1250. Bła˙zewicz, J., M. Drabowski, and J. W¸eglarz. (1986). “Scheduling Multiprocessor Tasks to Minimize Schedule Length.” IEEE Transactions on Computing 35, 389–393. Bła˙zewicz, J., K. Ecker, E. Pesch, G. Schmidt, and J. W¸eglarz. (1996). Scheduling Computer and Manufacturing Processes. Berlin/New York: Springer. Bła˙zewicz, J., K. Ecker, B. Plateau, and D. Trystram (eds.). (2000). Handbook on Parallel and Distributed Processing. Berlin/New York: Springer. Dongarra, J., L. Duff, D. Danny, C. Sorensen, and H. van der Vorst. (1999). Numerical Linear Algebra for High Performance Computers (Software, Environments, Tools). Philadelphia, PA: SIAM. Drozdowski, M. (1996). “Scheduling Multiprocessor Tasks – An Overview.” European Journal of Operational Research 94, 215–230. Du, J. and J.Y.-T. Leung. (1989). “Complexity of Scheduling Parallel Tasks Systems.” SIAM Journal on Discrete Mathematics 2, 473–487. Ludwig, W.T. (1995). “Algorithms for Scheduling Malleable and Nonmalleable Parallel Tasks.” Ph.D. thesis, University of Wisconsin–Madison, Department of Computer Science. Mounie, G., C. Rapine and D. Trystram. (1999). “Efficient Approximation Algorithms for Scheduling Malleable Tasks.” In Eleventh ACM Symposium on Parallel Algorithms and Architectures (SPAA’99), ACM, pp. 23–32. Prasanna, G.N.S. and B.R. Musicus. (1995). “The Optimal Control Approach to Generalized Multiprocessor Scheduling.” Algorithmica. Rapine, C., I. Scherson, and D. Trystram. (1998). “On-line Scheduling of Parallelizable Jobs.” In Lecture Notes in Computer Science, Vol. 1470. New York: Springer. Scherson, I., R. Subramanian, V. Reis and L. Campos. (1996). “Scheduling Computationally Intensive Data Parallel Programs.” In Ecole Francaise de Parallélisme, Réseaux et Systèmes, Placement Dynamique et Répartition de Charge: Application aux Systèmes Parallèles et Répartis, pp. 107–129. Presquele de Giens: INRIA. Stanat, D.F. and D.F. McAllister. (1977). Discrete Mathematics in Computer Science. Englewood Cliffs, NJ: Prentice-Hall. Turek, J., J. Wolf, and P. Yu. (1992). “Approximate Algorithms for Scheduling Parallelizable Tasks.” In 4th Annual ACM Symposium on Parallel Algorithms and Architectures, pp. 323–332. W¸eglarz, J. (1981). “Project Scheduling with Continuously-Divisible, Doubly Constrained Resources.” Management Science 27, 1040–1052. W¸eglarz, J. (1982). “Modelling and Control of Dynamic Resource Allocation Project Scheduling Systems.” In S.G. Tzafestas (ed.), Optimization and Control of Dynamic Operational Research Models. Amsterdam: North-Holland.