Selective Preemption Strategies for Parallel Job Scheduling

2 downloads 0 Views 1MB Size Report
INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING AND NETWORKING. 1 ...... 50. 100. 150. 200. 250. 300. 350. 400. Seq. Narrow. Wide. Very. Wide. Width. W o rs ...... citeseer.nj.nec.com/article/anastasiadis96parallel.html.
INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING AND NETWORKING

1

Selective Preemption Strategies for Parallel Job Scheduling Rajkumar Kettimuthu, Vijay Subramani, Srividya Srinivasan, Thiagaraja Gopalsamy, D. K. Panda, and P. Sadayappan

Abstract— Although theoretical results have been established regarding the utility of preemptive scheduling in reducing average job turnaround time, job suspension/restart is not much used in practice at supercomputer centers for parallel job scheduling. A number of questions remain unanswered regarding the practical utility of preemptive scheduling. We explore this issue through a simulation-based study, using real job logs from supercomputer centers. We develop a tunable selective-suspension strategy and demonstrate its effectiveness. We also present new insights into the effect of preemptive scheduling on different job classes and deal with the impact of suspensions on worstcase response time. Further, we analyze the performance of the proposed schemes under different load conditions. Index Terms— Preemptive scheduling, Parallel job scheduling, Backfilling.

I. I NTRODUCTION Although theoretical results have been established regarding the effectiveness of preemptive scheduling strategies in reducing average job turnaround time [1]–[5], preemptive scheduling is not currently used for scheduling parallel jobs at supercomputer centers. Compared to the large number of studies that have investigated nonpreemptive scheduling of parallel jobs [6]–[21], little research has been reported on evaluation of preemptive scheduling strategies using real job logs [22]–[25]. The basic idea behind preemptive scheduling is simple: If a long-running job is temporarily suspended and a waiting short job is allowed to run to completion first, the wait time of the short job is significantly decreased, without much fractional increase in the turnaround time of the long  job. Consider a long job with run time . After time t, let a  short job arrive with run time . If the short job were to run after completion of   the   long  job,  the average  turnaround    job  time would be , or . Instead, if the long job were suspended when the short job arrived, the  turnaround     times of the short and long jobs would   be   and , respectively, giving an average of !". $The #&% average turnaround time with suspension is less if , R. Kettimuthu is with Argonne National Laboratory. V. Subramani and S. Srinivasan are with Microsoft Corporation. T. Gopalsamy is with Altera Corporation. D. K. Panda and P. Sadayappan are with the Ohio State University.

that is, the remaining run time of the running job is greater than the run time of the waiting job. The suspension criterion has to be chosen carefully to ensure freedom from starvation. Also, the suspension scheme should bring down the average turnaround times without increasing the worst-case turnaround times. Even though theoretical results [1]–[5] have established that preemption improves the average turnaround time, it is important to perform evaluations of preemptive scheduling schemes using realistic job mixes derived from actual job logs from supercomputer centers, to understand the effect of suspension on various categories of jobs. The primary contributions of this work are as follows: ' Development of a selective-suspension strategy for preemptive scheduling of parallel jobs, ' Characterization of the significant variability in the average job turnaround time for different job categories, ' Demonstration of the impact of suspension on the worstcase turnaround times of various categories, and development of a tunable scheme to improve worst-case turnaround times. This paper is organized as follows. Section II provides background on parallel job scheduling and discusses prior work on preemptive job scheduling. Section III characterizes the workload used for the simulations. Section IV presents the proposed selective preemption strategies and evaluates their performance under the assumption of accurate estimation of job run times. Section V studies the impact of inaccuracies in user estimates of run time on the selective preemption strategies. It also models the overhead for job suspension and restart and evaluates the proposed schemes in the presence of overhead. Section VI describes the performance of the selective preemption strategies under different load conditions. Section VII summarizes the results of this work. II. BACKGROUND

AND

R ELATED W ORK

Scheduling of parallel jobs is usually viewed in terms of a 2D chart with time along one axis and the number of processors along the other axis. Each job can be thought of as a rectangle whose width is the user-estimated run time

INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING AND NETWORKING

and height is the number of processors requested. Parallel job scheduling strategies have been widely studied in the past [26]–[33]. The simplest way to schedule jobs is to use the firstcome-first-served (FCFS) policy. This approach suffers from low system utilization, however, because of fragmentation of the available processors. Consider a scenario where a few jobs are running in the system and many processors are idle, but the next queued job requires all the processors in the system. An FCFS scheduler would leave the free processors idle even if there were waiting queued jobs requiring only a few processors. Some solutions to this problem are to use dynamic partitioning [34] or gang scheduling [35]. An alternative approach to improve the system utilization is backfilling. A. Backfilling Backfilling was developed for the IBM SP1 parallel supercomputer as part of the Extensible Argonne Scheduling sYstem (EASY) [13] and has been implemented in several production schedulers [36], [37]. Backfilling works by identifying “holes” in the 2D schedule and moving forward smaller jobs that fit those holes. With backfilling, users are required to provide an estimate of the length of the jobs submitted for execution. This information is used by the scheduler to predict when the next queued job will be able to run. Thus, a scheduler can determine whether a job is sufficiently small to run without delaying any previously reserved jobs. It is desirable that a scheduler with backfilling support two conflicting goals. On the one hand, it is important to move forward as many short jobs as possible, in order to improve utilization and responsiveness. On the other hand, it is also important to avoid starvation of large jobs and, in particular, to be able to predict when each job will run. There are two common variants to backfilling — conservative and aggressive (EASY) — that attempt to balance these goals in different ways. 1) Conservative Backfilling: With conservative backfilling, every job is given a reservation (start time guarantee) when it enters the system. A smaller job is allowed to backfill only if it does not delay any previously queued job. Thus, when a new job arrives, the following allocation procedure is executed by a conservative backfilling scheduler. Based on the current knowledge of the system state, the scheduler finds the earliest time at which a sufficient number of processors are available to run the job for a duration equal to the user-estimated run time. This is called the “anchor point.” The scheduler then updates the system state to reflect the allocation of processors to this job starting from its anchor point. If the job’s anchor point is the current time, the job is started immediately. An example is given in Fig. 1. The first job in the queue does not have enough processors to run. Hence, a reservation is made for it at the anticipated termination time of the

2

2

nodes

1

3 time

running jobs

2 1

queued jobs

Fig. 1.

3

Conservative backfilling.

longer-running job. Similarly, the second queued job is given a reservation at the anticipated termination time of the first queued job. Although enough processors are available for the third queued job to start immediately, it would delay the second job; therefore, the third job is given a reservation after the second queued job’s anticipated termination time. Thus, in conservative backfilling, jobs are assigned a start time when they are submitted, based on the current usage profile. But they may actually be able to run sooner if previous jobs terminate earlier than expected. In this scenario, the original schedule is compressed by releasing the existing reservations one by one, when a running job terminates, in the order of increasing reservation start time guarantees and attempting backfill for the released job. If as a result of early termination of some job, “holes” of the right size are created for a job, then it gets an earlier reservation. In the worst case, each released job is reinserted in the same position it held previously. With this scheme, there is no danger of starvation, since a reservation is made for each job when it is submitted. 2) Aggressive Backfilling: Conservative backfilling moves jobs forward only if they do not delay any previously queued job. Aggressive backfilling takes a more aggressive approach and allows jobs to skip ahead provided they do not delay the job at the head of the queue. The objective is to improve the current utilization as much as possible, subject to some consideration for the queue order. The price is that execution guarantees cannot be made, because it is impossible to predict how much each job will be delayed in the queue. An aggressive backfilling scheduler scans the queue of waiting jobs and allocates processors as requested. The scheduler gives a reservation guarantee to the first job in the queue that does not have enough processors to start. This reservation is given at the earliest time at which the required processors are expected to become free, based on the current system state.

INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING AND NETWORKING

nodes

3

2 1

time

running jobs

backfill

2 1

queued jobs

Fig. 2.

3

Aggressive backfilling.

The scheduler then attempts to backfill the other queued jobs. To be eligible for backfilling, a job must require no more than the currently available processors and must satisfy either of two conditions that guarantee it will not delay the first job in the queue: ' It must terminate by the time the first queued job is scheduled to commence, or ' It must use no more nodes than those are free at the time the first queued job is scheduled to start. Figure 2 shows an example. B. Metrics Two common metrics used to evaluate the performance of scheduling schemes are the average turnaround time and the average bounded slowdown. We use these metrics for our studies. The bounded slowdown [38] of a job is defined as follows:

       %%     ! %   "$#$%  '&    !&%  "$#$%  (1) The threshold of 10 seconds is used to limit the influence of very short jobs on the metric. Preemptive scheduling aims at providing lower delay to short jobs relative to long jobs. Since long jobs have greater tolerance to delays as compared to short jobs, our suspension criterion is based on the expansion factor (xfactor), which increases rapidly for short jobs and gradually for long jobs.

!()+* %,-.   %%  0/ % 12 ,% 3- %   '& / % 12 %, 3- %  

(2)

3

C. Related Work Although preemptive scheduling is universally used at the operating system level to multiplex processes on singleprocessor systems and shared-memory multi-processors, it is rarely used in parallel job scheduling. A large number of studies have addressed the problem of parallel job scheduling (see [38] for a survey of work on this topic), but most of them address nonpreemptive scheduling strategies. Further, most of the work on preemptive scheduling of parallel jobs considers the jobs to be malleable [3], [25], [39], [40]; in other words, the number of processors used to execute the job is permitted to vary dynamically over time. In practice, parallel jobs submitted to supercomputer centers are generally rigid; that is, the number of processors used to execute a job is fixed. Under this scenario, the various schemes proposed for a malleable job model are inapplicable. Few studies have addressed preemptive scheduling under a model of rigid jobs, where the preemption is “local,” that is, the suspended job must be restarted on exactly the same set of processors on which they were suspended. Chiang and Vernon [23] evaluate a preemptive scheduling strategy called “immediate service (IS)” for shared-memory systems. With this strategy, each arriving job is given an immediate timeslice of 10 minutes, by suspending one or more running jobs if needed. The selection of jobs for suspension is based on their instantaneous-xfactor, defined as (wait time + total accumulated run time) / (total accumulated run time). Jobs with the lowest instantaneous-xfactor are suspended. The IS strategy significantly decreases the average job slowdown for the traces simulated. A potential shortcoming of the IS strategy, however, is that its preemption decisions do not reflect the expected run time of a job. The IS strategy can be expected to significantly improve the slowdown of aborted jobs in the trace. Hence, it is unclear how much, if any, of the improvement in slowdown is experienced by the jobs that completed normally. However, no information is provided on how different job categories are affected. Chiang et al. [22] examine the run-to-completion policy with a suspension policy that allows a job to be suspended at most once. Both this approach and the IS strategy limit the number of suspensions, whereas we use a “suspension factor” to control the rate of suspensions, without limiting the number of times a job can be suspended. Parsons and Sevcik [25] discuss the design and implementation of a number of multiprocessor preemptive scheduling disciplines. They study the effect of preemption under the models of rigid, migratable, and malleable jobs. They conclude that their proposed preemption scheme may increase the response time for the model of rigid jobs. So far, few simulation-based studies have been done on preemption strategies for clusters. With no process migration, the

INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING AND NETWORKING

distributed-memory systems impose an additional constraint that a suspended job should get the same set of processors when it restarts. In this paper, we propose tunable suspension strategies for parallel job scheduling in environments where process migration is not feasible. III. W ORKLOAD C HARACTERIZATION We perform simulation studies using a locally developed simulator with workload logs from different supercomputer centers. Most supercomputer centers keep a trace file as a record of the scheduling events that occur in the system. This file contains information about each job submitted and its actual execution. Typically the following data is recorded for each job: ' Name of job, user name, and so forth ' Job submission time ' Job resources requested, such as memory and processors ' User-estimated run time ' Time when job started execution ' Time when job finished execution

from a 430-node IBM SP2 system at the Cornell Theory Center, the SDSC trace from a 128-node IBM SP2 system at the San Diego Supercomputer Center, and the KTH trace from a 100-node IBM SP2 system at the Swedish Royal Institute of Technology. The other traces did not contain user estimates of run time. We observed similar performance trends with all the three traces. In order to minimize the number of graphs, we report the performance results for CTC and SDSC traces alone. This selection is purely arbitrary. Although user estimates are known to be quite inaccurate in practice, as explained above, we first studied the effect of preemptive scheduling under the idealized assumption of accurate estimation, before studying the effect of inaccuracies in user estimates of job run time. Also, we first studied the impact of preemption under the assumption that the overhead for job suspension and restart were negligible and then studied the influence of the overhead. TABLE IV AVERAGE SLOWDOWN FOR VARIOUS CATEGORIES WITH NONPREEMPTIVE SCHEDULING

TABLE I J OB

0 - 10 min 10 min - 1 hr 1 hr - 8 hr 8 hr

CATEGORIZATION CRITERIA

1 Proc VS Seq S Seq L Seq VL Seq

2-8 Procs VS N SN LN VL N

9-32 Procs VS W SW LW VL W

32 Procs VS VW S VW L VW VL VW

4

0 - 10 min 10 min - 1 hr 1 hr - 8 hr 8 hr

1 Proc 2.6 1.26 1.13 1.03

- CTC T RACE

2-8 Procs 4.76 1.76 1.43 1.05

9-32 Procs 13.01 3.04 1.88 1.09

32 Procs 34.07 7.14 1.63 1.15

TABLE V AVERAGE SLOWDOWN FOR VARIOUS CATEGORIES WITH NONPREEMPTIVE SCHEDULING -

TABLE II J OB DISTRIBUTION BY

0 - 10 min 10 min - 1 hr 1 hr - 8 hr 8 hr

1 Proc 14% 18% 6% 2%

CATEGORY

2-8 Procs 8% 4% 3% 2%

- CTC T RACE

9-32 Procs 13% 6% 9% 1%

32 Procs 9% 2% 2% 1%

TABLE III J OB DISTRIBUTION BY CATEGORY - SDSC T RACE

0 - 10 min 10 min - 1 hr 1 hr - 8 hr 8 hr

1 Proc 8% 2% 8% 3%

2-8 Procs 29% 8% 5% 5%

9-32 Procs 9% 5% 6% 3%

32 Procs 4% 3% 1% 1%

From the collection of workload logs available from Feitelson’s archive [41], subsets of the CTC workload trace, the SDSC workload trace and the KTH workload trace were used to evaluate the various schemes. The CTC trace was logged

0 - 10 min 10 min - 1 hr 1 hr - 8 hr 8 hr

1 Proc 2.53 1.15 1.19 1.03

SDSC T RACE

2-8 Procs 14.41 2.43 1.24 1.09

9-32 Procs 37.78 4.83 1.96 1.18

32 Procs 113.31 15.56 2.79 1.43

Any analysis that is based only on the average slowdown or turnaround time of all jobs in the system cannot provide insights into the variability within different job categories. Therefore, in our discussion, we classify the jobs into various categories based on the run time and the number of processors requested, and we analyze the slowdown and turnaround time for each category. To analyze the performance of jobs of different sizes and lengths, we classified jobs into 16 categories: considering four partitions for run time — Very Short (VS), Short (S), Long (L) and Very Long (VL) — and four partitions for the number of processors requested — Sequential (Seq), Narrow (N), Wide (W) and Very Wide (VW). The criteria used for job classification are shown in Table I. The distribution of jobs in

INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING AND NETWORKING

the trace, corresponding to the sixteen categories, is given in Tables II and III. Tables IV and V show the average slowdowns for the different job categories under a nonpreemptive aggressive backfilling strategy. The overall slowdown for the CTC trace was 3.58, and for the SDSC trace was 14.13. Even though the overall slowdowns are low, from the tables one can observe that some of the Very Short categories have slowdowns as high as 34 (CTC trace) and 113 (SDSC trace). Preemptive strategies aim at reducing the high average slowdowns for the short categories without significant degradation to long jobs. IV. S ELECTIVE S USPENSION We first propose a preemptive scheduling scheme called Selective Suspension (SS), where an idle job may preempt a running job if its “suspension priority” is sufficiently higher than the running job. An idle job attempts to suspend a collection of running jobs so as to obtain enough free processors. In order to control the rate of suspensions, a suspension factor (SF) is used. This specifies the minimum ratio of the suspension priority of a candidate idle job to the suspension priority of a running job for preemption to occur. The suspension priority used is the xfactor of the job. A. Theoretical Analysis

Task T1

N

5

the entire system for execution, with the system being free when the two tasks are submitted. Let “s” be the suspension factor. Before starting, both tasks have a suspension priority of 1. The suspension priority of a task remains constant when the task executes and increases when the task waits. One of the two tasks, say , will start instantly. The other task, say  , will wait until its suspension priority   becomes s times    the priority of before it can preempt . Now will have to wait until its suspension priority  becomes s times   before it can preempt  . Thus, execution of the two tasks will alternate, controlled by the suspension factor. Figures 4,  5, and 6 show the execution pattern of the tasks and  for various values of SF. The optimal value for SF, to restrict the number of repeated suspensions by two similar tasks arriving at the same time, can be obtained as follows: Let  represent the suspension priority of the waiting job and  represent the suspension priority of the running job. The condition for the first suspension is   = s. The preemption swaps the running job and the waiting job. Thus, after the preemption,   = 1 and  = s. The condition for the second suspension is   =    = .

 Similarly, the condition for the suspension is   = . The lowest value of s for which at most n suspensions occur is given by

 , when the running job completes.  = When the running job completes,      ;  =        that is,   = 2, since the wait time of the waiting job = the run time  of the running  job = 2 and s =   . Thus, if the number of suspensions   is to be 0, then s = 2. . With s = 1, the number For at most one suspension, s = of suspensions is very large, bounded only by the granularity of the preemption routine. With all jobs having equal length, any suspension factor greater than 2 will not result in suspension and will be the same as a suspension factor of 2. However, with jobs of varying length, the number of suspensions reduces with higher suspension factors. Thus, to avoid thrashing and to reduce the number of suspensions, we use different suspension factors between 1.5 and 5 in evaluating our schemes.











L Task T2

N

L Fig. 3. Two simultaneously submitted tasks T1 and T2, each requiring ‘N’ processors for ‘L’ seconds.



Let and  be two tasks submitted to the scheduler at the same time. Let both tasks be of the same length and require

B. Preventing Starvation without Reservation Guarantees With priority-based suspension, an idle job can preempt a running job only if its priority is at least SF times greater than the priority of the running job. All the idle jobs that are able to find the required number of processors by suspending lower

INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING AND NETWORKING

T1 T2 T1 T2 T1 T2 T1

Nodes(N)

T1 T2 T1 T2

T2

.......

t Fig. 4.

6

t

……….

Time

Execution pattern of the tasks T1 and T2 when SF = 1. Here, t represents the minimum time interval between two suspensions.

T2

T1

T2

T1

T2

Nodes(N)

T1

.......

Time 

Fig. 5.

Execution pattern of the tasks T1 and T2 when 1

SF



.

T2

T1

Nodes(N)

T1

L 

Fig. 6.

Execution pattern of the tasks T1 and T2 when SF =

Time



.

priority running jobs are selected for execution by preempting the corresponding jobs. All backfilling scheduling schemes use job reservations for one or more jobs at the head of the idle queue as a means of guaranteeing finite progress and thereby avoiding starvation. But start time guarantees do not have much significance in a preemptive context. Even if we give start time guarantees for the jobs in the idle queue, they are not guaranteed to run to completion. Since the SS strategy uses the expected slowdown (xfactor) as the suspension priority, there is an automatic guarantee of freedom from starvation: ultimately any job’s xfactor will get large enough that it will be able to preempt some running job(s) and begin execution. Thus, one can use backfilling without the usual reservation guarantees. We therefore remove guarantees for all our preemption schemes. Jobs in some categories inherently have a higher probability

of waiting longer in the queue than do jobs with comparable xfactor from other job categories. For example, consider a VW job needing 300 processors, and a Sequential job in the queue at the same time. If both jobs have the same xfactor, the probability that the Sequential job finds a running job to suspend is higher than the probability that the VW job finds enough lower-priority running jobs to suspend. Therefore, the average slowdown of the VW category will tend to be higher than the Sequential category. To redress this inequity, we impose a restriction that the number of processors requested by a suspending job should be at least half of the number of processors requested by the job that it suspends, thereby preventing the wide jobs from being suspended by the narrow jobs. The scheduler periodically (after every minute) invokes the preemption routine.

INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING AND NETWORKING

Very Short 7.14

SF = 1.5

4

SF = 2

3

SF = 5

2

No Suspension IS

1

SF = 1.5

4

SF = 2

3

SF = 5

2

No Suspension IS

1

0

0

Seq

Narrow

Wide

Very Wide

Seq

Narrow

Width

Wide

Long

No Suspension IS

SF = 1.5

2

SF = 2

1.5

SF = 5 No Suspension

1

IS

Seq

Very Wide

Narrow

Wide

Very Short

SF = 2

800

SF = 5

600

No Suspension

400

IS

200 0

Wide

Very Wide

No Suspension IS

Wide

SF = 5 No Suspension IS

Wide

Width

Very Wide

SF = 2

40000

SF = 5

30000

No Suspension

20000

IS

10000

Wide



800

No Suspension

400

IS

0

Wide

9000 8000 7000 6000 5000 4000 3000 2000 1000 0

Very Wide

SF = 1.5 SF = 2 SF = 5 No Suspension IS

Seq

Narrow

24000

SF = 1.5

20000

SF = 2

16000

SF = 5

12000

No Suspension

8000

IS

4000 0

Narrow

Wide

SF = 1.5 SF = 2 SF = 5 No Suspension IS

Seq

Very Wide

Task  can be scheduled by preempting one or more tasks in  if and only if

1350709

Narrow

Wide

Very Wide

Width

Fig. 10. Average turnaround time: SS scheme, SDSC trace. The trends are similar to those with the average slowdown metric (Fig. 9).



%

   %             

309782

90000 80000 70000 60000 50000 40000 30000 20000 10000 0

Width

(

Very Wide

Very Long 135252

333035

28000

Seq

Wide

Width

32000

Very Wide



%

SF = 5

Narrow

17329 200972

37060

SF = 2

Width

Let   be the suspension priority for a task  which requests Let  represent the set of processors allocated  processors. % to  . Let  represent the set of free processors and  represent the number of free processors at time t when the preemption is attempted. % The set of tasks that can be preempted by task  is given by



Very Wide

Short

69479 SF = 1.5

50000

Wide

17449

Long

Fig. 8. Average turnaround time: SS scheme, CTC trace. The trends are similar to those with the average slowdown metric (Fig. 7).

C. Algorithm

Narrow

Width

1200

165473

60000

Narrow

IS

Width

70000

Seq

No Suspension

Seq

Very Wide

SF = 1.5

Seq

0

Narrow

Wide

1600

Very Wide

95167

SF = 2

Narrow

2000

Very Long

SF = 1.5

SF = 5

0

Width

Turnaround Time

Turnaround Time

Narrow

SF = 2

1 0.5

4344

SF = 5

Seq

72368

Seq

IS

32254

SF = 2

Long 20000 18000 16000 14000 12000 10000 8000 6000 4000 2000 0

No Suspension

SF = 1.5

2 1.5

Fig. 9. Average slowdown: SS scheme, SDSC trace. Compared to NS, SS provides significant benefit for the VS, S, W, and VW categories; slight improvement for most of L categories; but a slight deterioration for the VL categories. Compared to IS, SS performs better for all the categories except for the VS categories.

SF = 1.5

Width

29.42

5.53

SF = 5

Seq

Turnaround Time

1000

Turnaround Time

Turnaround Time

SF = 1.5

Very Wide

2.5

Very Short 10835

9000 8000 7000 6000 5000 4000 3000 2000 1000 0

Wide

Very Long

SF = 2

Short

1200

Narrow

Width

3928 1400

IS

32.26 SF = 1.5

Very Wide

Fig. 7. Average slowdown: SS scheme, CTC trace. Compared to NS, SS provides significant benefit for the VS, S, W, and VW categories; slight improvement for most of L categories; but a slight deterioration for the VL categories. Compared to IS, SS performs better for all the categories except for the VS categories.

1600

No Suspension

3

Width

Turnaround Time

Wide

Width

SF = 5

Width

3 2.5 2 1.5 1 0.5 0

0

0

SF = 2

Seq

Very Wide

6.93

0.5

0.5

SF = 1.5

Long

Slowdown

SF = 5

1

Wide

4 3.5

2.5

Slowdown

Slowdown

SF = 2

2 1.5

Narrow

3.58

SF = 1.5

Narrow

IS

Very Long

2.5

Seq

No Suspension

Seq

3

Narrow

SF = 5

16 14 12 10 8 6 4 2 0

Width

7.59

Seq

SF = 2

Very Wide

192.59

26.72

SF = 1.5

Width

3

113.3

Slowdown

5

Short

37.8 16 14 12 10 8 6 4 2 0

Turnaround Time

5

36.83

Slowdown

6

Slowdown

Short 34.07

Slowdown

Slowdown

13.01 6

Turnaround Time

Very Short

7





" % " %$# " %& " %$'  " &% " + ) ' +)  %

1(  

  !"     

*)

( + ) 

 Let be the elements of  . Let be     a permutation of (1,2,3,. . . ,x) such that        . (If       , then   

 

% %  . If   , then the start time of the start time of

. If the start time of %  = the start time of %  , then the %  %  queue time of the queue time of

So,

+)

#

'

+)

" ,.-   #  -     / 0 2 1 + )    

 +)









#

(  3 !

INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING AND NETWORKING

80

SF = 2

60

No Suspension IS

40 20 0 Narrow

Wide

Very Wide

SF = 2

40

No Suspension

30

IS

20 10 0 Seq

Narrow

96.6

93.49

IS

Seq

Narrow

SF = 2 No Suspension IS

Wide

Very Wide

No Suspension IS

Narrow

SF = 2

30

SF = 2 Tuned

25

No Suspension

20

IS

15 10 5 0 Seq

Narrow

Wide

Very Wide

31.72

35.59

93.49

SF = 2 SF = 2 Tuned No Suspension IS

Seq

Narrow

Wide

27.58

SF = 2 SF = 2 Tuned No Suspension IS

Very Wide

Seq

Narrow

Very Short

SF = 2

6000

No Suspension IS

4000 2000 0

Seq

Narrow

Wide

45000 40000 35000 30000 25000 20000 15000 10000 5000 0

SF = 2 No Suspension IS

10000

SF = 2

8000

SF = 2 Tuned

6000

No Suspension

4000

IS

2000 0

Seq

Very Wide

Seq

Narrow

Width

Wide

Very Wide

Narrow

Wide

107743 147603

45000 40000 35000 30000 25000 20000 15000 10000 5000 0

SF = 2 SF = 2 Tuned No Suspension IS

Seq

Very Wide

Narrow

1121651

50000

SF = 2

40000

No Suspension

30000

IS

20000

173584

140000

10000

223338

712283

1711958

120000 100000

SF = 2

80000

No Suspension

60000

IS

40000 20000

0

112806

Narrow

Wide

205273

438900

Very Wide

Very Long 1121651

140000

60000 50000

SF = 2

40000

SF = 2 Tuned

30000

No Suspension

20000

IS

10000

Seq

Narrow

Wide

223338

712283 1711958

100000

SF = 2

80000

SF = 2 Tuned

60000

No Suspension

40000

IS

20000 0

Seq

Very Wide

173584

120000

0

0

Seq

Very Wide

Narrow

Wide

Very Wide

Seq

Width

Narrow

Wide

Very Wide

Width

Width

Width

Fig. 12. Worst-case turnaround time: SS scheme, CTC trace. The trends are similar to those with the worst-case slowdown metric (Fig. 11).

" 201 +)     

 % )









 

Fig. 14. Worst-case turnaround times for the TSS scheme: CTC trace. TSS improves the worst-case turnaround times for many categories without affecting the worst-case tunraround times for other categories.

( # %

 is given by

#  -  -   

The set of tasks preempted by task

%

Worst case TAT

438900

60000

70000

Worst case TAT

Very Long

Worst case TAT

Worst case TAT

205273

Wide

Width

Width

Long 112086

1754226

Width

Long 70000

Very Wide

Short 44371

12000

Worst case TAT

8000

16470 1754226

Worst case TAT

Worst case TAT

10000

Worst case TAT

16470

12000

174603

Wide

Width

Fig. 13. Worst-case slowdown for the TSS scheme: CTC trace. TSS improves the worst-case slowdowns for many categories without affecting the worstcase slowdowns for other categories.

Short 107743

26.65

10 9 8 7 6 5 4 3 2 1 0

Width

Fig. 11. Worst-case slowdown: SS scheme, CTC trace. SS is much better than NS for most of the categories and is slightly worse for some of the VL categories. Compared to IS, SS is much better for all the categories except for the VS categories.

44371

Very Wide

Very Long 96.6

10 9 8 7 6 5 4 3 2 1 0

Width

Very Short

Wide

Width

Long

SF = 2

Seq

Very Wide

35

26.65

10 9 8 7 6 5 4 3 2 1 0

Width

Wide

40

Width

27.58

Worst case Slowdown

Worst case Slowdown

No Suspension

757.6

41.91

41.4

SF = 2 Tuned

Very Long

Long 40 35 30 25 20 15 10 5 0 Narrow

Very Wide

Short 746.15

SF = 2

Width

Width

Seq

Wide

Worst case Slowdown

Seq

50

291.51

Worst case Slowdown

100

60

135.48

92.74 40 35 30 25 20 15 10 5 0

Worst case Slowdown

120

757.65 70

Worst case Slowdown

746.15

Worst case Slowdown

Worst case Slowdown

291.51

Very Short

Short

Very Short 135.48

8

If  is a previously suspended task attempting reentry, then it has to get the same set of processors that it was using before it was suspended. Here we remove the restriction that the number of processors requested by a suspending job should be at least half of the number of nodes requested by the job that it suspends. Otherwise if a VW job happens to suspend a narrow job, then in the worst-case, the narrow job has to wait till the VW job completes to get rescheduled. So the set of % tasks that can be preempted by  in this case is given by



%

    %       



  

Task  can be scheduled by preempting one or more tasks in  if and only if

 

   !/   

 



D. Results We compare the SS scheme run under various suspension factors with the No-Suspension (NS) scheme with aggressive backfilling and the IS scheme. From Figs. 7 – 10, we can see that the SS scheme provides significant improvement for the

INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING AND NETWORKING

Pseudocode for the selective suspension scheme Sort the list of running jobs in ascending order of suspension priority Sort the list of idle jobs in descending order of suspension priority for each idle job do set the candidate_job_set to be the null set if (idle job is a suspended job) then goto already_suspended else available_processors = number of free processors for each running job do if (number of processors requested by the idle job > available_processors) then if ((suspension priority of the idle job >= SF * suspension priority of the running job) && (number of processors used by the running job = SF * suspension priority of the running job) then if ({set of processors used by the running job} n {set of processors requested by the idle job} is not empty) then available_processor_set = {available_processor_set} u {set of processors used by running job} candidate_job_set = {candidate_job_set} u {running job} end if else goto next_idle_ job end if else goto suspend_job_2 end if done end for goto next_idle_job suspend_jobs_1: sort job(s) in candidate_job_set in descending order of number of processors used available_processors = number of free processors for each job in candidate_job_set do if (number of processors requested by the idle job > available_processors) then suspend the job available_processors = available_processors + number of processors used by the suspended job else schedule the idle job goto next_idle_job end if done end for goto next_idle_job suspend_jobs_2: suspend all jobs in the candidate_job_set schedule the idle job next_idle_job: do nothing done end for

9

INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING AND NETWORKING

Seq

Narrow

No Suspension IS

Very Wide

5 4

SF = 2

3

No Suspension

2

IS

1 0 Seq

Narrow

Wide

Very Short

20000 16000

SF = 2

12000

No Suspension IS

8000 4000

Worst case TAT

Worst case TAT

64874

Very Wide

0

Wide

100000

SF = 2

80000

No Suspension

60000

IS

40000 20000

552731 280000

Worst case TAT

Worst case TAT

5624519

120000

Narrow

Wide

Width

No Suspension IS

Seq

Narrow

Very Wide

Wide

Very Wide

Wide

4722325 5421950

Very Wide

10.1 5

SF = 2

4

SF = 2 Tuned

3

No Suspension

2

IS

1 0 Seq

Narrow

Wide

Very Wide

Short 2326544

SF = 2

16000

SF = 2 Tuned

12000

No Suspension

8000

IS

4000

Narrow

Wide

SF = 2 SF = 2 Tuned No Suspension IS

Seq

Very Wide

267089

198354

SF = 2

160000

No Suspension

120000

IS

80000 40000

Very Wide

Width

Fig. 16. Worst-case turnaround time: SS scheme, SDSC trace. The trends are similar to those with the worst-case slowdown metric (Fig. 15).

Very-Short (VS) and Short (S) length categories and Wide (W) and Very-Wide (VW) width categories. For example, for the VS-VW category, slowdown is reduced from 113 for the NS scheme to 7 for SS with SF = 2 for the SDSC trace (reduced from 34 for the NS scheme to under 3 for SS with SF = 2 for the CTC trace). For the VS and S length categories, a lower SF results in lowered slowdown and turnaround time. This is because a lower SF increases the probability that a job in these categories will suspend a job in the Long (L) or Very-Long (VL) category. The same is also true for the L length category, but the effect of change in SF is less pronounced. For the VL length category, there is an opposite trend with decreasing SF: the slowdown and turnaround times worsen. This is due to the

Narrow

Wide

Very Wide

Width

Long 5566523

4160489

90000 80000 70000 60000 50000 40000 30000 20000 10000 0

Width

200000

Wide

128.7

86.1

72.8

6

64874

20000

Seq

240000

Narrow

Very Wide

Fig. 17. Worst-case slowdown for the TSS scheme: SDSC trace. TSS improves the worst-case slowdowns for many categories without affecting the worst-case slowdowns for other categories.

140000

Seq

Wide

Width

0

0

0

Seq

SF = 2 Tuned

Very Long

198354 3974919

Narrow

Very Long 505

SF = 2

Width

Long 267089

595

39588 86754

IS

Narrow

Seq

Very Short

No Suspension

Seq

IS

4160489

SF = 2

Very Wide

No Suspension

Width

24000

Width

140000

Very Wide

SF = 2 Tuned

Width

2326544

90000 80000 70000 60000 50000 40000 30000 20000 10000 0

Wide

24 22 20 18 16 14 12 10 8 6 4 2 0

Short

39588 86754

Narrow

Narrow

Width

Fig. 15. Worst-case slowdown: SS scheme, SDSC trace. SS is much better than NS for most of the categories and is slightly worse for some of the VL categories. Compared to IS, SS is much better for all the categories except for the VS categories.

Seq

Seq

SF = 2

Long 128.7

86.1

72.8

10.11 6

Width

24000

IS

5476

2216 80 70 60 50 40 30 20 10 0

Width

Worst case Slowdown

505

SF = 2

Wide

No Suspension

Very Long

595

Worstcase Slowdown

Worstcase Slowdown

Long

Narrow

Very Wide

SF = 2 Tuned

Width

24 22 20 18 16 14 12 10 8 6 4 2 0 Seq

Wide

SF = 2

Worst case Slowdown

Very Wide

Width

Worst case TAT

Wide

IS

Worst case TAT

Narrow

No Suspension

Short 867

Worst case Slowdown

IS

SF = 2

1056

647 400 350 300 250 200 150 100 50 0

Worst case TAT

No Suspension

5476

Very Long 552731

3974919 5624519

280000

120000 100000

SF = 2

80000

SF = 2 Tuned

60000

No Suspension

40000

IS

20000

Worst case TAT

SF = 2

2216 80 70 60 50 40 30 20 10 0

Worst case Slowdown

867

Worstcase Slowdown

Worstcase Slowdown

1056

400 350 300 250 200 150 100 50 0 Seq

Very Short

Short

Very Short 647

10

4722325 5421950

5566523

240000 200000

SF = 2

160000

SF = 2 Tuned

120000

No Suspension IS

80000 40000 0

0

Seq

Narrow

Wide

Width

Very Wide

Seq

Narrow

Wide

Very Wide

Width

Fig. 18. Worst-case turnaround times for the TSS scheme: SDSC trace. TSS improves the worst-case turnaround times for many categories without affecting the worst-case tunraround times for other categories.

increasing probability that a Long job will be suspended by a job in a shorter category as SF decreases. In comparison to the base No-Suspension (NS) scheme, the SS scheme provides significant benefits for the VS and S categories and a slight improvement for most of the Long categories but is slightly worse for the VL categories. The performance of the IS scheme is very good for the VS categories. It is better than the SS scheme for the VS categories and worse for the other categories. Although the overall slowdown for IS is considerably less than for the No-Suspension scheme, it is not better than SS. Moreover, with IS the VW and VL categories get significantly worse.

INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING AND NETWORKING

No Suspension IS

SF = 1.5 Tuned

4

SF = 2 Tuned

3

SF = 5 Tuned

2

No Suspension IS

1 0

Seq

Narrow

Wide

Very Wide

Seq

Width

Narrow

3.22

1.5

SF = 5 Tuned No Suspension

1

IS

0.5

SF = 1.5 Tuned

2

SF = 2 Tuned SF = 5 Tuned No Suspension

1

IS

0.5

Very Wide

Seq

Narrow

Wide

Very Wide

Very Short

SF = 1.5 Tuned

4

SF = 2 Tuned

3

SF = 5 Tuned

2

No Suspension IS

1

5

SF = 1.5 Tuned

4

SF = 2 Tuned

3

SF = 5 Tuned

2

No Suspension IS

1

0

0 Very Wide

Seq

Width

Narrow

Wide

Very Wide

Width

Long

Very Long 4.85

2.97

3

3

2.5

2.5

SF = 1.5 Tuned

2

SF = 2 Tuned

1.5

SF = 5 Tuned No Suspension

1

IS

0.5 0

Slowdown

Slowdown

Wide

Very Wide

Very Long

SF = 1.5 Tuned

2

SF = 2 Tuned

1.5

SF = 5 Tuned No Suspension

1

IS

2.5

SF = 1.5 Tuned

2

SF = 2 Tuned

1.5

SF = 5 Tuned No Suspension

1

IS

0.5 0

Narrow

Wide

Very Wide

Seq

Narrow

Wide

Very Wide

Width

Fig. 21. Average slowdown of badly estimated jobs: CTC trace. Compared to NS, SS provides a slight improvement in slowdowns for many categories. SS tends to penalize the badly estimated jobs in VS categories. IS gives better performance for VS, S and VL categories.

7.23 55.18

Slowdown

Slowdown

5

4.77

SF = 1.5 Tuned

2

SF = 2 Tuned

1.5

SF = 5 Tuned No Suspension

1

IS

0.5 0

Wide

Narrow

Short 6

Narrow

Seq

Width

2.5

Width

16.94 6

Seq

Very Wide

3

Seq

Fig. 19. Average slowdown: Inaccurate estimates of run time; CTC trace. Compared to NS, SS improves the slowdowns for most of the categories with little deterioration to other categories. The performance of IS is bad for the long jobs.

3.42

Wide

0

Width

Wide

Narrow

0.5

0 Wide

IS

6.35

1.5

Width

No Suspension

Long

2.5

0

SF = 5 Tuned

2 1

3

Slowdown

SF = 2 Tuned

Slowdown

SF = 1.5 Tuned

SF = 2 Tuned

3

0

4.57

2

Narrow

IS

SF = 1.5 Tuned

4

Width

5.65

2.5

Seq

No Suspension

Seq

Very Wide

3

Narrow

SF = 5 Tuned

Very Long

Long

Seq

SF = 2 Tuned

Width

3

Slowdown

Wide

SF = 1.5 Tuned

5

Slowdown

SF = 5 Tuned

6

20 18 16 14 12 10 8 6 4 2 0

Slowdown

SF = 2 Tuned

Slowdown

SF = 1.5 Tuned

5

6.33 6.61

23.43 21.85 59.69

6

Slowdown

Slowdown

7.2 37.66

53.9

20 18 16 14 12 10 8 6 4 2 0

Short

Very Short

Short

Very Short 21.7

11

Very Wide

Seq

Width

Narrow

Wide

Very Wide

Width

Fig. 20. Average slowdown of well estimated jobs: CTC trace. Compared to NS, SS significantly improves the slowdowns for most of the categories with little deterioration to other categories. The performance of SS is better than or comparable to IS for VS categories.

E. Tunable Selective Suspension (TSS) From the graphs of the previous section, one can observe that the SS scheme significantly improves the average slowdown and turnaround time of various job categories. But from a practical point of view, the worst-case slowdowns and turnaround times are very important. A scheme that improves the average slowdowns and turnaround times for most of the categories but makes the worst-case slowdown and turnaround time for the long categories worse, is not an acceptable scheme. For example, a delay of 1 hour for a 10-minute job (slowdown = 7) is tolerable, whereas a slowdown of 7 for a 24-hour job is unacceptable. Figure 11 compares the worstcase slowdowns for SF = 2 with the worst-case slowdowns of

the NS scheme and the IS scheme for the CTC trace. One can observe that the worst-case slowdowns with the SS scheme are much better than with the NS scheme for most of the cases. But the worst-case slowdowns for some of the long categories are worse than for the NS scheme. Although the worst-case slowdown with SS is generally less than that with NS, the absolute worst-case slowdowns are much higher than the average slowdowns for some of the short categories. For the IS scheme, the worst-case slowdowns for the very short categories are lower, but they are very high for the long jobs, an unacceptable situation. Figure 12 compares the worst-case turnaround times for the SS scheme with worstcase turnaround times for the NS scheme and the IS scheme, for the CTC trace. Even though the trends observed here are similar to those with the worst-case slowdowns, the categories where SS is the best with respect to worst-case turnaround time are not same as the categories for which SS is the best with respect to worst-case slowdowns. This is because a job with the worst-case turnaround time need not be the one with worst-case slowdown. Similar trends can be observed for the SDSC trace from Figs. 15 and 16. We next propose a tunable scheme to improve the worstcase slowdown and turnaround time without significant deterioration of the average slowdown and turnaround time. This scheme involves controlling the variance in the slowdowns and turnaround times by associating a limit with each job. Preemption of a job is disabled when its priority exceeds this limit. This limit is set to 1.5 times the average slowdown of the category that the job belongs to. % The candidate set of tasks that can be preempted by a task  is given by

INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING AND NETWORKING

SF = 5 Tuned No Suspension

4000

IS

2000 0

Wide

Very Wide

Seq

Width

Wide

SF = 2 Tuned SF = 5 Tuned No Suspension IS

Turnaround Time

SF = 1.5 Tuned

88868

60000

SF = 1.5 Tuned 40000

SF = 2 Tuned

30000

SF = 5 Tuned

20000

No Suspension IS

10000

Seq

Very Wide

Narrow

Wide

Narrow



 

SF = 2 Tuned SF = 5 Tuned

4000

No Suspension IS

2000

Very Wide

Seq

/

U SER E STIMATE I NACCURACIES

We have so far assumed that the user estimates of job run time are perfect. Now, we consider the effect of user estimate inaccuracies on the proposed schemes. This analysis is needed

Narrow

Wide

Very Wide

Width Very Long 130543

SF = 1.5 Tuned SF = 2 Tuned SF = 5 Tuned No Suspension IS

90297

70000

369561

60000

SF = 1.5 Tuned

50000

SF = 2 Tuned

40000

SF = 5 Tuned

30000

No Suspension

20000

IS

10000 0

Seq

Narrow

Wide

Very Wide

Seq

Width

All the other conditions remain the same as mentioned in Section IV-C. Figures 13 and 14 show the result of this tunable scheme for the CTC trace. It improves the worst-case slowdowns for some long categories (VL W, VL VW, L N) and some short categories (VS Seq, VS N, S Seq) without affecting the worst-case slowdowns of the other categories. It improves the worst-case turnaround times for most of the categories without affecting the worst-case turnaround times of the other categories. Figures 17 and 18 show similar trends for the SDSC trace. This scheme can also be applied to selectively tune the slowdowns or turnaround times for particular categories. The TSS scheme is used for all the subsequent experiments, and the term “Selective Suspension” or “SS” in the following sections refers to “Tunable Selective Suspension.” OF

Wide

20000 18000 16000 14000 12000 10000 8000 6000 4000 2000 0

Very Wide

 is a previously suspended task attempting reentry, then  %         #     *  ,% -  %   

     /  

SF = 1.5 Tuned

8000 6000

0

Width

Narrow

Wide

Very Wide

Width

Fig. 23. Average turnaround time of well estimated jobs: CTC trace. Compared to NS, SS significantly improves the turnaround times for most of the categories with little deterioration to other categories. The performance of SS is comparable to IS for VS categories. Short

Very Short 5200

Turnaround Time

   %          !      #     *  %, -  %    "    (category(%  )) represents the average slowdown where % for the job category to which belongs.

V. I MPACT

0

10000

Long

 



IS

200

342395

Fig. 22. Average turnaround time: Inaccurate estimates of run time; CTC trace. Compared to NS, SS improves the turnaround times for most of the categories with little deterioration to other categories. The performance of IS is bad for the long jobs.

%

400

12000

Width

50000

Width

If

No Suspension

Seq

5000 4500 4000 3500 3000 2500 2000 1500 1000 500 0

SF = 1.5 Tuned SF = 2 Tuned SF = 5 Tuned No Suspension IS

Seq

Narrow

Wide

11000 10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0

Very Wide

SF = 1.5 Tuned SF = 2 Tuned SF = 5 Tuned No Suspension IS

Seq

Width

Narrow

Wide

Very Wide

Width

Long 45825

30000

Turnaround Time

Wide

SF = 5 Tuned

600

Very Wide

0

Narrow

SF = 2 Tuned

800

Very Long 108346

20000 18000 16000 14000 12000 10000 8000 6000 4000 2000 0

Seq

1000

Width

Long 23580

Narrow

Turnaround Time

Narrow

SF = 1.5 Tuned

Turnaround Time

IS

SF = 2 Tuned

6000

1200

Turnaround Time

No Suspension

SF = 1.5 Tuned 8000

68464

1400

Turnaround Time

SF = 5 Tuned

10000

Very Long 59287

40000

25000

SF = 1.5 Tuned 20000

SF = 2 Tuned

15000

SF = 5 Tuned

10000

No Suspension IS

5000 0

Turnaround Time

SF = 2 Tuned

Short 4472

45663

12000

Turnaround Time

SF = 1.5 Tuned

Turnaround Time

Turnaround Time

5101

Turnaround Time

Very Short

Short

Very Short 4500 4000 3500 3000 2500 2000 1500 1000 500 0

Seq

12

35000 30000

SF = 1.5 Tuned

25000

SF = 2 Tuned

20000

SF = 5 Tuned

15000

No Suspension

10000

IS

5000 0

Seq

Narrow

Wide

Width

Very Wide

Seq

Narrow

Wide

Very Wide

Width

Fig. 24. Average turnaround time of badly estimated jobs: CTC trace. Compared to NS, SS provides a slight improvement in turnaround times for many categories. SS tends to penalize the badly estimated jobs in VS categories. IS gives better performance for VS, S, and VL categories.

for modeling an actual system workload. In this context, we believe that a problem has been ignored by previous studies when analyzing the effect of over estimation on scheduling strategies. Abnormally aborted jobs tend to excessively skew the average slowdown of jobs in a workload. Consider a job requesting a wall-clock limit of 24 hours, which is queued for 1 hour and then aborts within one minute due to some fatal exception. The slowdown of this job would be computed to be 60, whereas the average slowdown of normally completing long jobs is typically under 2. If even 5% of the jobs have a high slowdown of 60, while 95% of the normally completing jobs have a slowdown of 2, the average slowdown over all jobs would be around 5. Now consider a scheme such as the

INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING AND NETWORKING

Short

No Suspension IS

SF = 1.5 Tuned

12

SF = 2 Tuned SF = 5 Tuned

8

No Suspension

4

IS

0 Narrow

Wide

Seq

Very Wide

Narrow

Long

Slowdown

4

SF = 2 Tuned

3

SF = 5 Tuned

2

No Suspension IS

1 0

IS

Wide

Very Wide

Narrow

Wide

SF = 5 Tuned No Suspension

2

IS

Seq

Wide

Very Long 5.44

2

SF = 2 Tuned SF = 5 Tuned No Suspension

1

IS

0.5

SF = 1.5 Tuned

2.5

SF = 2 Tuned

2

SF = 5 Tuned

1.5

No Suspension

1

IS

SF = 2 Tuned SF = 5 Tuned

0 Narrow

Wide

No Suspension

1

0 Seq

SF = 1.5 Tuned

2 1.5

0.5

Width

Fig. 25. Average slowdown: Inaccurate estimates of run time; SDSC trace. Compared to NS, SS improves the slowdowns for most of the categories with little deterioration to other categories. The performance of IS is bad for the long jobs.

2.5

0.5 Very Wide

IS

Seq

Very Wide

Narrow

Wide

Very Wide

Width

Width

Fig. 26. Average slowdown of well estimated jobs: SDSC trace. Compared to NS, SS significantly improves the slowdowns for most of the categories with little deterioration to other categories. The performance of IS is bad except for VS categories. Very Short

Short 227.4

111 123

70

SF = 1.5 Tuned

50

SF = 2 Tuned

40

SF = 5 Tuned

30

No Suspension

20

IS

Slowdown

Slowdown

60

10 0 Seq

Narrow

Wide

16 14 12 10 8 6 4 2 0

SF = 1.5 Tuned SF = 2 Tuned SF = 5 Tuned No Suspension IS

Seq

Very Wide

Width

Narrow

Wide

Very Wide

Width

Long 13.5

Very Long 155.6

4.5 4 3.5 3 2.5 2 1.5 1 0.5 0

3 SF = 1.5 Tuned SF = 2 Tuned SF = 5 Tuned No Suspension IS

2.5

Slowdown

Slowdown

speculative backfilling strategy evaluated in [29]. With this scheme, a job is given a free timeslot to execute in, even if that slot is considerably smaller than the requested wall-clock limit. Aborting jobs will quickly terminate, and since they did not have to be queued till an adequately long window was available, their slowdown would decrease dramatically with the speculative backfilling scheme. As a result, the average slowdown of the entire trace would now be close to 2, assuming that the slowdown of the normally completing jobs does not change significantly. A comparison of the average slowdowns would seem to indicate that the speculative backfilling scheme results in a significant improvement in job slowdown from 5 to 2. However, under the above scenario, the change is due only to the small fraction of aborted jobs, and not due to any benefits to the normal jobs. In order to avoid this problem, we group the jobs into two different estimation categories:

39.05

3

3

1.5

Very Wide

Width

Long

SF = 1.5 Tuned

Wide

Narrow

39.05

2.5

Narrow

SF = 2 Tuned

4

Very Wide

3.5

Seq

SF = 1.5 Tuned

6

0 Seq

0

Width

No Suspension

8

Width

5.44

SF = 1.5 Tuned

Narrow

SF = 5 Tuned

Very Wide

3

5

Seq

SF = 2 Tuned

Very Long 97.7

Slowdown

8.26

6

Wide

SF = 1.5 Tuned

Width

Width

Slowdown

Seq

20 23 106

10

Slowdown

SF = 5 Tuned

11.7 52.5

8 7 6 5 4 3 2 1 0

Slowdown

SF = 2 Tuned

16

Slowdown

20 SF = 1.5 Tuned

Short

Very Short 178.47

99 121

Slowdown

Slowdown

Very Short 65 60 55 50 45 40 35 30 25 20 15 10 5 0

13

SF = 1.5 Tuned

2

SF = 2 Tuned

1.5

SF = 5 Tuned No Suspension

1

IS

0.5 0

Seq

Narrow

Wide

Width

Very Wide

Seq

Narrow

Wide

Very Wide

Width

Fig. 27. Average slowdown of badly estimated jobs: SDSC trace. Compared to NS, SS provides a slight improvement in slowdowns for many categories. SS tends to penalize the badly estimated jobs in VS categories.

' Jobs that are well estimated (the estimated time is not was badly estimated, it would be treated as a long job and its more than twice the actual run time of that job) and

' Jobs that are poorly estimated (the estimated run time is more than twice the actual run time). Within each group, the jobs are further classified into 16 categories based on their actual run time and the number of processors requested. One can observe from Figs. 19 and 25 that the Selective Suspension scheme improves the slowdowns for most of the categories without adversely affecting the other categories. The slowdowns for the short and wide categories are quite high compared to the other categories, mainly because of the overestimation. Since the suspension priority used by the SS scheme is xfactor, it favors the short jobs. But if a short job

priority would increase only gradually. So, it will not be able to suspend running jobs easily and will end up with a high slowdown. This situation does not happen with IS because of the 10-minute time quantum for each arriving job irrespective of the estimated run time, and therefore the slowdowns for the very short category (whose length is less than or equal to 10 minutes) is better in IS than other schemes. For the other categories, however, SS performs much better than IS. Figures 22 and 28 compare the average turnaround times for the SS scheme with that of the NS and IS schemes for the CTC and SDSC traces, respectively. The improvement in performance for the short and wide categories is much less when compared to the improvement achieved with the accurate

INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING AND NETWORKING

Short

5000

SF = 2 Tuned

4000

SF = 5 Tuned

3000

No Suspension

2000

IS

1000 0

Wide

SF = 1.5 Tuned

16000

SF = 2 Tuned

12000

SF = 5 Tuned No Suspension

8000

IS

4000 0

Seq

Very Wide

Long

128842

SF = 1.5 Tuned

30000

SF = 2 Tuned 20000

SF = 5 Tuned No Suspension

10000

IS

0

Narrow

Wide

Wide

SF = 2 Tuned

60000

SF = 5 Tuned

40000

No Suspension IS

20000 0

Seq

Narrow

Wide

IS

1000 0

104779

SF = 1.5 Tuned

3000

SF = 2 Tuned SF = 5 Tuned

2000

No Suspension

1000

IS

0

Wide

20000 16000

SF = 2 Tuned

12000

SF = 5 Tuned No Suspension

8000

No Suspension IS

Seq

Very Wide

Narrow

969299

30000

SF = 1.5 Tuned

25000

SF = 2 Tuned

20000

SF = 5 Tuned

15000

No Suspension

10000

IS

5000 0

Narrow

Very Wide

Very Long

144916

35000

Wide

Width

Wide

Very Wide

40000 35000 30000

SF = 1.5 Tuned

25000

SF = 2 Tuned

20000

SF = 5 Tuned

15000

No Suspension

10000

IS

5000 0

Seq

Narrow

Wide

Very Wide

Width

Fig. 30. Average turnaround time of badly estimated jobs: SDSC trace. Compared to NS, SS provides a slight improvement in turnaround times for many categories. SS tends to penalize the badly estimated jobs in VS categories. The performance of IS is bad except for VS categories.

0

Seq

Long

Narrow

Wide

Very Wide

Very Long

SF = 1.5 Tuned

25000

SF = 2 Tuned

20000

SF = 5 Tuned

15000

No Suspension

10000

IS

5000 0

Very Wide

Turnaround Time

373932

30000

outperforms IS in all other categories. A. Modeling of Job Suspension Overhead

IS

4000

Very Wide

35000

Wide

Wide

SF = 5 Tuned

SF = 1.5 Tuned

Width

50590

Narrow

SF = 2 Tuned

Short Turnaround Time

Turnaround Time

2000

SF = 1.5 Tuned

Width

24000

Width

Turnaround Time

No Suspension

Seq

Very Wide

Width

4000

Width

3000

1618960

SF = 1.5 Tuned

21175

Narrow

SF = 5 Tuned

Long

288427

80000

Very Short

Seq

SF = 2 Tuned

4000

294928

18000 16000 14000 12000 10000 8000 6000 4000 2000 0

Width

100000

Very Wide

5000

49066

5000

Seq

Fig. 28. Average turnaround time: Inaccurate estimates of run time; SDSC trace. Compared to NS, SS improves the turnaround times for most of the categories with little deterioration to other categories. The performance of IS is bad except for VS categories.

Narrow

SF = 1.5 Tuned

Very Wide

120000

Width

Seq

6000

Very Long 701384

Turnaround Time

Turnaround Time

96790 40000

Seq

Narrow

7000

Width

Width

Turnaround Time

Narrow

20000

Short

12478 18872

8000

Turnaround Time

SF = 1.5 Tuned

24000

Turnaround Time

6000

218087

Turnaround Time

Turnaround Time

7000

Seq

Very Short

18935

11079 8000

Turnaround Time

Very Short

14

129682

100000

288427 1621780

80000

SF = 1.5 Tuned

60000

SF = 2 Tuned SF = 5 Tuned

40000

No Suspension

20000

IS

0

Seq

Narrow

Wide

Very Wide

Width

Fig. 29. Average turnaround time of well estimated jobs: SDSC trace. Compared to NS, SS significantly improves the turnaround times for most of the categories with little deterioration to other categories. The performance of IS is very bad for long jobs.

user estimate case. The reasoning provided above for the increase in slowdowns for the short and wide categories holds for this case also. The seemingly long jobs (badly estimated short jobs) are unable to suspend running jobs easily and have to wait in the queue for a longer time, thus ending up with a high turnaround time. From Figs. 20 - 21 and Figs. 26 - 27, the higher slowdowns for the VS categories with SS clearly are due to the badly estimated jobs. Figures. 23 - 24 and Figures 29 - 30 show that the reduction in the percentage improvement of the average turnaround times for the short and wide categories in SS is due to the badly estimated jobs. One can also observe that, for the well estimated jobs, SS is better than or comparable to IS for the VS categories and SS

We have so far assumed no overhead for preemption of jobs. In this section, we present simulation results that incorporate overheads for job suspension. Since the job traces did not have information about job memory requirements, we considered the memory requirement of jobs to be random and uniformly distributed between 100 MB and 1 GB. The overhead for suspension is calculated as the time taken to write the main memory used by the job to the disk. The memory transfer rate that we considered is based on the following scenario: with a commodity local disk for every node, with each node being a quad, the transfer rate per processor was assumed to be 2 MB/s (corresponding to a disk bandwidth of 8 MB/s). Figures 31 and 32 compare respectively the slowdowns and turnaround times of the proposed tunable scheme with NS and IS in the presence of overhead for job suspension/restart for the CTC trace. Figures 33 and 34 compare respectively the slowdowns and turnaround times of the proposed tunable scheme with NS and IS in the presence of overhead for job suspension/restart for the SDSC trace. One can observe that overhead does not significantly affect the performance of the SS scheme. VI. L OAD VARIATION We have so far seen the performance of the Selective Suspension scheme under normal load. In this section, we present the performance of the SS scheme under different load conditions starting from the normal load (original trace) and increasing load until the system reaches saturation. The

INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING AND NETWORKING

Short

60 SF = 2

4

SF = 2 OH

3

No Suspension

2

IS

Wide

Seq

Narrow

Width

3.22

No Suspension

20

IS

Wide

Seq

SF = 2 OH No Suspension IS

0 Wide

SF = 2 OH No Suspension

1

IS

Wide

Very Short

Turnaround Time

IS

SF = 2

8000

SF = 2 OH

6000

No Suspension

4000

IS

2000

Very Wide

Seq

IS

Wide

Width

Very Wide

Turnaround Time

No Suspension

90297

60000

SF = 2 OH No Suspension

20000

IS

10000 0

Very Wide

Width

Very Wide

218087

SF = 2 OH No Suspension IS

Narrow

Wide

16000

SF = 2 12000

SF = 2 OH No Suspension

8000

IS 4000 0

Seq

Very Wide

30000 25000

SF = 2

20000

SF = 2 OH

15000

No Suspension

10000

IS

5000 0

Narrow

Wide

Very Wide

Very Long 701384

35000

Seq

Narrow

Width

96790

SF = 2

Wide

Short

Long 369581

30000

Wide

Narrow

20000

Width

40000

Narrow

Seq

SF = 2

Seq

Very Wide

50000

Seq

IS

Very Wide

18935

Very Long 108346

SF = 2 OH

Narrow

Wide

No Suspension

Width

9000 8000 7000 6000 5000 4000 3000 2000 1000 0

Width

SF = 2

Seq

Narrow

Wide

SF = 2 OH

1

Very Short

10000

Long 23580

IS

1

SF = 2

2 1.5

Fig. 33. Average slowdown with modeling of overhead for suspension/restart: SDSC trace. The impact of overhead on the performance of SS scheme is minimal.

45663

12000

Width

20000 18000 16000 14000 12000 10000 8000 6000 4000 2000 0

No Suspension

Narrow

39.05

2.5

Width

Turnaround Time

Wide

SF = 2 OH

Seq

0

Narrow

5.44

SF = 2

2

Short 5101

No Suspension

Very Long

1.5

Very Wide

Very Wide

3

2.5

Width

Wide

97.7

0

Fig. 31. Average slowdown with modeling of overhead for suspension/restart: CTC trace. The impact of overhead on the performance of SS scheme is minimal.

Seq

8.26

0

Narrow

Narrow

Width

0 Seq

SF = 2 OH

Seq

Very Wide

0.5

Width

SF = 2

IS

0.5

Very Wide

4500 4000 3500 3000 2500 2000 1500 1000 500 0

No Suspension

2

0.5

Turnaround Time

Narrow

SF = 2

2 1.5

Slowdown

3

Slowdown

2.5 SF = 2

SF = 2 OH

4

Long

2.5

1

Wide

4.57

3.5

Seq

Narrow

Very Long 3

2

SF = 2 6

Width

5.65

1.5

8

0

Very Wide

3

0.5

Turnaround Time

SF = 2 OH

30

Width

Long

Turnaround Time

SF = 2

40

0

0

Very Wide

50

10

1 Narrow

17.2 178.4

Slowdown

IS

5

10

Turnaround Time

No Suspension

70

Wide

Width

Very Wide

Turnaround Time

SF = 2 OH

Short 121.38

6

Slowdown

SF = 2

Seq

Slowdown

Very Short 7.2 37.66

53.9

Slowdown

Slowdown

21.7 18 16 14 12 10 8 6 4 2 0

Slowdown

Very Short

15

128842

80000

288427

1618960

60000

SF = 2 SF = 2 OH

40000

No Suspension IS

20000 0

Seq

Narrow

Wide

Very Wide

Width

Fig. 32. Average turnaround time with modeling of overhead for suspension/restart: CTC trace. The impact of overhead on the performance of SS scheme is minimal.

Fig. 34. Average turnaround time with modeling of overhead for suspension/restart: SDSC trace. The impact of overhead on the performance of SS scheme is minimal.

different loads correspond to modification of the traces by dividing the arrival times of the jobs by suitable constants, keeping their run time the same as in the original trace. For example, the job trace for a load factor of 1.1 is obtained by dividing the arrival times of the jobs in the original trace by 1.1. For simplicity, we have reduced the number of job categories from sixteen to four for the load variation studies: two categories based on their run time — Short (S) and Long (L) — and two categories based on the number of processors requested — Narrow (N) and Wide (W). The criteria used for job classifications are shown in Table VI. The distribution of jobs in the CTC and SDSC traces, corresponding to the four categories, is given in Table VII and Table VIII, respectively. Figures 35 and 38 show the overall system utilization for

different schemes under different load conditions for the CTC and SDSC traces. One can observe that the SS scheme is able to achieve a better utilization than the NS scheme at higher loads, whereas the overall system utilization is very low under the IS scheme. Also, there is no significant increase in the overall system utilization (for both the SS and NS schemes) when the load factor is increased beyond 1.6 (for CTC) and 1.3 (for SDSC). This result indicates that the system reaches saturation at a load factor of 1.6 (for CTC) and 1.3 (for SDSC). We report the performance of the SS scheme for various load factors between 1.0 (normal) and 1.6 for the CTC trace and between 1.0 and 1.3 for the SDSC trace. Figures 36 and 37 and Figures 39 and 40 compare the performance of the SS scheme with the NS and IS schemes for different job categories under different load conditions

INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING AND NETWORKING

16

Load Vs Utilization

Load Vs Utilization 90

100

80

90

70

80 70

Utilization

Utilization

60 SF = 2 Tuned No Suspension

50 40

IS

30

60

SF = 2 Tuned

50

No Suspension IS

40 30

20

20

10

10

0 1

1.2

1.4

1.6

1.8

0

2

1

Load

1.1

1.2

1.3

1.4

1.5

Load

Fig. 35. Overall system utilization under different load conditions: CTC trace. The overall system utilization with the SS scheme is better than or comparable to the NS scheme. The performance of IS is much worse.

Short Narrow

Short Wide

80

20

0

0 0.5

1

1.5

IS

40

5 0

No Suspension

60

2

0

0.5

1

Long Narrow

1.5

0 0

SF = 2 Tuned

8

No Suspension

6

IS

4

0

0.5

1

1.5

2000

No Suspension

1500

IS

1000 500 0 1.5

2

30000 SF = 2 Tuned

20000

No Suspension

15000

IS

10000 5000 0.5

1

Long Narrow

1.5

2

1

2

200000 SF = 2 Tuned No Suspension 100000

IS

50000

1

1.5

2

Load

Fig. 37. Average turnaround time: varying load; CTC trace. The improvements achieved by the SS scheme are more pronounced under high load.

1.5

2

Short Wide

3000 2500 SF = 2 Tuned No Suspension

1500

IS

1000 500

120000 100000 SF = 2 Tuned

80000

No Suspension

60000

IS

40000 20000 0

0.5

1

1.5

2

0

0.5

80000

SF = 2 Tuned

60000

No Suspension IS

40000 20000

0.5

1

Load

1.5

2

Long Wide

100000

0

1

Load

0 0.5

1

Long Narrow

150000

0

0.5

Load

120000

0 1.5

0

2

Load

Turnaround Time

Turnaround Time

IS

1.5

2000

Long Wide

No Suspension

IS

140000

0

250000

SF = 2 Tuned

No Suspension

20

0 0.5

0 0

SF = 2 Tuned

30

10

Load

45000 40000 35000 30000 25000 20000 15000 10000 5000 0 1

2

40

Short Narrow

25000

2

Fig. 39. Average slowdown: varying load; SDSC trace. The improvements achieved by the SS scheme are more pronounced under high load.

0

Load

IS

3500

Load

0.5

No Suspension

3

Load

Turnaround Time

SF = 2 Tuned

Turnaround Time

Turnaround Time

2500

0

SF = 2 Tuned

4

Short Wide

3000

1.5

50

5

0

35000

3500

1

Long Wide

0

2

Fig. 36. Average slowdown: varying load; CTC trace. The improvements achieved by the SS scheme are more pronounced under high load.

Short Narrow

0.5

60

Load

4000

IS

Load

1

Load

No Suspension

0

6

10

0 2

2

SF = 2 Tuned

Long Narrow

2 1.5

1.5

7

Slowdown

IS

Slowdown

No Suspension

1

1

180 160 140 120 100 80 60 40 20 0

Load

12

0.5

0.5

Long Wide

SF = 2 Tuned

0

10

2

14

1

IS

Load

1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0.5

No Suspension

15

5

Load

0

SF = 2 Tuned

20

Slowdown

IS

10

25

Slowdown

No Suspension

15

SF = 2 Tuned

Turnaround Time

20

30

100

Slowdown

SF = 2 Tuned

Short Wide

35

120

25

Slowdown

Slowdown

30

Slowdown

Short Narrow

140

1.5

2

Turnaround Time

35

Turnaround Time

Fig. 38. Overall system utilization under different load conditions: SDSC trace. The overall system utilization with the SS scheme is better than or comparable to the NS scheme. The performance of IS is much worse.

1000000 900000 800000 700000 600000 500000 400000 300000 200000 100000 0

SF = 2 Tuned No Suspension IS

0

0.5

1

1.5

2

Load

Fig. 40. Average turnaround time: varying load; SDSC trace. The improvements achieved by the SS scheme are more pronounced under high load.

INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING AND NETWORKING

Short Narrow

IS

10 5

No Suspension

60

IS

40 20

0 40

60

80

100

0

20

Utilization

40

60

80

1000 500

No Suspension IS

Slowdown

SF = 2 Tuned

40

10 SF = 2 Tuned

8

No Suspension

6

IS

4

80

20

40

60

IS

40000 20000 0

20

80

40

80000

No Suspension IS

40000 20000

20

40

60

100

80

1000000 900000 800000 700000 600000 500000 400000 300000 200000 100000 0

100

SF = 2 Tuned No Suspension IS

0

Utilization

Fig. 41. Average slowdown versus system utilization: CTC trace. SS provides better performance even if the system is heavily utilized.

80

Long Wide

SF = 2 Tuned

60000

0

100

60

Utilization

100000

Utilization

Utilization

No Suspension

60000

100

0

0

100

60

120000

0 80

SF = 2 Tuned

80000

Long Narrow

2 60

100000

0 20

Long Wide

12

40

IS

120000

Utilization

14

1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 20

No Suspension

1500

Utilization

Long Narrow

0

SF = 2 Tuned

2000

0

100

Turnaround Time

20

2500

0

0 0

3000

Turnaround Time

15

SF = 2 Tuned

80

140000

Turnaround Time

No Suspension

100

Turnaround Time

SF = 2 Tuned

Slowdown

Slowdown

120

25

Short Wide

3500

140

30

Slowdown

Short Narrow

Short Wide

35

20

17

20

40

60

80

100

Utilization

Fig. 44. Average turnaround time versus system utilization: SDSC trace. SS provides better performance even if the system is heavily utilized. TABLE VI

Short Narrow

Short Wide

3000 2500

SF = 2 Tuned

2000

No Suspension

1500

IS

1000 500

30000 25000 No Suspension

15000

IS

10000

20

40

60

80

100

0

20

40

80

100

Long Wide 250000

45000 40000 35000 30000 25000 20000 15000 10000 5000 0

SF = 2 Tuned No Suspension IS

Turnaround Time

Turnaround Time

60

Utilization

Long Narrow

200000 SF = 2 Tuned

150000

No Suspension 100000

IS

50000 0

0

20

40

60

80

0

100

20

40

60

80

100

Utilization

Utilization

Fig. 42. Average turnaround time versus system utilization: CTC trace. SS provides better performance even if the system is heavily utilized. Short Narrow

Short Wide

35

25 SF = 2 Tuned

20

No Suspension

15

IS

10

Slowdown

30

Slowdown

8 Processors SW LW

5000

Utilization

5 0 20

40

60

80

180 160 140 120 100 80 60 40 20 0

100

No Suspension

J OB 0

20

40

60

80

Long Wide

6

1 Hr 1 Hr

SF = 2 Tuned

4

No Suspension

3

IS

2

Slowdown

50

5

40

10 0 80

100

8 Processors 13% 13%

TABLE VIII

No Suspension IS

20

0

8 Processors 44% 30%

SF = 2 Tuned

30

1 60

- CTC

TRACE

100

60

40

DISTRIBUTION BY CATEGORY FOR LOAD VARIATION STUDIES

Utilization

7

Utilization

TABLE VII

IS

Long Narrow

20

for CTC and SDSC traces. It can be observed that the improvements obtained by the SS scheme are more pronounced under high load. The trends with respect to different categories under higher loads is similar to that observed under the normal load. It provides significant benefit to the short jobs without affecting the performance of long jobs. The IS scheme is better than the SS scheme only for the SN jobs in terms of average turnaround time, whereas it is better than SS for both SN and SW jobs in terms of average slowdown. It implies that the IS scheme improves the performance of only the relatively shorter jobs in SW category by adversely affecting the performance of the relatively longer jobs. Also, the performance of the IS scheme is much worse for the long jobs, a very undesirable

SF = 2 Tuned

Utilization

Slowdown

8 Processors SN LN

0 0

0

1 Hr 1 Hr

SF = 2 Tuned

20000

0

0

CATEGORIZATION CRITERIA FOR LOAD VARIATION STUDIES

35000

3500

Turnaround Time

Turnaround Time

4000

J OB

J OB DISTRIBUTION BY 0

20

40

60

80

CATEGORY FOR LOAD VARIATION STUDIES TRACE

100

Utilization

Fig. 43. Average slowdown versus system utilization: SDSC trace. SS provides better performance even if the system is heavily utilized.

1 Hr 1 Hr

8 Processors 47% 21%

8 Processors 22% 10%

- SDSC

INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING AND NETWORKING

situation. Figures 41 and 42 compare respectively the average slowdowns and the average turnaround times of the jobs in the CTC trace against the overall system utilization for various schemes. Figures 43 and 44 compare respectively the average slowdowns and the average turnaround times of the jobs in the SDSC trace against the overall system utilization for various schemes. The SS scheme clearly is much better than both the IS and NS schemes. Even when the system is highly utilized, the SS scheme is able to provide much better response times for all categories of jobs. The IS scheme is not able to achieve high system utilization. VII. C ONCLUSIONS In this paper, we have explored the issue of preemptive scheduling of parallel jobs, using job traces from different supercomputer centers. We have proposed a tunable, selective suspension scheme and demonstrated that it provides significant improvement in the average and the worst-case slowdown of most job categories. It was also shown to provide better slowdown for most job categories over a previously proposed Immediate Service scheme. We also modeled the effect of overheads for job suspension, showing that even under stringent assumptions about available bandwidth to disk, the proposed scheme provides significant benefits over nonpreemptive scheduling and the Immediate Service strategy. We also evaluated the proposed schemes in the presence of inaccurate estimate of job run times and showed that the proposed scheme produced good results. Further, we showed that the Selective Suspension strategy provides greater benefits under high system loads compared to the other schemes. ACKNOWLEDGMENTS We thank the anonymous referees for their helpful suggestions on improving the presentation of the paper. This work was supported in part by Sandia National Laboratories, the University of Chicago under NSF Grant #SCI0414407, and the U.S. Department of Energy under Contract W-31-109-ENG-38. R EFERENCES [1] B. DasGupta and M. A. Palis, “Online real-time preemptive scheduling of jobs with deadlines,” in APPROX, 2000, pp. 96–107. [Online]. Available: citeseer.nj.nec.com/dasgupta00online.html [2] X. Deng and P. Dymond, “On multiprocessor system scheduling,” in Proceedings of the eighth annual ACM symposium on Parallel algorithms and architectures. ACM Press, 1996, pp. 82–88. [3] X. Deng, N. Gu, T. Brecht, and K. Lu, “Preemptive scheduling of parallel jobs on multiprocessors,” in SODA: ACM-SIAM Symposium on Discrete Algorithms, 1996. [Online]. Available: citeseer.nj.nec.com/deng00preemptive.html

18

[4] L. Epstein, “Optimal preemptive scheduling on uniform processors with non-decreasing speed ratios,” Lecture Notes in Computer Science, vol. 2010, pp. 230–248, 2001. [Online]. Available: citeseer.nj.nec.com/epstein00optimal.html [5] U. Schwiegelshohn and R. Yahyapour, “Fairness in parallel job scheduling,” Journal of Scheduling, vol. 3, no. 5, pp. 297–320, 2000. [Online]. Available: citeseer.ist.psu.edu/schwiegelshohn00fairness.html [6] S. V. Anastasiadis and K. C. Sevcik, “Parallel application scheduling on networks of workstations,” Journal of Parallel and Distributed Computing, vol. 43, no. 2, pp. 109–124, 1997. [Online]. Available: citeseer.nj.nec.com/article/anastasiadis96parallel.html [7] W. Cirne and F. Berman, “Adaptive selection of partition size for supercomputer requests,” in Workshop on Job Scheduling Strategies for Parallel Processing, 2000, pp. 187–208. [Online]. Available: citeseer.nj.nec.com/479768.html [8] D. G. Feitelson, “Analyzing the root causes of performance evaluation results,” Leibniz Center, Hebrew University, Tech. Rep., 2002. [9] J. P. Jones and B. Nitzberg, “Scheduling for parallel supercomputing: A historical perspective of achievable utilization,” in Workshop on Job Scheduling Strategies for Parallel Processing, 1999, pp. 1–16. [Online]. Available: citeseer.nj.nec.com/patton99scheduling.html [10] W. A. W. Jr., C. L. Mahood, and J. E. West, “Scheduling jobs on parallel systems using a relaxed backfill strategy,” in Workshop on Job Scheduling Strategies for Parallel Processing, 2002. [11] B. G. Lawson and E. Smirni, “Multiple-queue backfilling scheduling with priorities and reservations for parallel systems,” in Workshop on Job Scheduling Strategies for Parallel Processing, 2002. [12] B. G. Lawson, E. Smirni, and D. Puiu, “Self-adapting backfilling scheduling for parallel systems,” in Proceedings of the International Conference on Parallel Processing, 2002. [13] D. Lifka, “The ANL/IBM SP scheduling system,” in Workshop on Job Scheduling Strategies for Parallel Processing, 1995, pp. 295–303. [14] A. W. Mu’alem and D. G. Feitelson, “Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfilling,” IEEE Transactions on Parallel and Distributed Systems, vol. 12, no. 6, pp. 529–543, 2001. [15] G. Sabin, R. Kettimuthu, A. Rajan, and P. Sadayappan, “Scheduling of parallel jobs in a heterogeneous multi-site environment,” in Workshop on Job Scheduling Strategies for Parallel Processing, 2003. [16] S. Srinivasan, R. Kettimuthu, V. Subramani, and P. Sadayappan, “Selective reservation strategies for backfill job scheduling,” in Workshop on Job Scheduling Strategies for Parallel Processing, 2002. [17] S. Srinivasan, V. Subramani, R. Kettimuthu, P. Holenarsipur, and P. Sadayappan, “Effective selection of partition sizes for moldable scheduling of parallel jobs,” in Proceedings of the 9th International Conference on High Performance Computing, 2002. [18] V. Subramani, R. Kettimuthu, S. Srinivasan, J. Johnston, and P. Sadayappan, “Selective buddy allocation for scheduling parallel jobs on clusters,” in Proceedings of the IEEE International Conference on Cluster Computing, 2002. [19] V. Subramani, R. Kettimuthu, S. Srinivasan, and P. Sadayappan, “Distributed job scheduling on computational grids using multiple simultaneous requests,” in Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing, 2002, pp. 359–366. [20] D. Talby and D. G. Feitelson, “Supporting priorities and improving utilization of the ibm sp scheduler using slack-based backfilling,” in Proceedings of the 13th International Parallel Processing Symposium, 1999, pp. 513–517. [Online]. Available: citeseer.nj.nec.com/talby99supporting.html [21] D. Zotkin and P. Keleher, “Job-length estimation and performance in backfilling schedulers,” in Proceedings of the 8th High Performance Distributed Computing Conference, 1999, pp. 236–243. [Online]. Available: citeseer.nj.nec.com/196999.html [22] S. H. Chiang, R. K. Mansharamani, and M. K. Vernon, “Use of application characteristics and limited preemption for run-to-completion parallel processor scheduling policies,” in ACM SIGMETRICS

INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING AND NETWORKING

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

Conference on Measurement and Modeling of Computer Systems, 1994, pp. 33–44. [Online]. Available: citeseer.nj.nec.com/chiang94use.html S. H. Chiang and M. K. Vernon, “Production job scheduling for parallel shared memory systems,” in Proceedings of International Parallel and Distributed Processing Symposium, 2002. [Online]. Available: citeseer.nj.nec.com/196999.html L. T. Leutenneger and M. K. Vernon, “The performance of multiprogrammed multiprocessor scheduling policies,” in ACM SIGMETRICS Conference on Measurement and Modelling of Computer Systems, May 1990, pp. 226–236. [Online]. Available: citeseer.nj.nec.com/196999.html E. W. Parsons and K. C. Sevcik, “Implementing multiprocessor scheduling disciplines,” in Job Scheduling Strategies for Parallel Processing, D. G. Feitelson and L. Rudolph, Eds. Springer Verlag, 1997, pp. 166– 192, lecture Notes in Computer Science vol. 1291. K. Aida, “Effect of job size characteristics on job scheduling performance,” in Workshop on Job Scheduling Strategies for Parallel Processing, 2000, pp. 1–17. [Online]. Available: citeseer.nj.nec.com/319169.html O. Arndt, B. Freisleben, T. Kielmann, and F. Thilo, “A comparative study of online scheduling algorithms for networks of workstations,” Cluster Computing, vol. 3, no. 2, pp. 95–112, 2000. [Online]. Available: citeseer.nj.nec.com/article/arndt98comparative.html W. Cirne, “When the herd is smart: The emergent behavior of sa,” in IEEE Transactions on Parallel and Distributed Systems, 2002. [Online]. Available: citeseer.nj.nec.com/457615.html D. Perkovic and P. J. Keleher, “Randomization, speculation, and adaptation in batch schedulers,” in Proceedings of the 2000 ACM/IEEE conference on Supercomputing (CDROM). IEEE Computer Society, 2000, p. 7. P. J. Keleher, D. Zotkin, and D. Perkovic, “Attacking the bottlenecks of backfilling schedulers,” Cluster Computing, vol. 3, no. 4, pp. 245–254, 2000. [Online]. Available: citeseer.nj.nec.com/467800.html J. Krallmann, U. Schwiegelshohn, and R. Yahyapour, “On the design and evaluation of job scheduling algorithms,” in Workshop on Job Scheduling Strategies for Parallel Processing, 1999, pp. 17–42. [Online]. Available: citeseer.nj.nec.com/krallmann99design.html S. Srinivasan, R. Kettimuthu, V. Subramani, and P. Sadayappan, “Characterization of backfilling strategies for parallel job scheduling,” in Proceedings of the ICPP-2002 Workshops, 2002, pp. 514–519. A. Streit, “On job scheduling for HPC-clusters and the dynP scheduler,” in Proceedings of the 8th International Conference on High Performance Computing. Springer-Verlag, 2001, pp. 58–67. C. McCann, R. Vaswani, and J. Zahorjan, “A dynamic processor allocation policy for multiprogrammed shared-memory multiprocessors,” ACM Transactions on Computer Systems, vol. 11, no. 2, pp. 146–178, 1993. D. G. Feitelson and M. A. Jette, “Improved utilization and responsiveness with gang scheduling,” in Workshop on Job Scheduling Strategies for Parallel Processing. Springer-Verlag, 1997, pp. 238–261. D. Jackson, Q. Snell, and M. J. Clement, “Core algorithms of the maui scheduler,” in Workshop on Job Scheduling Strategies for Parallel Processing, 2001, pp. 87–102. [Online]. Available: citeseer.nj.nec.com/479768.html J. Skovira, W. Chan, H. Zhou, and D. Lifka, “The easy loadleveler api project,” in Workshop on Job Scheduling Strategies for Parallel Processing, 1996, pp. 41–47. [Online]. Available: citeseer.nj.nec.com/479768.html D. G. Feitelson, L. Rudolph, U. Schwiegelshohn, K. C. Sevcik, and P. Wong, “Theory and practice in parallel job scheduling,” in Workshop on Job Scheduling Strategies for Parallel Processing. Springer-Verlag, 1997, pp. 1–34. K. C. Sevcik, “Application scheduling and processor allocation in multiprogrammed parallel processing systems,” Performance Evaluation, vol. 19, no. 2-3, pp. 107–140, 1994. [Online]. Available: citeseer.nj.nec.com/sevcik93application.html J. Zahorjan and C. McCann, “Processor scheduling in shared memory

19

multiprocessors,” in ACM SIGMETRICS Conference on Measurement and Modelling of Computer Systems, May 1990, pp. 214–225. [Online]. Available: citeseer.nj.nec.com/196999.html [41] D. G. Feitelson, “Logs of real parallel workloads from production systems,” http://www.cs.huji.ac.il/labs/parallel/workload/logs.html.

Rajkumar Kettimuthu is a researcher at Argonne National Laboratory’s Mathematics and Computer Science Division. His research interests include data transport in high-bandwidth and high-delay networks and scheduling and resource management for cluster computing and the Grid. He has a bachelor of engineering degree in computer science and engineering from Anna Univerisy, Madras, India, and a master of science in computer and information science from the Ohio State University.

Vijay Subramani received his B.E in computer science and engineering from Anna University, India, in 2000 and his M.S in computer and information science from the Ohio State University in 2002. His research at Ohio State included scheduling and resource management for parallel and distributed systems. He currently works at Microsoft Corporation in Redmond, WA. His past work experience includes an internship at Los Alamos National Laboratory, where he worked on buffered coscheduling.

Srividya Srinivasan currently works as a software engineer at Microsoft Corporation in Redmond, WA. She worked as a software developer at Bloomberg L.P in New York earlier. She received a B.E degree in computer science and engineering from Anna University, Chennai, India, in 2000 and her M.S in computer and information science from the Ohio State University in 2002. Her research at Ohio State focused on parallel and distributed systems with an emphasis on parallel job scheduling.

Thiagaraja Gopalsamy is a senior software engineer with Altera Corporation, San Jose. He received his bachelor’s degree in computer science and engineering in 1999 from Anna University, India and his master’s degree in computer and information science in 2001 from the Ohio State University. His past research interests include mobile ad hoc networks and parallel computing. He is currently working on field programmable gate arrays and reconfigurable computing.

INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING AND NETWORKING

D. K. Panda is a professor of computer science at the Ohio State University. His research interests include parallel computer architecture, high performance networking, and network-based computing. He has published over 150 papers in these areas. His research group is currently collaborating with national laboratories and leading companies on designing various communication and I/O subsystems of next-generation HPC systems and datacenters with modern interconnects. The MVAPICH (MPI over VAPI for InfiniBand) package developed by his research group (http://nowlab.cis.ohio-state.edu/projects/mpi-iba/) is being used by more than 160 organizations worldwide to extract the potential of InfiniBand-based clusters for HPC applications. Dr. Panda is a recipient of the NSF CAREER Award, OSU Lumley Research Award (1997 and 2001), and an Ameritech Faculty Fellow Award. He is a senior member of IEEE Computer Society and a member of ACM.

P. Sadayappan received the B. Tech. degree from the Indian Institute of Technology, Madras, India, and an M.S. and Ph.D. from the State University of New York at Stony Brook, all in electrical engineering. He is currently a professor in the Department of Computer Science and Engineering at the Ohio State University. His research interests include scheduling and resource management for parallel/distributed systems and compiler/runtime support for highperformance computing.

The submitted manuscript has been in part created by the University of Chicago as Operator of Argonne National Laboratory (”Argonne”) under Contract No. W-31-109-ENG38 with the U.S. Department of Energy. The U.S. Government retains for itself, and othersacting on its behalf, a paid-up, nonexclusive, irrevocable worldwide license in said article to reproduce, prepare derivative works, distribute copies to the public, and perform publicly and display publicly, by or on behalf of the Government.

20