INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING AND NETWORKING. 1 ...... 50. 100. 150. 200. 250. 300. 350. 400. Seq. Narrow. Wide. Very. Wide. Width. W o rs ...... citeseer.nj.nec.com/article/anastasiadis96parallel.html.
INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING AND NETWORKING
1
Selective Preemption Strategies for Parallel Job Scheduling Rajkumar Kettimuthu, Vijay Subramani, Srividya Srinivasan, Thiagaraja Gopalsamy, D. K. Panda, and P. Sadayappan
Abstract— Although theoretical results have been established regarding the utility of preemptive scheduling in reducing average job turnaround time, job suspension/restart is not much used in practice at supercomputer centers for parallel job scheduling. A number of questions remain unanswered regarding the practical utility of preemptive scheduling. We explore this issue through a simulation-based study, using real job logs from supercomputer centers. We develop a tunable selective-suspension strategy and demonstrate its effectiveness. We also present new insights into the effect of preemptive scheduling on different job classes and deal with the impact of suspensions on worstcase response time. Further, we analyze the performance of the proposed schemes under different load conditions. Index Terms— Preemptive scheduling, Parallel job scheduling, Backfilling.
I. I NTRODUCTION Although theoretical results have been established regarding the effectiveness of preemptive scheduling strategies in reducing average job turnaround time [1]–[5], preemptive scheduling is not currently used for scheduling parallel jobs at supercomputer centers. Compared to the large number of studies that have investigated nonpreemptive scheduling of parallel jobs [6]–[21], little research has been reported on evaluation of preemptive scheduling strategies using real job logs [22]–[25]. The basic idea behind preemptive scheduling is simple: If a long-running job is temporarily suspended and a waiting short job is allowed to run to completion first, the wait time of the short job is significantly decreased, without much fractional increase in the turnaround time of the long job. Consider a long job with run time . After time t, let a short job arrive with run time . If the short job were to run after completion of the long job, the average turnaround job time would be , or . Instead, if the long job were suspended when the short job arrived, the turnaround times of the short and long jobs would be and , respectively, giving an average of !". $The #&% average turnaround time with suspension is less if , R. Kettimuthu is with Argonne National Laboratory. V. Subramani and S. Srinivasan are with Microsoft Corporation. T. Gopalsamy is with Altera Corporation. D. K. Panda and P. Sadayappan are with the Ohio State University.
that is, the remaining run time of the running job is greater than the run time of the waiting job. The suspension criterion has to be chosen carefully to ensure freedom from starvation. Also, the suspension scheme should bring down the average turnaround times without increasing the worst-case turnaround times. Even though theoretical results [1]–[5] have established that preemption improves the average turnaround time, it is important to perform evaluations of preemptive scheduling schemes using realistic job mixes derived from actual job logs from supercomputer centers, to understand the effect of suspension on various categories of jobs. The primary contributions of this work are as follows: ' Development of a selective-suspension strategy for preemptive scheduling of parallel jobs, ' Characterization of the significant variability in the average job turnaround time for different job categories, ' Demonstration of the impact of suspension on the worstcase turnaround times of various categories, and development of a tunable scheme to improve worst-case turnaround times. This paper is organized as follows. Section II provides background on parallel job scheduling and discusses prior work on preemptive job scheduling. Section III characterizes the workload used for the simulations. Section IV presents the proposed selective preemption strategies and evaluates their performance under the assumption of accurate estimation of job run times. Section V studies the impact of inaccuracies in user estimates of run time on the selective preemption strategies. It also models the overhead for job suspension and restart and evaluates the proposed schemes in the presence of overhead. Section VI describes the performance of the selective preemption strategies under different load conditions. Section VII summarizes the results of this work. II. BACKGROUND
AND
R ELATED W ORK
Scheduling of parallel jobs is usually viewed in terms of a 2D chart with time along one axis and the number of processors along the other axis. Each job can be thought of as a rectangle whose width is the user-estimated run time
INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING AND NETWORKING
and height is the number of processors requested. Parallel job scheduling strategies have been widely studied in the past [26]–[33]. The simplest way to schedule jobs is to use the firstcome-first-served (FCFS) policy. This approach suffers from low system utilization, however, because of fragmentation of the available processors. Consider a scenario where a few jobs are running in the system and many processors are idle, but the next queued job requires all the processors in the system. An FCFS scheduler would leave the free processors idle even if there were waiting queued jobs requiring only a few processors. Some solutions to this problem are to use dynamic partitioning [34] or gang scheduling [35]. An alternative approach to improve the system utilization is backfilling. A. Backfilling Backfilling was developed for the IBM SP1 parallel supercomputer as part of the Extensible Argonne Scheduling sYstem (EASY) [13] and has been implemented in several production schedulers [36], [37]. Backfilling works by identifying “holes” in the 2D schedule and moving forward smaller jobs that fit those holes. With backfilling, users are required to provide an estimate of the length of the jobs submitted for execution. This information is used by the scheduler to predict when the next queued job will be able to run. Thus, a scheduler can determine whether a job is sufficiently small to run without delaying any previously reserved jobs. It is desirable that a scheduler with backfilling support two conflicting goals. On the one hand, it is important to move forward as many short jobs as possible, in order to improve utilization and responsiveness. On the other hand, it is also important to avoid starvation of large jobs and, in particular, to be able to predict when each job will run. There are two common variants to backfilling — conservative and aggressive (EASY) — that attempt to balance these goals in different ways. 1) Conservative Backfilling: With conservative backfilling, every job is given a reservation (start time guarantee) when it enters the system. A smaller job is allowed to backfill only if it does not delay any previously queued job. Thus, when a new job arrives, the following allocation procedure is executed by a conservative backfilling scheduler. Based on the current knowledge of the system state, the scheduler finds the earliest time at which a sufficient number of processors are available to run the job for a duration equal to the user-estimated run time. This is called the “anchor point.” The scheduler then updates the system state to reflect the allocation of processors to this job starting from its anchor point. If the job’s anchor point is the current time, the job is started immediately. An example is given in Fig. 1. The first job in the queue does not have enough processors to run. Hence, a reservation is made for it at the anticipated termination time of the
2
2
nodes
1
3 time
running jobs
2 1
queued jobs
Fig. 1.
3
Conservative backfilling.
longer-running job. Similarly, the second queued job is given a reservation at the anticipated termination time of the first queued job. Although enough processors are available for the third queued job to start immediately, it would delay the second job; therefore, the third job is given a reservation after the second queued job’s anticipated termination time. Thus, in conservative backfilling, jobs are assigned a start time when they are submitted, based on the current usage profile. But they may actually be able to run sooner if previous jobs terminate earlier than expected. In this scenario, the original schedule is compressed by releasing the existing reservations one by one, when a running job terminates, in the order of increasing reservation start time guarantees and attempting backfill for the released job. If as a result of early termination of some job, “holes” of the right size are created for a job, then it gets an earlier reservation. In the worst case, each released job is reinserted in the same position it held previously. With this scheme, there is no danger of starvation, since a reservation is made for each job when it is submitted. 2) Aggressive Backfilling: Conservative backfilling moves jobs forward only if they do not delay any previously queued job. Aggressive backfilling takes a more aggressive approach and allows jobs to skip ahead provided they do not delay the job at the head of the queue. The objective is to improve the current utilization as much as possible, subject to some consideration for the queue order. The price is that execution guarantees cannot be made, because it is impossible to predict how much each job will be delayed in the queue. An aggressive backfilling scheduler scans the queue of waiting jobs and allocates processors as requested. The scheduler gives a reservation guarantee to the first job in the queue that does not have enough processors to start. This reservation is given at the earliest time at which the required processors are expected to become free, based on the current system state.
INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING AND NETWORKING
nodes
3
2 1
time
running jobs
backfill
2 1
queued jobs
Fig. 2.
3
Aggressive backfilling.
The scheduler then attempts to backfill the other queued jobs. To be eligible for backfilling, a job must require no more than the currently available processors and must satisfy either of two conditions that guarantee it will not delay the first job in the queue: ' It must terminate by the time the first queued job is scheduled to commence, or ' It must use no more nodes than those are free at the time the first queued job is scheduled to start. Figure 2 shows an example. B. Metrics Two common metrics used to evaluate the performance of scheduling schemes are the average turnaround time and the average bounded slowdown. We use these metrics for our studies. The bounded slowdown [38] of a job is defined as follows:
%% ! % "$#$% '& !&% "$#$% (1) The threshold of 10 seconds is used to limit the influence of very short jobs on the metric. Preemptive scheduling aims at providing lower delay to short jobs relative to long jobs. Since long jobs have greater tolerance to delays as compared to short jobs, our suspension criterion is based on the expansion factor (xfactor), which increases rapidly for short jobs and gradually for long jobs.
!()+* %,-. %% 0/ % 12 ,% 3- % '& / % 12 %, 3- %
(2)
3
C. Related Work Although preemptive scheduling is universally used at the operating system level to multiplex processes on singleprocessor systems and shared-memory multi-processors, it is rarely used in parallel job scheduling. A large number of studies have addressed the problem of parallel job scheduling (see [38] for a survey of work on this topic), but most of them address nonpreemptive scheduling strategies. Further, most of the work on preemptive scheduling of parallel jobs considers the jobs to be malleable [3], [25], [39], [40]; in other words, the number of processors used to execute the job is permitted to vary dynamically over time. In practice, parallel jobs submitted to supercomputer centers are generally rigid; that is, the number of processors used to execute a job is fixed. Under this scenario, the various schemes proposed for a malleable job model are inapplicable. Few studies have addressed preemptive scheduling under a model of rigid jobs, where the preemption is “local,” that is, the suspended job must be restarted on exactly the same set of processors on which they were suspended. Chiang and Vernon [23] evaluate a preemptive scheduling strategy called “immediate service (IS)” for shared-memory systems. With this strategy, each arriving job is given an immediate timeslice of 10 minutes, by suspending one or more running jobs if needed. The selection of jobs for suspension is based on their instantaneous-xfactor, defined as (wait time + total accumulated run time) / (total accumulated run time). Jobs with the lowest instantaneous-xfactor are suspended. The IS strategy significantly decreases the average job slowdown for the traces simulated. A potential shortcoming of the IS strategy, however, is that its preemption decisions do not reflect the expected run time of a job. The IS strategy can be expected to significantly improve the slowdown of aborted jobs in the trace. Hence, it is unclear how much, if any, of the improvement in slowdown is experienced by the jobs that completed normally. However, no information is provided on how different job categories are affected. Chiang et al. [22] examine the run-to-completion policy with a suspension policy that allows a job to be suspended at most once. Both this approach and the IS strategy limit the number of suspensions, whereas we use a “suspension factor” to control the rate of suspensions, without limiting the number of times a job can be suspended. Parsons and Sevcik [25] discuss the design and implementation of a number of multiprocessor preemptive scheduling disciplines. They study the effect of preemption under the models of rigid, migratable, and malleable jobs. They conclude that their proposed preemption scheme may increase the response time for the model of rigid jobs. So far, few simulation-based studies have been done on preemption strategies for clusters. With no process migration, the
INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING AND NETWORKING
distributed-memory systems impose an additional constraint that a suspended job should get the same set of processors when it restarts. In this paper, we propose tunable suspension strategies for parallel job scheduling in environments where process migration is not feasible. III. W ORKLOAD C HARACTERIZATION We perform simulation studies using a locally developed simulator with workload logs from different supercomputer centers. Most supercomputer centers keep a trace file as a record of the scheduling events that occur in the system. This file contains information about each job submitted and its actual execution. Typically the following data is recorded for each job: ' Name of job, user name, and so forth ' Job submission time ' Job resources requested, such as memory and processors ' User-estimated run time ' Time when job started execution ' Time when job finished execution
from a 430-node IBM SP2 system at the Cornell Theory Center, the SDSC trace from a 128-node IBM SP2 system at the San Diego Supercomputer Center, and the KTH trace from a 100-node IBM SP2 system at the Swedish Royal Institute of Technology. The other traces did not contain user estimates of run time. We observed similar performance trends with all the three traces. In order to minimize the number of graphs, we report the performance results for CTC and SDSC traces alone. This selection is purely arbitrary. Although user estimates are known to be quite inaccurate in practice, as explained above, we first studied the effect of preemptive scheduling under the idealized assumption of accurate estimation, before studying the effect of inaccuracies in user estimates of job run time. Also, we first studied the impact of preemption under the assumption that the overhead for job suspension and restart were negligible and then studied the influence of the overhead. TABLE IV AVERAGE SLOWDOWN FOR VARIOUS CATEGORIES WITH NONPREEMPTIVE SCHEDULING
TABLE I J OB
0 - 10 min 10 min - 1 hr 1 hr - 8 hr 8 hr
CATEGORIZATION CRITERIA
1 Proc VS Seq S Seq L Seq VL Seq
2-8 Procs VS N SN LN VL N
9-32 Procs VS W SW LW VL W
32 Procs VS VW S VW L VW VL VW
4
0 - 10 min 10 min - 1 hr 1 hr - 8 hr 8 hr
1 Proc 2.6 1.26 1.13 1.03
- CTC T RACE
2-8 Procs 4.76 1.76 1.43 1.05
9-32 Procs 13.01 3.04 1.88 1.09
32 Procs 34.07 7.14 1.63 1.15
TABLE V AVERAGE SLOWDOWN FOR VARIOUS CATEGORIES WITH NONPREEMPTIVE SCHEDULING -
TABLE II J OB DISTRIBUTION BY
0 - 10 min 10 min - 1 hr 1 hr - 8 hr 8 hr
1 Proc 14% 18% 6% 2%
CATEGORY
2-8 Procs 8% 4% 3% 2%
- CTC T RACE
9-32 Procs 13% 6% 9% 1%
32 Procs 9% 2% 2% 1%
TABLE III J OB DISTRIBUTION BY CATEGORY - SDSC T RACE
0 - 10 min 10 min - 1 hr 1 hr - 8 hr 8 hr
1 Proc 8% 2% 8% 3%
2-8 Procs 29% 8% 5% 5%
9-32 Procs 9% 5% 6% 3%
32 Procs 4% 3% 1% 1%
From the collection of workload logs available from Feitelson’s archive [41], subsets of the CTC workload trace, the SDSC workload trace and the KTH workload trace were used to evaluate the various schemes. The CTC trace was logged
0 - 10 min 10 min - 1 hr 1 hr - 8 hr 8 hr
1 Proc 2.53 1.15 1.19 1.03
SDSC T RACE
2-8 Procs 14.41 2.43 1.24 1.09
9-32 Procs 37.78 4.83 1.96 1.18
32 Procs 113.31 15.56 2.79 1.43
Any analysis that is based only on the average slowdown or turnaround time of all jobs in the system cannot provide insights into the variability within different job categories. Therefore, in our discussion, we classify the jobs into various categories based on the run time and the number of processors requested, and we analyze the slowdown and turnaround time for each category. To analyze the performance of jobs of different sizes and lengths, we classified jobs into 16 categories: considering four partitions for run time — Very Short (VS), Short (S), Long (L) and Very Long (VL) — and four partitions for the number of processors requested — Sequential (Seq), Narrow (N), Wide (W) and Very Wide (VW). The criteria used for job classification are shown in Table I. The distribution of jobs in
INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING AND NETWORKING
the trace, corresponding to the sixteen categories, is given in Tables II and III. Tables IV and V show the average slowdowns for the different job categories under a nonpreemptive aggressive backfilling strategy. The overall slowdown for the CTC trace was 3.58, and for the SDSC trace was 14.13. Even though the overall slowdowns are low, from the tables one can observe that some of the Very Short categories have slowdowns as high as 34 (CTC trace) and 113 (SDSC trace). Preemptive strategies aim at reducing the high average slowdowns for the short categories without significant degradation to long jobs. IV. S ELECTIVE S USPENSION We first propose a preemptive scheduling scheme called Selective Suspension (SS), where an idle job may preempt a running job if its “suspension priority” is sufficiently higher than the running job. An idle job attempts to suspend a collection of running jobs so as to obtain enough free processors. In order to control the rate of suspensions, a suspension factor (SF) is used. This specifies the minimum ratio of the suspension priority of a candidate idle job to the suspension priority of a running job for preemption to occur. The suspension priority used is the xfactor of the job. A. Theoretical Analysis
Task T1
N
5
the entire system for execution, with the system being free when the two tasks are submitted. Let “s” be the suspension factor. Before starting, both tasks have a suspension priority of 1. The suspension priority of a task remains constant when the task executes and increases when the task waits. One of the two tasks, say , will start instantly. The other task, say , will wait until its suspension priority becomes s times the priority of before it can preempt . Now will have to wait until its suspension priority becomes s times before it can preempt . Thus, execution of the two tasks will alternate, controlled by the suspension factor. Figures 4, 5, and 6 show the execution pattern of the tasks and for various values of SF. The optimal value for SF, to restrict the number of repeated suspensions by two similar tasks arriving at the same time, can be obtained as follows: Let represent the suspension priority of the waiting job and represent the suspension priority of the running job. The condition for the first suspension is = s. The preemption swaps the running job and the waiting job. Thus, after the preemption, = 1 and = s. The condition for the second suspension is = = .
Similarly, the condition for the suspension is = . The lowest value of s for which at most n suspensions occur is given by
, when the running job completes. = When the running job completes, ; = that is, = 2, since the wait time of the waiting job = the run time of the running job = 2 and s = . Thus, if the number of suspensions is to be 0, then s = 2. . With s = 1, the number For at most one suspension, s = of suspensions is very large, bounded only by the granularity of the preemption routine. With all jobs having equal length, any suspension factor greater than 2 will not result in suspension and will be the same as a suspension factor of 2. However, with jobs of varying length, the number of suspensions reduces with higher suspension factors. Thus, to avoid thrashing and to reduce the number of suspensions, we use different suspension factors between 1.5 and 5 in evaluating our schemes.
L Task T2
N
L Fig. 3. Two simultaneously submitted tasks T1 and T2, each requiring ‘N’ processors for ‘L’ seconds.
Let and be two tasks submitted to the scheduler at the same time. Let both tasks be of the same length and require
B. Preventing Starvation without Reservation Guarantees With priority-based suspension, an idle job can preempt a running job only if its priority is at least SF times greater than the priority of the running job. All the idle jobs that are able to find the required number of processors by suspending lower
INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING AND NETWORKING
T1 T2 T1 T2 T1 T2 T1
Nodes(N)
T1 T2 T1 T2
T2
.......
t Fig. 4.
6
t
……….
Time
Execution pattern of the tasks T1 and T2 when SF = 1. Here, t represents the minimum time interval between two suspensions.
T2
T1
T2
T1
T2
Nodes(N)
T1
.......
Time
Fig. 5.
Execution pattern of the tasks T1 and T2 when 1
SF
.
T2
T1
Nodes(N)
T1
L
Fig. 6.
Execution pattern of the tasks T1 and T2 when SF =
Time
.
priority running jobs are selected for execution by preempting the corresponding jobs. All backfilling scheduling schemes use job reservations for one or more jobs at the head of the idle queue as a means of guaranteeing finite progress and thereby avoiding starvation. But start time guarantees do not have much significance in a preemptive context. Even if we give start time guarantees for the jobs in the idle queue, they are not guaranteed to run to completion. Since the SS strategy uses the expected slowdown (xfactor) as the suspension priority, there is an automatic guarantee of freedom from starvation: ultimately any job’s xfactor will get large enough that it will be able to preempt some running job(s) and begin execution. Thus, one can use backfilling without the usual reservation guarantees. We therefore remove guarantees for all our preemption schemes. Jobs in some categories inherently have a higher probability
of waiting longer in the queue than do jobs with comparable xfactor from other job categories. For example, consider a VW job needing 300 processors, and a Sequential job in the queue at the same time. If both jobs have the same xfactor, the probability that the Sequential job finds a running job to suspend is higher than the probability that the VW job finds enough lower-priority running jobs to suspend. Therefore, the average slowdown of the VW category will tend to be higher than the Sequential category. To redress this inequity, we impose a restriction that the number of processors requested by a suspending job should be at least half of the number of processors requested by the job that it suspends, thereby preventing the wide jobs from being suspended by the narrow jobs. The scheduler periodically (after every minute) invokes the preemption routine.
INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING AND NETWORKING
Very Short 7.14
SF = 1.5
4
SF = 2
3
SF = 5
2
No Suspension IS
1
SF = 1.5
4
SF = 2
3
SF = 5
2
No Suspension IS
1
0
0
Seq
Narrow
Wide
Very Wide
Seq
Narrow
Width
Wide
Long
No Suspension IS
SF = 1.5
2
SF = 2
1.5
SF = 5 No Suspension
1
IS
Seq
Very Wide
Narrow
Wide
Very Short
SF = 2
800
SF = 5
600
No Suspension
400
IS
200 0
Wide
Very Wide
No Suspension IS
Wide
SF = 5 No Suspension IS
Wide
Width
Very Wide
SF = 2
40000
SF = 5
30000
No Suspension
20000
IS
10000
Wide
800
No Suspension
400
IS
0
Wide
9000 8000 7000 6000 5000 4000 3000 2000 1000 0
Very Wide
SF = 1.5 SF = 2 SF = 5 No Suspension IS
Seq
Narrow
24000
SF = 1.5
20000
SF = 2
16000
SF = 5
12000
No Suspension
8000
IS
4000 0
Narrow
Wide
SF = 1.5 SF = 2 SF = 5 No Suspension IS
Seq
Very Wide
Task can be scheduled by preempting one or more tasks in if and only if
1350709
Narrow
Wide
Very Wide
Width
Fig. 10. Average turnaround time: SS scheme, SDSC trace. The trends are similar to those with the average slowdown metric (Fig. 9).
%
%
309782
90000 80000 70000 60000 50000 40000 30000 20000 10000 0
Width
(
Very Wide
Very Long 135252
333035
28000
Seq
Wide
Width
32000
Very Wide
%
SF = 5
Narrow
17329 200972
37060
SF = 2
Width
Let be the suspension priority for a task which requests Let represent the set of processors allocated processors. % to . Let represent the set of free processors and represent the number of free processors at time t when the preemption is attempted. % The set of tasks that can be preempted by task is given by
Very Wide
Short
69479 SF = 1.5
50000
Wide
17449
Long
Fig. 8. Average turnaround time: SS scheme, CTC trace. The trends are similar to those with the average slowdown metric (Fig. 7).
C. Algorithm
Narrow
Width
1200
165473
60000
Narrow
IS
Width
70000
Seq
No Suspension
Seq
Very Wide
SF = 1.5
Seq
0
Narrow
Wide
1600
Very Wide
95167
SF = 2
Narrow
2000
Very Long
SF = 1.5
SF = 5
0
Width
Turnaround Time
Turnaround Time
Narrow
SF = 2
1 0.5
4344
SF = 5
Seq
72368
Seq
IS
32254
SF = 2
Long 20000 18000 16000 14000 12000 10000 8000 6000 4000 2000 0
No Suspension
SF = 1.5
2 1.5
Fig. 9. Average slowdown: SS scheme, SDSC trace. Compared to NS, SS provides significant benefit for the VS, S, W, and VW categories; slight improvement for most of L categories; but a slight deterioration for the VL categories. Compared to IS, SS performs better for all the categories except for the VS categories.
SF = 1.5
Width
29.42
5.53
SF = 5
Seq
Turnaround Time
1000
Turnaround Time
Turnaround Time
SF = 1.5
Very Wide
2.5
Very Short 10835
9000 8000 7000 6000 5000 4000 3000 2000 1000 0
Wide
Very Long
SF = 2
Short
1200
Narrow
Width
3928 1400
IS
32.26 SF = 1.5
Very Wide
Fig. 7. Average slowdown: SS scheme, CTC trace. Compared to NS, SS provides significant benefit for the VS, S, W, and VW categories; slight improvement for most of L categories; but a slight deterioration for the VL categories. Compared to IS, SS performs better for all the categories except for the VS categories.
1600
No Suspension
3
Width
Turnaround Time
Wide
Width
SF = 5
Width
3 2.5 2 1.5 1 0.5 0
0
0
SF = 2
Seq
Very Wide
6.93
0.5
0.5
SF = 1.5
Long
Slowdown
SF = 5
1
Wide
4 3.5
2.5
Slowdown
Slowdown
SF = 2
2 1.5
Narrow
3.58
SF = 1.5
Narrow
IS
Very Long
2.5
Seq
No Suspension
Seq
3
Narrow
SF = 5
16 14 12 10 8 6 4 2 0
Width
7.59
Seq
SF = 2
Very Wide
192.59
26.72
SF = 1.5
Width
3
113.3
Slowdown
5
Short
37.8 16 14 12 10 8 6 4 2 0
Turnaround Time
5
36.83
Slowdown
6
Slowdown
Short 34.07
Slowdown
Slowdown
13.01 6
Turnaround Time
Very Short
7
" % " %$# " %& " %$' " &% " + ) ' +) %
1(
!"
*)
( + )
Let be the elements of . Let be a permutation of (1,2,3,. . . ,x) such that . (If , then
% % . If , then the start time of the start time of
. If the start time of % = the start time of % , then the % % queue time of the queue time of
So,
+)
#
'
+)
" ,.- # - / 0 2 1 + )
+)
#
( 3 !
INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING AND NETWORKING
80
SF = 2
60
No Suspension IS
40 20 0 Narrow
Wide
Very Wide
SF = 2
40
No Suspension
30
IS
20 10 0 Seq
Narrow
96.6
93.49
IS
Seq
Narrow
SF = 2 No Suspension IS
Wide
Very Wide
No Suspension IS
Narrow
SF = 2
30
SF = 2 Tuned
25
No Suspension
20
IS
15 10 5 0 Seq
Narrow
Wide
Very Wide
31.72
35.59
93.49
SF = 2 SF = 2 Tuned No Suspension IS
Seq
Narrow
Wide
27.58
SF = 2 SF = 2 Tuned No Suspension IS
Very Wide
Seq
Narrow
Very Short
SF = 2
6000
No Suspension IS
4000 2000 0
Seq
Narrow
Wide
45000 40000 35000 30000 25000 20000 15000 10000 5000 0
SF = 2 No Suspension IS
10000
SF = 2
8000
SF = 2 Tuned
6000
No Suspension
4000
IS
2000 0
Seq
Very Wide
Seq
Narrow
Width
Wide
Very Wide
Narrow
Wide
107743 147603
45000 40000 35000 30000 25000 20000 15000 10000 5000 0
SF = 2 SF = 2 Tuned No Suspension IS
Seq
Very Wide
Narrow
1121651
50000
SF = 2
40000
No Suspension
30000
IS
20000
173584
140000
10000
223338
712283
1711958
120000 100000
SF = 2
80000
No Suspension
60000
IS
40000 20000
0
112806
Narrow
Wide
205273
438900
Very Wide
Very Long 1121651
140000
60000 50000
SF = 2
40000
SF = 2 Tuned
30000
No Suspension
20000
IS
10000
Seq
Narrow
Wide
223338
712283 1711958
100000
SF = 2
80000
SF = 2 Tuned
60000
No Suspension
40000
IS
20000 0
Seq
Very Wide
173584
120000
0
0
Seq
Very Wide
Narrow
Wide
Very Wide
Seq
Width
Narrow
Wide
Very Wide
Width
Width
Width
Fig. 12. Worst-case turnaround time: SS scheme, CTC trace. The trends are similar to those with the worst-case slowdown metric (Fig. 11).
" 201 +)
% )
Fig. 14. Worst-case turnaround times for the TSS scheme: CTC trace. TSS improves the worst-case turnaround times for many categories without affecting the worst-case tunraround times for other categories.
( # %
is given by
# - -
The set of tasks preempted by task
%
Worst case TAT
438900
60000
70000
Worst case TAT
Very Long
Worst case TAT
Worst case TAT
205273
Wide
Width
Width
Long 112086
1754226
Width
Long 70000
Very Wide
Short 44371
12000
Worst case TAT
8000
16470 1754226
Worst case TAT
Worst case TAT
10000
Worst case TAT
16470
12000
174603
Wide
Width
Fig. 13. Worst-case slowdown for the TSS scheme: CTC trace. TSS improves the worst-case slowdowns for many categories without affecting the worstcase slowdowns for other categories.
Short 107743
26.65
10 9 8 7 6 5 4 3 2 1 0
Width
Fig. 11. Worst-case slowdown: SS scheme, CTC trace. SS is much better than NS for most of the categories and is slightly worse for some of the VL categories. Compared to IS, SS is much better for all the categories except for the VS categories.
44371
Very Wide
Very Long 96.6
10 9 8 7 6 5 4 3 2 1 0
Width
Very Short
Wide
Width
Long
SF = 2
Seq
Very Wide
35
26.65
10 9 8 7 6 5 4 3 2 1 0
Width
Wide
40
Width
27.58
Worst case Slowdown
Worst case Slowdown
No Suspension
757.6
41.91
41.4
SF = 2 Tuned
Very Long
Long 40 35 30 25 20 15 10 5 0 Narrow
Very Wide
Short 746.15
SF = 2
Width
Width
Seq
Wide
Worst case Slowdown
Seq
50
291.51
Worst case Slowdown
100
60
135.48
92.74 40 35 30 25 20 15 10 5 0
Worst case Slowdown
120
757.65 70
Worst case Slowdown
746.15
Worst case Slowdown
Worst case Slowdown
291.51
Very Short
Short
Very Short 135.48
8
If is a previously suspended task attempting reentry, then it has to get the same set of processors that it was using before it was suspended. Here we remove the restriction that the number of processors requested by a suspending job should be at least half of the number of nodes requested by the job that it suspends. Otherwise if a VW job happens to suspend a narrow job, then in the worst-case, the narrow job has to wait till the VW job completes to get rescheduled. So the set of % tasks that can be preempted by in this case is given by
%
%
Task can be scheduled by preempting one or more tasks in if and only if
!/
D. Results We compare the SS scheme run under various suspension factors with the No-Suspension (NS) scheme with aggressive backfilling and the IS scheme. From Figs. 7 – 10, we can see that the SS scheme provides significant improvement for the
INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING AND NETWORKING
Pseudocode for the selective suspension scheme Sort the list of running jobs in ascending order of suspension priority Sort the list of idle jobs in descending order of suspension priority for each idle job do set the candidate_job_set to be the null set if (idle job is a suspended job) then goto already_suspended else available_processors = number of free processors for each running job do if (number of processors requested by the idle job > available_processors) then if ((suspension priority of the idle job >= SF * suspension priority of the running job) && (number of processors used by the running job = SF * suspension priority of the running job) then if ({set of processors used by the running job} n {set of processors requested by the idle job} is not empty) then available_processor_set = {available_processor_set} u {set of processors used by running job} candidate_job_set = {candidate_job_set} u {running job} end if else goto next_idle_ job end if else goto suspend_job_2 end if done end for goto next_idle_job suspend_jobs_1: sort job(s) in candidate_job_set in descending order of number of processors used available_processors = number of free processors for each job in candidate_job_set do if (number of processors requested by the idle job > available_processors) then suspend the job available_processors = available_processors + number of processors used by the suspended job else schedule the idle job goto next_idle_job end if done end for goto next_idle_job suspend_jobs_2: suspend all jobs in the candidate_job_set schedule the idle job next_idle_job: do nothing done end for
9
INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING AND NETWORKING
Seq
Narrow
No Suspension IS
Very Wide
5 4
SF = 2
3
No Suspension
2
IS
1 0 Seq
Narrow
Wide
Very Short
20000 16000
SF = 2
12000
No Suspension IS
8000 4000
Worst case TAT
Worst case TAT
64874
Very Wide
0
Wide
100000
SF = 2
80000
No Suspension
60000
IS
40000 20000
552731 280000
Worst case TAT
Worst case TAT
5624519
120000
Narrow
Wide
Width
No Suspension IS
Seq
Narrow
Very Wide
Wide
Very Wide
Wide
4722325 5421950
Very Wide
10.1 5
SF = 2
4
SF = 2 Tuned
3
No Suspension
2
IS
1 0 Seq
Narrow
Wide
Very Wide
Short 2326544
SF = 2
16000
SF = 2 Tuned
12000
No Suspension
8000
IS
4000
Narrow
Wide
SF = 2 SF = 2 Tuned No Suspension IS
Seq
Very Wide
267089
198354
SF = 2
160000
No Suspension
120000
IS
80000 40000
Very Wide
Width
Fig. 16. Worst-case turnaround time: SS scheme, SDSC trace. The trends are similar to those with the worst-case slowdown metric (Fig. 15).
Very-Short (VS) and Short (S) length categories and Wide (W) and Very-Wide (VW) width categories. For example, for the VS-VW category, slowdown is reduced from 113 for the NS scheme to 7 for SS with SF = 2 for the SDSC trace (reduced from 34 for the NS scheme to under 3 for SS with SF = 2 for the CTC trace). For the VS and S length categories, a lower SF results in lowered slowdown and turnaround time. This is because a lower SF increases the probability that a job in these categories will suspend a job in the Long (L) or Very-Long (VL) category. The same is also true for the L length category, but the effect of change in SF is less pronounced. For the VL length category, there is an opposite trend with decreasing SF: the slowdown and turnaround times worsen. This is due to the
Narrow
Wide
Very Wide
Width
Long 5566523
4160489
90000 80000 70000 60000 50000 40000 30000 20000 10000 0
Width
200000
Wide
128.7
86.1
72.8
6
64874
20000
Seq
240000
Narrow
Very Wide
Fig. 17. Worst-case slowdown for the TSS scheme: SDSC trace. TSS improves the worst-case slowdowns for many categories without affecting the worst-case slowdowns for other categories.
140000
Seq
Wide
Width
0
0
0
Seq
SF = 2 Tuned
Very Long
198354 3974919
Narrow
Very Long 505
SF = 2
Width
Long 267089
595
39588 86754
IS
Narrow
Seq
Very Short
No Suspension
Seq
IS
4160489
SF = 2
Very Wide
No Suspension
Width
24000
Width
140000
Very Wide
SF = 2 Tuned
Width
2326544
90000 80000 70000 60000 50000 40000 30000 20000 10000 0
Wide
24 22 20 18 16 14 12 10 8 6 4 2 0
Short
39588 86754
Narrow
Narrow
Width
Fig. 15. Worst-case slowdown: SS scheme, SDSC trace. SS is much better than NS for most of the categories and is slightly worse for some of the VL categories. Compared to IS, SS is much better for all the categories except for the VS categories.
Seq
Seq
SF = 2
Long 128.7
86.1
72.8
10.11 6
Width
24000
IS
5476
2216 80 70 60 50 40 30 20 10 0
Width
Worst case Slowdown
505
SF = 2
Wide
No Suspension
Very Long
595
Worstcase Slowdown
Worstcase Slowdown
Long
Narrow
Very Wide
SF = 2 Tuned
Width
24 22 20 18 16 14 12 10 8 6 4 2 0 Seq
Wide
SF = 2
Worst case Slowdown
Very Wide
Width
Worst case TAT
Wide
IS
Worst case TAT
Narrow
No Suspension
Short 867
Worst case Slowdown
IS
SF = 2
1056
647 400 350 300 250 200 150 100 50 0
Worst case TAT
No Suspension
5476
Very Long 552731
3974919 5624519
280000
120000 100000
SF = 2
80000
SF = 2 Tuned
60000
No Suspension
40000
IS
20000
Worst case TAT
SF = 2
2216 80 70 60 50 40 30 20 10 0
Worst case Slowdown
867
Worstcase Slowdown
Worstcase Slowdown
1056
400 350 300 250 200 150 100 50 0 Seq
Very Short
Short
Very Short 647
10
4722325 5421950
5566523
240000 200000
SF = 2
160000
SF = 2 Tuned
120000
No Suspension IS
80000 40000 0
0
Seq
Narrow
Wide
Width
Very Wide
Seq
Narrow
Wide
Very Wide
Width
Fig. 18. Worst-case turnaround times for the TSS scheme: SDSC trace. TSS improves the worst-case turnaround times for many categories without affecting the worst-case tunraround times for other categories.
increasing probability that a Long job will be suspended by a job in a shorter category as SF decreases. In comparison to the base No-Suspension (NS) scheme, the SS scheme provides significant benefits for the VS and S categories and a slight improvement for most of the Long categories but is slightly worse for the VL categories. The performance of the IS scheme is very good for the VS categories. It is better than the SS scheme for the VS categories and worse for the other categories. Although the overall slowdown for IS is considerably less than for the No-Suspension scheme, it is not better than SS. Moreover, with IS the VW and VL categories get significantly worse.
INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING AND NETWORKING
No Suspension IS
SF = 1.5 Tuned
4
SF = 2 Tuned
3
SF = 5 Tuned
2
No Suspension IS
1 0
Seq
Narrow
Wide
Very Wide
Seq
Width
Narrow
3.22
1.5
SF = 5 Tuned No Suspension
1
IS
0.5
SF = 1.5 Tuned
2
SF = 2 Tuned SF = 5 Tuned No Suspension
1
IS
0.5
Very Wide
Seq
Narrow
Wide
Very Wide
Very Short
SF = 1.5 Tuned
4
SF = 2 Tuned
3
SF = 5 Tuned
2
No Suspension IS
1
5
SF = 1.5 Tuned
4
SF = 2 Tuned
3
SF = 5 Tuned
2
No Suspension IS
1
0
0 Very Wide
Seq
Width
Narrow
Wide
Very Wide
Width
Long
Very Long 4.85
2.97
3
3
2.5
2.5
SF = 1.5 Tuned
2
SF = 2 Tuned
1.5
SF = 5 Tuned No Suspension
1
IS
0.5 0
Slowdown
Slowdown
Wide
Very Wide
Very Long
SF = 1.5 Tuned
2
SF = 2 Tuned
1.5
SF = 5 Tuned No Suspension
1
IS
2.5
SF = 1.5 Tuned
2
SF = 2 Tuned
1.5
SF = 5 Tuned No Suspension
1
IS
0.5 0
Narrow
Wide
Very Wide
Seq
Narrow
Wide
Very Wide
Width
Fig. 21. Average slowdown of badly estimated jobs: CTC trace. Compared to NS, SS provides a slight improvement in slowdowns for many categories. SS tends to penalize the badly estimated jobs in VS categories. IS gives better performance for VS, S and VL categories.
7.23 55.18
Slowdown
Slowdown
5
4.77
SF = 1.5 Tuned
2
SF = 2 Tuned
1.5
SF = 5 Tuned No Suspension
1
IS
0.5 0
Wide
Narrow
Short 6
Narrow
Seq
Width
2.5
Width
16.94 6
Seq
Very Wide
3
Seq
Fig. 19. Average slowdown: Inaccurate estimates of run time; CTC trace. Compared to NS, SS improves the slowdowns for most of the categories with little deterioration to other categories. The performance of IS is bad for the long jobs.
3.42
Wide
0
Width
Wide
Narrow
0.5
0 Wide
IS
6.35
1.5
Width
No Suspension
Long
2.5
0
SF = 5 Tuned
2 1
3
Slowdown
SF = 2 Tuned
Slowdown
SF = 1.5 Tuned
SF = 2 Tuned
3
0
4.57
2
Narrow
IS
SF = 1.5 Tuned
4
Width
5.65
2.5
Seq
No Suspension
Seq
Very Wide
3
Narrow
SF = 5 Tuned
Very Long
Long
Seq
SF = 2 Tuned
Width
3
Slowdown
Wide
SF = 1.5 Tuned
5
Slowdown
SF = 5 Tuned
6
20 18 16 14 12 10 8 6 4 2 0
Slowdown
SF = 2 Tuned
Slowdown
SF = 1.5 Tuned
5
6.33 6.61
23.43 21.85 59.69
6
Slowdown
Slowdown
7.2 37.66
53.9
20 18 16 14 12 10 8 6 4 2 0
Short
Very Short
Short
Very Short 21.7
11
Very Wide
Seq
Width
Narrow
Wide
Very Wide
Width
Fig. 20. Average slowdown of well estimated jobs: CTC trace. Compared to NS, SS significantly improves the slowdowns for most of the categories with little deterioration to other categories. The performance of SS is better than or comparable to IS for VS categories.
E. Tunable Selective Suspension (TSS) From the graphs of the previous section, one can observe that the SS scheme significantly improves the average slowdown and turnaround time of various job categories. But from a practical point of view, the worst-case slowdowns and turnaround times are very important. A scheme that improves the average slowdowns and turnaround times for most of the categories but makes the worst-case slowdown and turnaround time for the long categories worse, is not an acceptable scheme. For example, a delay of 1 hour for a 10-minute job (slowdown = 7) is tolerable, whereas a slowdown of 7 for a 24-hour job is unacceptable. Figure 11 compares the worstcase slowdowns for SF = 2 with the worst-case slowdowns of
the NS scheme and the IS scheme for the CTC trace. One can observe that the worst-case slowdowns with the SS scheme are much better than with the NS scheme for most of the cases. But the worst-case slowdowns for some of the long categories are worse than for the NS scheme. Although the worst-case slowdown with SS is generally less than that with NS, the absolute worst-case slowdowns are much higher than the average slowdowns for some of the short categories. For the IS scheme, the worst-case slowdowns for the very short categories are lower, but they are very high for the long jobs, an unacceptable situation. Figure 12 compares the worst-case turnaround times for the SS scheme with worstcase turnaround times for the NS scheme and the IS scheme, for the CTC trace. Even though the trends observed here are similar to those with the worst-case slowdowns, the categories where SS is the best with respect to worst-case turnaround time are not same as the categories for which SS is the best with respect to worst-case slowdowns. This is because a job with the worst-case turnaround time need not be the one with worst-case slowdown. Similar trends can be observed for the SDSC trace from Figs. 15 and 16. We next propose a tunable scheme to improve the worstcase slowdown and turnaround time without significant deterioration of the average slowdown and turnaround time. This scheme involves controlling the variance in the slowdowns and turnaround times by associating a limit with each job. Preemption of a job is disabled when its priority exceeds this limit. This limit is set to 1.5 times the average slowdown of the category that the job belongs to. % The candidate set of tasks that can be preempted by a task is given by
INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING AND NETWORKING
SF = 5 Tuned No Suspension
4000
IS
2000 0
Wide
Very Wide
Seq
Width
Wide
SF = 2 Tuned SF = 5 Tuned No Suspension IS
Turnaround Time
SF = 1.5 Tuned
88868
60000
SF = 1.5 Tuned 40000
SF = 2 Tuned
30000
SF = 5 Tuned
20000
No Suspension IS
10000
Seq
Very Wide
Narrow
Wide
Narrow
SF = 2 Tuned SF = 5 Tuned
4000
No Suspension IS
2000
Very Wide
Seq
/
U SER E STIMATE I NACCURACIES
We have so far assumed that the user estimates of job run time are perfect. Now, we consider the effect of user estimate inaccuracies on the proposed schemes. This analysis is needed
Narrow
Wide
Very Wide
Width Very Long 130543
SF = 1.5 Tuned SF = 2 Tuned SF = 5 Tuned No Suspension IS
90297
70000
369561
60000
SF = 1.5 Tuned
50000
SF = 2 Tuned
40000
SF = 5 Tuned
30000
No Suspension
20000
IS
10000 0
Seq
Narrow
Wide
Very Wide
Seq
Width
All the other conditions remain the same as mentioned in Section IV-C. Figures 13 and 14 show the result of this tunable scheme for the CTC trace. It improves the worst-case slowdowns for some long categories (VL W, VL VW, L N) and some short categories (VS Seq, VS N, S Seq) without affecting the worst-case slowdowns of the other categories. It improves the worst-case turnaround times for most of the categories without affecting the worst-case turnaround times of the other categories. Figures 17 and 18 show similar trends for the SDSC trace. This scheme can also be applied to selectively tune the slowdowns or turnaround times for particular categories. The TSS scheme is used for all the subsequent experiments, and the term “Selective Suspension” or “SS” in the following sections refers to “Tunable Selective Suspension.” OF
Wide
20000 18000 16000 14000 12000 10000 8000 6000 4000 2000 0
Very Wide
is a previously suspended task attempting reentry, then % # * ,% - %
/
SF = 1.5 Tuned
8000 6000
0
Width
Narrow
Wide
Very Wide
Width
Fig. 23. Average turnaround time of well estimated jobs: CTC trace. Compared to NS, SS significantly improves the turnaround times for most of the categories with little deterioration to other categories. The performance of SS is comparable to IS for VS categories. Short
Very Short 5200
Turnaround Time
% ! # * %, - % " (category(% )) represents the average slowdown where % for the job category to which belongs.
V. I MPACT
0
10000
Long
IS
200
342395
Fig. 22. Average turnaround time: Inaccurate estimates of run time; CTC trace. Compared to NS, SS improves the turnaround times for most of the categories with little deterioration to other categories. The performance of IS is bad for the long jobs.
%
400
12000
Width
50000
Width
If
No Suspension
Seq
5000 4500 4000 3500 3000 2500 2000 1500 1000 500 0
SF = 1.5 Tuned SF = 2 Tuned SF = 5 Tuned No Suspension IS
Seq
Narrow
Wide
11000 10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0
Very Wide
SF = 1.5 Tuned SF = 2 Tuned SF = 5 Tuned No Suspension IS
Seq
Width
Narrow
Wide
Very Wide
Width
Long 45825
30000
Turnaround Time
Wide
SF = 5 Tuned
600
Very Wide
0
Narrow
SF = 2 Tuned
800
Very Long 108346
20000 18000 16000 14000 12000 10000 8000 6000 4000 2000 0
Seq
1000
Width
Long 23580
Narrow
Turnaround Time
Narrow
SF = 1.5 Tuned
Turnaround Time
IS
SF = 2 Tuned
6000
1200
Turnaround Time
No Suspension
SF = 1.5 Tuned 8000
68464
1400
Turnaround Time
SF = 5 Tuned
10000
Very Long 59287
40000
25000
SF = 1.5 Tuned 20000
SF = 2 Tuned
15000
SF = 5 Tuned
10000
No Suspension IS
5000 0
Turnaround Time
SF = 2 Tuned
Short 4472
45663
12000
Turnaround Time
SF = 1.5 Tuned
Turnaround Time
Turnaround Time
5101
Turnaround Time
Very Short
Short
Very Short 4500 4000 3500 3000 2500 2000 1500 1000 500 0
Seq
12
35000 30000
SF = 1.5 Tuned
25000
SF = 2 Tuned
20000
SF = 5 Tuned
15000
No Suspension
10000
IS
5000 0
Seq
Narrow
Wide
Width
Very Wide
Seq
Narrow
Wide
Very Wide
Width
Fig. 24. Average turnaround time of badly estimated jobs: CTC trace. Compared to NS, SS provides a slight improvement in turnaround times for many categories. SS tends to penalize the badly estimated jobs in VS categories. IS gives better performance for VS, S, and VL categories.
for modeling an actual system workload. In this context, we believe that a problem has been ignored by previous studies when analyzing the effect of over estimation on scheduling strategies. Abnormally aborted jobs tend to excessively skew the average slowdown of jobs in a workload. Consider a job requesting a wall-clock limit of 24 hours, which is queued for 1 hour and then aborts within one minute due to some fatal exception. The slowdown of this job would be computed to be 60, whereas the average slowdown of normally completing long jobs is typically under 2. If even 5% of the jobs have a high slowdown of 60, while 95% of the normally completing jobs have a slowdown of 2, the average slowdown over all jobs would be around 5. Now consider a scheme such as the
INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING AND NETWORKING
Short
No Suspension IS
SF = 1.5 Tuned
12
SF = 2 Tuned SF = 5 Tuned
8
No Suspension
4
IS
0 Narrow
Wide
Seq
Very Wide
Narrow
Long
Slowdown
4
SF = 2 Tuned
3
SF = 5 Tuned
2
No Suspension IS
1 0
IS
Wide
Very Wide
Narrow
Wide
SF = 5 Tuned No Suspension
2
IS
Seq
Wide
Very Long 5.44
2
SF = 2 Tuned SF = 5 Tuned No Suspension
1
IS
0.5
SF = 1.5 Tuned
2.5
SF = 2 Tuned
2
SF = 5 Tuned
1.5
No Suspension
1
IS
SF = 2 Tuned SF = 5 Tuned
0 Narrow
Wide
No Suspension
1
0 Seq
SF = 1.5 Tuned
2 1.5
0.5
Width
Fig. 25. Average slowdown: Inaccurate estimates of run time; SDSC trace. Compared to NS, SS improves the slowdowns for most of the categories with little deterioration to other categories. The performance of IS is bad for the long jobs.
2.5
0.5 Very Wide
IS
Seq
Very Wide
Narrow
Wide
Very Wide
Width
Width
Fig. 26. Average slowdown of well estimated jobs: SDSC trace. Compared to NS, SS significantly improves the slowdowns for most of the categories with little deterioration to other categories. The performance of IS is bad except for VS categories. Very Short
Short 227.4
111 123
70
SF = 1.5 Tuned
50
SF = 2 Tuned
40
SF = 5 Tuned
30
No Suspension
20
IS
Slowdown
Slowdown
60
10 0 Seq
Narrow
Wide
16 14 12 10 8 6 4 2 0
SF = 1.5 Tuned SF = 2 Tuned SF = 5 Tuned No Suspension IS
Seq
Very Wide
Width
Narrow
Wide
Very Wide
Width
Long 13.5
Very Long 155.6
4.5 4 3.5 3 2.5 2 1.5 1 0.5 0
3 SF = 1.5 Tuned SF = 2 Tuned SF = 5 Tuned No Suspension IS
2.5
Slowdown
Slowdown
speculative backfilling strategy evaluated in [29]. With this scheme, a job is given a free timeslot to execute in, even if that slot is considerably smaller than the requested wall-clock limit. Aborting jobs will quickly terminate, and since they did not have to be queued till an adequately long window was available, their slowdown would decrease dramatically with the speculative backfilling scheme. As a result, the average slowdown of the entire trace would now be close to 2, assuming that the slowdown of the normally completing jobs does not change significantly. A comparison of the average slowdowns would seem to indicate that the speculative backfilling scheme results in a significant improvement in job slowdown from 5 to 2. However, under the above scenario, the change is due only to the small fraction of aborted jobs, and not due to any benefits to the normal jobs. In order to avoid this problem, we group the jobs into two different estimation categories:
39.05
3
3
1.5
Very Wide
Width
Long
SF = 1.5 Tuned
Wide
Narrow
39.05
2.5
Narrow
SF = 2 Tuned
4
Very Wide
3.5
Seq
SF = 1.5 Tuned
6
0 Seq
0
Width
No Suspension
8
Width
5.44
SF = 1.5 Tuned
Narrow
SF = 5 Tuned
Very Wide
3
5
Seq
SF = 2 Tuned
Very Long 97.7
Slowdown
8.26
6
Wide
SF = 1.5 Tuned
Width
Width
Slowdown
Seq
20 23 106
10
Slowdown
SF = 5 Tuned
11.7 52.5
8 7 6 5 4 3 2 1 0
Slowdown
SF = 2 Tuned
16
Slowdown
20 SF = 1.5 Tuned
Short
Very Short 178.47
99 121
Slowdown
Slowdown
Very Short 65 60 55 50 45 40 35 30 25 20 15 10 5 0
13
SF = 1.5 Tuned
2
SF = 2 Tuned
1.5
SF = 5 Tuned No Suspension
1
IS
0.5 0
Seq
Narrow
Wide
Width
Very Wide
Seq
Narrow
Wide
Very Wide
Width
Fig. 27. Average slowdown of badly estimated jobs: SDSC trace. Compared to NS, SS provides a slight improvement in slowdowns for many categories. SS tends to penalize the badly estimated jobs in VS categories.
' Jobs that are well estimated (the estimated time is not was badly estimated, it would be treated as a long job and its more than twice the actual run time of that job) and
' Jobs that are poorly estimated (the estimated run time is more than twice the actual run time). Within each group, the jobs are further classified into 16 categories based on their actual run time and the number of processors requested. One can observe from Figs. 19 and 25 that the Selective Suspension scheme improves the slowdowns for most of the categories without adversely affecting the other categories. The slowdowns for the short and wide categories are quite high compared to the other categories, mainly because of the overestimation. Since the suspension priority used by the SS scheme is xfactor, it favors the short jobs. But if a short job
priority would increase only gradually. So, it will not be able to suspend running jobs easily and will end up with a high slowdown. This situation does not happen with IS because of the 10-minute time quantum for each arriving job irrespective of the estimated run time, and therefore the slowdowns for the very short category (whose length is less than or equal to 10 minutes) is better in IS than other schemes. For the other categories, however, SS performs much better than IS. Figures 22 and 28 compare the average turnaround times for the SS scheme with that of the NS and IS schemes for the CTC and SDSC traces, respectively. The improvement in performance for the short and wide categories is much less when compared to the improvement achieved with the accurate
INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING AND NETWORKING
Short
5000
SF = 2 Tuned
4000
SF = 5 Tuned
3000
No Suspension
2000
IS
1000 0
Wide
SF = 1.5 Tuned
16000
SF = 2 Tuned
12000
SF = 5 Tuned No Suspension
8000
IS
4000 0
Seq
Very Wide
Long
128842
SF = 1.5 Tuned
30000
SF = 2 Tuned 20000
SF = 5 Tuned No Suspension
10000
IS
0
Narrow
Wide
Wide
SF = 2 Tuned
60000
SF = 5 Tuned
40000
No Suspension IS
20000 0
Seq
Narrow
Wide
IS
1000 0
104779
SF = 1.5 Tuned
3000
SF = 2 Tuned SF = 5 Tuned
2000
No Suspension
1000
IS
0
Wide
20000 16000
SF = 2 Tuned
12000
SF = 5 Tuned No Suspension
8000
No Suspension IS
Seq
Very Wide
Narrow
969299
30000
SF = 1.5 Tuned
25000
SF = 2 Tuned
20000
SF = 5 Tuned
15000
No Suspension
10000
IS
5000 0
Narrow
Very Wide
Very Long
144916
35000
Wide
Width
Wide
Very Wide
40000 35000 30000
SF = 1.5 Tuned
25000
SF = 2 Tuned
20000
SF = 5 Tuned
15000
No Suspension
10000
IS
5000 0
Seq
Narrow
Wide
Very Wide
Width
Fig. 30. Average turnaround time of badly estimated jobs: SDSC trace. Compared to NS, SS provides a slight improvement in turnaround times for many categories. SS tends to penalize the badly estimated jobs in VS categories. The performance of IS is bad except for VS categories.
0
Seq
Long
Narrow
Wide
Very Wide
Very Long
SF = 1.5 Tuned
25000
SF = 2 Tuned
20000
SF = 5 Tuned
15000
No Suspension
10000
IS
5000 0
Very Wide
Turnaround Time
373932
30000
outperforms IS in all other categories. A. Modeling of Job Suspension Overhead
IS
4000
Very Wide
35000
Wide
Wide
SF = 5 Tuned
SF = 1.5 Tuned
Width
50590
Narrow
SF = 2 Tuned
Short Turnaround Time
Turnaround Time
2000
SF = 1.5 Tuned
Width
24000
Width
Turnaround Time
No Suspension
Seq
Very Wide
Width
4000
Width
3000
1618960
SF = 1.5 Tuned
21175
Narrow
SF = 5 Tuned
Long
288427
80000
Very Short
Seq
SF = 2 Tuned
4000
294928
18000 16000 14000 12000 10000 8000 6000 4000 2000 0
Width
100000
Very Wide
5000
49066
5000
Seq
Fig. 28. Average turnaround time: Inaccurate estimates of run time; SDSC trace. Compared to NS, SS improves the turnaround times for most of the categories with little deterioration to other categories. The performance of IS is bad except for VS categories.
Narrow
SF = 1.5 Tuned
Very Wide
120000
Width
Seq
6000
Very Long 701384
Turnaround Time
Turnaround Time
96790 40000
Seq
Narrow
7000
Width
Width
Turnaround Time
Narrow
20000
Short
12478 18872
8000
Turnaround Time
SF = 1.5 Tuned
24000
Turnaround Time
6000
218087
Turnaround Time
Turnaround Time
7000
Seq
Very Short
18935
11079 8000
Turnaround Time
Very Short
14
129682
100000
288427 1621780
80000
SF = 1.5 Tuned
60000
SF = 2 Tuned SF = 5 Tuned
40000
No Suspension
20000
IS
0
Seq
Narrow
Wide
Very Wide
Width
Fig. 29. Average turnaround time of well estimated jobs: SDSC trace. Compared to NS, SS significantly improves the turnaround times for most of the categories with little deterioration to other categories. The performance of IS is very bad for long jobs.
user estimate case. The reasoning provided above for the increase in slowdowns for the short and wide categories holds for this case also. The seemingly long jobs (badly estimated short jobs) are unable to suspend running jobs easily and have to wait in the queue for a longer time, thus ending up with a high turnaround time. From Figs. 20 - 21 and Figs. 26 - 27, the higher slowdowns for the VS categories with SS clearly are due to the badly estimated jobs. Figures. 23 - 24 and Figures 29 - 30 show that the reduction in the percentage improvement of the average turnaround times for the short and wide categories in SS is due to the badly estimated jobs. One can also observe that, for the well estimated jobs, SS is better than or comparable to IS for the VS categories and SS
We have so far assumed no overhead for preemption of jobs. In this section, we present simulation results that incorporate overheads for job suspension. Since the job traces did not have information about job memory requirements, we considered the memory requirement of jobs to be random and uniformly distributed between 100 MB and 1 GB. The overhead for suspension is calculated as the time taken to write the main memory used by the job to the disk. The memory transfer rate that we considered is based on the following scenario: with a commodity local disk for every node, with each node being a quad, the transfer rate per processor was assumed to be 2 MB/s (corresponding to a disk bandwidth of 8 MB/s). Figures 31 and 32 compare respectively the slowdowns and turnaround times of the proposed tunable scheme with NS and IS in the presence of overhead for job suspension/restart for the CTC trace. Figures 33 and 34 compare respectively the slowdowns and turnaround times of the proposed tunable scheme with NS and IS in the presence of overhead for job suspension/restart for the SDSC trace. One can observe that overhead does not significantly affect the performance of the SS scheme. VI. L OAD VARIATION We have so far seen the performance of the Selective Suspension scheme under normal load. In this section, we present the performance of the SS scheme under different load conditions starting from the normal load (original trace) and increasing load until the system reaches saturation. The
INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING AND NETWORKING
Short
60 SF = 2
4
SF = 2 OH
3
No Suspension
2
IS
Wide
Seq
Narrow
Width
3.22
No Suspension
20
IS
Wide
Seq
SF = 2 OH No Suspension IS
0 Wide
SF = 2 OH No Suspension
1
IS
Wide
Very Short
Turnaround Time
IS
SF = 2
8000
SF = 2 OH
6000
No Suspension
4000
IS
2000
Very Wide
Seq
IS
Wide
Width
Very Wide
Turnaround Time
No Suspension
90297
60000
SF = 2 OH No Suspension
20000
IS
10000 0
Very Wide
Width
Very Wide
218087
SF = 2 OH No Suspension IS
Narrow
Wide
16000
SF = 2 12000
SF = 2 OH No Suspension
8000
IS 4000 0
Seq
Very Wide
30000 25000
SF = 2
20000
SF = 2 OH
15000
No Suspension
10000
IS
5000 0
Narrow
Wide
Very Wide
Very Long 701384
35000
Seq
Narrow
Width
96790
SF = 2
Wide
Short
Long 369581
30000
Wide
Narrow
20000
Width
40000
Narrow
Seq
SF = 2
Seq
Very Wide
50000
Seq
IS
Very Wide
18935
Very Long 108346
SF = 2 OH
Narrow
Wide
No Suspension
Width
9000 8000 7000 6000 5000 4000 3000 2000 1000 0
Width
SF = 2
Seq
Narrow
Wide
SF = 2 OH
1
Very Short
10000
Long 23580
IS
1
SF = 2
2 1.5
Fig. 33. Average slowdown with modeling of overhead for suspension/restart: SDSC trace. The impact of overhead on the performance of SS scheme is minimal.
45663
12000
Width
20000 18000 16000 14000 12000 10000 8000 6000 4000 2000 0
No Suspension
Narrow
39.05
2.5
Width
Turnaround Time
Wide
SF = 2 OH
Seq
0
Narrow
5.44
SF = 2
2
Short 5101
No Suspension
Very Long
1.5
Very Wide
Very Wide
3
2.5
Width
Wide
97.7
0
Fig. 31. Average slowdown with modeling of overhead for suspension/restart: CTC trace. The impact of overhead on the performance of SS scheme is minimal.
Seq
8.26
0
Narrow
Narrow
Width
0 Seq
SF = 2 OH
Seq
Very Wide
0.5
Width
SF = 2
IS
0.5
Very Wide
4500 4000 3500 3000 2500 2000 1500 1000 500 0
No Suspension
2
0.5
Turnaround Time
Narrow
SF = 2
2 1.5
Slowdown
3
Slowdown
2.5 SF = 2
SF = 2 OH
4
Long
2.5
1
Wide
4.57
3.5
Seq
Narrow
Very Long 3
2
SF = 2 6
Width
5.65
1.5
8
0
Very Wide
3
0.5
Turnaround Time
SF = 2 OH
30
Width
Long
Turnaround Time
SF = 2
40
0
0
Very Wide
50
10
1 Narrow
17.2 178.4
Slowdown
IS
5
10
Turnaround Time
No Suspension
70
Wide
Width
Very Wide
Turnaround Time
SF = 2 OH
Short 121.38
6
Slowdown
SF = 2
Seq
Slowdown
Very Short 7.2 37.66
53.9
Slowdown
Slowdown
21.7 18 16 14 12 10 8 6 4 2 0
Slowdown
Very Short
15
128842
80000
288427
1618960
60000
SF = 2 SF = 2 OH
40000
No Suspension IS
20000 0
Seq
Narrow
Wide
Very Wide
Width
Fig. 32. Average turnaround time with modeling of overhead for suspension/restart: CTC trace. The impact of overhead on the performance of SS scheme is minimal.
Fig. 34. Average turnaround time with modeling of overhead for suspension/restart: SDSC trace. The impact of overhead on the performance of SS scheme is minimal.
different loads correspond to modification of the traces by dividing the arrival times of the jobs by suitable constants, keeping their run time the same as in the original trace. For example, the job trace for a load factor of 1.1 is obtained by dividing the arrival times of the jobs in the original trace by 1.1. For simplicity, we have reduced the number of job categories from sixteen to four for the load variation studies: two categories based on their run time — Short (S) and Long (L) — and two categories based on the number of processors requested — Narrow (N) and Wide (W). The criteria used for job classifications are shown in Table VI. The distribution of jobs in the CTC and SDSC traces, corresponding to the four categories, is given in Table VII and Table VIII, respectively. Figures 35 and 38 show the overall system utilization for
different schemes under different load conditions for the CTC and SDSC traces. One can observe that the SS scheme is able to achieve a better utilization than the NS scheme at higher loads, whereas the overall system utilization is very low under the IS scheme. Also, there is no significant increase in the overall system utilization (for both the SS and NS schemes) when the load factor is increased beyond 1.6 (for CTC) and 1.3 (for SDSC). This result indicates that the system reaches saturation at a load factor of 1.6 (for CTC) and 1.3 (for SDSC). We report the performance of the SS scheme for various load factors between 1.0 (normal) and 1.6 for the CTC trace and between 1.0 and 1.3 for the SDSC trace. Figures 36 and 37 and Figures 39 and 40 compare the performance of the SS scheme with the NS and IS schemes for different job categories under different load conditions
INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING AND NETWORKING
16
Load Vs Utilization
Load Vs Utilization 90
100
80
90
70
80 70
Utilization
Utilization
60 SF = 2 Tuned No Suspension
50 40
IS
30
60
SF = 2 Tuned
50
No Suspension IS
40 30
20
20
10
10
0 1
1.2
1.4
1.6
1.8
0
2
1
Load
1.1
1.2
1.3
1.4
1.5
Load
Fig. 35. Overall system utilization under different load conditions: CTC trace. The overall system utilization with the SS scheme is better than or comparable to the NS scheme. The performance of IS is much worse.
Short Narrow
Short Wide
80
20
0
0 0.5
1
1.5
IS
40
5 0
No Suspension
60
2
0
0.5
1
Long Narrow
1.5
0 0
SF = 2 Tuned
8
No Suspension
6
IS
4
0
0.5
1
1.5
2000
No Suspension
1500
IS
1000 500 0 1.5
2
30000 SF = 2 Tuned
20000
No Suspension
15000
IS
10000 5000 0.5
1
Long Narrow
1.5
2
1
2
200000 SF = 2 Tuned No Suspension 100000
IS
50000
1
1.5
2
Load
Fig. 37. Average turnaround time: varying load; CTC trace. The improvements achieved by the SS scheme are more pronounced under high load.
1.5
2
Short Wide
3000 2500 SF = 2 Tuned No Suspension
1500
IS
1000 500
120000 100000 SF = 2 Tuned
80000
No Suspension
60000
IS
40000 20000 0
0.5
1
1.5
2
0
0.5
80000
SF = 2 Tuned
60000
No Suspension IS
40000 20000
0.5
1
Load
1.5
2
Long Wide
100000
0
1
Load
0 0.5
1
Long Narrow
150000
0
0.5
Load
120000
0 1.5
0
2
Load
Turnaround Time
Turnaround Time
IS
1.5
2000
Long Wide
No Suspension
IS
140000
0
250000
SF = 2 Tuned
No Suspension
20
0 0.5
0 0
SF = 2 Tuned
30
10
Load
45000 40000 35000 30000 25000 20000 15000 10000 5000 0 1
2
40
Short Narrow
25000
2
Fig. 39. Average slowdown: varying load; SDSC trace. The improvements achieved by the SS scheme are more pronounced under high load.
0
Load
IS
3500
Load
0.5
No Suspension
3
Load
Turnaround Time
SF = 2 Tuned
Turnaround Time
Turnaround Time
2500
0
SF = 2 Tuned
4
Short Wide
3000
1.5
50
5
0
35000
3500
1
Long Wide
0
2
Fig. 36. Average slowdown: varying load; CTC trace. The improvements achieved by the SS scheme are more pronounced under high load.
Short Narrow
0.5
60
Load
4000
IS
Load
1
Load
No Suspension
0
6
10
0 2
2
SF = 2 Tuned
Long Narrow
2 1.5
1.5
7
Slowdown
IS
Slowdown
No Suspension
1
1
180 160 140 120 100 80 60 40 20 0
Load
12
0.5
0.5
Long Wide
SF = 2 Tuned
0
10
2
14
1
IS
Load
1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0.5
No Suspension
15
5
Load
0
SF = 2 Tuned
20
Slowdown
IS
10
25
Slowdown
No Suspension
15
SF = 2 Tuned
Turnaround Time
20
30
100
Slowdown
SF = 2 Tuned
Short Wide
35
120
25
Slowdown
Slowdown
30
Slowdown
Short Narrow
140
1.5
2
Turnaround Time
35
Turnaround Time
Fig. 38. Overall system utilization under different load conditions: SDSC trace. The overall system utilization with the SS scheme is better than or comparable to the NS scheme. The performance of IS is much worse.
1000000 900000 800000 700000 600000 500000 400000 300000 200000 100000 0
SF = 2 Tuned No Suspension IS
0
0.5
1
1.5
2
Load
Fig. 40. Average turnaround time: varying load; SDSC trace. The improvements achieved by the SS scheme are more pronounced under high load.
INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING AND NETWORKING
Short Narrow
IS
10 5
No Suspension
60
IS
40 20
0 40
60
80
100
0
20
Utilization
40
60
80
1000 500
No Suspension IS
Slowdown
SF = 2 Tuned
40
10 SF = 2 Tuned
8
No Suspension
6
IS
4
80
20
40
60
IS
40000 20000 0
20
80
40
80000
No Suspension IS
40000 20000
20
40
60
100
80
1000000 900000 800000 700000 600000 500000 400000 300000 200000 100000 0
100
SF = 2 Tuned No Suspension IS
0
Utilization
Fig. 41. Average slowdown versus system utilization: CTC trace. SS provides better performance even if the system is heavily utilized.
80
Long Wide
SF = 2 Tuned
60000
0
100
60
Utilization
100000
Utilization
Utilization
No Suspension
60000
100
0
0
100
60
120000
0 80
SF = 2 Tuned
80000
Long Narrow
2 60
100000
0 20
Long Wide
12
40
IS
120000
Utilization
14
1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 20
No Suspension
1500
Utilization
Long Narrow
0
SF = 2 Tuned
2000
0
100
Turnaround Time
20
2500
0
0 0
3000
Turnaround Time
15
SF = 2 Tuned
80
140000
Turnaround Time
No Suspension
100
Turnaround Time
SF = 2 Tuned
Slowdown
Slowdown
120
25
Short Wide
3500
140
30
Slowdown
Short Narrow
Short Wide
35
20
17
20
40
60
80
100
Utilization
Fig. 44. Average turnaround time versus system utilization: SDSC trace. SS provides better performance even if the system is heavily utilized. TABLE VI
Short Narrow
Short Wide
3000 2500
SF = 2 Tuned
2000
No Suspension
1500
IS
1000 500
30000 25000 No Suspension
15000
IS
10000
20
40
60
80
100
0
20
40
80
100
Long Wide 250000
45000 40000 35000 30000 25000 20000 15000 10000 5000 0
SF = 2 Tuned No Suspension IS
Turnaround Time
Turnaround Time
60
Utilization
Long Narrow
200000 SF = 2 Tuned
150000
No Suspension 100000
IS
50000 0
0
20
40
60
80
0
100
20
40
60
80
100
Utilization
Utilization
Fig. 42. Average turnaround time versus system utilization: CTC trace. SS provides better performance even if the system is heavily utilized. Short Narrow
Short Wide
35
25 SF = 2 Tuned
20
No Suspension
15
IS
10
Slowdown
30
Slowdown
8 Processors SW LW
5000
Utilization
5 0 20
40
60
80
180 160 140 120 100 80 60 40 20 0
100
No Suspension
J OB 0
20
40
60
80
Long Wide
6
1 Hr 1 Hr
SF = 2 Tuned
4
No Suspension
3
IS
2
Slowdown
50
5
40
10 0 80
100
8 Processors 13% 13%
TABLE VIII
No Suspension IS
20
0
8 Processors 44% 30%
SF = 2 Tuned
30
1 60
- CTC
TRACE
100
60
40
DISTRIBUTION BY CATEGORY FOR LOAD VARIATION STUDIES
Utilization
7
Utilization
TABLE VII
IS
Long Narrow
20
for CTC and SDSC traces. It can be observed that the improvements obtained by the SS scheme are more pronounced under high load. The trends with respect to different categories under higher loads is similar to that observed under the normal load. It provides significant benefit to the short jobs without affecting the performance of long jobs. The IS scheme is better than the SS scheme only for the SN jobs in terms of average turnaround time, whereas it is better than SS for both SN and SW jobs in terms of average slowdown. It implies that the IS scheme improves the performance of only the relatively shorter jobs in SW category by adversely affecting the performance of the relatively longer jobs. Also, the performance of the IS scheme is much worse for the long jobs, a very undesirable
SF = 2 Tuned
Utilization
Slowdown
8 Processors SN LN
0 0
0
1 Hr 1 Hr
SF = 2 Tuned
20000
0
0
CATEGORIZATION CRITERIA FOR LOAD VARIATION STUDIES
35000
3500
Turnaround Time
Turnaround Time
4000
J OB
J OB DISTRIBUTION BY 0
20
40
60
80
CATEGORY FOR LOAD VARIATION STUDIES TRACE
100
Utilization
Fig. 43. Average slowdown versus system utilization: SDSC trace. SS provides better performance even if the system is heavily utilized.
1 Hr 1 Hr
8 Processors 47% 21%
8 Processors 22% 10%
- SDSC
INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING AND NETWORKING
situation. Figures 41 and 42 compare respectively the average slowdowns and the average turnaround times of the jobs in the CTC trace against the overall system utilization for various schemes. Figures 43 and 44 compare respectively the average slowdowns and the average turnaround times of the jobs in the SDSC trace against the overall system utilization for various schemes. The SS scheme clearly is much better than both the IS and NS schemes. Even when the system is highly utilized, the SS scheme is able to provide much better response times for all categories of jobs. The IS scheme is not able to achieve high system utilization. VII. C ONCLUSIONS In this paper, we have explored the issue of preemptive scheduling of parallel jobs, using job traces from different supercomputer centers. We have proposed a tunable, selective suspension scheme and demonstrated that it provides significant improvement in the average and the worst-case slowdown of most job categories. It was also shown to provide better slowdown for most job categories over a previously proposed Immediate Service scheme. We also modeled the effect of overheads for job suspension, showing that even under stringent assumptions about available bandwidth to disk, the proposed scheme provides significant benefits over nonpreemptive scheduling and the Immediate Service strategy. We also evaluated the proposed schemes in the presence of inaccurate estimate of job run times and showed that the proposed scheme produced good results. Further, we showed that the Selective Suspension strategy provides greater benefits under high system loads compared to the other schemes. ACKNOWLEDGMENTS We thank the anonymous referees for their helpful suggestions on improving the presentation of the paper. This work was supported in part by Sandia National Laboratories, the University of Chicago under NSF Grant #SCI0414407, and the U.S. Department of Energy under Contract W-31-109-ENG-38. R EFERENCES [1] B. DasGupta and M. A. Palis, “Online real-time preemptive scheduling of jobs with deadlines,” in APPROX, 2000, pp. 96–107. [Online]. Available: citeseer.nj.nec.com/dasgupta00online.html [2] X. Deng and P. Dymond, “On multiprocessor system scheduling,” in Proceedings of the eighth annual ACM symposium on Parallel algorithms and architectures. ACM Press, 1996, pp. 82–88. [3] X. Deng, N. Gu, T. Brecht, and K. Lu, “Preemptive scheduling of parallel jobs on multiprocessors,” in SODA: ACM-SIAM Symposium on Discrete Algorithms, 1996. [Online]. Available: citeseer.nj.nec.com/deng00preemptive.html
18
[4] L. Epstein, “Optimal preemptive scheduling on uniform processors with non-decreasing speed ratios,” Lecture Notes in Computer Science, vol. 2010, pp. 230–248, 2001. [Online]. Available: citeseer.nj.nec.com/epstein00optimal.html [5] U. Schwiegelshohn and R. Yahyapour, “Fairness in parallel job scheduling,” Journal of Scheduling, vol. 3, no. 5, pp. 297–320, 2000. [Online]. Available: citeseer.ist.psu.edu/schwiegelshohn00fairness.html [6] S. V. Anastasiadis and K. C. Sevcik, “Parallel application scheduling on networks of workstations,” Journal of Parallel and Distributed Computing, vol. 43, no. 2, pp. 109–124, 1997. [Online]. Available: citeseer.nj.nec.com/article/anastasiadis96parallel.html [7] W. Cirne and F. Berman, “Adaptive selection of partition size for supercomputer requests,” in Workshop on Job Scheduling Strategies for Parallel Processing, 2000, pp. 187–208. [Online]. Available: citeseer.nj.nec.com/479768.html [8] D. G. Feitelson, “Analyzing the root causes of performance evaluation results,” Leibniz Center, Hebrew University, Tech. Rep., 2002. [9] J. P. Jones and B. Nitzberg, “Scheduling for parallel supercomputing: A historical perspective of achievable utilization,” in Workshop on Job Scheduling Strategies for Parallel Processing, 1999, pp. 1–16. [Online]. Available: citeseer.nj.nec.com/patton99scheduling.html [10] W. A. W. Jr., C. L. Mahood, and J. E. West, “Scheduling jobs on parallel systems using a relaxed backfill strategy,” in Workshop on Job Scheduling Strategies for Parallel Processing, 2002. [11] B. G. Lawson and E. Smirni, “Multiple-queue backfilling scheduling with priorities and reservations for parallel systems,” in Workshop on Job Scheduling Strategies for Parallel Processing, 2002. [12] B. G. Lawson, E. Smirni, and D. Puiu, “Self-adapting backfilling scheduling for parallel systems,” in Proceedings of the International Conference on Parallel Processing, 2002. [13] D. Lifka, “The ANL/IBM SP scheduling system,” in Workshop on Job Scheduling Strategies for Parallel Processing, 1995, pp. 295–303. [14] A. W. Mu’alem and D. G. Feitelson, “Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfilling,” IEEE Transactions on Parallel and Distributed Systems, vol. 12, no. 6, pp. 529–543, 2001. [15] G. Sabin, R. Kettimuthu, A. Rajan, and P. Sadayappan, “Scheduling of parallel jobs in a heterogeneous multi-site environment,” in Workshop on Job Scheduling Strategies for Parallel Processing, 2003. [16] S. Srinivasan, R. Kettimuthu, V. Subramani, and P. Sadayappan, “Selective reservation strategies for backfill job scheduling,” in Workshop on Job Scheduling Strategies for Parallel Processing, 2002. [17] S. Srinivasan, V. Subramani, R. Kettimuthu, P. Holenarsipur, and P. Sadayappan, “Effective selection of partition sizes for moldable scheduling of parallel jobs,” in Proceedings of the 9th International Conference on High Performance Computing, 2002. [18] V. Subramani, R. Kettimuthu, S. Srinivasan, J. Johnston, and P. Sadayappan, “Selective buddy allocation for scheduling parallel jobs on clusters,” in Proceedings of the IEEE International Conference on Cluster Computing, 2002. [19] V. Subramani, R. Kettimuthu, S. Srinivasan, and P. Sadayappan, “Distributed job scheduling on computational grids using multiple simultaneous requests,” in Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing, 2002, pp. 359–366. [20] D. Talby and D. G. Feitelson, “Supporting priorities and improving utilization of the ibm sp scheduler using slack-based backfilling,” in Proceedings of the 13th International Parallel Processing Symposium, 1999, pp. 513–517. [Online]. Available: citeseer.nj.nec.com/talby99supporting.html [21] D. Zotkin and P. Keleher, “Job-length estimation and performance in backfilling schedulers,” in Proceedings of the 8th High Performance Distributed Computing Conference, 1999, pp. 236–243. [Online]. Available: citeseer.nj.nec.com/196999.html [22] S. H. Chiang, R. K. Mansharamani, and M. K. Vernon, “Use of application characteristics and limited preemption for run-to-completion parallel processor scheduling policies,” in ACM SIGMETRICS
INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING AND NETWORKING
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[35]
[36]
[37]
[38]
[39]
[40]
Conference on Measurement and Modeling of Computer Systems, 1994, pp. 33–44. [Online]. Available: citeseer.nj.nec.com/chiang94use.html S. H. Chiang and M. K. Vernon, “Production job scheduling for parallel shared memory systems,” in Proceedings of International Parallel and Distributed Processing Symposium, 2002. [Online]. Available: citeseer.nj.nec.com/196999.html L. T. Leutenneger and M. K. Vernon, “The performance of multiprogrammed multiprocessor scheduling policies,” in ACM SIGMETRICS Conference on Measurement and Modelling of Computer Systems, May 1990, pp. 226–236. [Online]. Available: citeseer.nj.nec.com/196999.html E. W. Parsons and K. C. Sevcik, “Implementing multiprocessor scheduling disciplines,” in Job Scheduling Strategies for Parallel Processing, D. G. Feitelson and L. Rudolph, Eds. Springer Verlag, 1997, pp. 166– 192, lecture Notes in Computer Science vol. 1291. K. Aida, “Effect of job size characteristics on job scheduling performance,” in Workshop on Job Scheduling Strategies for Parallel Processing, 2000, pp. 1–17. [Online]. Available: citeseer.nj.nec.com/319169.html O. Arndt, B. Freisleben, T. Kielmann, and F. Thilo, “A comparative study of online scheduling algorithms for networks of workstations,” Cluster Computing, vol. 3, no. 2, pp. 95–112, 2000. [Online]. Available: citeseer.nj.nec.com/article/arndt98comparative.html W. Cirne, “When the herd is smart: The emergent behavior of sa,” in IEEE Transactions on Parallel and Distributed Systems, 2002. [Online]. Available: citeseer.nj.nec.com/457615.html D. Perkovic and P. J. Keleher, “Randomization, speculation, and adaptation in batch schedulers,” in Proceedings of the 2000 ACM/IEEE conference on Supercomputing (CDROM). IEEE Computer Society, 2000, p. 7. P. J. Keleher, D. Zotkin, and D. Perkovic, “Attacking the bottlenecks of backfilling schedulers,” Cluster Computing, vol. 3, no. 4, pp. 245–254, 2000. [Online]. Available: citeseer.nj.nec.com/467800.html J. Krallmann, U. Schwiegelshohn, and R. Yahyapour, “On the design and evaluation of job scheduling algorithms,” in Workshop on Job Scheduling Strategies for Parallel Processing, 1999, pp. 17–42. [Online]. Available: citeseer.nj.nec.com/krallmann99design.html S. Srinivasan, R. Kettimuthu, V. Subramani, and P. Sadayappan, “Characterization of backfilling strategies for parallel job scheduling,” in Proceedings of the ICPP-2002 Workshops, 2002, pp. 514–519. A. Streit, “On job scheduling for HPC-clusters and the dynP scheduler,” in Proceedings of the 8th International Conference on High Performance Computing. Springer-Verlag, 2001, pp. 58–67. C. McCann, R. Vaswani, and J. Zahorjan, “A dynamic processor allocation policy for multiprogrammed shared-memory multiprocessors,” ACM Transactions on Computer Systems, vol. 11, no. 2, pp. 146–178, 1993. D. G. Feitelson and M. A. Jette, “Improved utilization and responsiveness with gang scheduling,” in Workshop on Job Scheduling Strategies for Parallel Processing. Springer-Verlag, 1997, pp. 238–261. D. Jackson, Q. Snell, and M. J. Clement, “Core algorithms of the maui scheduler,” in Workshop on Job Scheduling Strategies for Parallel Processing, 2001, pp. 87–102. [Online]. Available: citeseer.nj.nec.com/479768.html J. Skovira, W. Chan, H. Zhou, and D. Lifka, “The easy loadleveler api project,” in Workshop on Job Scheduling Strategies for Parallel Processing, 1996, pp. 41–47. [Online]. Available: citeseer.nj.nec.com/479768.html D. G. Feitelson, L. Rudolph, U. Schwiegelshohn, K. C. Sevcik, and P. Wong, “Theory and practice in parallel job scheduling,” in Workshop on Job Scheduling Strategies for Parallel Processing. Springer-Verlag, 1997, pp. 1–34. K. C. Sevcik, “Application scheduling and processor allocation in multiprogrammed parallel processing systems,” Performance Evaluation, vol. 19, no. 2-3, pp. 107–140, 1994. [Online]. Available: citeseer.nj.nec.com/sevcik93application.html J. Zahorjan and C. McCann, “Processor scheduling in shared memory
19
multiprocessors,” in ACM SIGMETRICS Conference on Measurement and Modelling of Computer Systems, May 1990, pp. 214–225. [Online]. Available: citeseer.nj.nec.com/196999.html [41] D. G. Feitelson, “Logs of real parallel workloads from production systems,” http://www.cs.huji.ac.il/labs/parallel/workload/logs.html.
Rajkumar Kettimuthu is a researcher at Argonne National Laboratory’s Mathematics and Computer Science Division. His research interests include data transport in high-bandwidth and high-delay networks and scheduling and resource management for cluster computing and the Grid. He has a bachelor of engineering degree in computer science and engineering from Anna Univerisy, Madras, India, and a master of science in computer and information science from the Ohio State University.
Vijay Subramani received his B.E in computer science and engineering from Anna University, India, in 2000 and his M.S in computer and information science from the Ohio State University in 2002. His research at Ohio State included scheduling and resource management for parallel and distributed systems. He currently works at Microsoft Corporation in Redmond, WA. His past work experience includes an internship at Los Alamos National Laboratory, where he worked on buffered coscheduling.
Srividya Srinivasan currently works as a software engineer at Microsoft Corporation in Redmond, WA. She worked as a software developer at Bloomberg L.P in New York earlier. She received a B.E degree in computer science and engineering from Anna University, Chennai, India, in 2000 and her M.S in computer and information science from the Ohio State University in 2002. Her research at Ohio State focused on parallel and distributed systems with an emphasis on parallel job scheduling.
Thiagaraja Gopalsamy is a senior software engineer with Altera Corporation, San Jose. He received his bachelor’s degree in computer science and engineering in 1999 from Anna University, India and his master’s degree in computer and information science in 2001 from the Ohio State University. His past research interests include mobile ad hoc networks and parallel computing. He is currently working on field programmable gate arrays and reconfigurable computing.
INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING AND NETWORKING
D. K. Panda is a professor of computer science at the Ohio State University. His research interests include parallel computer architecture, high performance networking, and network-based computing. He has published over 150 papers in these areas. His research group is currently collaborating with national laboratories and leading companies on designing various communication and I/O subsystems of next-generation HPC systems and datacenters with modern interconnects. The MVAPICH (MPI over VAPI for InfiniBand) package developed by his research group (http://nowlab.cis.ohio-state.edu/projects/mpi-iba/) is being used by more than 160 organizations worldwide to extract the potential of InfiniBand-based clusters for HPC applications. Dr. Panda is a recipient of the NSF CAREER Award, OSU Lumley Research Award (1997 and 2001), and an Ameritech Faculty Fellow Award. He is a senior member of IEEE Computer Society and a member of ACM.
P. Sadayappan received the B. Tech. degree from the Indian Institute of Technology, Madras, India, and an M.S. and Ph.D. from the State University of New York at Stony Brook, all in electrical engineering. He is currently a professor in the Department of Computer Science and Engineering at the Ohio State University. His research interests include scheduling and resource management for parallel/distributed systems and compiler/runtime support for highperformance computing.
The submitted manuscript has been in part created by the University of Chicago as Operator of Argonne National Laboratory (”Argonne”) under Contract No. W-31-109-ENG38 with the U.S. Department of Energy. The U.S. Government retains for itself, and othersacting on its behalf, a paid-up, nonexclusive, irrevocable worldwide license in said article to reproduce, prepare derivative works, distribute copies to the public, and perform publicly and display publicly, by or on behalf of the Government.
20