AN EXTENSIBLE JOB SCHEDULING SYSTEM ... - Semantic Scholar

4 downloads 81444 Views 211KB Size Report
EASY-LoadLeveler project was started to combine the best features of both scheduling systems. .... types of computer jobs, often with conflicting requirements.
AN EXTENSIBLE JOB SCHEDULING SYSTEM FOR MASSIVELY PARALLEL PROCESSOR ARCHITECTURES

BY DAVID ANDREW LIFKA

Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate College of the Illinois Institute of Technology

Approved____________________________ Advisor ____________________________ Co-Advisor

Chicago, Illinois May, 1998

ACKNOWLEDGMENT

This work is the result of the guidance, support, and encouragement I have received from my wife, family, advisors and colleagues. I would like to thank Argonne National Laboratory for supporting a majority of my graduate school work and access to their original 128 node IBM SP system that the original EASY scheduler was developed on. As EASY matured and scalability became the focus of my research, the Cornell Theory Center provided a new one-of-a-kind opportunity for me. Their exceptionally well run IBM SP system combined with their excellent working relationship with IBM were the two key factors that made the EASY-LoadLeveler work possible. The freedom to do this research and the time to consider appropriate solutions has been invaluable. I would also like to thank my advisors at the Illinois Institute of Technology especially Dr. Kenevan for helping me formulate this work and the opportunity to do this research under his guidance and Dr. Martha Evens for her patience and encouragement during some difficult times. I would like to thank my parents for instilling in me the importance of “aiming high” and working hard to achieve my goals. These lessons combined with their love and support have given me the confidence I needed complete this degree. Finally, I would like to thank my wife, Torri, for her love, patience and encouragement. I am truly grateful for all the sacrifices she has made so that I could complete this degree. D. A. L.

iii

TABLE OF CONTENTS Page ACKNOWLEDGEMENT ........................................................................................... iii LIST OF FIGURES......................................................................................................

v

ABSTRACT................................................................................................................. vi CHAPTER I.

INTRODUCTION................................................................................ 1.1 1.2 1.3 1.4

II .

1

Problem Description ................................................................ 3 Background Information .......................................................... 4 Review of Previous Work ........................................................ 7 Approaches to Job Scheduling................................................. 10

HOMOGENEOUS RESOURCE SOLUTION.................................... 17 2.1 The EASY Scheduler ............................................................... 17 2.2 EASY-LoadLeveler Solution................................................... 24

III .

HETEROGENEOUS RESOURCE SOLUTION................................. 26

IV .

SCHEDULING PERFORMANCE...................................................... 31 4.1 Performance Simulation........................................................... 33 4.2 Performance Simulation Results.............................................. 35

V.

CONCLUSIONS.................................................................................. 39

VI .

FUTURE WORK................................................................................. 44

APPENDIX.................................................................................................................. 47 REFERENCES............................................................................................................. 64

iv

LIST OF FIGURES Figure 1.1

Page Essential Components of a Scheduling System........................... 6

2.1

Backfill Algorithm Pseudo-code ................................................. 18

2.2

Backfill Example, Part 1.............................................................. 20

2.3

Backfill Example, Part 2.............................................................. 21

2.4

Backfill Example, Part 3.............................................................. 22

2.5

EASY Design .............................................................................. 23

2.6

EASY-LoadLeveler Design......................................................... 25

3.1

O(n3) Heterogeneous Backfill Strategy ....................................... 27

3.2

Allocate Routine Pseudo-code..................................................... 28

3.3

CheckRequirements Routine Pseudo-code.................................. 29

3.4

ReserveResources Routine Pseudo-code..................................... 30

4.1

Homogeneous EASY Daemon Performance............................... 32

4.2

Heterogeneous EASY Daemon Performance.............................. 32

4.3

Average Wait Time ..................................................................... 37

4.4

Utilization .................................................................................... 38

4.5

Run Time ..................................................................................... 38

6.1

Advanced Resource Management System................................... 45

v

ABSTRACT

During the last five years scientists have discovered that modern UNIX workstations connected with ethernet and high performance networks can provide enough computational performance to compete with the supercomputers of the day.

Today

supercomputer systems, like International Business Machines SP, can provide more CPU and networking bandwidth than is obtainable from networks of workstations, NOWs. The IBM SP is actually made up of individual workstation class processors connected by a high bandwidth switch network so scheduler developers felt that the scheduling systems that were previously used on NOWs would still apply. It quickly became obvious to the many sites who purchased MPP systems that this was certainly not the case. Realizing that there was an urgent need for a job scheduling system that works well in an MPP environment I started the development of the Extensible Argonne Scheduling sYstem (EASY). A unique development approach in which users were encouraged to make suggestions or report inconsistencies with the documented behavior of EASY was used. As EASY became more widely used and IBM SP systems started to become much larger, several scalability problems had to be addressed. The main scalability problem was due to the fact that EASY was doing resource management as well as job scheduling. After discussions with IBM on how to address these issues, we decided on an application programming interface to their LoadLeveler product. LoadLeveler did a poor job of scheduling large parallel jobs but did provide scaleable resource management.

The

EASY-LoadLeveler project was started to combine the best features of both scheduling systems. In order to test the scalability of EASY-LoadLeveler a much larger system than

vi

I had access to at Argonne was necessary. The Cornell Theory Center provided not only the largest IBM SP in existence at the time, but also a very close working relationship with IBM, making it an ideal place to do this research. Within a year we had the first version up and running at the Theory Center. This work has been extended to include a new deterministic heterogeneous scheduling algorithm that can be used on systems that have nodes with different resources.

vii

1

CHAPTER I INTRODUCTION

Massively parallel computing architectures (MPP) are made up of many high performance processors that are connected by specialized hardware switches or backplanes. For this reason these systems often appear to the users as though they are networks of workstations connected by a high speed network. For the past five years networks of workstations have been able to provide enough CPU power to be competitive with the supercomputers produced in this time frame. This makes the conversion of code that runs on networks of workstations to MPP platforms very simple for the users and in some cases no modifications are needed at all. Although MPPs possess many of the same characteristics as clusters of workstations, they have additional specialized hardware and software that have special scheduling requirements. After a great deal of research and failed attempts using the various advanced scheduling systems developed for clusters of workstations, I was faced with the task of developing a scheduling system that took into account the specialized high performance switch of the IBM SP system and met the needs of a diverse research community. The resulting Extensible Argonne Scheduling sYstem, EASY and the follow on Advanced Resource Management System (ARMS), currently under development at the Cornell Theory Center, are the focus of this research. In a research community all scientists have their own research agenda. When the goals of these agendas rely on the use of a shared resource such as a supercomputer, fairness is by far the most difficult issue a scheduling system must address. Each user’s definition of fairness can be different and in conflict with the others. Typically it is a

2

method that will run their jobs first! Another problem with addressing fairness is doing it in such a way that valuable resources are used efficiently. A first-in first-out approach is typically the only way to be truly fair, but it can lead to poor system utilization. Designing the EASY scheduler architecture in such a way that the scheduling policy it enforces could quickly and easily be modified was of key importance. As users began using the system, reporting bugs, and making suggestions for improvement, it was important that the development language be flexible so that making changes would not require significant downtime for recompilation and testing. EASY is now in use at many sites internationally. As part of this research, I have spent the last two years working on the EASY-LoadLeveler project [6]* at the Cornell Theory Center.

This work included modifying EASY to work with International

Business Machines’ LoadLeveler product [27] through an application programming interface (API) that we jointly designed and developed. It is now available in the current product release. This API allows LoadLeveler to be controlled by any external scheduler, like EASY, that can be modified at different sites so that it meets their local needs. The main benefit to IBM is that they now can support the wide range of scheduling and policy requirements of their customers without having to develop a new product release to do so. EASY-LoadLeveler provides the foundation for ARMS [38], which is an open scheduling system that can be extended to the distributed heterogeneous scheduling environment. ARMS is a layer of software that provides seamless cohesive access to various popular scheduling systems running on different computer architectures ranging from supercomputers to Personal Computers. * Numbers in brackets refer to numbered references.

3

1.1 Problem Description

The fundamental problem that is addressed by this research is the need for a parallel job scheduler that meets the requirements of a diverse research community. The general requirements of a parallel job scheduler are fairness, simplicity, and efficient use of the available resources. Most users view fairness in a scheduling system as the way in which their application has the best chance of running over any other. This makes designing a scheduler that satisfies a majority of the users very difficult. Whatever queuing scheme is decided upon for a particular user community will have to make efficient use of the available resources. Consider the following example. A pure first in, first out queue (FIFO) is usually considered to be very fair. If users are allowed to submit a parallel job requiring any number of the system nodes a FIFO queue can lead to grossly inefficient use of the machine. Assume a 128 node MPP system. If the first user wants to run a one node job for two hours and the second user wants to run a 128 way job for two hours, the first user will be allowed to run leaving 127 nodes idle for two hours. Obviously this is an exaggerated example but it illustrates the problem. A good parallel scheduling system should be simple to use. It should not require a significant learning curve for users. It has to be simple for users to submit jobs, understand the scheduling algorithm, and determine when they can expect results from their jobs. This also allows them to feel comfortable in that they are being treated fairly and to make suggestions for improvements. There are some additional requirements that are found in most research environments. A scheduling system should not put restrictions on the types of jobs that

4

can be run on the system. For example, if a particular MPP supports any messagepassing system, the scheduler should not impose any restrictions in this area.

In a

research environment users typically require exclusive access to the nodes they have been allocated for cache performance, full memory and disk space availability, and for benchmarking purposes. From a systems administration standpoint the scheduling system should be straightforward to set up and maintain and it should not create an unreasonable load on the system. Since the scheduler must support exclusive access to resources it must provide programming “hooks” for specialized system accounting mechanisms.

For

example, wall-clock accounting may be a more desirable measure versus CPU utilization, in the case of inefficient or I/O bound applications. Finally the scheduler must be very modular and written in a language that supports rapid prototyping and modifications so that user requirements can drive the scheduling policy and bug fixes can be made without requiring a significant portion of the code to be rewritten or recompiled.

1.2 Background Information

The basis of this research, the EASY scheduler [39] developed at Argonne National Laboratory, was designed for use by a diverse research community with many types of computer jobs, often with conflicting requirements. Many research institutions and universities have similar user requirements for their parallel computers which explains the broad popularity that EASY now enjoys. For clarification, a job scheduler is different than a process scheduler. Although queuing algorithms for both are often the same, the task unit that is being scheduled is

5

different. The EASY scheduler schedules parallel jobs, or programs, on available parallel nodes of a parallel machine such as an MPP. Operating systems schedule processes and threads on a single processor typically at a much lower level. Due to the flexibility an MPP architecture provides, researchers have found different ways to take advantage of its computational potential. There are three general ways an MPP can be used. The first and simplest method is to use a single node of the machine to run traditional serial programs. Since MPP nodes typically have high end workstation processors with large amounts of memory and disk, they provide excellent platforms to run these serial programs. The second method is really an extension of the serial job. This method, often referred to as “task-farming” [39] is a simple way to obtain parallel performance with no or minimal modification to the serial program. The idea is to run identical copies of a serial program on many nodes in parallel, each having different input and output data sets so that the speed that results are obtained is proportional to the number of nodes the application can be run on. The final method, and one of particular interest in computer science, is the use of parallel algorithms implemented in a parallel programming model, like the messagepassing model. In a general research environment, the actual science being done by any of these programming methods is the focus and not the method itself. For this reason the job scheduler must not favor any particular programming model over another. A complete job scheduling system is made up of several essential components. Figure 1.1 illustrates these components. The primary pieces are resource management, scheduling, job starting mechanisms, and user processes. Resource management provides information on the status and availability of the various schedulable resources. These resources are typically computer nodes but could also be things like I/O subsystems, data

6

Figure 1.1 Essential Components of a Scheduling System

objects, and graphical output devices. The scheduling component uses this resource information, the queue of waiting jobs and their requirements and uses a queuing algorithm to decide which job will start next and on which resources. There are often two levels of scheduling decisions to be made. High level decisions include matching special job requirements with resources that can meet them and policy decisions for groups of resources managed by a group or computing facility. Policy decisions affect when jobs with different resource requirements or users and groups are eligible to run. Many computing facilities wish to give priority to specific types of jobs (e.g., large parallel jobs) over other types. Another common policy use is to limit the number of jobs or amount of resources a user or group can use. Low level scheduling decisions involve matching specific computation resources to a particular job. These are made after high level scheduling decisions are made so that only eligible jobs are considered at this level. Once the decision to start a particular job is made, the job ID and the set of resources it is to be started on is passed to the job start component of the system. The job start component is

7

responsible for starting the job.

This usually means granting a user authority to a

resource, setting up network communication between resources that will be used by the job and then actually executing the parallel application. The result of the work done by the job start component is running user processes on the allocated resources. The security and dynamic process allocation components must communicate with all other components.

It is essential that all components take appropriate security

measures so that the scheduling system does not introduce a security risk where jobs or resources can be sabotaged. The dynamic process allocation component must be aware of what resources are currently available as well as what job will be scheduled next. This information is required so that additional resources are not granted to a currently running job that could cause the scheduling algorithm to malfunction. In the case of the EASY scheduling algorithm, which is deterministic by design, the dynamic process allocation component should not grant additional resources if it will cause a change in the start time of the job that is currently scheduled to run next.

1.3 Review of Previous Work

There is a broad range of scheduling research and experimental implementations. The literature ranges from process scheduling to queuing algorithms to parallel job scheduling strategies. Papers and reports providing overviews of scheduling problems and terminology were studied first. There were two particularly strong papers [11, 13] in this area. The paper by Feitelson and Rudolf [13] provided an excellent overview of what computer scientists consider to be the issues in job scheduling. The difference between process scheduling, typically done by the operating system, and job scheduling, done at a

8

higher level was explained.

This paper also provided an overview with some

comparisons made on four popular job scheduling strategies: global queue, variable partitioning, dynamic partitioning, and gang scheduling. In the global queue strategy a stand alone operating system is run on each processor while data structures are shared in memory. Process threads that are waiting are placed in a shared queue that all processors can draw from. These processes are executed for some amount of time and then swapped out and returned to the queue. The variable partitioning strategy partitions the processors into disjoint sets. Jobs requiring different numbers of processors are run in the different sized partitions.

Dynamic partitioning starts jobs requiring different numbers of

processors in different sized partitions, but these partition sizes can change during run time to adapt for changes in the jobs requirements or in the system load. The gang scheduling strategy synchronizes context switching of jobs across all processors it is executing on.

All of the job’s process threads will execute at the same time, allowing

them to interact with a fine granularity. The focus of the paper by Feitelson [11] is to provide a detailed comparison of many of the scheduling systems that were available at the time the paper was written. It also explains several of the more common queuing strategies that can be used by the various scheduling strategies: first in first out (FIFO), shortest job first (SJF), largest job first (LJF), and smallest cumulative demand first (SCDF). The first three are very common. In SCDF jobs are given priority based on the number of processors they require and the amount of time they need to run on them. This favors short running jobs that require few processors. There are many permutations of these queuing methods as well.

9

An important issue in the actual application of the scheduling and queuing algorithms is the evaluation criteria that are used to decide which is the best solution for a particular computer resource. Finding the “best” solution must be based on the correct criteria. For example, a machine used for a wide range of research activities will most likely have different scheduling requirements than a machine used for financial transaction processing.

Squillante [58] and Kleinrock and Huang [34] provide

explanation of some typical evaluation methods. In some sense, the goal of all these methods is “fairness”. To most users fairness is simply, “The scheduler should run my job before anyone else’s”. Fairness can mean balancing the load across all the system processors, optimum queue throughput, or quick response time. For this reason it is difficult to quantify and provide fairness to a broad range of users with a scheduling system. In the development of the EASY scheduler an iterative method was used. The evolving queuing algorithm was made public to the user community, users could watch the queue of jobs and when they were started, and then make suggestions for improvements and point out bugs. In a sense the users designed their own scheduling system. One of the key features the users wanted was exclusive access to the processors they were allocated. On a research machine this was necessary for network and processor bbperformance, and availability of system RAM, cache, and local disk space. For this reason load-balancing was not necessary in the EASY scheduler.

What became

extremely important was queue throughput. There are different ways of optimizing queue throughput. One is to minimize average response time. That is to minimize the time jobs are waiting in the queue.

10

Another is optimizing the ratio of response time to service time. The idea here is that if you provide very fast response time to all waiting jobs in the queue, service time will suffer on a saturated system. Likewise if you provide high service time, response time for waiting jobs will suffer. For this reason average wait time is commonly used. That is, how long in between service time cycles does the job have to wait until it is completed. This is somewhat of a “religious war”.

1.4 Approaches to Job Scheduling

The following four job scheduling approaches were discussed in detail in many of the papers that were reviewed. Often example implementations were used as a basis for comparison. The key issues in these papers were the trade-offs that are observed with any particular approach and situations in which these have significant impact. 1.4.1 Global Queue. The global queue can provide automatic load balancing and good response time. It is generally only an option with a shared memory architecture. There are several potential problems. The first is that it does not scale well. As the number of processors that are accessing the global queue increases so does queue contention.

Second, there is no guarantee that a particular process thread will be

scheduled on the same processor every time so all data in memory, cache, or local disk must be checkpointed whenever a thread is returned to the wait queue. Finally, since there is no coordination of when and where threads will be scheduled, parallel programs or programs that have threads that must interact with one another, can encounter problems.

11

1.4.2 Variable Partitioning. Variable partitioning job schedulers are extremely popular. The idea is to partition the processors into disjoint sets and execute jobs that fit in a given partition until completion. Partitions are set according to the number of nodes requested by a job and then fused when the jobs terminate. The main advantage to variable partitioning is that it eliminates the overhead of context switching.

The main

disadvantage is fragmentation. This occurs when the processors of a machine have been partitioned for several different size jobs leaving odd numbers of processors available with no queued jobs that can use them. This problem is bad in theory but the EASY Backfill Algorithm greatly reduces this problem in practice. This is because users quickly realize that a job that can use the currently available processors without causing delay to the first job in the queue will start immediately. Another problem is that queued jobs can experience long wait times in a saturated system. This is particularly bad for users who wish to do short debugging runs. Jobs that either run or fail very quickly are forced to wait as long as jobs that can run for long periods of time. Finally, because jobs are granted exclusive access to any resources they are allocated, poor or inefficient programming can result in poor utilization of the computing resource. There are many implementations optimized for different environments and there is a great deal of literature based on analogies to other models that have a particular desired behavior. Many of these are economic approaches based on different bidding and auction mechanisms [9, 14, 59], but there are also some more obscure models like the comparison to a hydrodynamic system [26]. The goal of the microeconomic approach is to provide “fairness”.

When

resources become available, an auction is initiated in which interested users (jobs) bid

12

“monetary” funds that increase with time. The client that offers the highest bid is granted the resources. If competition increases so does the relative price. These systems have been studied using a variety of different auctioning systems that are described in detail with an emphasis on how they affect the overall system efficiency. There are some serious problems with the microeconomic approach. First of all it is not deterministic. That is, users cannot look at the queue and determine when their job will be started. This is complicated by the fact that if a particular job doesn't finish by its predicted time it may be allowed to continue while being charged for the additional time as long as it has the funds to pay for the additional time, otherwise it is terminated. Also as prices fluctuate while jobs are running, jobs may run out of money and not be able to pay for the computation any longer. Second, holding an auction every time resources become available can result in an extremely high scheduling overhead. Third, and most distressing, there is extremely limited discussion of the potential of starvation for jobs as there would be in a real economy. This is complicated when considering systems that were designed to run large parallel jobs. There was no mention of specialized pricing strategies for parallel jobs. This means that jobs requiring smaller numbers of processors have a huge advantage over massively parallel applications. There are a few good variable partitioning scheduling systems that realized the need for different systems to implement localized usage policies. These policies can be adjusted to meet the needs of different user communities. Portable Batch System (PBS) [25] is a good example of system that allows local system administrators to implement their own policy requirements. It was one of the first production quality systems to support an application programming interface, API, so that the scheduling algorithm

13

could be externalized from the other key components of a scheduling system shown in Figure 1.1. The development of the EASY-LoadLeveler system [6], described in the introduction, has expanded upon this idea. An API was developed that can be used between the various scheduling components so that development advances on any of these can quickly be integrated into the overall system. 1.4.3

Variations of Variable Partitioning.

There are several alternative

methods of variable partitioning. These are fixed partitioning, adaptive partitioning, and dynamic partitioning. Partition sizes are preset using the fixed partitioning approach. All jobs on the machine are reset when the machine is repartitioned.

The obvious

disadvantage to this approach is fragmentation. In adaptive partitioning, partitions are automatically set by the system according to the current load when the job is submitted. As the system load increases the partition sizes typically start to shrink. This is an obvious problem for a machine that has a user community interested in massively parallel applications. An interesting implementation of adaptive partitioning uses a non-work conserving strategy [53]. The concept of non-work conservation is to keep some number of processors free (either based on a fixed percentage or based on previously observed behavior) even when there are jobs waiting that could use them to handle anticipated new job arrivals or unexpected system problems. This strategy can result in performance improvements under certain conditions, for example, when the system workload does not scale linearly, when there is a large variability in job submission times, and when the workload is made up of multiple classes with different computational requirements. In dynamic partitioning, partition sizes can change at runtime to reflect changes in job

14

requirements and system load. For example, if the system load increases, i.e., more jobs are started, each parallel job may be required to give back some of the processors it is using on the fly so that other jobs can use them. As system load decreases, jobs that can make use of additional processors may have them allocated on the fly. Tasks are then distributed to the new processors working set. The potential advantages of dynamic partitioning has made it a subject of a great deal of study. In a uniform memory access (UMA) shared memory system the potential task queuing overheads are greatly reduced [58]. Architectures like the Thinking Machines, CM5, provide an optimal architecture for this approach. Dynamic partitioning has the advantages of no system fragmentation, no overhead for context switching except for redistributing the processors when the load changes, and no CPU waste for a process that is in the “busy waiting for synchronization” state due to the fact that process threads can be blocked inexpensively. There are some serious disadvantages to dynamic partitioning. First of all it does not support popular programming styles such as single process, multiple data (SPMD), message-passing, and data parallel jobs. Further, it can be very difficult for users to implement their parallel programs so that they can dynamically change in size at the systems request.

Second, in a saturated system dynamic

repartitioning can lead to extensive queuing.

Finally, programs developed for this

environment are not portable because each application must be closely tied to its interaction with the operating system. 1.4.4 Gang Scheduling. Gang schedulers address the problem of long wait times in the queue by swapping the processes of two or more parallel jobs in a coordinated fashion across all processors. Since all threads of the parallel job will be executed at the

15

same time preventing the synchronization problems that the global queue approach experiences. By time slicing several parallel jobs at a time, gang scheduling provides a better average response time and average ratio of response time to service time for several types of parallel jobs. This addresses the problem of long wait times as seen in variable partitioning systems.

In a heavily saturated system the debate between variable

partitioning and gang scheduling schemes tends to be over which is more desirable, response time or throughput.

For simple parallel jobs better response time is

advantageous, especially in the case of debugging runs where the job may fail quickly. This has the advantage of minimal wait times between debugging runs. The main problem with gang schedulers is that it is very time consuming for applications that use large amounts of system ram or disk space to be swapped out of the system. The potentially high overhead of swapping such a large parallel application can lead to very poor system throughput. Another serious problem with the gang scheduling approach is the actual development of a scheduler that supports it for a wide range of applications. For parallel applications, message passing applications in particular, it is very difficult to synchronize the swapping of all associated processes and to check point all necessary system resources and the communication state of these processes so that the job can be effectively restarted. This problem is particularly difficult on MPPs or clusterbased parallel machines. Without a global shared memory or other globally shared system resource synchronizing the gang scheduling of jobs across processors is very difficult.

Typically specialized mechanisms that use the system’s communication

subsystem and associated interrupts must be used.

16

CHAPTER II HOMOGENEOUS RESOURCE SOLUTION

Many parallel systems are made up of homogeneous resources. This usually means that all nodes of the system have the same CPU, amount of RAM, amount of local disk space, and network I/O bandwidth but can also include things like available software libraries.

Having homogeneous resources greatly simplifies job scheduling since

scheduling a job only requires finding available resources and not specific features of those resources. The development of the EASY system for a homogeneous 128 node IBM SP system was the initial part of this research.

2.1 The EASY Scheduler

The EASY Scheduler was designed to ensure that large parallel jobs (e.g., those requiring most or all of the nodes) have the same chance of running as smaller parallel or serial jobs. To accomplish this a new queuing algorithm was developed. The EASY Backfill Algorithm is essentially a specialized FIFO implementation. When a job reaches the top of the queue no job below it in the queue will be allowed to start that will cause it to be delayed any longer than it is by the jobs that are running at that time. This is particularly important in massively parallel systems where jobs requiring large numbers of nodes are common. The Backfill Algorithm requires that users specify the number of nodes needed for their job as well as the amount of wall-clock time that they will need. This is similar to the days of mainframe batch computing. The advantage to this is that the queue is deterministic. At any time users can look at the queue and determine when

17

their job will start. This is a worst-case estimate. Often jobs on the system complete before their reserved time is over. 2.1.1 The Backfill Algorithm. The pseudo-code for the Backfill Algorithm is shown in Figure 2.1. The first step is to determine the number of nodes in the system that are in use and available. If the number of available nodes meets the requirements for the first job in the queue, the job is started. If there are not enough available nodes to start the first job the “shadow-time” and number of “extra-nodes” are calculated. “Shadowtime” is the amount of time that the currently available nodes can be used by other jobs in the queue without causing the first job in the queue any additional delay in starting. It is based on the time that enough jobs will have completed so that the first job can start. If at the actual “shadow-time”, more nodes will be available than are required by the first job, the “extra-nodes” variable is set to the number of these nodes that are not needed. Finally, the rest of the queue is searched to see if any of the other jobs can use the available nodes. In order to do so, the job must either complete before the “shadow-time” or use the “extra-nodes”. Determine number of nodes in use and available; If (first Job in the queue can run) { Start job; } else {calculate shadow-time and number of extra-nodes based on first job; /* attempt to backfill around it */ for (each Job in the ordered list of all other jobs) { calculate run-time for this job; if (required-nodes