Extensible Resource Management For Cluster

0 downloads 0 Views 369KB Size Report
In this paper, we present a resource management system based on a ... advantages over existing methods for managing resources at all levels of the system .... based on triggers (including possibly system administrator de ned triggers), ..... Then the workers decide collectively (based on node numbering) about the tasks.
Extensible Resource Management For Cluster Computing Nayeem Islam Andreas L. Prodromidisy Mark S. Squillante Liana L. Fong Ajei S. Gopal The resource requirements and processing characteristics of parallel scienti c and commercial applications are quite diverse. In this paper we present a new resource management system for allocating resources among such applications in general-purpose distributed-memory parallel computers. Our system, Octopus, consists of several fundamental components, including a hierarchical software architecture and a new technique called exible dynamic partitioning. The hierarchical software architecture supports the co-existence of multiple independent scheduling strategies accommodated in separate partitions in a fault tolerant manner, whereas exible dynamic partitioning is responsible for managing the resources among the partitions. The system also consists of a portable gang scheduling approach that exploits logical clocks for ecient multi-node context switching, and a practical load-sharing policy that exploits theoretical scheduling results. Based on our system design, we have implemented a prototype scheduler that incorporates space-sharing, gang-scheduling and load-sharing strategies together with a fault-tolerant strategy. Octopus evaluates incoming applications and maps them to the scheduling strategies that best t their resource requirements and processing characteristics given the state of the system. Furthermore, Octopus is extensible, in that new scheduling strategy can be easily added, and fault-tolerant, in that it can recover from node failures. In this paper, we include performance results of Octopus running on the IBM SP2 using di erent workloads which show that our approach improves the performance of a variety of parallel applications within a cluster.

1 Introduction The class of parallel scienti c and commercial applications is extremely large and diverse. These applications exhibit a wide range of speedup and eciency characteristics. The communication and synchronization traits among the constituent tasks of such parallel applications are equally diverse, with one extreme consisting of ne-grained tasks that frequently communicate and synchronize, and the other extreme consisting of coarse-grained tasks that are independent. Furthermore, current and expected workloads for large-scale scienti c and commercial computing environments consist of various mixtures of parallel applications with very di erent resource requirements [10, 15]. Moreover, the failure characteristics of distributed computing environments place additional reliability and availability demands on the system. Parallel computer systems, such as the IBM SP2, must support this wide variety of parallel applications. In this paper, we present a resource management system based on a hierarchical software architecture that supports and integrates multiple scheduling strategies and a fault-tolerance strategy. Our hierarchical scheduler design addresses two important problems for general-purpose, distributed-memory parallel systems: how to map incoming applications to an appropriate strategy and how to eciently manage the resources used in each strategy. In our design, each strategy runs in its own partition and resources are dynamically reallocated among strategies using a mechanism called exible dynamic partitioning (FDP). Our resource management system is extensible and is  IBM

Research Division, T. J. Watson Research Center, Yorktown Heights, NY 10598. Department, Columbia University, New York, NY 10027.

y Computer Science

1

designed using object-oriented frameworks. This structure facilitates the incremental modi cation of existing code to modify/create strategies to accommodate the demands of emerging applications. Our resource management system is based a collection of new scheduling strategies that provide signi cant advantages over existing methods for managing resources at all levels of the system hierarchy in a controllable manner. A prototype implementation of our resource management system has been developed, with which we have conducted numerous experiments. Our experimental results show that the performance of the general-purpose parallel system is improved by our scheduling strategies and mechanisms, and the reliability is improved by making system services and the scheduler itself fault-tolerant.

1.1 Background and Motivation Several hierarchical schedulers have been discussed in the literature [12, 13]. Although these systems also provide operating system infrastructures for creating scheduling algorithms, our work di ers in that we consider the support of parallel and sequential applications for distributed-memory parallel systems, we consider extensions to be a fundamental aspect of our architecture, and we integrate fault tolerance into our system design. Similarly, most batch schedulers are single-strategy schedulers, they are not designed to be extensible, and they are not fault tolerant. Messiahs [5] is a system that provides a toolkit for experimenting with di erent global scheduling algorithms. It has a xed set of algorithm types to choose from, it does not consider fault tolerance and it is not geared towards parallel applications for distributed-memory parallel systems. Although there has been little work in general frameworks for parallel scheduling in distributed-memory parallel systems, there has been considerable research on scheduling strategies for such architectures. These scheduling strategies support di erent types of applications, and each di er in the way nodes are shared among the jobs submitted for execution based on their characteristics. However, in most parallel systems, these scheduling strategies are deployed independently. In the remainder of this section, we present an overview of several important and general strategies for scheduling di erent classes of parallel applications, and we discuss how our scheduling strategies for each of these strategies di er from existing schemes and how they provide good performance for a wide range of parallel applications. Scheduling strategies for the class of ne-grained applications have received considerable attention in the research literature. There are three basic scheduling strategies for these applications, namely those which space-share the nodes by partitioning them among di erent parallel jobs, those which time-share the nodes by rotating them among a set of jobs, and those which combine space-sharing and time-sharing. Several space-sharing strategies have been proposed and evaluated. The static partitioning of the nodes into disjoint sets has often been employed in a number of production systems due in part to its low system overhead and its simplicity from both the system and application viewpoints. However, this scheme can lead to relatively low system throughputs and resource utilizations under nonuniform workloads [38, 31, 35, 39]. System performance can be improved by adaptively determining the number of nodes allocated to a job based on the system state when the job arrives or when another job completes execution [24, 49, 31, 37]. The performance bene ts of this adaptive partitioning approach, however, can be limited due to its inability to adjust scheduling decisions in response to subsequent workload changes. These potential problems can be alleviated under dynamic partitioning, which is similar to adaptive partitioning with the addition that the number of nodes allocated to a job can also be modi ed throughout its execution. In fact, dynamic partitioning can provide the best system performance for a wide variety of application workloads, especially those with lower eciencies and/or those with less variable service time requirements [45, 49, 14, 26, 31, 27]. This is because the dynamic policy can maintain very ecient node utilizations by 2

adjusting node allocation according to workload changes. On the other hand, studies [31, 40, 18] have demonstrated that the overheads associated with dynamic partitioning in distributed environments can generally limit and/or eliminate its potential performance bene ts (due to factors such as data/job migration, node preemption/coordination, and recon guration of the application), and that short jobs should not be repartitioned. Our Flexible Dynamic Partitioning mechanism addresses these performance issues in part by employing a smoothing technique whereby a minimum interval of time is maintained between changes in node allocations and by batching reallocations. FDP also provides triggering events and control functions, which are both programmable and extensible, to manage when and how nodes are moved among the partitions. By time-sharing all of the resources among a set of jobs, the system can ensure that all jobs gain access to the system resources within a relatively short period of time, and thus it can be particularly suitable for tasks with small processing requirements. It can also provide the best system performance in the limiting extreme of linear speedup workloads, as all node allocations are equally ecient and the fully parallel allocation reduces job execution time [40]. On the other hand, it is most often the case that the speedup of parallel jobs does not increase linearly with the number of processors allocated to the job [7, 23], which results in poor utilization of the resources in time-sharing systems [23, 43, 44, 47]. Moreover, the context-switching overheads associated with time sharing can generally limit and/or eliminate its potential performance bene ts, particularly for application instances with large data sets. For these reasons, there appears to be a general agreement that a pure time-sharing policy is not suitable for general-purpose parallel computing environments. Several strategies that combine space-sharing together with time-sharing, in what is often called gang scheduling, have therefore received considerable attention in the research literature [11, 6, 14, 43, 44]. This class of scheduling policies, however, must address the various context-switching overheads and fragmentation issues in order to realize the potential performance bene ts of gang scheduling. In this paper, we present a new and portable approach to gang scheduling that integrates exible dynamic partitioning together with an ecient and controllable mechanism for multi-node simultaneous context switching in a fully distributed and independent manner. This particular combination of the space-sharing and time-sharing components of our resource management system provides the bene ts of both of these scheduling strategies. The control features of our exible dynamic partitioning and distributed time-sharing strategies are used to achieve the desired level of performance while reducing the overheads associated with gang scheduling in large-scale parallel environments. The class of coarse-grained parallel applications, e.g., those found in data mining and computational chemistry, also has received considerable attention in the research literature. These applications consists of the various independent processing elements schemes in which the tasks of arriving jobs are assigned to particular nodes and each node independently and concurrently executes the tasks assigned to it [9]. Load sharing is a general scheduling strategy that addresses the requirements of these applications. This strategy can also accommodate the non-negligible fraction of sequential applications that are part of any general-purpose computing environment [34]. Many di erent approaches have been employed to allocate resources under this scheduling strategy, each with varying and limited degrees of success [9]. However, an optimal scheduling strategy for this parallel environment has been recently established, and it has been shown to provide signi cant performance improvements over other methods for coarse-grained applications [25, 41, 19, 42]. Nevertheless, there has been no previous implementation of nor experimentation with a practical scheduling strategy that is based on these theoretical results. Our prototype scheduler combines our gang scheduling and load sharing strategies into a single resource management system. In contrast to shared-memory machines where processor failures are linked, the individual nodes of a distributed system may fail independently. The failure of a single node running an essential system service, such as resource management, would be extremely detrimental to the total system availability. It is therefore imperative to exploit 3

the loose coupling of the nodes to implement fault-tolerant system services and to exploit these services in order to make the resource scheduler itself fault-tolerant. By fault-tolerant we mean that the scheduler can recover from faults even when the nodes that it uses fail. Since the scheduler has timely information on resource availability, it can also facilitate an application's recovery from such failures. Although several job scheduling systems exist, such as the IBM LoadLeveler, few have integrated fault tolerant features.

1.2 Objectives and Contributions Our objective in this paper is to present the design of a new resource management system, called Octopus, that addresses important issues for resource scheduling in large-scale and distributed parallel systems. Fundamental aspects of our approach are the combination of the space-sharing, gang-scheduling and load-sharing strategies into a coherent and ecient system, the dynamic partitioning of system resources among multiple scheduling strategies, and the integration of these scheduling strategies with a fault-tolerance strategy. The system e ectively supports the diverse resource requirements of di erent parallel scienti c and commercial applications in a controllable manner to achieve the desired levels of system and user performance. Our resource management system is based on a design that explored scheduling and fault-tolerance services for large-scale distributed and parallel computing environments [8]. This system has many diverse aspects and it makes numerous contributions in the areas of system architecture and scheduling strategies: 1. a hierarchical scheduling architecture that supports and integrates the dynamic and concurrent existence of multiple resource management strategies in large-scale and distributed parallel systems 2. an extensible design so that a user and/or system administrator can modify or replace di erent aspects of the scheduling system, such as scheduling and fault-tolerance policies, control functions and triggering events 3. the integration of fault-tolerance with resource scheduling so that the system is itself fault tolerant and it also provides fault-tolerance services to applications, both of which are important requirements in large-scale and distributed environments 4. a new mechanism, FDP, for dynamically adjusting the partitioning of resources at all levels of the system hierarchy in a controllable manner 5. a novel and portable gang scheduling approach that exploits logical clocks for multi-node simultaneous context switching and that provides ecient, controllable time-sharing in a fully distributed manner, while eliminating the potential bottleneck and overheads of centralized approaches 6. a new load-sharing scheduling strategy that eciently yields the optimal performance for coarse-grained parallel applications Octopus is the rst resource management system to our knowledge that includes each of these fundamental properties.

1.3 Organization The remainder of this paper is organized as follows. The architecture of Octopus is presented in Section 2. Our general approach to managing parititons is described in Section 3, and our resource management strategies are presented in Section 4. Implementation details of Octopus are described in Sections 5 and 6. Performance measurements are presented in Section 7, including the overheads of the basic resource management strategies and the performance of applications. Our conclusions are discussed in Section 8. 4

2 Resource Management Architecture In this section, we describe the system architecture of Octopus and its extensibility features. The top level of the hierarchical software architecture consists of a set of non-overlapping and independent domains into which the system administrator divides the available system resources. The computing nodes of each domain can then be divided into multiple disjoint and nonoverlapping partitions. In this paper, a computing \node" constitutes the smallest unit of resource allocation. Similarly, each partition can be recursively re ned into smaller, non-overlapping subpartitions. The bottom level of the hierarchy consists of single-node partitions. Each domain is controlled by a domain-level scheduler (DLS), the \global" distributed subsystem responsible for managing and allocating the set of resources in the domain. The DLS maintains a list of all the computing nodes within its control, it knows about the scheduling strategies supported by its children partitions, and it de nes the fault-tolerance characteristics for the entire domain (since di erent domains may support di erent levels of fault tolerance). Each DLS is responsible for the manner in which arriving applications are assigned to the partitions of the domain. Applications may be submitted to the system through a launcher, such as a shell. The user, via this launcher, can specify to the DLS the application's resource and scheduling requirements (e.g., memory, number of nodes, etc.) and the DLS can decide which particular scheduling strategy is best for that job. If no suitable partition is available when a job arrives (e.g., there are a sucient number of partitions, each of which already have enough work assigned to them), the DLS may enqueue the job for later consideration. Each DLS also de nes and dynamically adjusts the resources allocated to the partitions it manages within constraints speci ed by the system administrator. To facilitate the general resource management of partitions at all levels of the system hierarchy, we have introduced a new scheduling mechanism called exible dynamic partitioning (FDP), the details of which are described in Section 3. The FDP mechanism determines when to re-assign resources based on triggers (including possibly system administrator de ned triggers), and provides a re-allocation function to determine how nodes are re-assigned between partitions dynamically. Each DLS uses the FDP mechanism to maintain the sizes of its partitions within their speci ed range so that the reallocation function is satis ed. Each partition is controlled by a partition-level scheduler (PLS), a distributed subsystem much like the DLS that manages all of the resources in the partition according to the particular scheduling strategy associated with the partition. The PLS for each partition (see Figure 1) provides the following functions: 1. the mapping of all the applications it receives from a higher level scheduler to a lower level scheduler within its partition 2. the management of its subpartitions using FDP when necessary 3. any coordination among its subpartitions required to implement the scheduling strategy associated with the partition; for example, implementations of gang-scheduling require that the scheduler implement coordinated context switching The mapping function depends upon a number of important factors, including the characteristics of the parallel application, the current state of the subpartitions, and the current workload of the partition. A good characterization of the parallel application by the user via the launcher will result in a more e ective mapping of the application to a lower level scheduler. Each subpartition may itself be recursively re ned into smaller, non-overlapping subpartitions. A special node-level scheduler (NLS) manages the single-node subpartitions at the leaves of the scheduling architecture hierarchy. Each NLS is responsible for running the processes of an application on its node by allocating the node-level system resources to the active processes assigned to it by the parent PLS. The NLS implements the scheduling policies and system attributes set by its PLSs and DLS. When the NLS is assigned a task by its parent 5

Triggers

Partition

triggers

resource reallocation sub-partition

subpartition

sub-partition

subpartition

Figure 1: Generic Partitions and sub-partitions PLS, the NLS inserts the task into the local dispatcher queue in a location that satis es the PLS scheduling requirements. For example, an NLS in a gang scheduling partition inserts the task in coordination with the other NLSs in the partition.

2.1 Resource Monitoring Several components of the system, including the FDP mechanism, require the use of resource monitoring to control resource management decisions. To assist in the collection of resource-usage information, each NLS also supports a resource monitor module (RM). The RM monitors a number of important measures (including execution times of applications, memory usage and e ective resource utilization for each node) and exchanges this information with its parent PLS as part of their regular communication. Resource usage information may propagate up the scheduler hierarchy. In the case of communication inactivity, this resource information is periodically transmitted at pre-set intervals which are system administrator controlled and per partition customizable.

2.2 Parallel Application Structure Many parallel applications are written such that the number of nodes allocated to them can only be set when they start execution. However, it is desirable for a parallel application (if it runs under a scheduler that supports dynamic partitioning) to be able to handle at any time during its execution fewer or more nodes than it was initially allocated. Applications that are able to react to such changes are recon gurable or fault-tolerant applications. Recon gurable and fault-tolerant applications are di erent in the sense that the former can handle the loss of nodes only if they are noti ed in advance, whereas the latter may survive even if the node loss precedes the noti cation. Note that a faulttolerant application is also recon gurable. Existing and standard parallel applications that cannot be recon gured are de ned here as legacy applications. In this paper we considered two types of recon gurable applications, the bag-of-tasks and the symmetric. The bag-of-tasks are centralized applications consisting of a coordinator process with a set of worker processes; the coordinator assigns to each worker a set of tasks to work on. The symmetric are distributed applications with only worker processes; all decisions, coordination or ordering, if needed, are distributed and based on node numbering. Each recon gurable application consists of an application level manager (ALM), and a set of worker processes as shown in Figure 2. When an application starts, it spawns an ALM process and a set of worker processes. 6

Application Level Manager (ALM)

Worker

Worker

Worker

Worker

Figure 2: Structure of a Recon gurable and Fault Tolerant Parallel Application The ALM is an important component of our resource management system. It serves as the point of contact between the system and the application. For example, in node allocation changes resulting from either a node fault or a recon guration (where the application gains or loses nodes due to changes in the subpartition size), the PLS and an application's ALM communicate and work together to handle the necessary allocation changes. For a recon gurable application that is able to handle such node allocation changes, its ALM must coordinate and work with the application processes to appropriately respond to node changes. A fault tolerant application will react to lost nodes by rolling back to a previous checkpoint when its ALM receives noti cation from the PLS that some of its allocated nodes have been declared to be dead. Since there is no single recovery technique that is applicable for all applications, each ALM can tailor its speci c recovery strategies to a particular class of applications and orchestrate the actual recovery.

2.2.1 Application and Scheduler Interactions Figure 3 shows the salient interactions between the application and various parts of the resource management system. 1. Launcher to DLS: Register application and give the pro le to the DLS. 2. DLS to PLS: Create partition with appropriate attributes or modify existing partition. Assign application to partition. Apply FDP to re-allocate resources between partitions. 3. PLS to NLS: Customize appropriate NLSs, and assign process to individual NLS. 4. DLS to Launcher: Resources granted, and application completion result. 5. PLS to ALM : Noti cation of application's current resource allocation. The same noti cation mechanism is used for node death and node reallocation. 6. ALM to PLS: Modify resource allocation requests. This is applicable to recon gurable applications which may desire to expand or reduce the resources currently allocated. 7

1 DLS Launcher

9

4

2

2

9

PLS PLS

5

8

6 ALM

7 7

3

7

3

3

RM

NLS

NLS NLS

Figure 3: Application interactions with the system 7. NLS to PLS: Process completion or death noti cation. 8. RM to PLS: Resource usage information sent periodically in case such information is not transmitted as part of regular communication between NLS and PLS. 9. PLS to DLS: Transmit partition resource utilization to DLS.

2.3 Scheduler Extensibility Our resource management system is extensible, in that it supports multiple scheduling strategies and permits the easy implementation and incorporation of new/re ned scheduling strategies. Octopus is designed using object-oriented frameworks [22, 2, 16, 3, 17, 46]. This makes it possible, for example, to easily customize the DLS(s) to the needs of each computing domain. Similarly, the various aspects of PLSs can be customized to meet the needs of di erent types of partitions that concurrently exist in the system. The launcher/DLS interface is also extensible, thus permitting future applications to specify requirements (to an appropriately extended DLS) that are as yet unidenti ed. Partitions with new scheduling policies are created by specializing two independent interactions: between a parent PLS and a child PLS, and between a PLS and its NLS. The default classes were designed with considerable support code, and few changes are required to implement new schedulers. In fact, we were able to reuse parts of the code quite extensively.

2.3.1 Customizing parent PLS - child PLS interactions In order to determine which partition is best suited to handle a particular application, the parent PLS invokes the function findResourceMatch(Application*) of each child PLS. In case there are multiple appropriate PLSs, the parent PLS chooses one of them. The children PLSs customize the findResourceMatch method and hence, they direct 8

the parent PLS with respect to the type of applications they can support most e ectively. After choosing the appropriate PLS, the parent PLS forwards the application to the child PLS by calling the takeApplication(Application*) function of that PLS. Partition management is implemented through three methods: manage(), takeResource and giveResource(). manage() removes a node from one sub-partition and gives it to another sub-partition. This method determines when to invoke partition management functions, what runtime parameters to use in the decision, and which resources are taken from one child partition to be given to another. The manage() method invokes the takeResource() method of the partition that looses a node and the giveResource() method of the partition that gains that node. The actions may percolate up and down the tree. The sendResourceUsage() method is used to pass resource usage information to a parent partition. The implementation of these methods, as well as their implications, depends upon the scheduling strategy supported by the PLS. Note that the DLS-PLS interaction is a special case of a parent PLS - child PLS interaction.

2.3.2 Customizing PLS-NLS interactions For each new partition, we have created a new subclass of PLS with three methods overloaded, namely plsSchedule(), searchNodes() and launchApplication(). Method plsSchedule() determines when a particular application should be assigned to a set of nodes. Method searchNodes() is responsible for the mapping of an application to a set of nodes. It is specialized to consider di erent factors; e.g., to consider load and proximity to devices, etc. Method launchApplication() sends the application to the chosen node. It is specialized to lter the amount of information that is given to the NLS. For example in load sharing, each NLS is only told about the application's processes that it must run. However, to implement gang scheduling, the method is specialized to also send to the NLS a synchronization message to indicate when to run a group of applications.

3 Management of Partitions using FDP The exible dynamic partition (FDP) mechanism of our resource management system is used to dynamically adjust the partitioning of resources at all levels of the hierarchical software architecture. The FDP mechanism employs triggers to determine when to invoke FDP functions. Example triggers include timers, application entry and exit events, and resource utilization changes. A trigger is said to re if it is in a speci c range. When triggers re, the FDP mechanism invokes a re-allocation method called manage(), which evaluates the the current status of the partitions under its control using the evaluate() method. The evaluate method is a computable function that the partition designer uses to evaluate how to allocate resources between sub-partitions. If FDP decides that resources in its sub-partitions should be re-allocated, it then invokes the getResource() method to take a resource from a partition and then invokes the giveResource() method to give the resource to another partition. This approach is a generalization of dynamic partitioning with smoothing [18, 20], an instance of FDP that uses timers as triggers. In this paper, we refer to the interval in which FDP timers re as a smoothing interval and will often use the term FDP with smoothing meaning that a timer trigger is being employed. For example, enforcing a minimum interval of time between repartitions limits the number of repartitions, decreases the total cost of repartitioning by batching multiple repartitions together, and prevents potential repartition thrashing. The implementation of Octopus uses the following triggers: application entry events, application exit events, timers, resource utilization and application eciency. Depending on the trigger types, some information is obtained through resource monitoring or is provided by an application programmer. Recall that when one of the triggers res, the DLS/PLS invokes the manage() method. The DLS/PLS then attempts to divide the nodes into partitions with 9

reallocation function values based on the corresponding information provided by the resource monitoring components of the system and on the application attributes. We now consider a speci c example of the DLS or a PLS using FDP to manage multiple partitions. The system administrator could specify a threshold Ti;j for the load di erential i;j  kUi ? Uj k between the utilizations of each pair of partitions i and j (or the same threshold can be used for all partitions, i.e., Ti;j = T ).1 The DLS/PLS then attempts to maintain each partition size within its speci ed range so that the pairwise di erences in partition utilizations do not exceed their speci ed thresholds (i.e., i;j  Ti;j ). Whenever i;j > Ti;j , the DLS/PLS moves resources from underutilized partitions to overutilized partitions, subject to the partition constraints, using our FDP mechanism. The repartitioning of resources among the partitions occurs dynamically as the DLS/PLS periodically re-evaluates the state of the system and the load di erentials.

4 Resource Management Strategies Another fundamental aspect of our resource management system is a set of gang-scheduling, load-sharing and fault tolerance strategies. In this section, we describe these strategies.

4.1 Gang Scheduling Gang scheduling combines the advantages of space-sharing and time-sharing. Our approach can be conceptually viewed as a generalization of the global scheduling matrix originally proposed by Ousterhout [32]. Each column represents one node and each row represents one time-slice. In our approach, the PGS columns (where PGS is the number of nodes that are reserved for the gang scheduling partition) are divided into K disjoint subpartitions, or groups. The number of columns in the kth partition represents the number of nodes allocated to the parallel applications assigned to that partition, and the number of rows for the kth partition represents its degree of multiprogramming Dk , 1  k  K. Under gang scheduling, the resource management system schedules and executes all processes of an application at approximately at the same time, which implies that all nodes within a subpartition will context switch simultaneously. Each row in the global matrix represents the mapping of a set of applications to a set of subpartitions, where each application is mapped to only one subpartition. The time-slice, or quantum, length for each row of the kth subpartition is denoted by Tk;i, 1  i  Dk , 1  k  K. These quantum lengths for each subpartition can be di erent and independent of the other subpartitions, which provides additional exibility to optimize various performance objectives. Moreover, for each quantum allocated to larger job classes, our approach supports the allocation of multiple quanta to smaller job classes (i.e., smaller jobs can be placed in multiple rows of a subpartition of the matrix) so that the overall mean response times are reduced and small jobs receive ecient response times, both in a controllable manner [44]. The overall gang scheduling matrix is shown in Figure 4. This organization for the gang-scheduling matrix is due in part to our distributed gang-scheduling design, which is in stark contrast to previous methods that employ centralized, tightly-coupled and/or hardware-based control mechanisms. Our approach consists of a general, software-based mechanism that is used to provide coordinated context-switching across the system nodes. Speci cally, each node independently time-slices among the jobs allocated to it according to its local logical \time", and synchronized clocks are used to maintain consistent time across the nodes [29]. Since each node switches independently based on its local clock, the actual overhead of a simultaneous multi-node context switch in our system is no more expensive than a local context switch. By having a xed and 1

The utilization of a partition can be approximated by averaging the utilization of the nodes in the partition.

10

PGS columns or nodes

N1

NP

T1 T2 Context Switch Points

Dk

Time slice interval

Tm

D k = number of time slices for sub−partition k

K sub−partitions

Figure 4: The gang scheduling matrix

11

independent degree of time-slicing for each subpartition, node fragmentation is signi cantly reduced and nodes can execute their tasks independently and eciently. The local clocks can be kept synchronized by employing any one of the many di erent methods for synchronizing clocks that have already been implemented. This structure also reduces considerably the complexity to maintain the gang-scheduling matrix as jobs enter and especially when jobs leave the system. An instance of the FDP mechanism is used to maintain a proper value for K. The smoothing interval and the trigger ring between repartition epochs should be appropriately coarse { neither too small, since repartitioning in this case implies a reconstruction of (portions of) the gang-scheduling matrix, nor too large, to be responsive to workload changes. The smoothing parameter can be dynamically adjusted as a function of the Dk values. The repartition overhead can be reduced by decreasing the number of recon gurations required for each subpartition by taking advantage of the natural draining of the columns due to departures and by performing all recon gurations for each subpartition in parallel. Additional context-switching overheads (e.g., memory and cache e ects) can be kept relatively low by using suciently coarse quantum lengths; in the case of memory overheads, block paging and prefetching mechanisms can also be used to signi cantly reduce the overheads incurred [48]. During each smoothing interval (in which K is xed), the system can maintain a high level of performance and respond to rising load conditions by increasing the values of Dk , subject to memory constraints and the current utilization of the system. When a job in the kth subpartition departs, the PLS assigns a waiting job to this subpartition and the value of Dk remains the same. If there are no waiting jobs, however, the value of Dk is decremented independently in the PLS and the corresponding NLSs. The maximum value of Dk is de ned in terms of the application workload eciency, the variability of workload service demands, and the system load. The allocation of resources between space-sharing subpartitions (i.e., those with Dk = 1) and time-sharing partitions (i.e., those with Dk > 1) in our integrated gang-scheduling approach is dynamically adjusted in response to changes in the system state.

Pure Space Sharing and Pure Time Sharing

Our gang scheduling partition naturally reduces to a pure dynamic space-sharing partition under FDP when Dk = 1 for all values of k within the gang partition. In this case, the FDP mechanism is used to maintain the proper subpartition sizes depending on the applications in the system. Similarly, our gang scheduling partition naturally reduces to a pure time-sharing partition when K = 1.

4.2 Load Sharing Our resource management system also introduces a new practical load-sharing strategy that is based on previous theoretical scheduling results [25, 36, 41, 42], and that provides the optimal performance for the class of coarsegrained and (legacy) sequential applications. This approach consists of a two-phase method that rst determines the best set of nodes on which to schedule the tasks of each job arrival, and then assigns the job's tasks among these nodes to achieve the best performance. Each node independently and concurrently executes the tasks assigned to it. To help clarify the detailed behavior of our load-sharing strategy, consider the case where a job J, consisting of NJ tasks, arrives to the system and assume that the scheduling policy decides to allocate its tasks on the L least-loaded nodes, 1  L  PLS (where PLS denotes the number of nodes reserved for the load-sharing partition). De ne L` to be the number of tasks that would be assigned to the `th least-loaded node if the NJ tasks of job J were scheduled in a manner that tends to equalize the amount of work at the L least-loaded nodes (at the time of P J's arrival), L`=1 L` = NJ , and let W`L denote the total amount of time (including queueing delays) required to complete the L` tasks of J assigned to the `th least-loaded node, 1  `  L. The response time for job J under this 12

allocation is then given by M L  maxfW1L ; W2L; : : :; WLLg, i.e., the time required to complete the NJ tasks of J on all of the L nodes allocated to J. Our load-sharing strategy rst determines the value of L that minimizes the quantity M L for each job arrival J, which we denote by L . More formally, the scheduling policy determines the value of L that satis es2 L = arg minf 1  L  PLS : maxfW1L ; W2L; : : :; WLLg g:

(1)

Note that the optimal set of nodes on which to allocate the tasks of J consists of the L least-loaded nodes. Our load-sharing strategy then assigns L`  of the NJ tasks of the job arrival J to the `th least-loaded node, 1  `  L . We assume that each of the tasks have similar processing demands; however, the values L`  can easily accommodate variable-sized tasks when this information is made available to the scheduling system. It can be shown that balancing the load among the L nodes chosen in the rst phase of the load-sharing policy is the optimal way of allocating the job's tasks to these nodes [25, 41, 42]. A consequence of our strategy is that sequential applications are placed on the least-loaded node at the time of their arrival, as desired. Observe that the rst phase of the load-sharing policy consists of determining the number of nodes which minimizes the expected time to execute the job under the load balancing method used in the second phase. The monitoring facilities of our resource management system provide estimates of the mean and variance of the work currently allocated to each node. These estimates are exploited by our load-sharing strategy to eciently determine the allocation that tends to minimize the response time of each job arrival given the current system state.

4.3 Fault Tolerance Unlike shared-memory machines where the processor failures are linked, in a distributed memory machines it is possible to exploit the loose coupling of the processors to implement fault-tolerant system services. In this section, we describe how to incorporate fault-tolerance into our scheduler. By fault-tolerant we mean that the scheduler can recover from faults even when the nodes that uses for itself fail. When a resource fails, the scheduler rst ensures the integrity of the various components of the scheduler. It nds alternative resources for components that were using the failed resource, and it ensures that the failed resource is not reallocated (until recovery). Second, it noti es other users of the resource (e.g., the parallel applications allocated to the resource) of the loss, so that they can recover from the failure. To integrate the fault-tolerance feature in our scheduler design, we assume the following failure model: First, nodes may fail by crashing at any time; when a node crashes, all the processes running on the node terminate, all communication to and from the node ceases, and the non-faulty nodes detect the failure within a known bounded time [30]. Second, nodes fail independently; furthermore, there is a bound on the number of simultaneous independent node failures. Thus, in a primary-backup system that is tolerant of one node failure, the primary and the backup nodes will not fail simultaneously. Third, failures are infrequent; with this assumption, our design leans toward less complicated and low overhead schemes. We use the following principles to ensure that our scheduler is fault-tolerant. We minimize the amount of data that must be preserved in a fault-tolerant manner. The data used by the scheduling system can be categorized into: system state, system control and application control. The system state data include static hardware information (e.g., speed of a node), dynamic hardware information (e.g., node and communication loads), and software environment data (e.g., availability of software components). The system control data includes the con guration of the hierarchical The mathematical notation argminfx  y  z : f (y)g simply refers to the value of y that minimizes the function f (y) under the constraint that y is between x and z, inclusively. 2

13

scheduler. It has information about the global system queue and the local queues employed in each of the individual strategies, i.e., the queues of the gang-scheduling and load-sharing strategies. Application control data includes all the data structures representing an application in the system. This information contains the current scheduling state of the individual processes of an application, the nodes the application is running on and the strategy it has been allocated to run in. The scheduler only needs to preserve the system control data and application control across failures; system state data can be recreated during failure recovery. It is important to separate application and system fault-tolerance. The recovery of the system should not dependent on the recovery of applications using the system. The design of Octopus using these principles is detailed in later sections.

5 Prototype Implementation In this section we describe in detail the prototype implementation of Octopus. We use this prototype to experimentally analyze various performance aspects of our resource management system.

5.1 System Structure As shown in Figure 5, the prototype of Octopus is a multi-level hierarchical resource management system. It consists of a single DLS at the top level of the hierarchy, multiple levels of PLSs that support di erent scheduling strategies, and a set of NLSs for the nodes of the domain. In our implementation we have included several partition level schedulers. Speci cally, we have implemented a Load Sharing Partition Scheduler, a Gang Partition Scheduler, a Space Sharing Partition Scheduler, and a Time Sharing Partition Scheduler. Note that the space sharing partition is the portion of the gang scheduling partition with Dk equal to 1, whereas the time-sharing partition is the remaining portion of the gang scheduling partition with Dk greater than 1. Initially, the DLS creates partitions for the strategies that the system administrator has speci ed minimum sizes, and a single default partition with the rest of the resources. When a new application enters the system the DLS determine which of the currently instantiated partitions is more appropriate for this new application. When no suitable partition exists, the DLS creates an appropriate partition by taking nodes from existing partitions. If no partition can be created (e.g., when there are already too many partitions), the DLS enqueues the application for later consideration. PLSs handle incoming applications in an analogous way. To accommondate uctuations in the workloads and to support varying resource requirements as requested by recon gurable applications, a PLS (or the DLS) may employ the FDP mechanism to monitor and evaluate the state of the system and re-allocate resources from one subpartition to another. The FDP mechanism was introduced in Section 2 and in this section we will discuss some aspects of its usage in more detail.

5.1.1 Application-Scheduler Negotiations Octopus provides an interface for an application to specify its characteristics in order to obtain the appropriate types of resources. It also provides an interface for applications to respond to changes in resource allocation. An application may specify the properties shown in Table 1. Such properties include the type of application (e.g., legacy, recon gurable, fault-tolerant), whether it is tightly or loosely synchronized, a minimum and maximum number of nodes on which it can be executed, an estimate of its per-node memory requirements, a measure of its execution eciency, and a classi cation of its execution time (e.g., in our experiments we used the following classi cation: very small (less than 1 second), small (between 1 second and 10 seconds), medium (10 seconds to 100 seconds), large (100 14

Flexible FDP = Dynamic Partitioning

Domain Level Scheduler

Gang−scheduling Partition

Load−sharing Partition FDP

NLS

Space Sharing Partitions

NLS

partition for application 2

partition for application 1

NLS

partition for application 3

Partition 1

FDP

FDP

NLS

FDP

Time Sharing Partitions

NLS

NLS

NLS

NLS

NLS

Figure 5: Structure of the prototype implemenation

15

Partition 2 FDP

NLS

NLS

NLS

to 1000 seconds), very large (1000 seconds to 10,000 seconds) and huge (over 10,000 seconds)).3 In our experiments, these execution times are used to compute the coecients of variation of workloads in Section 6.3. By execution eciency we mean a function of the application's speedup on a speci c number of nodes.4 Recall that all negotiations and the ow of information between the scheduler and an application goes though the ALM. The interface the ALM provides is shown below: class ALM{ public: /* Method invoked to react to PLS message */ void reconfigure(ResourceList * rl); /* Request resource allocation change from PLS */ void reconfigure_me(ResourceList * rl); };

Property Type of Application

Value Legacy, Recon gurable, Fault Tolerant, Sequential Eciency (Speedup) Inecient, Ecient Execution Time Very Small, Small, Medium, Large, Very Large, Huge Synchronization Type Tightly synchronized Loosely synchronized Number of Nodes 1- P Memory Per Node 0 - M megabytes

Table 1: Application Properties and acceptable values of these properties

5.2 FDP and Application Support for Recon guration Recall that FDP is the component of our scheduler that is responsible for deciding and realizing the reallocation of resources. In particular, the DLS and some PLSs have their own FDP mechanism for controlling the resources granted to their partitions and for performing any recon gurations depending on the partition type and reallocation policy. For example, in a space-sharing partition, where all applications are assumed to be fault-tolerant or at least recon gurable, the FDP has to reallocate the resources every time a trigger (job arrival or departure) res, subject to certain smoothing interval constraints, if any. In addition, it has to coordinate with all the applications a ected by such a recon guration to ensure a smooth transition. On the other hand, in other partitions (time-sharing, load sharing or gang scheduling) if the applications are not fault-tolerant or recon gurable, the FDP mechanism waits until all the scheduled resources are released. In general, recon guration cannot proceed smoothly unless both the system and the applications support it. The system support is the same for all applications and it is independent of the type(s) of application(s) being recon gured. Application support for recon guration, on the other hand, depends upon an application's implementation of the recon guration phase. 3 4

A ner granularity can be easily accommodated in our system. More complicated de nitions could employ the continuous function of the execution eciency or its rst derivative.

16

System Support The resource management system of the space sharing partition deploys the FDP mechanism to support the recon guration process. On a job arrival, the PLS uses the FDP to determine which of the nodes, currently used by other applications, can be assigned to the new partition that will accommodate the new job. Then, it issues a recon guration request with the list of revoked nodes to every ALM being a ected and waits for their acknowledgment indicating that the request has been attended. At this point, the PLS is free to proceed. On a job exit, and if there are no waiting jobs, the PLS uses FDP to decide which partitions would bene t more by conveying to them the nodes released by the departing application. It then collaborates with the ALMs a ected in a manner similar to the one just described. In both cases (job arrival and departure), FDP bases its decision on the eciency characteristics provided by the applications, if any. Bag-of-tasks Application Recon guration During repartitioning, the application has to complete a series

of actions: The ALM (the recipient of the recon guration message) rst sends a checkpoint request to all of the worker processes and waits until they complete their current phase, checkpoint and send back an acknowledgement. Then, the ALM sends an acknowledgement back to the PLS that checkpointing has completed, it starts accepting connections requests but only from the workers of nodes that belong in the partition, and it assigns work to its new set of workers. Once this process completes, the application continues with its normal execution.

Symmetric Application Recon guration The steps are essentially the same as in the bag-of-tasks application.

In this case, however, recon guration is performed in a more distributed fashion and the ALM informs all workers about the new valid set of nodes. Then the workers decide collectively (based on node numbering) about the tasks each worker will undertake, and they redistribute all essential data accordingly. Upon completion, they continue with their normal execution.

5.3 Gang Scheduling According to the gang-scheduling strategy described in Section 4.1, the gang-scheduling matrix consists of K subpartitions. In the Octopus prototype, we divided the K subpartitions into two separate groups, one with space sharing subpartitions all managed by a single PLS, and one with time sharing subpartitions each managed individually by a di erent PLS. It is important to note that the space sharing PLS is a gang scheduling PLS with the value of Dk equal to 1, whereas the time-sharing PLSs are gang scheduling PLSs with a value of Dk greater than 1. A separate FDP mechanism is designated to manage and re-allocate any resources between the two children PLSs handling the two groups of subpartitions. The ratio of the number of nodes allocated to each subpartition is linearly dependent upon the CPU utilization of each subpartition. When the CPU utilizations of the subpartitions change suciently, FDP re-evaluates the entire matrix and re-allocates nodes appropriately to level out the CPU utilizations. Such changes can take e ect at most every GS smooth interval. The algorithm for allocating nodes between sub-partitions is the same as the general resource management scheme described in Section 3. The various parameters of the gang scheduling partition are shown in Table 2.

5.4 Load Sharing We implemented a particular version of our load sharing algorithm described in Section 4.2. The monitoring facilities provide estimates for the mean (R) and variance (V ) of the workload response time at each of the nodes. These estimates are used to approximate the W`L variables of the optimal load sharing algorithm when it determines the allocation that tends to minimize the job's response time given the current system state. The parameter Lmax 17

Parameter DK K Ksp Kts Ti SS smooth

Policy System Monitored System Monitored System Monitored System Monitored system administrator system administrator

Description Multi-programming level number of subpartitions number of subpartitions for space-sharing number of time-sharing subpartitions Time slice duration minimum interval for invoking FDP in space sharing partition GS smooth system administrator minimum interval for invoking FDP in gang scheduling partition TRM system administrator time for resource information update Kmin system administrator minimum size of partition Table 2: Summary of Gang Scheduling Parameters. speci es the maximum number of nodes allocated to an application in the load sharing partition, which can constrain the possible range of values for L and is at most the number of nodes in the partition PLS . This makes it possible for the system administrator to control and adjust this important aspect of our load sharing algorithm. These policy parameters are summarized in Table 3. Parameter Lmax Rj Vj

Control Policy System Administrator System Monitored System Monitored

De nition Maximum number of nodes allocated to a job Mean execution time of a job at node j Variance in the execution times of jobs at node j

Table 3: Load Sharing Parameters Since the maximumoperator is a convex function, the calculation of L in equation (1) of Section 4.2 for increasing L is terminated as soon as we nd a value of L for which M L < M L+1 , 1  L  minfNJ ; Lmax g (Recall that NJ denotes the number of tasks for job arrival J). It also can be shown that the value of L decreases with increasing system load [25, 41, 42], and thus a further optimization for light to moderate loads would be to start this calculation with L = minfNJ ; Lmax g and analogously work backwards. For very large numbers of nodes, the overhead of calculating L via equation (1) can be further reduced by dividing the nodes (in order of increasing work) into Z groups and performing the calculation over these groups, where W`L is modi ed to re ect the response time through the `th group of nodes, 1  `  L  Z. This can yield an ecient approximation to the true value of L , where the load balancing properties of the optimal load-sharing strategy (possibly together with additional optimizations) can be used to reduce/minimize any inaccuracies. Many di erent approaches have been employed to allocate resources for coarse-grained applications, each with varying and limited degrees of success [9]. We have demonstrated elsewhere [41, 42] that our load-sharing approach performs signi cantly better than other alternatives. An additional performance comparison between our load-sharing policy and gang scheduling is also provided in Section 7.3.

18

5.5 DLS Management of Gang Scheduling and Load Sharing partitions The DLS manages the allocation of nodes between the gang-scheduling PLS and the load-sharing PLS, as described in Section 3. Table 4 lists the various parameters that can be adjusted to control the behavior of the DLS. Parameter UGS ULS 

Policy Type System Monitored System Monitored j UGS ? ULS j T System Administrator DLS smooth System Administrator Jmax System Administrator Smin System Administrator

De nition Utilization of GS partition Utilization of LS partition Load di erential Threshold for load di erential FDP invocation interval Maximum Number of jobs Minimum Size of a partition

Table 4: The scheduling parameters for the DLS The system administrator has full control over the minimum partition size (Smin) and the maximum number of jobs (Jmax) that can be accommodated. He/she may also con gure the load di erential T and the smoothing interval DLS smooth as described in Section 3. The implementation of FDP in the DLS uses the load di erential between the load sharing and gang scheduling partitions to re-allocate nodes. Consider, for example, the case where UGS > ULS and  exceeds T . This event causes the corresponding FDP trigger to re, and the FDP mechanism reacts by determining whether to transfer nodes from one partition to the other to balance the load according to the values of these control parameters. Once it decides to repartition, the FDP mechanism starts moving nodes from the gang scheduling partition to the load sharing partition until  drops below T . Repartitioning cannot occur more often than the time speci ed by DLS smooth. This mechanism prevents moving nodes back and forth between two partitions in a thrashing manner. In this example, nodes can be re-assigned from the gang scheduling partition to the load sharing partition only if there are no application processes running on them. First the DLS sends a request to the gang scheduling partition to give up a number of nodes. The gang scheduling partition selects a set of nodes and asks the relevant applications to give them up. When the applications release the nodes, then the DLS hands the nodes to the load sharing partition. The gang scheduling partition cannot give up nodes that are executing non-recon gurable applications. In such a case, nodes can be handed over only after the completion of the applications that use them. In the opposite case, (i.e., ULS > UGS and  > T ), for re-assigning nodes from the load sharing partition to the gang scheduling partition, FDP waits until the appropriate nodes of the load-sharing partition become available, before it can re-assign them to the gang scheduling partition. The nodes become available when the application that already use them complete their execution; meanwhile, no newly arriving jobs are assigned to this speci c set of nodes. We assume that all applications in the load sharing partition are not recon gurable.

5.6 Scheduler Fault-Tolerance We designed Octopus to be fault-tolerant using a primary-backup scheme, which is a comparably less costly approach [30]. The number of backups is customizable at the domain level, and there is a tradeo between the number of backups and performance. In our hierarchical design, the NLS on each node is restricted to manage the state data and the application assigned to the node. Since the DLS and PLS have the superset of NLS data, the NLS does not need to be fault-tolerant. 19

When the primary takes a scheduling event, it logs the event by sending messages to the backups, which maintain active replicas of the primary's scheduling structures. An alternate design would have been for the primary to store the events on stable storage; on primary failure, the new primary would build the scheduling state based on the information in stable storage. Our current implementation had the advantages of not requiring any le system support and storage device accessible by both primary and backup nodes. Only scheduling events that modify system and application control data are logged, e.g., scheduling events that changes the allocation of resources among the PLSs or assign applications to nodes. Scheduling events not logged include negotiations between scheduling components and applications (e.g., resource negotiations between a launcher and the DLS), and modi cations to system state data (e.g., load information). Such messages need not be logged since rst they are not critical and second they can be re-produced if needed. This selective logging reduces message overhead. The logged events are those that correspond to messages of type 3, 4, 5, and 7 shown in Figure 3; messages of type 1, 2 and 6 are not logged. For performance reasons, most of the logging is asynchronous; that is, the primary sends the log to the backups and proceeds with the computation without waiting for an acknowledgement from the backups. In Figure 3, messages of type 4 and 5 are synchronous, and messages of type 3 and 7 are asynchronous. Asynchronous logging introduces substantially less latency than synchronous logging, at the cost of some uncertainty in the success of the log. When a backup takes over from a primary, there are small timing windows in which it is unknown by the backup whether or not the primary has sent a critical message. Our messages are timestamped with logical timestamps. As such, the backup can reconstruct and re-send information while the rest of the system components can suppress any duplicate messages using standard methods [30]. The primary and backup DLSs form a membership group that relies on the underlying node membership (NM) mechanism for consistent and complete failure detection and noti cation [21]. The NM mechanism also helps to select one of the backups to become the primary when the current primary fails. In the event of a node failure, the DLS informs the applications that were using the node by sending messages to the appropriate ALMs. Then it is up to the ALMs to direct the recovery process of the applications. It is important to note that applications which use this fault-tolerant feature of the scheduler do not have to provide their own fault detection mechanisms. On the other hand, Octopus would terminate immediately all applications that are a ected by a node crash and do not provide their own ALM process.

6 Prototype Platform and Workloads Given the above details of our scheduler design, we now describe the hardware and software platform on which the Octopus prototype was developed. We also provide here a description of the application workloads that were used to experimentally analyze various performance issues.

6.1 Hardware The IBM SP2 is a distributed-memory multicomputer that is connected by a high-speed switch. We use TCP/IP to communicate over the switch and facilitate software portability. The nodes run 63 MHz RS1 machines. We use an SP2 with a maximum of 16 nodes, where each node has 32 megabytes of memory.

20

6.2 System Software We have implemented Octopus on AIX 4.1. Currently, the DLS and the PLSs are integrated in one process. The NLS runs as a separate process. The Launcher and the ALM also run as separate processes. Typically the DLS/PLS and the NLSs run on di erent nodes. Although we had prototyped a version of the NLS that runs inside the AIX kernel we do not consider it further in this paper because of portability considerations. We describe some aspects of our time keeping and alarm subsystem, the context switching mechanism and other signals and system calls. The schedulers of every level (DLS, PLS, NLS) need to keep track of certain events and timeout periods. The DLS for example sets a timeout period every time it expects a message from a peer scheduler. The PLS sets timers for the GS smooth interval and for probing whether its NLSs are alive. An NLS uses a timer to do the context switching and to send load information to its PLS. AIX, however, does not o er independent clocks or multiple alarms. To compensate for this shortcoming (which is common in all UNIX systems), we created the alarmList, a time ordered list, where we keep track of all the pending activities. This list sequentializes concurrent events by placing distant future events back in the list and near future events in the front. A timer is set only for the nearest future event. When the alarm goes o , AIX sends a SIGALARM signal to the scheduler which then takes the appropriate action. In a way, we have multiplexed all the pending activities in the one available alarm mechanism such that no event can be missed. Inserting, deleting and updating the alarmList takes place within critical sections. The SIGALARM signal is bu ered if it occurs while the scheduler operates within a critical section. The AIX signal mechanism is also employed in the implementation of the context switching. Each NLS operating within the Time Shared partition employs a TimeSlice array with a one process per entry. At any given moment, the NLS has all its application processes but one blocked. To block these processes, the NLS sends them a SIGSTOP signal. When the time slice expires, the NLS sends a SIGSTOP signal to the currently running process and resumes the execution of the next process of the TimeSlice array by a SIGCONT signal. Signals are used in other cases as well. For example, an NLS can detect that a child process has exited by catching the SIGCHLD signal. With the wait3() system call, the NLS obtains more information on the terminated application process, it updates its database, and it inform its PLS. The co-ordination of row execution is potentially an expensive operation. A simple solution is for the PLS to send a message at every context switch to the NLSs. But this would result in high overheads for every context switch and undermine the aim for scalability. Instead, we rely on the fact that the Network Time Protocol (NTP) [28] synchronizes the nodes on the set of machines. Now, each node independently sets a timer at the start of the column it has been given by the PLS. When this (local) timer expires, each node moves forward one row. The NTP protocol ensures that the context switch will occur for all nodes almost simultaneously. In our AIX 4.1 implementation the resource monitor is part of the NLS process and reads /dev/kmem to get node utilization information and issues the UNIX rusage command to get information for each process.

6.3 Workloads The Applications We used the following set of applications in our study. AtEarth [4] simulates the ight of

neutrinos from the sun toward earth. Adaptive Quadrature [1] is an algorithm for numerical integration. Matrix Multiplication performs matrix multiplication in parallel. Fast Fourier Transform is a parallel version of fast fourier transform. Parallel Make is a parallel version of the DQS make program. Conjugent Gradient performs a conjugate gradient algorithm in parallel [33]. AtEarth and Adaptive Quadrature are bag-of-tasks applications. Conjugate Gradient, on the other hand, is a symmetric application. Sim is a synthetic application that consists of small parallel applications with irregular nishing times for its constituent processes. 21

Workloads We studied a large number of workloads by varying the mixture of the above applications. We also

changed sizes of the datasets used in some of the applications to yield the service time variations reported below. Speedups are calculated by running the application on a dedicated system with the appropriate number of processors. We only report the properties of the applications required for each to make appropriate scheduling decisions. The results presented in this paper are a representative sample of the trends observed in our numerous experiments. They were obtained with three di erent workloads. W1, a mixture of the Fast Fourier Transform, Matrix Multiplication and Conjugent Gradient applications, is an ecient and variable workload with an average speedup of 6.2 when executed on 8 nodes (in comparing the execution time when executed on 1 dedicated node) and a service time coecient of variation of 4.5. W2 is a less ecient and less variable workload with a speedup of 4 when executed on 8 nodes and a service time coecient of variation of 1.3. The average memory consumed by the processes is 3 megabytes. It consists of a mixture of the AtEarth, Adaptive Quadrature and Conjugent Gradient applications. Both W1 and W2 respresented characteristics of tightly-synchronous applications. W3 consists of a mixture of looselysynchronous applications (Sim, Parallel Make and sequential applications). WN, nally, is an equal mixture of the above three workloads.

7 Performance Measurements In this section, we use the Octopus prototype and the workloads presented in the previous sections to: 1. measure the overheads associated with FDP 2. examine the parameters of gang scheduling 3. compare load sharing with gang scheduling 4. evaluate the performance of the integrated system 5. evaluate the overheads associated with integrating fault-tolerance into the scheduler

7.1 FDP and Application Recon guration Overheads We rst evaluate the overheads associated with a simple form of FDP that uses application entry and exit events as triggers. Each application is placed in its own partition. When an application enters or exits all nodes are equally allocated between the currently executing partitions. This instance of FDP exhibits the basic overheads of the various instances of FDP used in our system. We also compare the overheads with the basic folding strategy without rotation proposed in [27]. According to the latter strategy, the arrival of a job causes the system to split the largest sub-partition into two equal sub-partitions, one for the old application and one for the new. The completion of an application causes the system to grant the nodes that became available to the smallest sub-partition. There are two types of overheads, those experienced by the system and those experienced by each application. The system overheads are the same, independent of the type(s) of application(s) being recon gured. Application overheads, on the other hand, depend upon their particular implementation of the recon guration phase. As system overhead we de ne the time spent by the partition from the moment it determines that it needs to recon gure the subpartitions until it can proceed to its next task. As application overhead we de ne the time spent by an application from the moment it began servicing the recon guration command, until the moment it was able to proceed with its normal work. 22

.. ...... 4 Communication restructuring 140 ... ........ ..... ..... 5 Checkpoint/Redistribute . ..... .. .......... Total Overheads 120 ..... ..... ..... .. ..... 100 ....... ..... ....... ... . .... . milli- 80 ... .... .. . .... . ..... 5 .... seconds 60 ... ... . ............... .... ........ .... ...................... ................. .... ... ............. 5................ .... ........................ ... . . ....... 40 ... ........ 5................... ... .................................................................................... 20 ....... . 5 5 5 5 ... .................................. ... ..............4 . . . . . . . . . .........................4 4 ......4 ..........................4 .........................4 ..........................4 ... 5........... 04 1 2 3 4 5 6 7 8 9 10 Number of Applications Figure 6: Total Overheads Associated with FDP without smoothing. We use a bag-of-tasks application. The application recon guration costs can be a high percentage of the overall overheads. At low arrival rates recon guring a large application has very high overheads. ...

The application overheads for the bag-of-tasks case were obtained from the AtEarth application (Adaptive Quadrature gave similar results). Conjugent Gradient was used for the symmetric application case. System overheads were almost identical for all cases verifying our prediction. To ensure the accuracy of these measurements, each set of experiments were repeated 50 times and we present the average of these runs. Both the application overheads for the bag-of-tasks applications and the total system overheads are shown in Figure 6. Partitions shrink in size when an application enters the system (if capacity is not exhausted) and expand when an application exits (if the system queue is empty). In this gure, we present the overheads when starting with an empty system, and then measure the overheads as more and more application enter the system until the number of applications is equal to the maximum number of partitions (in this case 8). The middle curve presents the time that the application spends for checkpointing and redistributing data during recon guration. The lowest curve shows how much time the application needs to reestablish network connections while the upper curve denotes the total time the system spends in each recon guration. Notice that the upper curve is placed higher than where the sum of the lower two curves would have been. This was an expected result. The system not only has to wait for the application to recon gure, but it is also involved with its own computations and bookkeeping. Recon guration Total Application (2 ! 1; 1) 445 (4 ! 2; 2) 528 (8 ! 4; 4) 275 (16 ! 8; 8) 230

Data Redistribution 429 487 213 116

Communication Restructuring 16 41 62 114

Table 5: Symmetric (CG) application recon guration costs. The overheads are measured in milliseconds. The costs are broken down to into data redistribution costs (redistribution of a matrix of size 1000 x 1000) and total application overhead costs. The application overheads for the symmetric application is shown in Table 5. The trends show that larger applications (applications with many nodes; e.g., when a partition of 16 nodes splits into two partitions of 8 nodes 23

260 240 4 SP2:FDP 220  SP2:Folding 200 180 160 .. 4 ................ . 140 .... ..................... . . milliseconds .... ........ .. .... 4 . 120 ... .... ............ .... .. ....... . ..... ........... 100 ......... ....... ...... ...... 4 ...... ....... ......... . . . . 80 ..... ...... ..... .............. ......... . ....... .... 60 4.. ................. . ...............4 ......4 . .. .....................................................................4 ..........................4  40 20 0 1 2 3 4 5 6 7 8 9 10 Number of Applications Figure 7: Comparison of the overheads for Folding and FDP without smoothing: Shrinking Partitions each, i.e., 16 ! 8; 8) experience lower total redistribution overheads, although there are more interactions involved. Also, this is in contrast with what we observed for the the bag-of-tasks applications in Figure 6 (In that gure, the large applications are at the left side; small number of applications implies more nodes per application). There are two reasons for this behavior. When an application runs on many nodes, data is partitioned in many small chunks, hence the amount of data being redistributed is small. Moreover, the exchange of the messages among the nodes is done concurrently, counter balancing the increased number of interactions. With the bag-of-tasks, all messages are exchanged sequentially since checkpointing is centralized and all nodes have to communicate with the ALM. The trends are di erent when it comes to the time spent while recon guring the communication channels. A large application (an applications that runs on many nodes, i.e., 16 ! 8; 8 repartitioning) needs more time to establish all necessary connections, hence more time is spent in this phase comparing to small applications (e.g., when a small partition of 2 nodes, splits into two single node partitions, i.e., 2 ! 1; 1). Table 5 presents these overheads as well. Notice that this is also consistent with the bag-of-tasks applications. (Observe the trend of the lower curve, as it goes from the right to the left. Larger applications are positioned in the left where less applications share the nodes.) Figure 7 compares the overheads of the FDP without smoothing and folding partitioning schemes. The folding partitioning scheme exhibits lower costs for certain regions because it interrupts fewer applications. However, at heavy loads the overheads of the two schemes are essentially the same.

7.2 Gang Scheduling Recall that the gang scheduling partition consists of both a space-sharing partition and a partitions of time-sharing subpartitions. In this section, we describe in detail the experiments and results associated with these space-sharing and time-sharing components of gang scheduling, and we explore further the various parameters and issues associated with the complete gang scheduling matrix and the integration of the two components via the FDP mechanism.

24

7.2.1 Space-sharing To quantitatively evaluate the bene ts of our FDP mechanism for implementing space sharing we compared it against two previously proposed strategies. The rst is the standard Dynamic Partitioning (DP) strategy (an instance of our FDP without timer triggers), and the second is the basic folding strategy without rotation. In these experiments, our FDP uses application entries, exits and timers for triggers. A timer res every thirty seconds. The best space sharing strategy depends on factors such as CPU utilization and workload. We have conducted a series of experiments and we have measured the mean job response time under a less ecient, less variable workload (see W2 in Section 6.3) at di erent CPU utilization levels. The results are shown in Figure 8. According to these results, FDP using application entry, application exit and timers as triggers, outperforms the standard DP which in turn gives better results that folding. At low CPU utilizations the DP strategy outperforms folding only marginally, whereas at high CPU utilizations DP does much better. (There is no contradiction between Figure 7 and Figure 8. The former presents overheads, while the latter presents mean response time.) One reason for DP outperforming folding (under the workloads considered) is that DP achieves better node allocations. Smoothing on the other hand, outperforms the rest in all cases and its e ects are pronounced at medium to high CPU utilizations. This result is not surprising, because at these CPU utilization levels, the system is still below full capacity (system queue is empty) and every job arrival and departure is likely to cause delay to both system and applications. By smoothing, the system gives time to applications to exit without being disturbed too often. At the same time, it batches the new applications that arrive and performs all recon gurations in one sweep. A summary of example values and the system related parameters can be found in Table 6. Parameter Policy Example Value minimum partition size system administrator 1 SS smooth interval system administrator 30 seconds Table 6: Space Sharing (SS) Scheduling Parameters

7.2.2 Time-sharing There are two fundamental types of overheads: context switching, and clock drift between the node clocks. From our experiments, we have measured that the NLS spends approximately 700 microseconds to context switch. This overhead is the time to handle a timer interrupt, select the next job to run, suspend the currently executing jobs and start the next one. The clock drift is a measure of the variance. We used NTP to synchronize the clocks and we have achieved a synchronization of 2ms. We could have used hardware synchronized clocks but we decided not to adopt such a technique since it is not portable.

7.2.3 Setting parameters for the best time sharing partitions In order to nd the best time sharing partition we considered the following questions: 1. What are the e ects of having di erent size subpartitions? 2. How does the value of Dk a ect the performance of the kth subpartition? 3. What is the best number of subpartitions K? 25

1.0

|

0.5

|

0.0

|

|

1.5

|

Normalized Mean Response Time

FDP without timing trigger Folding Partitioning FDP

Low (0.4)

Medium (0.7)

High (0.85) CPU Utilization

Best Dynamic

Figure 8: Mean Response Times Comparisons, for Workload 2 and 8 nodes. This workload consists of a set of bag-of-tasks and symmetric applications. We use a 30 seconds smoothing interval.

Choosing the right values for K and Dk The right pair of values is heavily dependent upon workload and

CPU utilization. To demonstrate this, we took workload W1 (jobs with high eciency and highly variable services times) and varied K and Dk at light, medium and high system loads. These results are shown in Figures 10, { 13. The results are normalized to the execution time of the best value for each CPU utilization. CPU Utilization 0.0 < U< 0.4 0.4 < U< 0.7 0.70 < U< 0.9 0.9< U< 1.0

K 1 1 8 8

Dk 4 8 1 4

Table 7: The best values of K and Dk for various utilizations and wor kload W1 The overall behavior depends upon factors such as queuing time (when the degree of multi-programming reaches the maximumpossible value Dk ), resource fragmentation (when there are less jobs than partitions), memory demands and context switch overheads. The best values of K and Dk for this setting are summarized in Table 7. In general, ecient applications can e ectively use all the given processors. Thus, when CPU-utilization is low, large values of K can result in fragmentation and idle resources. Hence, small values of K are better. By increasing the degree of multi-programming (large Dk ), the system can reduce the average waiting times and also accommodate any bursts of arrivals. At the other extreme of high CPU-utilizations, it is more important to maintain ecient node allocations (larger values of K). In this case, parallel applications run on fewer nodes with higher memory demands per node. Maintaining a low Dk can reduce both the thrashing due to memory misses and context switch overheads. The task of the monitoring facilities is to watch for any changes in the workload and for the gang scheduling PLS to adjust K and Dk accordingly. 26

|

2.0

|

1.6

|

1.4

|

1.2

|

1.0

|

0.8

|

0.6

|

0.4

|

0.2

|

0.0

|

|

1.8

|

Normalized Response Time

2.2

K=1, DK=1 K=1, DK=4 K=1, DK=8 K=8, DK=1 K=8, DK=2 K=8, DK=8

|

|

|

|

CPU Utilization 0.3

Figure 9: Best values of K and Dk for CPU utilization of 0.3. These results are for 8 nodes and workload W1. Our gang-scheduling strategy attempts to maintain the best values for these parameters (see Section 4.1) based on the trends observed in and the insights gained from our experiments and analysis. In particular, for each set of workload characteristics, the system identi es a range of CPU utilizations for which each value of K provides the best performance. The triggers used by FDP in this strategy are CPU utilization and timers. At each smoothing interval (speci ed by the timers) our system attempts to adjust the value of K as the load crosses from one CPU utilization range. While K is xed, the system increases Dk as the load rises within the speci c CPU utilization range. The monitoring facilities watch for changes in the workload and the system adjusts the range of CPU utilizations for each value of K accordingly.

7.2.4 Gang Scheduling Integration Pursuing further the relations among the characteristics of the workloads and the scheduling strategies of each of the strategies, we compared the relative performance for workloads W1 and W2 against the space and time sharing partitions. Table 8 shows these results. As expected, the time sharing partition provides better performance for highly variable and ecient workloads, due in part to its preferential treatment of short jobs, while space sharing provides better performance for less ecient workloads. 27

|

2.0

|

1.6

|

1.4

|

1.2

|

1.0

|

0.8

|

0.6

|

0.4

|

0.2

|

0.0

|

|

1.8

|

Normalized Response Time

2.2

K=1, DK=1 K=1, DK=4 K=1, DK=8 K=4, DK=1 K=8, DK=1 K=8, DK=8

|

|

|

|

CPU Utilization 0.5

Figure 10: Best values of K and Dk for CPU utilization of 0.5. These results are for 8 nodes and workload W1. The gang scheduling matrix adjusts to the workload mixture by reserving a subpartition for space-sharing (where the value of Dk is maintained to 1), a set of time-sharing subpartitions (where the value of Dk may be greater than 1), and an FDP mechanism that employs timer and CPU utilization triggers to re-allocate resources dynamically. When load imbalances occur (CPU utilization trigger) only one subpartition is allowed to move from time-sharing to space-sharing (or vice versa) per GS smooth interval. By experimentation we found that the GS smooth interval gives better results when it is approximately 3 1/3 times the SS smooth interval. Workload Space Shared Time Shared W1 1.55 1.0 W2 1.0 1.42 Table 8: Comparing Time sharing to Space Sharing for an 8 node system

28

|

2.0

|

1.6

|

1.4

|

1.2

|

1.0

|

0.8

|

0.6

|

0.4

|

0.2

|

0.0

|

|

1.8

|

Normalized Response Time

2.2

K=1, DK=1 K=1, DK=4 K=4, DK=4 K=8, DK=1 K=8, DK=2 K=8, DK=4

|

|

|

|

CPU Utilization 0.85

Figure 11: Best values of K and Dk for CPU utilization of 0.85. These results are for 8 nodes and workload W1.

7.2.5 Summary of Gang Scheduling Parameters A summary of the example values we employed can be found in Table 9. Parameters Dk and K are load dependent and are adjustable by the system at run time. Both the time slice interval and GS smooth interval are set by the system administrator. Best results were observed when the time slice was set to between 5 seconds and GS smooth was 100 seconds. The system administrator may also set the minimum size of a column group.

7.3 Load Sharing vs Gang Scheduling The load sharing partition is intended and better suited for loosely-synchronized, irregular and sequential application workloads than is gang scheduling. To demonstrate this point, we present in Table 10 measurements for W3 scheduled according to the load sharing and gang scheduling partition. The load sharing partition provides considerable performance improvements over the gang scheduling strategy due in part to the \holes" that the irregular applications leave in the gang-scheduling matrix. W2, on the other hand, performs much worse on the load sharing partition because this partition provides no support for synchronized scheduling. The preferred partition for W2 (as well as W1) is the gang scheduling partition. The performance improvements of the load sharing partition over the gang scheduling partition become even more 29

|

2.0

|

1.6

|

1.4

|

1.2

|

1.0

|

0.8

|

0.6

|

0.4

|

0.2

|

0.0

|

|

1.8

|

Normalized Response Time

2.2

K=1, DK=1 K=1, DK=4 K=8, DK=1 K=8, DK=2 K=8, DK=4 K=8, DK=8

|

|

|

|

CPU Utilization 0.925

Figure 12: Best values of K and Dk for CPU utilization of 0.925. These results are for 8 nodes and workload W1. signi cant as the system utilization increases under a workload of only loosely-synchronized applications (excluding the sequential applications). This is shown in Figure 14, which provides the percentage of improvement in mean response time under our load sharing algorithm when compared to gang scheduling over a range of utilizations for 8- and 16-node partitions. Observe further that the magnitude of these relative performance bene ts under load sharing tends to rapidly increase with the number of nodes in the partition.

7.4 Performance of the nal integrated system It is apparent that no one partition is best for all workloads. For the three workloads we considered, the preferred scheduling strategies are shown in Table 11. In Octopus, when an application enters the system, it is expected to provide its pro le (application type, eciency, execution time, synchronization type, memory and processor requirements) as described in Section 5.1.1. With this information, the DLS dispatches the sequential or looselysynchronized parallel applications to the load-sharing scheduling partition, and the tightly-synchronized applications to the gang scheduling partition. Depending on the partitions' utilizations, and its eciency and execution time information, the tightly-synchronized applications are further dispatched to either the space sharing or time sharing sub-partitions. For example, ecient applications with short execution time would be dispatched to the time-sharing 30

Parameter Dk K time slice duration SS smooth GS smooth TRM minimum size of K

Policy System Monitored System Monitored system administrator system administrator system administrator system administrator system administrator

Value see Figures 10, 11, 12, and 13 see Figures 10, 11, 12, and 13 5 seconds 30 seconds 100 seconds < GS smooth 1

Table 9: Summary of Gang Scheduling Parameters. Workload Load Shared Gang Scheduled W3 1.0 1.28 W2 1.17 1.0 Table 10: The relative performance of Load Sharing vs Gang Scheduling for two di erent workloads. subpartition while the inecient and relative long running applications would be dispatched to the space-sharing subpartition. With applications assigned to the appropriate (sub-)partitions for execution, the FDP algorithm presented in Section 5.5 monitors system and partition utilizations, to allocate nodes to equalize partition utilizations. Workload W1 W2 W3

Preferred Partition Gang Scheduling (Time sharing part) Gang Scheduling (Space sharing part) Load Sharing

Table 11: The partition preferences of di erent workloads Table 12 shows the superiority of our integrated resource management system on a particular workload. WN is a workload that consists of an equal percentage of the three workloads (W1, W2, W3). The DLS smooth parameter was set to 2 times the PLS smooth parameter, i.e., 200 seconds. The number of nodes was 16. In these experiments the value of T was set to 0.2 (i.e., re-allocate when nodes are 20 percent of each other), and the minimum size of a partition was 2.

7.5 Overheads of integrating fault-tolerance Our current implementation of the fault-tolerance components of the system focused on the functional aspects of the backup and recovery scheme. At any given moment, the system can be in one of the following states:

 normal execution and logging messages  recovery due to DLS/PLS failure  recovery due to NLS failure Recall that the DLS and PLS are implemented as threads in the same process, and the NLS runs as a separate process. To log a message at the DLS/PLS backups, the primary sends either asynchronous or synchronous messages across the network, as described previously in Section 5.6. In Table 13, we list the cost of scheduling an application 31

200.0

|

180.0

|

160.0

|

140.0

|

120.0

|

100.0

|

80.0

|

60.0

|

40.0

|

20.0

|

Relative improvement

|

220.0

|

|

0.0

8 node partition 16 node partition

| 0.2

|

|

0.4

0.6

|

| 0.8

CPU Utilization

Figure 13: Relative performance of load sharing vs gang scheduling as a function of utilization and partition size. with 0, 1, 2 and 3 DLS/PLS backups. The numeric di erences of R 2 f1; 2; 3g and R = 0 would suggest the overheads for using the fault-tolerance feature. The measurements were collected by scheduling a parallel application on a load-sharing partition consisting of up to 8 nodes. Since most of the costs were in sending the messages, we compared the costs over the high speed switch (300Mbps) or ethernet (10Mbps) adaptors of the SP2 system, and also the token-ring (16Mbps) network of workstations. The average length of the messages sent to the backups was around 256 bytes, well below the maximum size of one message package. The DLS timings in the table were the costs seen from the primary DLS. These timings included the time for the DLS to process an arriving application (select and pass the application to the PLS), and to send the necessary messages to the backup DLSs. At each backup, the time to process these messages was about 4 msec for the SP2 system. The PLS/NLS measurements in the table include the time to select the nodes in the assigned partition for the application to run, and the time to send a message from the primary to the backups on the node-application Workload Combined Gang Scheduled Load Shared WN 1.0 1.37 1.54 Table 12: Combined approach versus Gang Scheduled versus Load Shared 32

pair information. Note that the time increases with the number of backups and nodes. In the next iteration of our prototype, we plan to investigate some obvious optimization techniques: messages from the primary can be multicast to the backups if there is more than one backup; instead of sending one message per node-application assignment, the pair information can be grouped into one or fewer messages from the primary to the backups. Type

DLS R=0 N=8,SP2/sw 43 N=4,SP2/sw 46 N=2,SP2/sw 42 N=8,SP2/eth 44 N=4,network 96

DLS R=1 65 64 63 62 120

DLS DLS NLS NLS R=2 R=3 R=0 R=1 76 100 87 239 41 119 27 57 85 108 89 250 41 155

NLS NLS R=2 R=3 421 611 456

639

Table 13: Application scheduling cost on the fault-tolerant scheduler (time in msec, N=number of nodes in partition, R=replication number, sw=high speed switch, eth=ethernet) In our prototype, we have employed an implementation of a node membership protocol (NM) that was build separately. NM \proactively" detects node failures using a heartbeat mechanism. Our fault tolerant scheduler is based on NM for leader election (primary/backup nodes), failure detection and group membership. Figure 15 presents the case in which the scheduler has one backup DLS. The gure also shows the underlying node membership service and the ow of information between the components. logging messages

Primary DLS

acknowledgements

NM data membership module for primary

Backup DLS

NM data heartbeat leader election

Node Membership Service

membership module for backup

Figure 14: Architecture of the fault tolerant scheduler When the primary DLS/PLS fails, the underlying NM noti es the backups of the failure and elects a new primary DLS/PLS. On the SP2 system, we measured 48 msec to establish a new primary DLS/PLS. The timing was measured at the newly elected backup once the death noti cation was received and until the time at which the new primary proclaimed itself as the new leader. During this period of time, the elected leader would receive noti cation from NM on its election; established the appropriate alarm activities, as described in Section 6.2; and reconstructed the state information from logged messages (see Section 5.6). NM could also be used to detect NLS node failure. Instead, for performance concerns, we used a passive mode of failure detection. NLSs send load information to their PLS at xed intervals. When an NLS fails to communicate with the DLS/PLS for a speci c timeout period, the DLS/PLS would inquire about the status of nodes. If the 33

response was negative, we declare that the node failed and proceed with the recovery steps: remove the failed node from the DLS/PLS, and then inform the ALMs of the a ected applications or terminate applications without an ALM. In Table 14, we show the time to remove a failed node when a bag-of-tasks application is running on a partition of 8 nodes. The cost of terminating an application here, is higher than recon guring the application. This is a result of the extra logic in the DLS, the PLS and their interaction as well as the additional communication with the backup to re ect the changes in the node-application information. Type DLS/NLS,R=0 DLS/NLS,R=1 ALM APP terminated 29 175 APP recon gured 21 65 67 Table 14: Recovery cost on the fault-tolerant scheduler (time in msec, R=replication number)

8 Conclusions In this paper we have presented the design and implementation of Octopus, a hierarchical and extensible resource management system that allows the concurrent execution of multiple scheduling paradigms. Octopus, adopts a new scheduling strategy, called exible dynamic partitioning, for dynamically adjusting the partitioning of resources at all levels of the system hierarchy To show the e ectiveness of our resource management system we have presented a new parallel scheduling algorithm that combines three known scheduling paradigms (load sharing, time sharing and space sharing). We have adopted a novel approach and we have integrated these well-known paradigms into a coherent, ecient and fault tolerant distributed scheduler. Our results exhibit that di erent workloads perform better under di erent paradigms, and thus the scheduler for general-purpose parallel systems must combine and manage the di erent paradigms required by the application workload. The purpose of Octopus is to map incoming applications to the scheduling paradigms that t best their resource requirements and processing characteristics given the state of the system. In addition, it is fault-tolerant and provides a uniform interface for applications to react to node pre-emption and node failure. Since we cannot apriori determine all the applications that will use our scheduling system, we have made our system extensible. This means that the new applications that use our system must be characterized and classi ed according to the properties discussed as being part of a speci c workload. These workloads may be di erent from these that we have examined in this paper and may thus require di erent scheduling policies.

References [1] G. R. Andrews. Paradigms for process interaction in distributed programs. ACM Computing Surveys, 23(1):49{90, Mar. 1991. [2] R. Campbell and N. Islam. Choices: A Parallel Object-Oriented Operating System. In G. Agha, P. Wegner, and A. Yonezawa, editors, Research Directions in Concurrent Object-Oriented Programming. MIT Press, 1993. [3] R. Campbell, N. Islam, D. Raila, and P. Madany. Designing and Implementing Choices:an Object-Oriented System in C++. Communications of the ACM, Sept. 1993. [4] Carriero, Freeman, Gelernter, and Kaminsky. Adaptive Parallelism in Piranha. IEEE Computer, 28(1):40{49, Jan. 1995.

34

[5] S. Chapin and E. Spa ord. Support for Implementing Scheduling Algorithms Using Messiahs. In Scienti c Programming, pages 325{340. John Wiley, 1994. [6] M. Crovella, P. Das, C. Dubnicki, T. LeBlanc, and E. Markatos. Multiprogramming on multiprocessors. In Proceedings of the IEEE Symposium on Parallel and Distributed Processing, pages 590{597, 1991. [7] V. D. Cung et al. Concurrent data structures and load balancing strategies for parallel branch-and-bound/A* algorithms. In The Third DIMACS International Algorithm Implementation Challenge on Parallel Algorithms, October 1994. [8] S. A. Fakhouri, L. L. Fong, A. S. Gopal, N. Islam, J. A. Pershing, and M. S. Squillante. Milliways scheduling and membership design. Technical report, IBM Research Division, November 1994. [9] D. G. Feitelson. A survey of scheduling in multiprogrammed parallel systems. Technical Report RC 19790, IBM Research Division, October 1994. [10] D. G. Feitelson and B. Nitzberg. Job characteristics of a production parallel scienti c workload on the NASA Ames iPSC/860. In Job Scheduling Strategies for Parallel Processing, D. G. Feitelson and L. Rudolph (eds.), pages 337{360. Springer-Verlag, 1995. Lecture Notes in Computer Science Vol. 949. [11] D. G. Feitelson and L. Rudolph. Distributed hierarchical control for parallel processing. Computer, pages 65{77, May 1990. [12] B. Ford and S. Susarla. CPU Inheritance Scheduling. OSDI 96, October 1996. [13] P. Goyal, X. Gao, and H. Vin. A Hierarchical CPU Scheduler for Multimedia Operating Systems. OSDI 96, October 1996. [14] A. Gupta, A. Tucker, and S. Urushibara. The impact of operating system scheduling policies and synchronization methods on the performance of parallel applications. In Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, pages 120{132, May 1991. [15] S. G. Hotovy. Workload evolution on the Cornell Theory Center IBM SP2. In Job Scheduling Strategies for Parallel Processing, D. G. Feitelson and L. Rudolph (eds.), pages 27{40. Springer-Verlag, 1996. Lecture Notes in Computer Science Vol. 1162. [16] N. Islam. Distributed Objects: Methodologies for Customizing System Software. IEEE Computer Society Press, 1996. [17] N. Islam and M. Devarakonda. An Essential Design Pattern for Fault-Tolerant Distributed State Sharing. In Communication of the ACM, Oct. 1996. [18] N. Islam, A. Prodromidis, and M. S. Squillante. Dynamic partitioning in di erent distributed-memory environments. In Job Scheduling Strategies for Parallel Processing, D. G. Feitelson and L. Rudolph (eds.), pages 244{270. Springer-Verlag, 1996. Lecture Notes in Computer Science Vol. 1162. [19] N. Islam, A. Prodromidis, M. S. Squillante, A. S. Gopal, and L. L. Fong. Extensible resource management for cluster computing. Technical Report RC 20526, IBM Research Division, May 1996. [20] N. Islam, A. Prodromidis, M. S. Squillante, A. S. Gopal, and L. L. Fong. Extensible resource scheduling for parallel scienti c applications. In Proceedings Eighth SIAM Conference on Parallel Processing for Scienti c Computing, March 1997. [21] F. Jahanian, S. Fakhouri, and R. Rajkumar. Processor group membership protocols: Speci cation, design and implementations. In Proceedings of the Symposium on Reliable Distributed Systems, October 1993. [22] G. E. Krasner and S. T. Pope. A Cookbook for Using the Model-View-Controller Paradigm in Smalltalk-80. Journal of Object-Oriented Programming, pages 26{49, 1988. [23] A. Krishnamurthy et al. Connected components on distributed memory machines. In The Third DIMACS International Algorithm Implementation Challenge on Parallel Algorithms, October 1994.

35

[24] S. T. Leutenegger and M. K. Vernon. The performance of multiprogrammed multiprocessor scheduling policies. In Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, pages 226{236, May 1990. [25] A. M. Makowski and R. D. Nelson. Optimal scheduling for a distributed parallel processing model. Technical Report RC 17449, IBM Research Division, February 1992. [26] C. McCann, R. Vaswani, and J. Zahorjan. A dynamic processor allocation policy for multiprogrammed shared-memory multiprocessors. ACM Transactions on Computer Systems, 11(2):146{178, May 1993. [27] C. McCann and J. Zahorjan. Processor allocation policies for message-passing parallel computers. In Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, pages 19{32, May 1994. [28] D. Mills. Network time protocol (v.3) speci cation implementation and analysis. Technical Report RFC 1305, University of Delaware, March 1992. [29] D. L. Mills. Improved algorithms for synchronizing computer network clocks. IEEE Transactions on Networks, pages 245{254, June 1995. [30] S. Mullender. Distributed Systems. Addison-Wesley, 1993. [31] V. K. Naik, S. K. Setia, and M. S. Squillante. Performance analysis of job scheduling policies in parallel supercomputing environments. In Proceedings of Supercomputing '93, pages 824{833, November 1993. [32] J. K. Ousterhout. Scheduling techniques for concurrent systems. In Proceedings of the Third International Conference on Distributed Computing Systems, pages 22{30, October 1982. [33] J. D. Padhye and L. W. Dowdy. Preemptive versus non-preemptive processor allocation policies for message passing parallel computers: An empirical comparison. In Job Scheduling Strategies for Parallel Processing, D. G. Feitelson and L. Rudolph (eds.), pages 224{243. Springer-Verlag, 1996. Lecture Notes in Computer Science Vol. 1162. [34] M. E. Rosenkrantz, D. J. Schneider, R. Leibensperger, M. Shore, and J. Zollweg. Requirements of the Cornell Theory Center for resource management and process scheduling. In Job Scheduling Strategies for Parallel Processing, D. G. Feitelson and L. Rudolph (eds.), pages 304{318. Springer-Verlag, 1995. Lecture Notes in Computer Science Vol. 949. [35] E. Rosti, E. Smirni, L. W. Dowdy, G. Serazzi, and B. M. Carlson. Robust partitioning policies of multiprocessor systems. Performance Evaluation, 19:141{165, 1994. [36] S. K. Setia, M. S. Squillante, and S. K. Tripathi. Analysis of processor allocation in multiprogrammed, distributed-memory parallel processing systems. IEEE Transactions on Parallel and Distributed Systems, 5(4):401{420, April 1994. [37] S. K. Setia and S. K. Tripathi. A comparative analysis of static processor partitioning policies for parallel computers. In Proceedings of the International Workshop on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), pages 283{286, January 1993. [38] K. C. Sevcik. Characterizations of parallelism in applications and their use in scheduling. In Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, pages 171{180, May 1989. [39] K. C. Sevcik. Application scheduling and processor allocation in multiprogrammed parallel processing systems. Performance Evaluation, 19:107{140, 1994. [40] M. S. Squillante. On the bene ts and limitations of dynamic partitioning in parallel computer systems. In Job Scheduling Strategies for Parallel Processing, D. G. Feitelson and L. Rudolph (eds.), pages 219{238. Springer-Verlag, 1995. Lecture Notes in Computer Science Vol. 949. [41] M. S. Squillante and K. P. Tsoukatos. Analysis of optimal scheduling in distributed parallel queueing systems. In Proceedings of the International Conference on Computer Communication, August 1995. [42] M. S. Squillante and K. P. Tsoukatos. Optimal scheduling of coarse-grained parallel applications. In Proceedings Eighth SIAM Conference on Parallel Processing for Scienti c Computing, March 1997.

36

[43] M. S. Squillante, F. Wang, and M. Papaefthymiou. An analysis of gang scheduling for multiprogrammed parallel computing environments. In Proceedings of the Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA), pages 89{98, June 1996. [44] M. S. Squillante, F. Wang, and M. Papaefthymiou. Stochastic analysis of gang scheduling in parallel and distributed systems. Performance Evaluation, 27&28:273{296, 1996. [45] A. Tucker and A. Gupta. Process control and scheduling issues for multiprogrammed shared-memory multiprocessors. In Proceedings of the Twelfth ACM Symposium on Operating Systems Principles, pages 159{166, December 1989. [46] J. M. Vlissides and M. Linton. Unidraw: A framework for building domain-speci c graphical editors. In User Interface Software Technologies, pages 81{94. ACM SIGGRAPH/SIGCHI, 1989. [47] F. Wang, H. Franke, M. Papaefthymiou, P. Pattnaik, L. Rudolph, and M. S. Squillante. A gang scheduling design for multiprogrammed parallel computing environments. In Job Scheduling Strategies for Parallel Processing, D. G. Feitelson and L. Rudolph (eds.), pages 111{125. Springer-Verlag, 1996. Lecture Notes in Computer Science Vol. 1162. [48] F. Wang, M. Papaefthymiou, and M. S. Squillante. Performance evaluation of gang scheduling for parallel and distributed multiprogramming. In Proceedings of the 3rd Workshop on Job Scheduling Strategies for Parallel Processing, April 1997. [49] J. Zahorjan and C. McCann. Processor scheduling in shared memory multiprocessors. In Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, pages 214{225, May 1990.

37