Faucets: Efficient Resource Allocation on the ...

11 downloads 936 Views 325KB Size Report
Nov 2, 2003 - 4 Adaptive Runtime Strategies for Multi-cluster Jobs. 8 .... computers (e.g. large and small, with large bisection bandwidth and ...... The software will also be useful in classroom teaching especially in classes involving parallel.
Faucets: Efficient Resource Allocation on the Computational Grid L. V. Kale November 2, 2003

Contents 1 Computer Power as a Commodity 1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Target Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 2 3

2 Job Specifications and Quality of Service Contracts 2.1 Specifying Job Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Obtaining Job Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3 3 4

3 Resource Allocation within a Cluster 3.1 Adaptive Run Time System . . . . . . . . . 3.2 Interactive Jobs . . . . . . . . . . . . . . . . 3.3 Adaptive Job Scheduler . . . . . . . . . . . 3.3.1 A Prototype Scheduler . . . . . . . . 3.3.2 Proposed work: Intelligent Scheduler

5 5 6 7 7 8

. . . . . . . . for

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Adaptive Jobs to

. . . . . . . . . . . . . . . . . . . . . . . . Maximize

. . . . . . . . . . . . . . . . Profit

. . . . .

4 Adaptive Runtime Strategies for Multi-cluster Jobs 5 Faucets: Infrastructure 5.1 Security . . . . . . . . . . . 5.1.1 Authorization Server 5.2 File transfer . . . . . . . . . 5.3 Input and Output support .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

8

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

6 Market Economy of Compute Power 6.1 Bidding Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Description of the Prototype System . . . . . . . . . . . . . . . . . 6.3 Proposed Work: QoS Contracts, Auction and Bidding Algorithms 6.3.1 Bid Generation . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.2 Scalable Matching and Bid Evaluation . . . . . . . . . . . . 6.3.3 Market Mechanisms . . . . . . . . . . . . . . . . . . . . . . 6.4 Proposed Work: Evaluation of Strategies via Simulation . . . . . . 6.5 Multicluster Bidding . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6 Cluster Bartering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6.1 Accounting . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Schedule and Milestones

. . . .

. . . . . . . . . .

. . . .

. . . . . . . . . .

. . . .

. . . . . . . . . .

. . . .

. . . . . . . . . .

. . . .

. . . . . . . . . .

. . . .

. . . . . . . . . .

. . . .

. . . . . . . . . .

. . . .

. . . . . . . . . .

. . . .

. . . . . . . . . .

. . . .

11 11 11 11 12

. . . . . . . . . .

12 12 12 13 13 13 14 14 14 15 15 15

8 Broad Impact

15

9 Results From Prior Support

16

0

1

Computer Power as a Commodity

The computational grid is a powerful metaphor. Compute power should be available and distributed in a manner similar to water or electricity. Of course, much work is needed before this metaphor can become reality. The emerging landscape in high performance computing consists of a large number of medium size clusters, and a few extremely large parallel machines. (For brevity, we will use the word ”cluster” to mean a parallel machine). In addition, there will be smaller clusters in individual departments (or even under desktops). With the commoditization of processors and even communication fabrics, the cost of building such clusters are low, although the cost of ownership can be high. So, the ”supply” of compute power can be raised at a reasonably fast rate. On the demand side, the broad community of scientists and engineers are just starting to take advantage of parallel computers. It can be expected that parallel modeling efforts in industry, for designing better products, and in science, for understanding natural processes, will be routine. Coupled with advances in parallel algorithms, and parallel programs available to the community, this will lead to increased demand for parallel compute-power. However, in its natural evolution, such a scenario will have significant systemic inefficiencies. Some of the strategies we propose are based on the adaptive runtime systems that we developed in Adaptive MPI and Charm++, which help optimize individual parallel application; With AMPI, these strategies are now applicable to all MPI programs, rather than just Charm++ programs. Also, we will validate our research continuously by using several production applications in our test-bed, especially NAMD, a parallel Molecular dynamics program (that shared a Gordon Bell Award at SC2003) written in Charm++ and used by a thousands of biophysicists. We will motivate several of our strategies with NAMD-based examples, although they are applicable Within a single Cluster , inefficiencies arise due to lower throughput resulting from inability to schedule all available processors. For example, consider a parallel machine with 1,000 processors running a 600 processor job A; job B then arrives, and needs at least 500 processors. It must wait until A finishes (or suspend A), leading to 400 (or 500) unused processors, while a job waits. We will describe our proposal for an adaptive queuing system that uses adaptive jobs ( which can change the number of processors allocated to them at runtime) to increase throughput. Running a single job on multiple Clusters will be desirable in several contexts. One such context is what we encounter at the theoretical and computational biophysics group (where Kale is a co-PI): we own several (including 3 new clusters with 48 processors each) clusters, which are in regular use running NAMD and other programs. However, sometimes a single critical simulation needs to run with a short deadline. Using resources at national centers is one option, but the usage there will be shared and intermittent. Using multiple available clusters for a single simulation will be highly desirable if it is efficient. Further, such multi-cluster runs will be useful to carry out capability simulations which are otherwise infeasible. We propose (Section 4) strategies for runtime optimizations of multi-cluster jobs. Although significant progress has been made in this area (e.g. Cactus-G [?]), we think that processor virtualization (as in AMPI) provides our strategies with unique advantages not explored before. The above strategies are useful, but they do not handle inefficiencies at global scale: Clusters will remain unused for periods of time, while user’s jobs are waiting for their turn on other machine that happen to be busy at the moment. We need adaptive resource allocation at the level of the computational grid to handle this. In the scenario we envisage, clusters will be deployed and operated by ”profit centers” (analogous to power generators). Users, authenticated by an independent “billing” service, will submit a parallel job to the ”grid” — the Faucets system — which will run it on a suitable machine that 1

submits the best (often, cheapest) bid. A job may be moved between supercomputers during its life. Large jobs may also run on multiple supercomputers simultaneously. Users will be able to monitor and interact with their jobs via the Web, and input and output files are appropriately moved from and to the user’s computer. Adaptive resource management in this scenario requires that we match jobs to resources (Clusters ) in market-efficient manner. This will not only allow efficient use of available resources at a given moment in time, but also promote long-term market-driven selection of a mix of parallel computers (e.g. large and small, with large bisection bandwidth and without) that matches the demand mix. To enable market forces to operate in this context, we will conduct research on bidding systems for selected Clusters for jobs. When a job is submitted, the Faucets system will identify potential machines with capabilities needed for the job; These machines (or their ”agents”) will be sent a request-for-bid. The machines then respond with a bid – which can be a a Dollar amount, an SU multiplier, or a bartering offer in different contexts. The faucet system then chooses the ”best” bid on behalf of the user, upload required files, start the job, and baby-sit the job through its lifetime. We plan to develop a software infrastructure, and a simulation framework to facilitate deployment of Faucets systems, and to facilitate research comparing alternate strategies. Some of the research issues to be explored include scalable resource selection, and effective bid-generation algorithms that aim to maximize profit for individual clusters. Note that multi-cluster jobs play an important role in this context also: they make it possible for federations of cluster to compete for large jobs with larger supercomputers, thus increasing competition and efficiency. Adaptive strategies in all three contexts (single Cluster , multiple Cluster , and grid) can benefit from an accurate characterization of parallel jobs whenever possible. We will develop QoS (Quality of Service) metrics (Section 2) that will help clusters identify resource requirements of a job, and estimate completion times with reasonable accuracy. To this end, we will also develop techniques for performance characterization of programs that will help build QoS profiles of common applications. There is a considerable body of research in the broad areas outlined above (E.g. XXXXXXXX cite). We are planning to develop an ambitious system, and plan to build upon existing research whenever possible (e.g. on security and authentication). The novelty of proposed research is in new classes of adaptive resource management strategies, and in integration, deployment, and experimentation with complete grid-based systems — in simulation as well as on real clusters. To this end, we have obtained support and collaboration from several researchers, listed in letters of support. (XXX CHECK THIS LINE... WE NEED TO GET LETTERS FROM KLAUS¡ QUINN, AND MARK TUCKERMAN + Haber, Heath, Adve).

1.1

Related Work

• Zenturio • Cactus • Condor • Globus • others?

2

1.2

2

Target Applications

Job Specifications and Quality of Service Contracts

1.5 pages To schedule a job on a parallel server effectively, the characteristics of the job must be known. In the grid environment, these characteristics are necessary for servers to decide if they want to accept the job, or even to place a bid for it. One of our first objectives will be to identify the relevant job requirements, and embody them into a quality-of-service contract.

2.1

Specifying Job Requirements

The admission control module in the cluster schedulers should be able to estimate the completion time of the job given its QOS contract, enabling it to return a bid multiplier. Hence the QOS contract should be comphresive for such performance predictions to be made. The job requirements for such QOS contracts will include, • Number of processors, the job request is going to request a certain number of processors. This could be a single number or the minimum and the maximum number of processors the Job can run on, etc. • Estimated CPU cycles, needed by the job, and some notion of how this changes with the number of processors. • Memory requirement, the amount of memory required by the job. If the per node memory requirement of the job is not available on a Cluster then the job cannot be submitted to that Cluster . • Deadline: There may be a soft and hard deadline, with different pay-off after the soft deadline. A parallel machine that cannot meet the deadline will not accept the Job. • Processor architecture, the preferred architecture for the job. Certain jobs may prefer a specific architecture, e.g. vector code may request a vector processor. Wild cards would mean the job can be run equally well on any architecture. • Communication characteristics. Certain jobs are sensitive to communication latency. Others may need large bisection bandwidths. Hence the QOS contract could include number of messages and the amount of data transmitted by the job. These characteristics would then decide the communication architectures suitable for the job. • Disk space. The amount of disk space used during run-time, the volume of the data transfered in uploading input files and downloading output-files will also affect where the job will run. • Software environment. This could include the executable for the job, the host operating system, and the required compilers and software libraries. • Payoff or priority. This determines the amount the user is willing to pay for if the job is finished before its deadline. The profit metrics do not have to be steady and can change (depreciate) with time. Missing the deadline could have a steep post-deadline drop-off in the payoff function.

3

• Adaptive Jobs Adaptive jobs are jobs that can change the number of processors they are running on at runtime. As we will show later running adaptive jobs increases the overall utilization of the system. So parallel clusters would be inclined to run adaptive jobs as opposed to traditional jobs, and hence offer cheaper bids for adaptive jobs. One of the research issues is to decide the level of detail in the specifications. For example, the completion time as a function of number of processors can be specified by simple or more sophisticated models. For memory, one could list the amount of memory used per processor, as well as some measure of the cache utilization. For communication, one may characterize the number and volume of messages internal to the program, and the communication with outside world, number, frequency and burstyness of complex collective communication operations etc. Disk I/O may also be characterized by initial input size, output size, and access pattern during the run itself.1 Some applications have distinct phases or components, each with very different requirements. So job requirements could also include a directed acyclic graph of the different phases of the application, which can potentially run on several supercomputers with violating dependencies. Even when they are running on the same machine, the scheduler may benefit from knowing the shift in performance parameters when the program shifts from one phase to another. The QoS contract will be able to specify such phases and components, and iterative structures around them (if any). Note that to be useful, such a phase must last for several minutes, to justify the overhead of moving the job. To specify complex job requirements, a resource specification language is needed. We plan to investigate the language RSL developed by Globus framework and further extend it if necessary. RSL description ??

2.2

Obtaining Job Requirements

How do users specify the job requirements? In the grid environment, a small number of applications will likely constitute a majority of parallel jobs, since many users run the same application program with different data. Calibration runs can be performed for such common applications. Using these runs as a basis, individual applications could provide a preprocessor to estimate the job requirements for a given input. Based on our experience with production applications in Science and Engineering (NAMD[23, 3], Rocket simulation[2], crack propagation[12]), we will develop such performance estimators, and validate their predictive power via experiments on multiple parallel machines. For an individual parallel machine, one should be able to use the parameters supplied by the application, and the machine-specific performance parameters (such as CPU speed, cache size, communication latency and bandwidth, topology, etc.) and predict the completion time on a given number of processors. The QoS parameters will be refined and standardized based on this experimentation, and tested on other applications. One could even carry out automatic parameter calibration via program instrumentation for applications where the developers have not provided parameters. Notice that our overall system does not depend strictly on a highly accurate predictive performance model. A conservative estimate that allows the users to reserve the resources for an adequate time is all that is needed. Further, the ability to checkpoint applications allows the parallel servers to stop an application at the end of the contracted period, without wasting the entire run in case the estimate was too short. Even when its predictive accuracy is limited, a contract is necessary in order to do any heuristic scheduling. Contracts are also useful to provide a structure by which unrelated clients and servers can interact to each other’s mutual benefit. 1

For example, an out-of-core solver may have extensive online disk usage.

4

3

Resource Allocation within a Cluster

1.5 pages - AMPI Description - Adaptive Jobs and Adaptive Queuing system - IMD

0.5 1.25 0.25

A simplistic definition of utility of a cluster is “processor utilization”. If the machine’s processors are busy 90% of the time, it is better utilized than when they are 60% busy. However, this metric does not take into account the relative importance of jobs. Running a toy program is considered on par with solving a research problem. If the machine is running as a profit center (as envisaged in our depiction of the grid environment), utility can be defined as the total fee collected from running jobs. The fee charged may vary depending on time of completion.2 Even with the simple definition of utility as percent utilization, it is clear that a major problem in attaining high utility is the rigidity of existing parallel jobs: once submitted, a job must run to completion on the same number of processors it started with. This rigidity was responsible for the wastage of computing resources identified in the first example in the introduction. This wastage can be alleviated if other smaller jobs are ready to be executed[27]. But this may not always be the case, leading to under-utilization of the system. Adaptive jobs provide a potential solution because they can change the number of processors used at runtime on command. (Adaptive jobs correspond to malleable and evolving jobs of the classification in [10].) It has been shown [33, 25, 30, 16, 29, 15, 13] that a parallel machine will run more efficiently with adaptive jobs. However, most parallel jobs today are not adaptive. In limited contexts, such as master-slave or certain data-parallel applications, adaptive jobs can be created relatively easily[26]. But most applications do not fit these patterns. Even though some libraries (e.g. PVM) support addition and removal of processors, the programmer has to write code to handle such exceptional conditions, by explicitly moving the work away from processors leaving the system and splitting work from other processors and moving it to the newly joined processors. Our approach to this problem is based on the Charm++ system and an adaptive implementation of MPI, described next.

3.1

Adaptive Run Time System

Charm++ [20], a language that has been the focus of our research for the last decade, is an objectoriented parallel language that enforces such encapsulation, by bundling work units into objects. Objects communicate via asynchronous remote method invocation. Charm++ provides constructs for creating a collection of objects. Individual objects in this collection are accessed by their index within the collection, rather than the processor on which they reside. This allows the Charm++ runtime system to map (and remap) individual objects to available processors to maximize processor utilization(Figure 1(a)). When objects are remapped by migrating them to different processors, intransit method invocations directed at those objects might still arrive on their original processors. However, the run-time system transparently forwards the method invocations to the new homes of those objects. 2 Even when the machine is not operating as a profit center, it is important to prioritize jobs in some fashion. One can thus imagine that funding agencies such as NSF will pay the computer centers indirectly, by funding application scientists for the compute power they plan to use. This may bring about additional efficiencies in private management of such centers.

5

(a) Charm++ automatically maps objects to available processors.

(b) Software components for creating adaptive jobs.

Figure 1: Creating adaptive jobs. Adaptive MPI (AMPI): Most parallel programs today are written using MPI and not in Charm++. Traditional MPI implementations assume one work unit (a heavyweight process) per processor. To bring the benefits of adaptive jobs (including dynamic load balancing) to MPI applications, we have developed Adaptive MPI [14], based on Charm++, that uses user-level threads as work units Multiple user-level threads, each running an MPI “process”, are mapped to each processor. Switching between user-level threads on each processor is a lot cheaper than switching between heavy-weight processes, and migration of user-level threads is also considerably cheaper than checkpointing and restarting the entire set of processes in an MPI job. AMPI and Charm++ redistribute work by migrating objects: When the number of processors changes, the work units in a parallel job need to be redistributed so as to efficiently utilize the new set of processors. For this purpose, the adaptive language should be able to estimate the computational loads of all work units. Fortunately, the principle of persistence [8] implies that future computational behavior of work units is closely related to their past behavior. Therefore we can use actual measurements of computational loads and communication patterns. The load balancing (LB) framework in the Charm++ runtime system automatically instruments the program to collect data about computation times and communication volumes for individual objects. The mapping (assignment) of objects to processors is left to pluggable load balancing strategies. The load balancing strategy is triggered either by the application itself when the application structure changes (such as in the case of adaptive mesh refinement) or by an external agent such as a job scheduler. We have shown in [24], that our adaptive job framework has low overhead and improves the performance of parallel clusters.

3.2

Interactive Jobs

Another use of adaptive jobs is in real-time interactive jobs, which are not handled effectively by current queuing systems. Most queuing systems have been designed to run jobs in batch mode. Examples of such real-time interactive jobs are interactive molecular dynamics[17] and debugging parallel programs on large machines. Here, users will run an application for a while and then examine the results for a while, before deciding the next course of action. While the user is making 6

his decision, the number of processors required by the job can be brought down to the minimum. This allows other interactive jobs to run. When the user makes a decision on the next course of action, the job can expand and run at full throughput. We plan to develop strategies that will effectively handle such jobs and utilize their idle time by expanding other jobs or running new jobs during those intervals.

3.3

Adaptive Job Scheduler

Once adaptive jobs are feasible, with their characteristics and payoff functions specified in the QoS contract, it enables the design of intelligent job schedulers that aim at maximizing system utility. Most current production queuing systems are incapable of exploiting the opportunities created by adaptive jobs. The next challenge is to develop adaptive job schedulers (AJS) that can take advantage of these jobs. (Note that an AJS is more than a queuing system, because it keeps control of a job even after it has started running, and may ask it to expand or shrink the number of processors it is using.) The scheduler is triggered when a new job arrives in the system, and when a running job finishes (or requests a change in the number of processors assigned to it). On job arrival, the scheduler analyzes a job’s resource requirements and hard/soft deadlines to decide if it can accept it. The scheduler should aim to maximize a system utility metric. This metric can be system utilization, job response time, or a profit metric which is the user’s payment scheme depending on the timely completion of their job. Hence if a high profit job arrives and has a tight deadline, the low priority jobs can be shrunk and the freed processors can be allocated to the high priority job. To demonstrate the utility of a scheduler that can take advantage of adaptive jobs, we have implemented a scheduler that uses simple QoS contracts, and only percentage utilization (as opposed to profit) as its performance metric. We describe this scheduler and its performance next, which also motivates the proposed research on this topic. 3.3.1

A Prototype Scheduler

The prototype system, called the Faucets Scheduler has been implemented for Unix-based parallel clusters, and tested for Linux and Solaris clusters. The Faucets scheduler accepts jobs which specify the minimum number of processors(minpe) and the maximum number of processors(maxpe) as their only QoS metrics. When a job arrives or completes, the scheduler reallocates the processors among the jobs depending on their QoS requirements. It may do any of the following; (i) expand existing jobs when one of the jobs finishes, (ii) shrink the current jobs when a new one arrives, (iii) suspend or checkpoint an existing job. The simple strategy we have implemented (details in [24]) is a variant of equipartitioning[33]. Performance Results: We performed experiments on our cluster scheduler on a Linux cluster with 32 nodes connected by 100 Mbps Ethernet. Each node has two 1 GHz Pentium III processors. A random job generator was used to fire jobs to the scheduler. Job arrival was Poisson distributed and service time was exponentially distributed. The experiments computed the mean Response Time and the mean System Utilization. All experiments had 50 job arrivals. The load factor lf is given by lf = λ ∗ (execution time on 64 processors), where λ is the mean arrival rate. Traditional jobs were submitted to a scheduler that emulates traditional schedulers like PBS [32] and DQS [9]. Adaptive jobs were submitted to our Adaptive Job Scheduler. In these experiments the minpe of the adaptive jobs is uniformly distributed from 1 to 64 and the maxpe is set to 64. For the traditional jobs the required number of processors is uniformly distributed between 1 and 64. The results (Figures 2(a) and 2(b)) clearly demonstrate the superiority 7

of the Adaptive Job Scheduler. At a load factor of 1.0, we effectively get 33% more compute power from the same parallel machine as utilization rises from 75% to 100%. At a more realistic load factor of 0.6, the percentage benefit is even higher.

(a) Mean Response Time on the Linux cluster

(b) System Utilization on the Linux cluster

Figure 2: Performance Curves on Linux cluster

3.3.2

Proposed work: Intelligent Scheduler for Adaptive Jobs to Maximize Profit

The prototype scheduler has already shown improved processor utilization for adaptive jobs in the context of simple QoS contracts. The planned research is aimed at strategies for maximizing utility (profit) in the context of complex QoS contracts. First, we plan to improve the scheduler while staying within its simpler assumptions. Alternatives to the equipartitioning strategies will be investigated, such as a strategy of proportionate allocation of excess processors. Concurrently, we will investigate strategies that, after the number of processors has been decided, allocate specific processors to jobs based on the communication topology and existing assignments, so as to minimize communication overhead and migration overhead respectively. The communication topology needs to be considered because the shrunk or expanded jobs continue to have communication locality and a close-by set of processors needs to be assigned to the new job. Jobs may also have to be check-pointed and restarted at a later point in time and possibly at another parallel server with a different architecture. Machine independent formats will be used in the new checkpointing implementations for this purpose. We also plan to develop schedulers that take the more general utility metric (or profit) and the hard real-time deadline specified in the QoS contract. We associate a payoff function with each job. The schedulers now aim at maximizing profit rather than utilization. Incidentally, making a single objective of profit also merges and rationalizes two often contradictory objectives of maximizing utilization while minimizing response times.

4

Adaptive Runtime Strategies for Multi-cluster Jobs

2.5 pages 8

Figure 3: Latency tolerance via processor virtualization. When more than one parallel machine is used to run a single parallel application, several new impediments to efficient performance arise. One of the major problems associated with such jobs is that the latency for messages that travel between two different clusters will be substantially higher than that between processors within a cluster. The bandwidth between clusters is another issue, perhaps less critical for many applications in the context of teragrid-like facilities with multi-GB/S bandwidth between clusters. Asynchrony between schedulers between Clusters presents another challenge: while one cluster is running an application’s half, and waiting for a collective operation, the other cluster may have suspended the counterpart of this job. Finally, load imbalances among clusters may lead to all processors of one cluster waiting for the other. We propose to develop new classes of resource allocation strategies, many of them based on the processor-virtualization idea embodied in AMPI (Sec. ??), to overcome these. With processor virtualization, a computation is decomposed into a large number of virtual processors (VPs), which are mapped to the physical processors under the control of the RTS. This provides two key benefits to parallel jobs running on multiple Clusters . First, there are multiple VPs assigned to a single processor. Therefore, the program can tolerate latencies in an adaptive manner: VPs which have their messages available can use the processors while other VPs are waiting for data (say from the remote Cluster ). Second, the runtime system can migrate objects (VPs) around, which creates opportunities for strategies to fine-tune performance at runtime by migrating objects within and across Clusters . To illustrate latency tolerance issues, consider a timeline plot of the molecular dynamics application running on 512 processors (only 4 of the processors are shown). Each red rectangle denotes 9

integration for atoms in a single cell, whereas each blue one denotes force calculations for a pair of nearby cells. Note that even on 512 processors, there are many force-computation objects on each processor. In a multi-cluster run, some of these may depend of cells on remote Clusters, while others depend on cells from the local Cluster . With the adaptive processor-level scheduler, the RTS will automatically schedule the objects with local clients (which may be ready sooner) before those with remote clients, thus effecting latency tolerance. AMPI/Charm++ load balancers work by assigning VPs to physical processors, and changing this assignment at runtime. We propose to develop several strategies using this capability to migrate VPs in a multi-cluster context. A-priori mapping strategies: Based on the QoS characterization of a job, and the known capabilities of the machines, the RTS must distribute work (i.e. VPs) to the clusters equitably at the start of a simulation. Since we want to rely on measurement based strategies (see below) for optimum performance, we will develop simple geometric or graph based initial partitioners, heuristically weighted by the machine and job characteristics. Strategies to increase Latency tolerance: Consider a set of VPs representing a physical simulation (such as an unstructured mesh), and the graph of VPs induced by “communicates-with” relation. If the job is distributed over two equal-power clusters, a likely distribution of VPs will leave the VPs that communicate across Clusters (the boundary VPs) on a small fraction of “boundary” processors on each cluster. These boundary processors thus will have a preponderence of boundary VPs, with a few VPs (or none) that communicate solely with local-cluster VPs. So, the latency tolerance effect described above is not very strong. A useful strategy will be to sprinkle (distribute) the boundary VPs among all the processors of the local cluster, so that each processor with a boundary VP has a large amount of locally-dependent work that can be carried out while boundary VPs await messages from remote processors. Of course, this creates a trade-off: distributing such VPs widely will lead to increased communication within each cluster. We propose to research object migration strategies that use runtime instrumentation and measurements of performance to remap objects at a near-optimal point with respect to this trade-off. We will explore both centralized strategies (that consider the entire VP graph, and can migrate all VPs if needed) and incremental/distributed strategies (that migrate a few VPs at a time, using local and neighborhood information only). Additional prioritization strategies will increase the priority of computations which lead to cross-cluster messages. Load Balancing Strategies across clusters: The above strategies only move work within a cluster. Runtime instrumentation can also gove the RTS enough information to effect acrosscluster load-balancing. We will develop strategies for migrating objects across clusters that take into account differential speeds of processors, strengths of internal communication networks of each cluster, and heterogeneity in data representations (e.g. integer of FP representations). We will also explore functional decomposition strategies: E.g. in a simulation of solid rockets, the structural dynamics of the solid fuel can be simulated by one cluster, while the fluid flow of the burning gas can be simulated by another. Although such a decomposition may not always be optimal (e.g. solid-fluid communication may be larger than fluid-fluid of solid-solid communication), it may be useful in some cases, especially when cluster capabilities are well-matched to different sets of modules. Strategies for Identifying tightly coupled cliques: In some computations, a set of VPs engage in tightly coupled, and intense communication. E.g. in a large NAMD simulation, we may have 30,000 VPs for electrostatic force computation, and 400 VPs executing the particle-mesh Ewald algorithm (which involves 3D-FFTs). The RTS should be able to identify these 400 VPs as a clique, and map them all on a single Cluster . This requires a runtime analysis of the VP graph. Cost-benefit analysis:Sometimes, in spite of optimal selection of strategies, the performance 10

of a multi-cluster run may be worse than if only one of the clusters were used. We believe that it is possible to evaluate this at runtime based on analysis of non-overlapped wait-time, and other measurable quantities. We will use such analysis to decide to withdraw the job from one of the clusters, if necessary. New multi-cluster collective communication strategies: We have recently developed efficient strategies for collective operations such as all-to-all multicasts [21] and personalized messages [22]. We plan to extend this work to a multi-cluster context.

5

Faucets: Infrastructure

1.5 pages Job Management 1 - Appspector I/O issues 1

5.1

Security

(TBD GREG) Clearly, a security framework is needed. We plan to use existing and evolving security standards and software for this purpose, since there is an extensive effort devoted to this in the research community. Specifically, we will use the Globus Security Infrastructure (GSI)[11, 5]. Two major security issues need to be handled by the Faucets framework, authenticating clients and maintaining the privacy of client information. To authenticate clients they will be given a certificate. Our demonstration system will accept all Globus certificates. Eventual deployment will require authentication by a billing service. Currently, supercomputing centers let us submit jobs only for the users who have accounts on their system. For this context, our system will use the Globus authentication model: a user’s job will run only on one of the machines that they have an account on. The selection of a particular machine will still be up to the Faucets system. For demonstration purposes, the subgrid consisting of clusters operated by us will allow all users to submit jobs. Client information privacy can be ensured by encryption of messages exchanged between the client and the Faucets system. This could be another source of overhead and levels of security would be provided for clients to choose from. Also, this is an active area of research and we hope to benefit from ongoing research and standards rather than developing our own techniques. 5.1.1

5.2

Authorization Server

File transfer

In our prototype, we have implemented a simple file upload and download facility using Java stream socket I/O. In the planned system, file transfer can use the Globus GASS[1] and the GSI ftp frameworks to move the files from the user’s desktop to the supercomputing centers. We will develop a job monitoring system which will keep track of submitted jobs. It will be a single point of access to a user’s job, even as the job moves between supercomputers. It will detect server failures and take appropriate action to recover. During the job’s execution, this system will facilitate real-time visualization and interaction with the job. On job completion, it will take several actions including informing the accounting system about the usage, informing the user about completion, and conducting file transfers as needed. 11

5.3

Input and Output support

Active buffering based support for efficient data movement. (QoS for I/O covered in QoS section..) - File upload and download optimizations by adaptive buffering

6

Market Economy of Compute Power

3 pages 6.1

Bidding Framework

The computational grid will consist of thousands of Clusters, some being managed by an Adaptive Job Scheduler (see Section 3.3), while others are managed by traditional queuing systems. Potentially, tens of millions of jobs, each with a QoS contract, will be submitted to the grid per day. Matching jobs to available Clusters efficiently (both from the point of view of the client as well as the Clusters) and in scalable fashion is a challenge. The current status of individual Clusters must be monitored by “the grid”. As job requests stream into the grid, a Cluster (or Clusters) must be selected for a job. Since the client wants to minimize its cost, and there are multiple competing Clusters, a bidding system is necessary. In addition, many jobs may have requirements that are not met by all the Clusters. We will conduct research on a scalable (and therefore distributed) screening and bidding process. To experiment with these notions, we have already built a very simple prototype system, with a simple place-holder bidding process and simple QoS contracts. We describe this next in order to motivate the research issues.

6.2

Description of the Prototype System

Figure 4(a) shows a simplified overview of our prototype Faucets system. All components have been implemented; visit http://charm.cs.uiuc.edu to download.

(a) System Components

(b) A Candidate Distributed Bidding System

Figure 4: Current System Components.

12

The Central Faucets Server maintains a directory of available Clusters and some information about each one, such as the number of processors, the available memory, and so on. A Faucets Daemon (FD) runs on each Cluster as a server listening on a well-known port. The Scheduler running on the supercomputer can be a traditional or adaptive queueing system. The user interacts with the system using a web browser or a stand-alone client. To submit a job, the client connects to the Central Faucets Server with the QoS contract and requests a list of matching Clusters. The prototype Faucets system uses a very simple QoS contract: the user only specifies the range of processors acceptable for a job. The client then connects to each FD and requests a bid for the desired job. On each Cluster, after some negotiation between the FD and the Scheduler, the FD accepts or declines the job. The bid from a Cluster is simply a yes/no decision. The client selects the first Cluster that can accept the job. Files are uploaded, and the FD takes over the job, which it will start on the supercomputer and manage via a network connection. When accepting a job, the FD gives the client a URL which can be used to monitor the job. Visiting the URL invokes an applet which communicates with the FD and allows the user to view the job output and graphical performance data while the job is running. The user can also download and delete the output files generated by the job.

6.3

Proposed Work: QoS Contracts, Auction and Bidding Algorithms

The proposed Faucets system will use more complex QoS contracts (see Section 2). This will require more advanced screening to narrow down the set of Clusters (or multi-clusters) based on their capabilities. The current system is not designed to scale to the large number of clients and Clusters envisioned, e.g. each Cluster is contacted for a bid sequentially, and scalability will thus need to be addressed. The bids will be in the form of a proposed charge , and a promised completion time. The charge can be monetary, or a variant of the Service Units (SU’s) currently used at the national supercomputer centers, or a bartering credit/debit as explained below. The client will need to choose between competing bids. 6.3.1

Bid Generation

When a Cluster or its agent receives a job’s QoS contract, it must submit a bid in response or decline. If it chooses low, its profit is affected, yet if it chooses high, its chances of getting selected are lowered. Although this appears similar to a standard bidding problem, there is a twist here: the computer time that the Clusters are selling is a highly perishable commodity. Bidding for a utility such as electric power comes close to this situation. If the grid has a higher capacity than needed, there is a danger that a Cluster may starve instead of getting its share of work. We plan to study existing literature on auction theory and bidding strategies[6, 7] and evaluate their applicability here. These strategies, and new algorithms, if needed, will be implemented to enhance the quality of the bidding process so that a Cluster generates a close-to-optimal bid. 6.3.2

Scalable Matching and Bid Evaluation

We plan to develop schemes that will allow “the grid” to keep track of the Cluster’s status, and allow jobs to be matched to a lowest-bid Cluster. A candidate scheme is shown in Figure 4(b). A request for a new job (with its QoS Contract) is first seen by one of the “screening servers”, which have precomputed information about capabilities of different Cluster groups. The screeners identify a subset of Clusters that are eligible for running the job, and send the request on to one or more Agent Servers. These servers house objects (e.g. threads or subroutines with state data) each 13

of which acts as an agent for a single Cluster. The agents communicate with the Clusters to keep track of their future availability and capacity. Agents may also keep track of “recently awarded contracts” from a local database on the Agent Server. Using these, and their own algorithms, they decide a bid (or decline) for each job. The bids are then collected and passed on to the client representative on one of the bid evaluation servers. It selects one of the Clusters, executes a twophase commit algorithm with the Cluster to ensure it is still available, and gets its commitment (and tries the second best Cluster otherwise, etc.). At this point, the client and Cluster have been paired and the bidding system is disengaged from the job. This is only a candidate architecture. One can think of optimizations to this structure immediately (e.g. Some of the bid evaluation can be done on the bid generators themselves, reducing the bandwidth between them and the evaluators. These functions can be compressed horizontally or vertically on fewer servers in many ways). 6.3.3

Market Mechanisms

At the same time, other substantially different schemes are also possible; these will be developed and evaluated. For example, one may employ an auction structure, where each job announces an initial price, which is broadcast to everyone. Cluster agents may then bid and reduce their bids in response to further announcements until the lowest bid is reached. This scheme presents larger overhead, and potentially larger delay between job submission and contracting, but may reach more “fair” contracts. Another possible model is the Commodities model. In this model, goods of the same type brought to market by various suppliers are regarded as interchangeable, market price is publicly agreed upon for each commodity regarded as a whole, and all buyers and sellers decide whether (and how much) to buy and sell at this price.[34] Development of the centralized server: Since the distributed bidding system is not needed immediately, we will concurrently develop the current centralized scheduler further. The issues here include reliability (what happens when the central server goes down?) and performance (to achieve best possible scaling without using a distributed-servers design).

6.4

Proposed Work: Evaluation of Strategies via Simulation

As outlined above, there are several algorithms and schemes that make up a functioning grid: bidding schemes, bid generation algorithms for agents, server architecture, job scheduler algorithm and so on. To evaluate them in a realistic context, we will build a simulation framework, with models for various components. Using this, performance studies can be conducted even before a large scale grid is operational. Also, a simulation system allows one to experiment with newer strategies (for bid generation, say) before “live trials”. Such a simulation system will generate two kinds of data. First, it will identify which strategies work better. For example, with two identical Clusters running bidding strategy A and B, we can compare who gets most of the contracts and/or makes more profit. Secondly, simulations can create timing data which will identify performance bottlenecks in distributed bidding schemes.

6.5

Multicluster Bidding

Multi-cluster jobs enrich the bidding problem with one more dimension. Letting jobs run on multiple clusters makes large number of processors more readily available to the user. In such a scenario the faucets system will return bids from clusters which have the required number processors to run the job, and groups of clusters which can run the job together. 14

The proposed research will investigate bidding mechanisms of obtaining a cumulative bid from the bids returned by individual clusters of a group of clusters which together agreed to run the job. It may be advantegeous to split the job among grographically nearby clusters. The maximum latency tolerated by the job would determine how far the clusters can be in this case. There is also a limit on the number of clusters that can together run the job. Moreover it may be combinatorially hard to find a group of clusters of size more than 3 or 4 from thoudands of clusters. In the above mentioned scenario, it is also likely that a group of clusters may form a federation and compete with larger machines. Such federations will probably return smaller (cheaper) bids. Proposed research will also study bidding strategies in the presence of federations.

6.6

Cluster Bartering

6.6.1

7

Accounting

Schedule and Milestones

0.5 pages Year 1 1 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3

8

Milestone Modify all Charm++ load balancers to use the usage vector Create a testbed Faucets Server system with complete functionality QoS: Develop Performance characterization of several applications Develop a preliminary QoS contract AJS: Adaptive Job Scheduler to take deadlines into account Study of existing bid generation and auction algorithms AMPI: Scalable set-associative thread migrations AJS: Preliminary support for real-time jobs: shrink/expand AJS: Use QoS data on efficiency variations AJS: Incorporate profit metric in scheduling AJS: Implement competing scheduling strategies Build grid simulation model Support for applications spanning multiple supercomputers Automatic checkpoint and restart on different machines Develop an advanced QoS contract, incorporating refinements AJS: Scheduler optimization for real-time jobs Performance evaluation of scheduling strategies Performance evaluation of bid generation strategies Deploy and test the Faucets system on the largest availble grid

Broad Impact

0.5 pages The impact of the proposed research can be divided into three parts. (1) in the short term, the research on improving the utilization of an individual parallel machine (section ??) will benefit many individual supercomputer centers and smaller clusters operated by research groups. The ability to run start-and-stop interactive jobs will help scientists doing interactive modeling as well as parallel application developers debugging their programs on large machine configurations. (2) A simplified version of the Faucets system, with a centralized scheduler and a simple matching of jobs to machines, will also be immediately useful to researchers with accounts on multiple supercomputers. (3) The longer term impact of the proposed large scale grid system 15

is speculative (high risk) to some extent. The grid based compute-power economy is clearly one of the possible and feasible futures, in spite of the obstacles such as security, trust and portability. The proposed research may help usher in such an economy. The PI has a strong record of distributing the research software widely via the Internet. Versions of the Adaptive Job Scheduler and the Faucets Job Submission and Monitoring System will be distributed via the web. The project will contribute directly to the training of three graduate students. The knowledge and skills developed by them for this project will be directly useful in their careers in the Internetdriven industry. The software will also be useful in classroom teaching especially in classes involving parallel computing and computational science and engineering taught by the PI and others. The PI has consistently used and trained undergraduate students in his laboratory. This project will continue that tradition via REU supplements.

9

Results From Prior Support

1 pages Report on one recent grant: Title: Advanced Computational Approaches to Biomolecular Modelling and Structure Determination (Grand Challenge Application Group). DBI 93-18159 Award Amount: $3,513,265 (shared by 3 PIs at UIUC as the principal grantee and Yale, Duke, UNC and NYU as subgrantees; UIUC part: $1,500,000 ) Period Covered: 9/1/1993-2/28/2001 As a part of this research, we developed a production quality parallel Molecular Dynamics program (NAMD). NAMD is a production-quality parallel molecular dynamics program which uses unique parallelization techniques to achieve unsurpassed performance. NAMD is implemented in C++, using the Charm++/Converse system developed by the PI. 1400 1200

Speedup

1000 800 600 400 200 0 0

500

1000 1500 Processors

2000

2500

Figure 5: The speedup of NAMD for a cutoff simulation of BC1 in a water environment (206,617 atoms, 12 ˚ A cutoff), run on the Sandia National Laboratories ASCI Red supercomputer.

The exceptional parallel efficiency of NAMD results from its hybrid spatial-force work decomposition technique. Although this parallelization technique results in a large number of independent pieces of work, the pieces vary greatly in how much work they represent. Therefore, NAMD depends on a sophisticated load balancing algorithm combining load measurement with object migration. Objects are timed during the first few steps of the simulation, and these times are used to relocate objects such that the load is balanced and communication is minimized. This algorithm uses the Converse load balancing framework to monitor program performance. Using this load balancing system, NAMD is able to achieve speedups over 1250 on 2048 processors References [28], [18], [19], [4], [31], [23], [3]

16

References [1] J. Bester, I. Foster, C. Kesselman, J. Tedesco, and S. Tuecke. Gass: A data movement and access service for wide area computing systems. In Sixth Workshop on I/O in Parallel and Distributed Systems, 1999. [2] Milind Bhandarkar, L. V. Kale, Eric de Sturler, and Jay Hoeflinger. Object-Based Adaptive Load Balancing for MPI Programs. In Proceedings of the International Conference on Computational Science, San Francisco, CA, LNCS 2074, pages 108–117, May 2001. [3] R. Brunner, J. Phillips, and L.V.Kal´e. Scalable molecular dynamics for large biomolecular systems. In Proceedings of SuperComputing 2000, 2000. [4] Robert Brunner, Laxmikant Kal´e, and James Phillips. Flexibility and interoperability in a parallel molecular dynamics code. In Object Oriented Methods for Inter-operable Scientific and Engineering Computing, pages 80–89. SIAM, October 1998. [5] R. Butler, D. Engert, I. Foster, C. Kesselman, S.Tuecke, J. Volmer, and V. Welch. A nationalscale authentication infrastructure. In IEEE Computer, volume 33, 2000. [6] Bernard Caillaud and Jacques Robert. Implementing the optimal auction. In Maryland Auction Conference, 1998. [7] Peter Cramton, Alfred E. Kahn, Robert H. Porter, and Richard D. Tabors. Uniform pricing or pay-as-bid pricing: A dilemma for california and beyond. Electricity Journal, pages 70–79, July 2001. [8] Eric deStruler, Jay Hoeflinger, L. V. Kale, and Milind Bhandarkar. A New Approach to Software Integration Frameworks for Multi-physics Simulation Codes. In Proceedings of IFIP TC2/WG2.5 Working Conference on Architecture of Scientific Software, Ottawa, Canada, pages 87–104, October 2000. [9] D. W. Duke, T. P. Green, and J. L. Pasko. Research toward a heterogenous networked computing cluster: The distributed queueing system version 3.0. Technical report, Florida State University, May 1994. [10] Dror G. Feitelson and Larry Rudolph. Toward convergence in job schedulers for parallel supercomputers. In D. G. Feitelson and L. Rudolph, editors, Job Scheduling Strategies for Parallel Processing, volume 1162 of Lecture Notes in Computer Science, pages 1–26. SpringerVerlag, 1996. [11] I. Foster, C. Kesselman, G. Tsudik, and S. Tuecke. A security architecture for computational grids. In 5th ACM Conference on Computer and Communications Security Conference, 1998. [12] P. H. Geubelle and W. G. Knauss. Crack propagation at and near bimaterial interfaces : linear analysis. ASME J. Appl. Mech., 61:560–566, 1994. [13] Mary W. Hall and Margaret Martonosi. Adaptive parallelism in compiler-parallelized code. Concurrency: Practice and Experience, 10(14):1235–1250, 1998. [14] Chao Huang, Orion Lawlor, and L. V. Kal´e. Adaptive mpi. In The 16th International Workshop on Languages and Compilers for Parallel Computing (LCPC 03), College Station, Texas, October 2003. 17

[15] S. Ioannidis, U. Rencuzogullari, R. Stets, and S. Dwarkadas. CRAUL: Compiler and runtime integration for adaptation under load. Journal of Scientific Programming, August 1999. Invited paper. [16] Jansen and Porkolab. Linear-time approximation schemes for scheduling malleable parallel tasks. In SODA: ACM-SIAM Symposium on Discrete Algorithms (A Conference on Theoretical and Experimental Analysis of Discrete Algorithms), 1999. [17] Klaus Schulten John E. Stone, Justin Gullingsrud and Paul Grayson. A System for Interactive Molecular Dynamics Simulation. In 2001 ACM Symposium on Interactive 3D Graphics, pages 191–194, 2001. [18] L. V. Kal´e, Milind Bhandarkar, and Robert Brunner. Load balancing in parallel molecular dynamics. In Fifth International Symposium on Solving Irregularly Structured Problems in Parallel, volume 1457 of Lecture Notes in Computer Science, pages 251–??, 1998. [19] L. V. Kal´e, Milind Bhandarkar, Robert Brunner, N. Krawetz, J. Phillips, and A. Shinozaki. Namd: A case study in multilingual parallel programming. In Proc. 10th International Workshop on Languages and Compilers for Parallel Computing, Minneapolis, Minnesota, August 1997. [20] L. V. Kale and Sanjeev Krishnan. Charm++: Parallel Programming with Message-Driven Objects. In Gregory V. Wilson and Paul Lu, editors, Parallel Programming using C++, pages 175–213. MIT Press, 1996. [21] L. V. Kale and Sameer Kumar. Scaling collective multicast on high performance clusters. Technical Report 03-04, Parallel Programming Laboratory, Department of Computer Science, University of Illinois at Urbana-Champaign, 2003. [22] L. V. Kale, Sameer Kumar, and Krishnan Vardarajan. A framework for collective personalized communication, communicated to ipdps 2003. Technical Report 02-10, Parallel Programming Laboratory, Department of Computer Science, University of Illinois at Urbana-Champaign, 2002. [23] Laxmikant Kal´e, Robert Skeel, Milind Bhandarkar, Robert Brunner, Attila Gursoy, Neal Krawetz, James Phillips, Aritomo Shinozaki, Krishnan Varadarajan, and Klaus Schulten. NAMD2: Greater scalability for parallel molecular dynamics. Journal of Computational Physics, 151:283–312, 1999. [24] Laxmikant V. Kal´e, Sameer Kumar, and Jayant DeSouza. A malleable-job system for timeshared parallel machines. In 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid 2002), May 2002. [25] Wan Y. Lee, Sung J. Hong, and Jong Kim. On-line scheduling of scalable real-time tasks on multiprocessor systems. [26] J. E. Moreira and V.K.Naik. Dynamic resource management on distributed systems using reconfigurable applications. IBM Journal of Research and Development, 41(3):303, 1997. [27] Ahuva W. Mu’alem and Dror G. Feitelson. Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfilling. IEEE Trans. on Parallel and Distributed Systems, 12(6), June 2001. 18

[28] Mark Nelson, William Humphrey, Attila Gursoy, Andrew Dalke, Laxmikant Kal´e, Robert D. Skeel, and Klaus Schulten. NAMD— A parallel, object-oriented molecular dynamics program. J. Supercomputing App., 1996. [29] T. D. Nguyen, R. Vaswani, and J. Zahorjan. Using runtime measured workload characteristics in parallel processor scheduling. In D. G. Feitelson and L. Rudolph, editors, Job Scheduling Strategies for Parallel Processing, volume 1162 of Lecture Notes in Computer Science. SpringerVerlag, 1996. [30] J. Padhye and L. Dowdy. Preemptive versus non-preemptive processor allocation policies for message passing parallel computers: An empirical comparison. In Proceedings of the 2nd Workshop on Job Scheduling Strategies for Parallel Processing, April 1996. [31] James C. Phillips, Robert Brunner, Aritomo Shinozaki, Milind Bhandarkar, Neal Krawetz, Laxmikant Kal´e, Robert D. Skeel, and Klaus Schulten. Avoiding algorithmic obfuscation in a message-driven parallel MD code. In P. Deuflhard, J. Hermans, B. Leimkuhler, A. Mark, S. Reich, and R. D. Skeel, editors, Computational Molecular Dynamics: Challenges, Methods, Ideas, volume 4 of Lecture Notes in Computational Science and Engineering, pages 472–482. Springer-Verlag, November 1998. [32] Portable batch system. http://pbs.mrj.com/. [33] Andrew Tucker and Anoop Gupta. Process control and scheduling issues for multiprogrammed shared-memory multiprocessors. In Proceedings of the 12th ACM SIGOPS Symposium on Operating Systems Principles, December 1989. [34] Rich Wolski, James S. Plank, John Brevik, and Todd Bryan. Analyzing market-based resource allocation strategies for the computational Grid. The International Journal of High Performance Computing Applications, 15(3):258–281, Fall 2001.

19