Adaptive Data-Aware Utility-Based Scheduling in ... - CiteSeerX

11 downloads 10235 Views 344KB Size Report
storage and for scheduling jobs onto the available CPUs mutually affect each other's environ- ... For information regarding the SML Technical Report Series, contact Jeanie Treichel, Editor-in-Chief ...... International Conference on Hybrid Information Technology (ICHIT '06), Vol ... MIT and a B.S. in Mathematics from MIT.
Adaptive Data-Aware Utility-Based Scheduling in Resource-Constrained Systems

David Vengerov, Lykomidis Mastroleon Declan Murphy, and Nick Bambos

Adaptive Data-Aware Utility-Based Scheduling in Resource-Constrained Systems

David Vengerov, Lykomidis Mastroleon Declan Murphy, and Nick Bambos SMLI TR-2007-164

April 2007

Abstract: This paper addresses the problem of dynamic scheduling of data-intensive multiprocessor jobs. Each job requires some number of CPUs and some amount of data that needs to be downloaded into a local storage space before starting the job. The completion of each job brings some benefit (utility) to the system, and the goal is to find the optimal scheduling policy that maximizes the average utility per unit of time obtained from all completed jobs. A co-evolutionary solution methodology is proposed, where the utility-based policies for managing local storage and for scheduling jobs onto the available CPUs mutually affect each other’s environments, with both policies being adaptively tuned using the Reinforcement Learning methodology. Our simulation results demonstrate the feasibility of this approach and show that it performs better than the best heuristic scheduling policy we could find for this domain.

Sun Labs 16 Network Circle Menlo Park, CA 94025

email addresses: [email protected] [email protected] [email protected] [email protected]

© 2007 Sun Microsystems, Inc. All rights reserved. The SML Technical Report Series is published by Sun Microsystems Laboratories, of Sun Microsystems, Inc. Printed in U.S.A. Unlimited copying without fee is permitted provided that the copies are not made nor distributed for direct commercial advantage, and credit to the source is given. Otherwise, no part of this work covered by copyright hereon may be reproduced in any form or by any means graphic, electronic, or mechanical, including photocopying, recording, taping, or storage in an information retrieval system, without the prior written permission of the copyright owner. TRADEMARKS Sun, Sun Microsystems, the Sun logo, Java, and Solaris are trademarks or registered trademarks of Sun Microsystems, Inc. in the U.S. and other countries. All SPARC trademarks are used under license and are trademarks or registered trademarks of SPARC International, Inc. in the U.S. and other countries. Products bearing SPARC trademarks are based upon an architecture developed by Sun Microsystems, Inc. UNIX is a registered trademark in the United States and other countries, exclusively licensed through X/Open Company, Ltd. For information regarding the SML Technical Report Series, contact Jeanie Treichel, Editor-in-Chief .All technical reports are available online on our website, http://research.sun.com/techrep/.

Adaptive Data-Aware Utility-Based Scheduling in Resource-Constrained Systems ? David Vengerov a,∗ Lykomidis Mastroleon b Declan Murphy a Nick Bambos b a Sun

Microsystems Laboratories, Menlo Park, CA 94025, USA

b Department

1

of Management Science and Engineering, Stanford University, Stanford, CA 94305, USA

Introduction

Scheduling in High Performance Computing (HPC) systems is becoming an increasingly important and difficult task. An HPC system can have as many as 105 multi-threaded processors and can cost many millions of dollars [8]. Correspondingly, it is desirable to operate such systems as efficiently as possible. Most of the HPC jobs need to access some data stored on local disk or remote storage, and some jobs need to access very large amounts of data, especially in applications such as high-energy physics, natural language processing, astronomy, and bioinformatics. As noted in [6], the amount of data processed by scientific applications has been increasing exponentially since 1990, at an even faster rate than predicted by Moore’s law. Thus, efficient data-aware scheduling will become a critical issue in scientific computing in the very near future. If the scheduling system does not explicitly account for the amount of data each job needs to access, then some jobs will occupy the CPU resources for much longer than necessary, since they will spend significant amounts of time ? This material is based upon work supported by DARPA under Contract No. NBCH3039002 ∗ Corresponding Author Email addresses: [email protected] (David Vengerov), [email protected] (Lykomidis Mastroleon), [email protected] (Declan Murphy), [email protected] (Nick Bambos).

idling and waiting for the data to be read from remote storage. Thread-level schedulers (in UNIX, Solaris, etc.) solve this problem by placing multiple threads on a single CPU, so that while some threads are waiting for their data to come in, the CPU can work on the other threads. However, job-level schedulers in HPC systems usually have strict security requirements, where jobs as a rule do not share the CPU resources. Despite the potentially large losses in systems’ productivity associated with ignoring the job data requirements, the problem described above has received little attention so far. Some of the works addressing this problem include [1,2,5,6]. A simple dataaware scheduling algorithm is proposed in [5], where each job is scheduled onto a node that has the fewest jobs running on it and that has enough local storage space to fit that job. Four different storage management algorithms are proposed in [6]: first-fit (using the first-come, first serve order, a job’s data is downloaded into the local storage if it can fit there and the job is skipped otherwise), best-fit (jobs are selected for downloading their data so as to maximize the utilization of the available storage space), smallest/largest fit (jobs are selected for downloading their data starting from the one with the smallest/largest data requirement, assuming that their data can fit into the local storage). However, no experimental evaluation of these techniques are provided in [6]. A mathematical programming formulation is presented in [2] for the problem of placing jobs and data files on multiple servers so that jobs running on each server would have all the data files they need stored locally on that server. However, they consider only the costs of copying the files into a local storage, without addressing the space constraints for storing the files locally. A distributed economy-based data replication algorithm is proposed in [1], where each local storage attempts to store files that are most likely to be accessed in the near future by extrapolating the access frequency during the recent past. While this approach is similar in spirit to the one we propose in that it attempts to optimize the future performance of the system, the data scheduling decisions are not made in real-time, since each new file needs to be accessed a number of times from the remote storage until it is deemed to be popular enough to be replicated into the local storage. This paper addresses the real-time data-aware scheduling problem using the utility-based optimization framework, where completion of each job is assumed to bring some benefit to the system (a decreasing function of the job response time) and the goal is to maximize the average benefit obtained per unit of time from all completed jobs. This formulation was initially proposed by Jensen in [4] and is a generalization of the standard deadline-based scheduling, where the benefit received from each job is 1 if the job is completed before its deadline and 0 otherwise. An alternative objective can be that of minimizing the average cost incurred by the system per unit of time, where the per job cost is an increasing function of the job response time. For the sake of being specific, we take the utility-maximizing rather than cost-minimizing view in this paper. 2

To be consistent with the latest literature on this topic, we will refer to this scheduling paradigm as UA (utility accrual). As suggested in the recent overview of the UA real-time scheduling domain [9], the only algorithm that allows jobs to share (mutually exclusively) a finite amount of a certain resource (e.g., local storage space) is presented in [14]. However, this algorithm is not forward-looking in a sense that it simply orders the existing jobs according to their expected utility upon completion divided by the time remaining to completion (potential utility density) without accounting for what might happen in the near future (ideally, an algorithm should not schedule a large job that has a low utility upon completion if the probability of a new high-utility job arriving is large enough). We use a preemption-enabled version of this algorithm as a benchmark in our data scheduling experiments (presented in Figure 4), calling it BDP-RUU. The only UA real-time scheduling algorithm we are aware of that explicitly tries to maximize the expected future utility received by the system (by statistically estimating the job arrival probabilities) is presented in [13]. However, this algorithm does not consider any data constraints, and we are not aware of any other utility-maximizing real-time scheduling algorithms that consider multi-unit resource constraints. As noted in [7], the problem of scheduling jobs in the UA paradigm even on a single CPU with single-unit resource constraints (there are N distinct resources in the system and each job specifies which resources it needs to use) is NP-hard. Thus, a heuristic needs to be developed for solving this problem in the static case of focusing only on the currently arrived jobs (as done in [7,14]) or an adaptive policy tuning algorithm needs to be used in the dynamic case when jobs arrive stochastically and the scheduler needs to learn their arrival pattern and make appropriate forward-looking scheduling decisions. The complexity of the policy tuning process increases (usually exponentially) with the number of input variables used by the policy, where the input variables describe the state of the local storage, of the CPU module executing jobs as well as of the jobs waiting for their data to be downloaded and those waiting to be executed on the CPU module. Instead of using a centralized learning algorithm (which will be impractically slow given the large number of input variables it needs to consider), we propose a distributed learning approach, where the CPU module and the local storage module are self-managing and self-optimizing. In particular, each module uses the Reinforcement Learning (RL) methodology [10] to adaptively tune the policy for managing its main resource (CPUs or local storage space). The CPU management policy and the corresponding storage management policy mutually affect each other’s environments, making the self-optimization process co-evolutionary. To the best of our knowledge, this paper presents the first study demonstrating the feasibility of such an approach in the scheduling domain. 3

Fig. 1. Problem formulation.

Co-evolutionary RL has already been used successfully in some domains (e.g., [3,12]). However, all co-evolutionary RL frameworks we are aware of have used agents that directly affect the global environment and receive either individual reinforcement signals from it or a single reinforcement signal for all agents. In this paper we are faced with a sequential multi-agent learning task, where the actions of the first agent (the one managing the local storage) affect the state of the second agent (the one managing the CPU resources), and the feedback signal received from the environment (the total utility of jobs executed over some period of time) is affected directly only by the actions of the second agent. In order to resolve the difficulty of the first agent learning without a direct feedback signal, we propose a new idea of letting it use the “value function” V (s) of the second agent in order to define its feedback signal. This process is explained in a greater detail in Section 4. To the best of our knowledge, no co-evolutionary feedback learning architectures have been developed by other researchers to address this difficulty.

The rest of the paper is organized as follows. Section 2 formulates the scheduling problem to be solved. Section 3 gives an overview of the solution framework used in this paper. Section 4 explains the specific scheduling algorithms that were developed. Section 5 presents numerical simulations that demonstrate the value of the proposed algorithms and gives some intuition about the observed results. Finally, Section 6 concludes the paper. 4

2

Problem Formulation

Consider an HPC machine with multiple CPU modules, each of which is capable of running one or more jobs. The machine has a finite local storage capacity as depicted in Figure 1. Jobs arrive to the machine in a stochastic manner. We also assume the following: i - Each job i requires some ideal number of CPUs Rmax to be executed in the i minimum possible time tE,min .

- The jobs are “moldable,” meaning that a job i can be executed with fewer i than Rmax CPUs at the cost of a slower execution rate. However, the number of CPUs assigned to a job has to be greater or equal to some minimum number i that depends on the specific job. Rmin i i i - If Ralloc CPUs are allocated to job i (Ralloc ≤ Rmax ), then its actual execution i i i time is given by (tE,min × Rmax )/Ralloc . Any other dependence on the number of assigned CPUs could have been used, since our scheduling algorithms do not use this information explicitly.

- Prior to execution, a job i also requires some data to be fetched from a remote location. The amount of data and the connection rate are known and therefore the actual download time tiD,min can be inferred. - We assume that there is sufficient bandwidth to support all data transfers at the defined rate without interfering with each other. - The service time of a job i is given by tiS = tiD + tiE , where tiD is the datafetching time (tiD,min plus any waiting time until the downloading process completes) and tiE is the execution time (actual execution time plus any waiting time until the execution completes). - The amount of local storage capacity needed for data during fetching time and execution time is constant. - The scheduler can preempt jobs that are currently being executed and resume their execution later in the future. The preemption process might take nonnegligible amount of time, which is required for the running job to encapsulate and save its state. - The scheduler can data-preempt jobs whose data is downloading or has already been downloaded into the local storage system, but which are not currently being executed. - Upon completion of each job the machine receives some utility, which is a 5

Fig. 2. Time Utility Function (TUF) of job i vs. service time ts .

nonincreasing function of the elapsed time from job arrival to completion. In particular, we assume that the Time Utility Function (TUF) of any job is as i i depicted in Figure 2, where Umax = tiE,min × Rmax , tiS,min = tiE,min + tiD,min , i i i and τ = 2 × tS,min (at time τ we assume that no further utility can be obtained by executing job i, which would be the case for jobs such as regional weather forecasts, predictions for stocks, etc). The proposed architecture will work with any other forms of TUFs. However, we will use the shape described above as an example. - Our goal is to maximize the average utility per time step received by the computing facility from completed jobs. The complexity of this problem is much greater than that of traditional scheduling problems because of the data fetching requirements. If one chooses to disregard the data-fetching and storage requirements and schedule jobs only based on the CPU requirements, the system’s performance can noticeably degrade. For example, if a job whose data is not yet in the local storage gets scheduled for execution, then the CPUs assigned for its execution will remain idle until the data fetching period completes. Moreover, if the local cache does not have enough free space to accommodate the job’s data, then the job will have to wait until other jobs complete their execution and release enough cache space to begin the data fetching process. So instead of scheduling a job that requires the number of CPUs equal to the available CPUs but whose data has not yet been downloaded, it might be better to schedule a job that requires fewer CPUs but already has its data downloaded. Also, when deciding in which order to download the data of submitted jobs, it is desirable to trade off the benefit per CPU-hour of each job to the CPU module with the data space this job requires in the local storage. 6

3

The Proposed Architecture

Before describing in detail the proposed scheduling algorithms, it is necessary to define some basic quantities: Available Jobs - Jobs whose data are in the local storage but which have not yet started their execution. Remaining Time To Completion (RTTC) - If a job has been assigned some number of CPUs, its RTTC is defined as expected time required for this job to complete, assuming this job continues to run uninterrupted with the current number of CPUs. If no CPUs have been assigned to the job, then its RTTC can be defined as a sum of its expected execution time (assuming it gets the maximum desired number of CPUs) plus the time required to download any remaining required data into the local storage. Total Time To Completion (TTTC) - The TTTC of a job is defined as its RTTC plus the time already spent by this job in the system. Expected Utility (EU) - The EU of a job is defined as utility that will be received (based on the job’s TUF) when the job completes with the number of CPUs it is currently assigned. If no CPUs are assigned to the job, then we assume the job’s total time in the system will be TTTC as defined above. Expected Unit Utility (EUU) - The EUU of a job i (which can be executing, i × tiE,min ), which waiting, or downloading its data) is defined as EU i /(Rmax i the job i expects to receive if gives the fraction of the maximum utility Umax its total time in the system is TTTC. Remaining Unit Utility (RUU) - The RUU of a running job i is defined as i EU i /(Ralloc × RT T C i ).

3.1

Architecture Overview

As was mentioned in the Introduction, we propose an architecture that simplifies the scheduling problem complexity by using two separate scheduling modules. The CPU Scheduling Module (CPU-SM) monitors jobs ready for execution (those whose data are already downloaded into the local storage) and the currently executing jobs. Based on this information, the module decides which jobs should execute first and how many CPUs should be assigned to them. In particular, this module first tries to schedule for execution jobs that provide 7

best fit to the available CPUs (minimizing the number of free CPUs that remain after scheduling). Then, if some jobs are still available but cannot fit into the remaining free CPUs, the CPU-SM examines whether it would be beneficial to preempt some running jobs and schedule some “more beneficial” available jobs. Finally, if each of the waiting jobs requires more CPUs than are currently available, the CPU-SM considers whether it would be worthwhile to “squeeze” any of those jobs into N CPUs, thereby extending their execution time. The Data Scheduling Module (DATA-SM) monitors the status of the local storage system and decides on the order in which to download data for the jobs arriving into the system. In particular, this module first tries to fetch data for as many jobs as possible either in the FIFO order or in the order of decreasing RUU or in the order of increasing job’s data requirement. It then considers whether any of the jobs whose data is being or has already been fetched should be preempted from the local storage by any newly arrived job. The modular scheduling approach described above has a much smaller complexity than any centralized approach that strives to find optimal schedules in the combined CPU-storage space. The important issue that remains to be resolved is to decide on the utility function which each module should strive to optimize when making its scheduling decisions. Ideally, such a utility function should reflect the long-term future effects of each decision on the system’s performance, rather than just its immediate impact. Such a utility function can be learned using Reinforcement Learning (RL) (e.g., [10,11]), which is a methodology for evaluating policies in unknown dynamic systems and for determining optimal policies with respect to any given performance criterion. A central concept to almost all RL algorithms is that of a value function V (s), which predicts the expected sum of future “rewards” starting from any given system’s state s. The value function learned with RL can serve as the desired utility function to be used by each module. In this case, one needs to decide on the state description s and on the “rewards” to be used for each module. These decisions are very nontrivial when RL is applied to any real-world problem, and the next section describes our reasoning for the choices we made.

4

4.1

Algorithms

Value Function Approximation

The standard procedure for learning a value function with RL consists first of choosing a value function approximation architecture with some tunable parameters, and then of tuning these parameters in the course of observing 8

the system’s evolution. Since each module chooses its actions so as to maximize the current value function, the management policy it uses evolves as the value function evolves. Under certain conditions, this evolution process can be proved to converge to the optimal value function [10]. The existing convergence results assume a fixed stochastic environment for the considered system, while our domain exhibits a co-evolutionary behavior between CPUSM and DATA-SM, and so we will focus on statistical performance of various algorithms rather than on convergence issues. The architectures that are linear in tunable parameters are easiest to adjust with RL, and in this paper we will use such architectures: Vˆ (s, p) = PM i i i=1 p φ (s), where s is the system’s state, p is a vector of tunable parameters, and φi (s) are fixed basis functions. The basis functions are usually chosen to cover the whole state space, and in this paper we will follow [13] in using basis functions of the form φi (s) = φi1 (s1 ) · φi2 (s2 ) · ... · φin (sn ), where n is the dimension of the vector s and φij is either equal to ψjf all (a “falling” function that is equal to 1 for sj < sj,min , falls linearly from 1 to 0 over the range [sj,min , sj,max ] and stays at 0 for sj > sj,max ) or ψjrise (a “rising” function that is equal to 0 for sj < sj,min , rises linearly from 0 to 1 over the range [sj,min , sj,max ] and stays at 1 for sj > sj,max ). Each φi (s) is defined by a certain distinct combination of rising and falling functions for sj , so that M = 2n . If the functions ψj were to fall/rise as step functions at s = (sj,min + sj,max )/2, then they could be visualized as “boxes” that cover the space of possible values of s so that whenever s belongs to box i, Vˆ (s, p) = pi . However, in our implementation each value of s belongs to all “boxes” simultaneously to the extent specified by φi (s), and correspondingly the parameters pi are weighted by φi (s). It is very important to select appropriately the ranges [sj,min , sj,max ], since a very large range would imply that the values of φi (s) change very little in response to changes in s, while a very small range would imply that many values of s fall completely within some “box” and Vˆ (s, p) would not be sensitive to changes of s within that “box”, while jumping quickly as s traverses the boundary between the “boxes.” The range estimation process used in this paper starts by observing the storage system for the first 5000 time steps (making about 5000 file upgrading decisions) and computing the mean s¯j and the standard deviation σj for each variable sj . The expected range for sj is then set to be [¯ sj − 2 · σj , s¯j + 2 · σj ]. The following procedure is used for updating the parameters pi : pit+1 = pit + αt [r(st , at , st+1 ) − ρt + V (st+1 , pt ) − V (st , pt )]φi (st ),

(1)

α being a which is executed simultaneously for all parameters, with αt = α+t “learning rate” with α > 1, r(st , at , st+1 ) being the reward received between states st and st+1 after the action at was taken in state st , and ρt being the average reward from time 0 to time t.

9

While the above parameter updating equation (based on linear function approximation architectures) has been extensively studied analytically (e.g., [11]), several crucial decisions need to be made every time it is applied to a practical problem: • defining the appropriate reinforcement signal r, which should be correlated with what one ultimately wants to optimize as a result of learning. This signal can differ from the ultimate objective if that makes the reinforcement signal more regularly observable or more correlated with agent’s actions. • defining the action space for the agent. Each action should ideally have an observable impact on the next reinforcement signal received. • defining the appropriate state vector st that contains enough information to predict the reinforcements. Since the complexity of learning is exponential in the size of the state vector, the dimensionality of st should be kept as small as possible. As a result, creative ways of compacting the relevant information are often needed, and we had to resort to such “creating compacting” in our domain. • deciding when parameters should be updated. The RL theory assumes that a new reinforcement is received at every state change, and so parameters can be updated at every state change. However, in practice reinforcements are often received less regularly than that. Moreover, some state changes for the RL agent can be caused by external events (as opposed to agent’s own actions), and so it might be be worthwhile to focus only on the state changes that result from agent’s own actions. Below are the choices we made for our domain, which define the specific reinforcement learning approach we used. We first focus on the most distinguishing aspect of our RL methodology: absence of a separately defined action space for the agent. The traditional RL problem formulation assumes that effects of each action can only be observed at the next time step. In some practical problems, however, the effect of each action can be immediately observed. This is often the case in systems that accept stochastically arriving jobs/requests and make some internal reconfigurations in response (e.g., job scheduling environment considered in this paper). In such cases, one can reduce the amount of noise present in the RL updates by describing the system’s state in such a way so that effects of each action are reflected immediately in the system’s state (we achieve that with the state descriptions given below), and thus eliminating the need for a separately defined action space. That is, we specify that the state sn in equation (1) is the one observed AFTER the scheduling decision (if any) is taken. After a scheduling decision the state sn evolves to the state s0n+1 based on the fixed stochastic behavior of the scheduling environment (which depends on the job arrival rate, the job properties and the server properties). The next scheduling decision constitutes the choice of the state sn+1 used in equation (1) out of the states that are reachable from s0n+1 by performing the allowable state reconfiguration decisions (job preemption, etc.). 10

So far no analytical results have been developed for the proposed sampling approach of choosing at every decision point the highest-utility state out of those that can be reached, but our positive experimental results suggest its feasibility. It is also interesting to note that this approach eliminates the need for performing action exploration (which is necessary when one learns to select different actions in every state), since the stochastic evolution of the state sn to s0n+1 naturally explores all regions of the state space. This makes it possible to use our RL approach online without degrading performance of the currently used policy. The reward r(xt , at , xt+1 ) for the CPU-SM learning was defined as utility received by the completed jobs between states xt and xt+1 . This definition of reward implies that the value function V (x) reflects the expected utility per time step received from jobs completed in the future starting from state x, which was suggested in the previous section to be the ideal objective for the scheduling system. After some experimentation, we determined that the following variables work well as inputs to the value function V (x) of the CPU-SM: • x1 = CPU-weighted average expected unit utility (EUU) of the currently running jobs • x2 = minimum remaining time to completion (RTTC) among all running jobs • x3 = number of free CPUs in the CPU-module If the CPU module receives jobs that on average have a high EUU, then we expect V (x) to increase as x1 and x2 increases and as x3 decreases. However, if the CPU module receives jobs that often have a low EUU, then when x1 is small (less than 0.5), we expect V (x) to increase as x2 decreases and as x3 increases. That is, when jobs with a small EUU are occupying the CPUs, it is best to have them occupy as few CPUs as possible and be as short as possible. The exact tradeoffs between the impact of these variables on V (x) depend on the scheduling workload, and so RL is ideally suited for learning these tradeoffs by observing the system’s behavior. The parameters of V (x) were updated using equation (1) whenever any of the state variables were changed and at least one job was considered for preemption or “squeezing” since the last update. The principally novel component in our co-evolutionary RL architecture is the reward function used for DATA-SM learning. Ideally, one should somehow reward DATA-SM for the average “quality” of jobs that CPU-SM has selected for execution since the last DATA-SM parameter update. The parameters of the DATA-SM value function W (y) are updated using equation (1) whenever the composition of jobs in the local storage is changed. If the job composition 11

is changed due to a new job starting to download its data, then the reward of 0 is used during the DATA-SM parameter update. However, if the composition of jobs in the local storage is changed due to the CPU-SM scheduling a new job onto its CPUs, then the DATA-SM reward is computed as the change in V (x) resulting from that scheduling decision. That is, the DATA-SM value function W (y) reflects the average future utility per time step that CPU-SM expects to receive (based on its value function V (x)) from all the scheduled jobs. This allows the DATA-SM to solve the difficult task of comparing the relative benefits of sending to CPU-SM a high-RUU job that is small and/or short vs. sending a low-RUU job that is large and/or long. Instead of inventing its own reward function for resolving this tradeoff, the DATA-SM can now most appropriately use the CPU-SM value function, which reflects the true preferences of CPU-SM. The experimental studies showed that the following variables work well for predicting the future value W (y): • y1 = average remaining unit utility (RUU) among all downloading and available jobs weighted by their CPU requirement and ideal execution time • y2 = average size of downloading and available jobs • y3 = average ideal execution time of all downloading and available jobs • y4 = number of downloading and available jobs If y1 is large, then we expect the CPU module to prefer large and long jobs, and have as many of them as possible to choose from. However, if y1 is small, then the CPU module might prefer these to be small and short jobs (so as not to occupy many valuable CPU resources for a long time), and to have fewer of them in the local storage so as to increase the probability of new jobs with a high RUU arriving and having their data downloaded into the local storage. The above four variables allow the data module to make scheduling decisions that result in the DATA-SM state that reflects the CPU module preferences described above.

4.2

CPU Scheduling Module Algorithms

The CPU-SM uses three fundamental scheduling policies: 1. The Basic No-Preemption (BNP) scheduling policy attempts to maximize the utilization of CPU resources by finding the job combination that leaves the smallest number of unused CPUs after the scheduling action. This is the standard “best-fit” scheduling algorithm used on most multi-processor servers today. The next two policies can be executed independently by the CPU module 12

whenever some jobs are waiting but cannot fit into the CPU module with the desired number of CPUs. 2. The Basic Preemption-Oversubscribing (BPO) scheduling policy attempts to perform preemptions on a given CPU module based on a simple heuristic algorithm that has proven to be very effective in our simulations. The BP policy first finds the job with the smallest RUU among the running jobs, which we will denote as RU U1 . Then it searches the queue of available jobs for those with RUU greater than RU U1 . If such jobs do exist, the policy selects the one that requires the smallest number of CPUs and forcefully schedules it by preempting the job with RU U1 and as many other jobs as necessary in the order of increasing RUU. Otherwise no preemptions take place on this module. If some jobs are still available for scheduling and some CPUs are still free, the BPO policy “squeezes” the smallest available job into the free CPUs on the module. We assume that a module can have at most one “squeezed” job and that if more CPUs become available, the module automatically allocates them to the “squeezed” job until it receives the ideal number of CPUs it requires. 3. The Reinforcement Learning Preemption-Oversubscribing (RLPO) scheduling policy attempts to perform preemptions based on the value function learned by RL. The policy first computes the variables x1 , x2 , and x3 describing the current state of the module and then uses them as inputs to compute the value function V0 , which corresponds to the case of no preemptions. After that, the policy finds the job with the smallest RUU among the running jobs, which we denote as RU U1 , and searches available jobs for those with RUU greater than RU U1 . For each such job i, the policy computes the alternate possible module state value Vi in terms of the values for x1 , x2 and x3 that would arise if job i were forcefully scheduled, preempting enough jobs (in the order of increasing RUU) to fit itself. Let Vmax = maxi {Vi }. If Vmax > V0 , then the policy preempts enough lowest-utility jobs and schedules the job with the highest Vi onto the module. Otherwise, no preemptions take place in this module. If some jobs are still available for scheduling and some CPUs are still free, RLPO policy decides whether or not to “oversubscribe” a module by finding the smallest available job j and computing the alternate module state in terms of the values for x1 , x2 , and x3 that would arise if job j were “squeezed” into the free CPUs of this module. These variables are then used to compute the corresponding value function Vˆ . If Vˆ > V0 , the RLPO policy “squeezes” the job j into the free CPUs on this module. The RLPO policy was evaluated in [13] for the case of multiple CPU modules, each of them independently using this policy, and was shown to result in a modest performance improvement over the BPO policy. We present experiments in Section 5 for a single CPU module that show a much more significant improvement of RLPO over BPO and explain what aspects of the problem 13

structure have the greatest impact on the relative performance of RLPO and BPO.

4.3

Data Scheduling Module Algorithms

The DATA-SM uses three fundamental scheduling policies: 1. No Data Preemption (NDP) scheduling policy first orders all jobs in the arrival queue in the order of decreasing RUU (NDP-RUU) or in the order of increasing data size (NDP-Size) or uses the natural job arrival order (NDPFIFO), and then tries to download as many of these jobs as possible starting from the first job. Naturally, the number of jobs that can be downloaded is limited by the available storage capacity. The NDP-RUU and the NDP-Size policies represent the opposite ends of the spectrum in the main tradeoff that needs to be done by DATA-SM: providing as many jobs as possible for the CPU module to choose from while at the same time providing it with the highest-RUU jobs. The next two policies assume that there are no jobs in the arrival queue whose data can fit in the local storage system. 2. The Basic Data-Preemption (BDP) scheduling policy attempts to perform data-preemptions in order to improve the “quality” of jobs in the local storage. The BDP-RUU policy first executes the NDP-RUU policy. The policy then finds the job j with the smallest RUU among the available and downloading jobs, denoted as RU U1 . Also, the policy finds the highest-RUU job i among all the jobs in the arrival queue that can fit into the local storage without preempting data of any of the currently running jobs. If such a job is found, then its data starts downloading while preempting the data for the smallest number of the currently available and downloading jobs in the order of their increasing RUU. The BDP-Size policy first executes the NDP-Size policy and then using the smallest-data job from the arrived queue to preempt the smallest number of the currently available and downloading jobs in the order of their increasing RUU. 3. The Reinforcement Learning Data-Preemption (RLDP) scheduling policy attempts to perform data-preemptions based on the value function W (y) learned using the RL algorithm. It starts by executing NDP-RUU, NDP-Size or NDP-FIFO policy, corresponding to RLDP-RUU, RLDP-Size or RLDPFIFO policies. Then it computes the values of y1 , y2 , y3 and y4 describing the current state of the local storage system and using them as inputs to compute W0 , describing the expected future benefit if no data-preemptions take place. After that, for every job i in the arrival queue that can fit into the local storage without preempting the data of the currently running jobs, the RLDP policy 14

Fig. 3. Relative performance of the CPU-SM scheduling policies

computes the alternate possible local storage state in terms of new values for y1 , y2 , y3 and y4 that would arise if job i data-preempts enough jobs (in the order of increasing RUU) in order to start downloading its own data. The new variable values are then used as inputs to compute the corresponding expected future local storage system benefit Wi . Let Wmax = maxi {Wi }. If Wmax > W0 , then RLDP data-preempts the data of enough lowest-RUU jobs and starts downloading the data for the job with the highest Wi . Otherwise, no data preemptions take place.

5

Simulation

The proposed scheduling framework is evaluated using an extension of the job scheduling simulator used in [13]. As a base case, we simulated one CPU module with 24 CPUs and a local storage system capable of storing 25 “pages” of data. Jobs arrived stochastically in a Poisson manner with an arrival rate λ = 0.1. The ideal number of CPUs Rmax required for each job was sampled from a random uniform distribution on the interval [4, 20], the corresponding execution time tE,min was sampled from a random uniform distribution on [5, 10], and the number of data pages required by each job was sampled from a random uniform distribution on the interval [1, 5]. The minimum number of i CPUs Rmin required by each job was assumed to be 1. The data download rate was assumed to be 1 data page per unit of time so that the job downloading time is randomly distributed on [1,5]. Each policy was evaluated for 20000 time steps and 50 trials were conducted for each policy so that the standard deviation of its performance observations was less than 1% of the performance itself. Prior to testing the RLPO or RLDP policy, the parameters of the corresponding value function were initialized to 15

0 and then tuned using equation (1) for 20000 time steps. The learning rate parameter α was 100 and no action exploration was used during learning (we explained in Section 4 why action exploration is not necessary in our RL architecture). The first three bars in Figure 3 show the relative performance of the various CPU-SM policies in terms of the total utility obtained by the scheduling system when FIFO data ordering was used with no data preemption (NDPFIFO). Note that prior to CPU-SM learning when all parameters are initialized at 0, the RLPO policy makes no preemptions or oversubscriptions and hence behaves just as the standard best-fit policy. As one can see, RL learns to improve its initial performance by 34%, outperforming the BPO policy by 9%. We have also observed that RLPO policy learns to make significantly fewer job preemptions than the BPO policy. This suggests that if some time is required to preempt the running jobs, then performance improvement of RLPO over BPO should be even greater, as less time will be wasted on waiting until jobs encapsulate and save their state. The last two bars in Figure 3 correspond to the case when 0.5 units of time are required to preempt each running job. In this environment, RLPO policy improves performance over BPO policy by 23%, confirming our intuition. Notice that performance of RLPO degraded by a smaller fraction than that of BPO relative to the case of instantaneous job preemptions. This demonstrates the ability of RL to adjust itself to the new environmental conditions and learn to make only those preemptions whose benefit outweighs their costs. We have also observed that performance improvement of RLPO over the bestfit and BPO policies increases as more jobs are available to the CPU-SM module for scheduling. This is expected, since a greater number of available jobs implies a more frequent possibility to make a preemption or oversubscribing decision, and hence a greater potential benefit due to any preemption/oversubscribing policy over the best-fit policy. At the same time, we have observed that as the number of jobs available for CPU scheduling increases, the performance difference between any DATA-SM scheduling policies decreases, since the CPU-SM module can “take care of itself” by selecting the jobs it likes the most from the available ones. As a reference, in the above experiment we have observed that on average 0.22 jobs were waiting for data space, 1.2 jobs were downloading, 1.8 jobs were available for scheduling and 1.9 jobs were executing. In this scenario, there was no statistically significant difference between RLDP and BDP policies. On the other hand, as the number of jobs available to CPU-SM decreases (for example due to a larger data size needed for each job), it becomes more important to manage that storage space most efficiently and make sure that it is occupied by the jobs most preferred by CPU-SM. Figure 4 shows the relative system performance under various DATA-SM policies used in conjunction with 16

Fig. 4. Relative performance of the DATA-SM scheduling policies used in conjunction with RLPO

RLPO when the data requirement for each job was randomly chosen from [5,25], decreasing the average number of jobs available to CPU-SM by a factor of 5. In this scenario, on average 7.2 jobs were waiting for data space, 1.1 jobs were downloading, 0.07 jobs were available for scheduling and 0.7 jobs were executing. There was no statistically significant difference between RLPO and BPO policies in this scenario, but as figure 4 shows, the difference between various data scheduling policies is very significant. Note that BDP-RUU and BDP-Size are just “improved” versions of NDPRUU and NDP-Size, since they try to maintain the RUU or size ordering more consistently by using the preemption mechanism. Also note that prior to learning, the parameters of the value function for DATA-SM were all initialized to 0, and so the RLDP policy initially makes no data preemptions and behaves as the corresponding NDP policy depending on the job ordering used. Figure 4 shows that RLDP-RUU improves its performance as a result of learning by 34% (behaving initially as NDP-RUU), while RLDP-Size improves its performance by 37% (behaving initially as NDP-Size), confirming the feasibility and effectiveness of the proposed co-evolutionary learning architecture. We have thus shown that the presented data-aware scheduling architecture provides benefit over the full range of possible job data requirements, with RL-based CPU scheduling providing most of the benefit when jobs require little data and RL-based data scheduling providing most of the benefit when jobs require a lot of data.

6

Conclusion

This paper has presented a novel co-evolutionary framework for solving the joint problem of managing the local storage space and the CPU resources 17

of a computing system. The presented instantiation of this framework in the scheduling domain can apply (with suitable modifications to the state variables) to any computing system that uses a local cache to speed up execution of some jobs. More generally, the distributed co-evolutionary aspect of this framework makes it applicable to a greater class of job shop scheduling problems, where a scheduler is used at each stage of the process to perform some processing on the jobs, which implicitly affects the ordering in which the jobs become available for the next stage. Even more generally, the presented co-evolutionary framework can be used for solving many other multi-agent learning problems where some agents affect the environment only by changing the states of other agents and cannot obtain a direct performance feedback from the environment.

References

[1] W. Bell, D. Cameron, L. Capozza, A. Millar, K. Stockinger, F. Zini, “Evaluation of an Economy-Based File Replication Strategy for a Data Grid,” In Proceedings of 3nd IEEE International Symposium on Cluster Computing and the Grid (CCGrid), pp. 661-668, 2003. [2] J. Bent, D. Rotem, A. Romosan and A. Shoshani. “Coordination of Data Movement with Computation Scheduling on a Cluster,” In Proceedings of Challenges of Large Applications in Distributed Environments (CLADE2005), North Carolina, pp. 25-34, July 2005. [3] R.H. Crites and A.G. Barto. “Elevator group control using multiple reinforcement learning agents.” Machine Learning, Vol. 33, Issue 2-3, pp. 235262, 1998. [4] E. Jensen, C. Locke, H. Tokuda, “A time driven scheduling model for real-time operating systems.” In Proceedings of the IEEE Real-Time Sytems Symposium, pp. 112-122, 1985. [5] J. Jiang, G. Xu, X. Wei. “An Enhanced Data-aware Scheduling Algorithm for Batch-mode Dataintensive Jobs on Data Grid,” In Proceedings of the International Conference on Hybrid Information Technology (ICHIT ’06), Vol 1, pp. 257-262, 2006. [6] T. Kosar, “A new paradigm in data intensive computing: Stork and the data-aware schedulers,” In Proceedings of Challenges of Large Applications in Distributed Environments (CLADE 2006) Workshop, Paris, France (June 2006), in conjunction with HPDC 2006, pp. 5-12, 2006. [7] P. Li, H. Wu, B. Ravindran, and E. D. Jensen, “A Utility Accrual Scheduling Algorithm for Real-Time Activities With Mutual Exclusion Resource Constraints,” IEEE Transactions on Computers, Vol. 55, Issue 4, pp. 454-469, April 2006.

18

[8] J. Mitchell, “Sun’s hpcs architecture and technologies.” In Proceedings of High Performance Computing, Networking and Storage Conference SC2004, Pittsburgh, PA, 2004. [9] B. Ravindran, E.D. Jensen, P. Li, “On recent advances in time/utility function real-time scheduling and resource management,” In Proceedings of the Eighth IEEE International Symposium on Object-Oriented Real-Time Distributed Computing (ISORC), pp. 55-60, 2005. [10] R. Sutton, A. G. Barto, Reinforcement Learning: An Introduction. MIT Press, 1998. [11] J. N. Tsitsiklis, B.V. Roy, “Average cost temporal-difference learning.” Automatica, vol. 18, no, 11, pp. 1799-1808, 1999. [12] H.R. Berenji, D. Vengerov, J. Ametha, “Co-Evolutionary Perception-based Reinforcement Learning for Sensor Allocation in Autonomous Vehicles.” In Proceedings of the 12th IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), pp. 125-130, 2003. [13] D. Vengerov, “Reinforcement learning framework for utility-based scheduling in resource-constrained systems.” Sun Microsystems Laboratories technical report TR-2005-141, Feb 1, 2005. [14] H. Wu, B. Ravindran, E. D. Jensen, and U. Balli, “Utility Accrual Scheduling under Arbitrary Time/utility Functions and Multi-unit Resource Constraints,” In Proceedings of 10th International Conference on Real-Time and Embedded Computing Systems and Applications (RTCSA), pp. 80-98, 2004.

19

7

About the Authors

David Vengerov is a staff engineer at Sun Microsystems Laboratories. He is a principal investigator for the Adaptive Optimization project developing and implementing self-managing and self-optimizing capabilities in computer systems. His primary research interests include Utility and Autonomic Computing, Reinforcement Learning Algorithms, and Multi-Agent Systems. He holds a Ph.D. in Management Science and Engineering from Stanford University, an M.S. in Engineering Economic Systems and Operations Research from Stanford University, an M.S. in Electrical Engineering and Computer Science from MIT and a B.S. in Mathematics from MIT. Declan Murphy is a Senior Staff Engineer at Sun Microsystems, focused on system management technology. He is currently leading the development of the next generation Sun Connection product. Previously, he lead the Administrative Environment team for Phase 2 of Sun’s DARPA HPCS program, researching techniques to enhance productivity of large high performance computers from the system administration perspective. Prior to that he worked on Sun’s N1 program for future enterprise systems, contributing to the initial vision and helping jump start the N1 product line. Declan spent the bulk of the 1990s working on Sun’s highly available clustering products, including leading the development of the Sun Cluster 3.0 product, with a focus on the high availability infrastructure. He received BA, BAI degrees in computer engineering from Trinity College, Dublin, Ireland in 1987. Lykomidis Mastroleon is a Ph.D. student at Stanford University in the department of Management Science and Engineering. Nick Bambos is a Professor at Stanford University, having a joint appointment in the Department of Electrical Engineering and the Department of Management Science & Engineering. He heads the Network Architecture and Performance Engineering research group at Stanford, conducting research in wireless network architectures, the Internet infrastructure, packet switching, network management and information service engineering, engaged in various projects of his Network Architecture Laboratory (NetLab). His current technology research interests include high-performance networking, autonomic computing, and service engineering. His methodological interests are in network control, online task scheduling, queueing systems and stochastic processing networks.

20