Decay-usage Scheduling in Multiprocessors - CiteSeerX

11 downloads 8333 Views 432KB Size Report
(a) CPU. In this paper we deal with a decay-usage scheduling policy in ... part, the accumulated and decayed CPU usage, increases in a linearly proportional ...
Decay-usage Scheduling in Multiprocessors Report 97-55

D.H.J Epema

Faculteit der Technische Wiskunde en Informatica Faculty of Technical Mathematics and Informatics Technische Universiteit Delft Delft University of Technology

ISSN 0922-5641

Copyright c 1997 by the Faculty of Technical Mathematics and Informatics, Delft, The Netherlands. No part of this Journal may be reproduced in any form, by print, photoprint, microfilm, or any other means without permission from the Faculty of Technical Mathematics and Informatics, Delft University of Technology, The Netherlands. Copies of these reports may be obtained from the bureau of the Faculty of Technical Mathematics and Informatics, Julianalaan 132, 2628 BL Delft, phone +31152784568. A selection of these reports is available in PostScript form at the Faculty’s anonymous ftpsite. They are located in the directory /pub/publications/tech-reports at ftp.twi.tudelft.nl

Abstract

Decay-usage scheduling is a priority-ageing time-sharing scheduling policy capable of dealing with a workload of both interactive and batch jobs by decreasing the priority of a job when it acquires CPU time, and by increasing its priority when it does not use the (a) CPU. In this paper we deal with a decay-usage scheduling policy in multiprocessors modeled after widely used systems. The priority of a job consists of a base priority and a time-dependent component based on processor usage. Because the priorities in our model are time dependent, a queueing-theoretic analysis, for instance for the mean job response time, seems impossible. Still, it turns out that as a consequence of the scheduling policy, the shares of the available CPU time obtained by jobs converge, and a deterministic analysis for these shares is feasible: We show how for a xed set of jobs with large processing demands, the steady-state shares can be obtained given the base priorities, and conversely, how to set the base priorities given the required shares. In addition, we analyze the relation between the values of the scheduler parameters and the level of control it can exercise over the steady-state share ratios, and we deal with the rate of convergence. We validate the model by simulations and by measurements of actual systems. Categories and Subject Descriptors: D.4.1 [Operating Systems]: Process Management| multiprocessing/multiprogramming, scheduling; D.4.8 [Operating Systems]: Performance| measurements, modeling and prediction, simulation

General Terms: Measurement, Multiprocessors, Performance, Scheduling, Simulation Additional Key Words and Phrases: Control, convergence, decay usage, priorities, shares A short version of this paper, which omits the proofs, Sections 4.3 and 5, and the extended treatment of the performance results in Section 6, appeared in the Proceedings of Sigmetrics '95/Performance '95, May 1995, Ottawa, Canada, pp. 74{85.

1 Introduction Time-sharing systems that support both interactive and batch jobs are steadily becoming less important in this age of workstations. Instead, for compute-intensive jobs, the use of multiprocessors and clusters of uniprocessors is increasing. Currently, most of the latter systems run some variation of the UNIX operating system, which employs a classical timesharing policy. In this paper we analyze a scheduling policy in multiprocessors modeled after scheduling in UNIX systems under a workload of compute-intensive jobs. An important issue in scheduling policies for time-sharing systems is how to recognize long, compute-bound jobs in order to give these a low priority relative to short, interactive jobs. The main mechanism to achieve this without a-priori knowledge of the processing demands of jobs is by means of multilevel feedback queues. There are two basic variations of such queues, in each of which the priority of a job is depressed whenever it acquires CPU time. The rst is the Head-of-the-Line policy, in which, on arrival, a job enters the highest-priority queue, and each time it exceeds some amount of processing time, its priority is depressed by appending it to the next-lower priority queue. This continues until a job reaches the lowest priority level, where it then stays; in particular, its priority remains depressed permanently. In the second, processor-usage ageing, also known as priority ageing and decay usage, is employed: When a job acquires CPU time, its priority is continually being depressed, and periodically, at the end of every decay cycle, the priorities of all jobs are elevated by diminishing the depressions due to the CPU time obtained (i.e., the processor usage of each job is decayed ), after which each job is appended to the queue corresponding to its new priority. As a consequence, the further in the past CPU time has been obtained, the smaller the priority depression it entails. In addition, the priorities of jobs may have xed base components, yielding the opportunity to give di erent jobs di erent levels of service. It is this type of scheduling that is used in UNIX, and also in Mach [2, 3], which latter is the basis of the operating system of the Open Software Foundation (OSF)1 . In the continuous-time multiprocessor scheduling policy we study, the priority of a job is equal to the sum of its xed base priority and a time-dependent component. The latter part, the accumulated and decayed CPU usage , increases in a linearly proportional fashion with CPU time obtained, and is periodically divided by the decay factor, which is a constant larger than one. (A low priority corresponds to a high level of precedence.) Jobs with equal base priorities belong to the same class. Jobs are selected for service in a processor-sharing fashion according to their priorities. At any point in time, there is a subdivision of jobs into three groups, any of which may be empty. The jobs in the rst group have the lowest priorities (i.e., the highest levels of precedence) and have dedicated processors; those in the second group have equal priorities, higher than any of the priorities of the jobs in the rst group, and evenly share the remaining processors; and those in the third group, with yet higher priorities, do not receive service. This subdivision of jobs may vary during a decay cycle and may change at the start of a new decay cycle, because the priority of a job that 1 Open Software Foundation, Inc., 11 Cambridge Center, Cambridge, Ma., USA.

1

receives service increases, and because of the recomputation of the priorities at the end of a decay cycle, respectively. Because the priorities are time dependent, a queueing-theoretic analysis of the policy, for instance solving for the mean response time, possibly assuming a Poisson arrival process and an exponential service-time distribution for each class, seems infeasible. However, provided that the set of jobs in the system is xed and that all jobs are present from the same time onwards, a deterministic analysis is feasible. In that case, because jobs with equal base priorities are treated in the same way, the priorities of the jobs in the same class are always equal, and the subdivision of jobs into three groups is in fact a subdivision of classes. The main ingredient of our analysis is keeping track of this subdivision, in particular at the end of every decay cycle. Our main results are proving the convergence of this scheduling policy in the sense that the fractions of CPU time obtained by jobs (the shares ) have limits (the steady-state shares ), and deriving the relation between the base priorities and the steady-state shares: For a xed set of jobs with large enough processing demands, we show how to compute the shares given the base priorities, and conversely, how to set the base priorities given the required shares. The decay-usage scheduling policy is characterized by four parameters (amongst which the decay factor and the increment in priority due to one unit of CPU time obtained). It turns out that these parameters are not independent, to the extent that in general, one parameter would suce as far as the ratios of the steady-state shares are concerned. We also deal with the transient behavior of the scheduler, i.e., with the rate of convergence to the steady state and with arrivals and departures of jobs. In actual systems, jobs receive discrete quanta of CPU time, and some numbers in the scheduler are integers. In our model, jobs receive CPU time in a continuous processorsharing fashion, and we assume all numbers to be reals. In order to validate our model and to assess the impact of the continuous-time hypothesis without any in uence of implementation details of actual systems, we simulate the UNIX scheduler. In general, the results of the simulations agree reasonably well with the model when the quantum size is small. In addition, we performed measurements on uniprocessors and multiprocessors running versions of UNIX based on 4.3 Berkeley Software Distributions (4.3BSD) UNIX and Mach. The results for the 4.3BSD uniprocessor agree extremely well with the model, but those of the 4.3BSD multiprocessor fail to do so. Detailed traces of the latter system show that its scheduling policy does not comply with the model. For the Mach systems, the measurements match the results of the model not very well; however, after adapting the model to include the somewhat di erent way in which Mach e ectuates decay, the measurements match the model very well indeed. Our analysis extends the analysis by J.L. Hellerstein [10] of scheduling in UNIX SystemV uniprocessors and the short treatment by Black [3] of Mach multiprocessors to a uni ed treatment of scheduling in uniprocessors and multiprocessors running UNIX System V (up to Release 4), 4.3BSD UNIX, 4.4BSD UNIX, or Mach. Neither of these authors proves that the decay-usage scheduling policy reaches a steady state, nor do they deal with the 2

transient behavior of the scheduler. The extension to 4.3BSD (and 4.4BSD) and Mach involves more complicated formulas for the priorities. The main diculty in the extension to multiprocessors is to prove the convergence of the scheduling policy and to determine how many processors serve each class in the steady state. We use some of the techniques of [10], in particular the subdivision of decay cycles into epochs, at the end of which the priorities of di erent classes become equal. Hellerstein [11] gives a more detailed exposition of the results in [10], and also deals with control issues relating to the range of possible ratios of the steady-state shares and the granularity within this range. In [10, 11], measurements are presented for UNIX System V on uniprocessors; they match the predictions of the model remarkably well. For a compute-intensive workload on a multiprocessor or a cluster of uniprocessors, it is a natural objective to deliver pre-speci ed shares of the total computing power to groups of jobs (e.g., all jobs belonging to a single user or to a speci ed group of users). Schedulers for uniprocessors with this objective have been treated in the literature under the name of fairshare schedulers [5, 12, 13], although the term share schedulers would be more appropriate. Our results show how to achieve share scheduling for long, compute-intensive jobs in UNIX systems, at least in theory, without scheduler modi cations, while in each of [5, 12, 13], an additional term in the priorities of jobs and a feedback loop are needed. Unfortunately, the relation between the base priorities and the share ratios is rather intricate. However, like in other classical time-sharing systems, the two main performance objectives of the UNIX scheduler are short response times of interactive jobs and a high throughput of large jobs, rather than share scheduling. Recently, there has been a renewal of interest in share scheduling of di erent types of resources, expecially of processor time and of network bandwidth. In [18], a probabilistic resource-management mechanism called lottery scheduling is presented. Jobs own a number of tickets proportional to their shares, and for each scheduling decision, a lottery is held in which the chance of a job to win is proportional to its number of tickets. Resource abstractions such as ticket transfers, currencies and in ation are supported. Ticket transfers allow a job to yield its rights to another job, for instance when the former sends a request to the latter and blocks. Currencies and in ation allow a group of jobs to be isolated from other jobs when it expands, shrinks or internally re-allocates the relative shares of its jobs. Experimental results show that lottery scheduling works well for compute-intensive jobs on a scale of seconds, and that it can also be used for certain non-compute-intensive workloads. In [19], a deterministic mechanism called stride scheduling is employed to achieve share scheduling that uses the same abstractions as lottery scheduling. Each job has a stride value that is inversely proportional to its number of tickets, and a pass value. When a scheduling decision is taken, the job with the lowest pass value is chosen, and its pass value is incremented by its stride value. It is shown by means of simulations and an implementation, that compared to lottery scheduling, stride scheduling achieves a somewhat better accuracy in throughput ratios and a much lower variability in response times (which are de ned as the 3

times between two successive quantum completions). As a matter of fact, stride scheduling is but a slight modi cation of decay-usage scheduling (cf. Section 4.1). In [7], a di erent time-slicing mechanism for achieving proportional resource usage called time-function scheduling (TFS) is analyzed in uniprocessors. For each class k of jobs, a time function Fk () is de ned. Within a class, jobs are served in a FCFS fashion. When a job has spent an amount of time t waiting in the queue corresponding to its class, its time-function value is equal to Fk (t). After a quantum has expired, the processor selects, among the jobs at the heads of the non-empty queues, the one with the highest time-function value. It is shown in [7] that with linear time functions, TFS is able to achieve share-ratio objectives on a per-class basis very well (in fact, the share ratios are the inverses of the ratios of the slopes of the time functions), with a waiting-time variance that is substantially lower than that of lottery scheduling. An important di erence between TFS on the one hand and decay-usage scheduling and stride scheduling on the other, is that in the former, priorities are depressed while a job is waiting, while in the latter this happens while a job receives service. Fair scheduling has also been investigated in the context of networks, in which di erent packet streams with quality-of-service demands compete for bandwidth. The usual aim of fair scheduling policies in networks is to approximate some form of processor sharing [9]. Two di erences with scheduling compute-intensive jobs in multiprocessors are that in networks, the time scale on which to achieve fair scheduling is much smaller, and that scheduling is nonpreemptive because the transmission of a packet cannot be interrupted and resumed. In [4], a study of three scheduling algorithms for (small) shared-memory multiprocessors by means of simulations driven by traces of UNIX systems, is reported. The emphasis is on the impact of load-balancing aspects such as migration overhead on response times. No attempt is made at simulating the decay-usage scheduling policy of UNIX.

2 The Model In this section we describe scheduling in UNIX systems and in Mach, our model of this type of scheduling, and the di erences between them. We then show how our model behaves over time.

2.1 Scheduling in UNIX Systems Scheduling in UNIX follows a general pattern, but details vary among variations such as UNIX System V Releases 2 [1] and 3, UNIX System V Release 4 [8], 4.3BSD UNIX [15] and 4.4BSD UNIX [16], and the operating system Mach [2, 3], which has a close resemblance to UNIX as far as scheduling is concerned. The general description below, which is valid for both uniprocessors and multiprocessors, applies to System V up to and including Release 3, 4.3BSD and 4.4BSD, and Mach. It does not apply to System V Release 4, which by default supports the two scheduling classes time-sharing and real-time, for the former of which it 4

uses a table-driven approach [8]. Therefore, in the remainder of this article, `System V' refers to the releases of that system up to and including Release 3. For a comprehensive treatment of scheduling in UNIX, see Chapter 5 of [17]. UNIX employs a round-robin scheduling policy with multilevel feedback and priority ageing. Time is divided into clock ticks, and CPU time is allocated in time quanta of some xed number of clock ticks. In most current UNIX and Mach systems, a clock tick is 10 ms. The set of priorities (typically 0; 1; : : :; 127) is partitioned into the kernel-mode priorities (usually 0 through 49) and the user-mode priorities (50 through 127). A job (in UNIX called a process) is represented by a job descriptor, in which the following three elds, implemented as integers, relate to processor scheduling (also for Mach, we use UNIX terminology): 1. p nice: The nice-value, which acts as a base priority and has range 0 through 19. 2. p cpu: This eld is incremented by 1 for every clock tick received, and every second, it is recomputed by dividing it by the decay factor D, and in some systems, by adding p nice to it. Although p nice enters in its periodic recomputation, we will refer to p cpu as the accumulated and decayed CPU usage, or sometimes simply as the CPU usage. It is the division by D that we call the decay of (CPU) usage, and also, because p cpu enters in the priority, priority ageing. 3. p usrpri: The priority of a job, which is the sum of a linear combination of p cpu and p nice, and of the constant PUSER serving to separate kernel-mode priorities from user-mode priorities (with the ranges of priorities mentioned above, PUSER equals 50). A high value of p usrpri corresponds to a low precedence. In the sequel, in the context of UNIX, `priority' refers to p usrpri. In principle, there is a run queue for every priority value. A waiting process is in the run queue corresponding to its priority. When a time quantum expires, the scheduler selects the process at the head of the highest non-empty run queue (lowest value of p usrpri) to run. In many implementations, there are 32 instead of 128 run queues; then, the two least-signi cant bits of p usrpri are ignored when deciding to which run queue to append a process.

2.1.1 System V In UNIX System V (excluding Release 4) [1], when a process receives a clock tick, the elds in its job descriptor are recomputed according to: ( p cpu := p cpu + 1 p usrpri := PUSER + 0:5  p cpu + p nice Every second, the following recomputation is performed for every process: ( p cpu := p cpu=2 p usrpri := PUSER + 0:5  p cpu + p nice So here, D = 2. Usually, PUSER = 50. 5

2.1.2 4.3BSD Scheduling in 4.3BSD UNIX ([15], p. 87) is identical to that in 4.4BSD UNIX ([16], p. 92). Because we performed measurements on 4.3BSD UNIX systems, we will in the sequel only refer to that system. In 4.3BSD UNIX, the priority of a running process is only recomputed| according to (2) below|when p cpu becomes a multiple of 4 (presumably because of the division by 4 of p cpu in (2)). Every second, the following recomputation is performed for every process:

8 > > < > > :

 load  p cpu + p nice p cpu := 2 2load +1 p usrpri := PUSER + p cpu 4 + 2  p nice

(1) (2)

Here, load is the total number of jobs in the run queues as sampled by the system, divided by the number of processors. From (1), we see that D = (2  load +1)=(2  load). (This value for D has been chosen to achieve that only about 10% of the value of p cpu persists after 5  load seconds.) Usually, PUSER = 50. In 4.3BSD UNIX, a clock tick is 10 ms and a time quantum is 10 clock ticks.

2.1.3 Mach In Mach [3], a quantity l called the load factor, is de ned by

l = max(1; M=P ); where M is the number of processes and P the number of processors. A basic element in Mach scheduling is the increment in priority R0 due to one second of received CPU time when l = 1 (R0 has been set to approximately 3:8). In general, the increment in priority due to one second of CPU time is lR0. The priority of a process is 0

p usrpri = PUSER + lR T  p cpu + 0:5  p nice;

(3)

where T is the number of clock ticks per second. The e ect of the load factor in the coecient of p cpu is to keep the priorities in the same range regardless of the number of processes and processors (see Section 3.2, Remark 5). Every second, the CPU usage of every process is decayed according to:

p cpu := 5  p cpu=8 (so D = 1:6), and p usrpri is recomputed according to (3). In Mach, PUSER = 12, a clock tick is 10 ms, a quantum is 10 clock ticks, and there is a direct match between the priorities, which range from 0 through 31, and the 32 run queues. In fact, the way decay is e ectuated in Mach is somewhat di erent from what is described above. Mach maintains a global 1-second counter C and a local decay counter Cp for every 6

process p. Whenever a clock interrupt occurs at a CPU, the system e ectuates decay for the process p running on that CPU by the following computation (p cpup is the eld p cpu of process p):

8 > < > :

p cpup := p cpup =DC?C Cp := C

(4) (5)

p

Subsequently, the priority of p is recomputed according to (3). In order to avoid starvation of processes that do not get a chance to run and perform this priority recomputation, but that would run if they could adjust their priorities, every 2 seconds the system performs the computation of (4), (5) and (3) for every process. Because we are not able to model this form of decay, throughout this paper, we assume in our modeling of Mach that decay is e ectuated every second (with decay factor D = 1:6) for every process, like in UNIX. Only when discussing Mach measurements in Section 6 will we analyze the implications of the actual way Mach e ectuates decay.

2.2 The Decay-Usage Scheduling Model Our model of decay-usage scheduling is de ned as follows. 1. There are P processors of equal speed. 2. There are K classes of jobs. We consider the model from time t = 0 onwards. There is a xed set of jobs, all present at t = 0; no arrivals or departures occur. (In Section 5 we will relax this assumption.) There are Mk class-k jobs, k = 1; : : :; K . We write M^ k = Pkl=1 Ml, and assume M^ K > P . Let k0 be the index such that M^ k0  P and M^ k0 +1 > P . If M1 > P , then let k0 = 0. 3. A class-k job has base priority bk , with bk real and non-negative, and with bk < bl when k < l, k; l = 1; : : :; K . The priority of a class-k job at time t is

qk (t) = bk + Rvk (t);

(6)

with and R real, positive constants. The function vk (t) will be explained in 4. below (where it will also be shown that the priority of a job only depends on its class). 4. Time is divided into intervals of length T called decay cycles , from t = 0 onwards. Let tn = nT; n = 1; 2; : : :. The n-th decay cycle [tn?1 ; tn] is denoted by Tn. The scheduling policy is a variation of a policy known as priority processor sharing [14] and also as discriminatory processor sharing [6], that is, jobs simultaneously progress at possibly di erent rates (called their processor shares), which may change over time. The functions vk are piece-wise linear and are de ned as follows: vk (0) = 0, and for every subinterval [t1; t2 ] of a decay cycle during which all jobs of class k receive a constant processor share, say f ,

vk (t) = vk (t1 ) + f  (t ? t1 ); for t1  t  t2: 7

(7)

Furthermore, at the end of every decay cycle, the following recomputation is performed:

vk (t+n ) := vk (t?n )=D + bk ;

k = 1; : : :; K;

(8)

where D, which is called the decay factor, and  are real constants, D > 1;   0, and where t?n and t+n denote the time tn immediately before and after the decay of (8) has been performed. 5. At any point in time t, the set of jobs that receive service and the processor shares of those jobs are determined as follows. Order the classes such that qk (t)  qk +1 (t), i = 1; : : :; K ? 1, with fki ji = 1; : : :; K g = f1; 2; : : :; K g. Let r be the lowest index P +1 M > P , and let s be P such that qk (t) < qk +1 (t), that ri=1 Mk  P , and that ri=1 k the index such that qk +1 (t) = qk (t) and that qk (t) < qk +1 (t). If such an r does not exist, let r = 0; if such an s does not exist, let s = K . Now at time t, each of the jobs of classes k1; : : :; kr has a dedicated processor, the jobs of classes kr+1 ; : : :; ks evenly P share the remaining P ? ri=1 Mk processors, and classes ks+1 ; : : :; kK do not receive service. i

r

i

r

r

i

i

s

s

s

i

There are a few things to note about this scheduling policy:

 The behavior of the model does not change when the base priorities bk are replaced by bk + C , k = 1; : : :; K , for some constant C such that b + C  0; the di erences bk ? b are important rather than the values bk themselves.  The dimension of the parameter T is time. We are at liberty to express time as elapsed 1

1

time, or as the number of clock ticks delivered by a processor during a decay cycle. This gives us the opportunity to compare the behavior of our model for processors of di erent speeds by choosing di erent values for T (di erent numbers of clock ticks in a decay cycle, each clock tick representing the same amount of useful work).

 The parameters R and D may depend on the other parameters (such as P and the Mk ), but should be constant in any instance of the model.  Because of (6), (7) and (8), for n  2 we have nX ? nX ? nX ? bk D?j  vk (t?n )  T D?j + bk D?j : (9) 2

1

2

j =0

j =0

j =0

The lower bound is only reached when class-k jobs do not receive service during decay cycles T1 ; : : :; Tn, while the upper bound is only reached when class-k jobs have dedicated processors during decay cycles T1; : : :; Tn.

 Because all processors are assumed to have equal speeds, any number M of jobs can be given equal shares on any number P 0 of processors by means of processor sharing, possibly also across processors, when M  P 0 , which shows the feasibility of 5. above. 8

 When during a decay cycle a job receives service, its priority increases proportionally

to its processor share; when it does not receive service, its priority remains constant. Shares only change during a decay cycle when two priorities qk (t) and ql (t) become equal. If at time t, jobs of di erent classes have equal priorities, they will receive equal shares at any time during the remainder of that decay cycle, and their priorities remain equal. Therefore, there are at most K ? 1 points in a decay cycle where shares change. The intervals between two consecutive such points are called epochs . The l-th epoch of decay cycle n is denoted by Tn(l).

The parameters of System V are R = 0:5, D = 2, = 1,  = 0. Insofar as we con ne ourselves to 4.3BSD systems in their steady states, we can assume that the load as sampled by the system is equal to M^ K =P , so because we assume that there are more processes than processors, the de nitions of load for 4.3BSD and of the load factor l for Mach coincide, and we will in the sequel denote both by l. The parameters of 4.3BSD are R = 0:25, D = (2l + 1)=2l, = 2, and  = 1, and those of Mach are R = lR0=T , D = 1:6, = 0:5, and  = 0, with l = M^ K =P . Currently, T = 100 clock ticks in almost all implementations of these operating systems the author knows of, and throughout this paper we will assume this value for T . Also, in the sequel we will use the terms base priority and nice-value interchangeably.

2.3 Discrepancies between UNIX Scheduling and the Model There are three points where our model di ers from real UNIX scheduling (except for the shift in priority by PUSER, which has no e ect), viz.: 1. In our model we use continuous time and a continuous range for all variables in the scheduler, while actual systems use discrete time and force most variables to have integer values. In our model, a job can get any fraction of a second of CPU time per second (or any real number of clock ticks less than or equal to T during a decay cycle), while in actual systems, a process can only get an integral number of quanta per second. 2. The UNIX scheduler uses priority clamping: Because of their representations in a xed number of bits, p cpu and p usrpri are set to their maximum values whenever the computations for these elds result in larger values. Thus, in 4.3BSD, p cpu cannot exceed 255 and p usrpri cannot exceed 127 ([15], p. 87). We will see in Section 4.3 that clamping is particularly prominent in 4.3BSD systems. The issue of an upper bound of p usrpri is also addressed in [11]. 3. In many UNIX systems, the two least-signi cant bits of p usrpri are ignored when determining to which run queue to append a process.

9

2.4 The Operation of the Model In this section we describe the operation of our decay-usage scheduling model. According to the description in Section 2.2, at the beginning of the rst decay cycle T1, all jobs in classes 1; 2; : : :; k0 get dedicated processors, and their priorities all increase at the same rate R. If P = M^ k0 , this operation continues until either qk0 (t) is equal to qk0 +1(t) or until T1 nishes, whichever occurs rst. If P > M^ k0 , the jobs of class k0 + 1 share the remaining processors, and their priority increases at rate R(P ? M^ k0 )=Mk0 +1 , which is smaller than R. This operation continues until one of four things happens: 1. The priority of class k0 becomes equal to the priority of class k0 + 1. Then, the jobs in classes 1; 2; : : :; k0 ? 1 continue having dedicated processors, and the jobs in classes k0 and k0 + 1 start sharing P ? M^ k0 ?1 processors. 2. The priority of class k0 + 1 becomes equal to the priority of class k0 + 2. In this case, the jobs in classes 1; 2; : : :; k0 continue having dedicated processors, and the jobs in classes k0 + 1 and k0 + 2 start sharing P ? M^ k0 processors. 3. The priorities of classes k0 and k0 + 1 become equal to the priority of class k0 + 2 at the same time. Then, the jobs in classes 1; 2; : : :; k0 ? 1 continue having dedicated processors, and the jobs in classes k0; k0 + 1 and k0 + 2 start sharing P ? M^ k0 ?1 processors. 4. Before any of 1.{3. happens, T1 nishes. Continuing in this way, it is clear that T1 consists of at most K epochs T1(1); T1(2); : : :, with service delivered as follows. During T1(l), the jobs in classes 1; 2; : : :; i1(l) have dedicated processors, jobs in classes i1 (l) + 1; : : :; j1(l) share P ? M^ i1 (l) processors, and the remaining classes do not receive service, for some i1 (l) and j1(l). Recalling the four possibilities at the end of an epoch detailed above, we have i1 (l + 1) = i1(l) ? 1 and j1 (l + 1) = j1(l), or i1(l + 1) = i1(l) and j1(l + 1) = j1(l) + 1, or i1 (l + 1) = i1(l) ? 1 and j1 (l + 1) = j1 (l) + 1, or T1(l) nishes because T1 does. That is, either the jobs with the highest base priority among those having dedicated processors catch up with the jobs that receive service but do not have dedicated processors, the latter catch up with the jobs in the class with the lowest base priority among those that are waiting, or both. Among the classes with jobs having dedicated processors, none can catch up with the class with the next-higher base priority, because the jobs of all these classes receive service at the same rate, and so, by (6), their priorities increase at the same rate, too. We conclude that there exist values i1 and j1 such that the jobs of classes 1; : : :; i1 each have a dedicated processor during the entire rst decay cycle T1, the jobs of classes i1 + 1; : : :; j1 receive processor time during T1 but do not have dedicated processors during at least part of T1, and the jobs of classes j1 + 1; : : :; K do not receive service during T1. Also, at the end of T1, we have k = 1; : : :; i1 ? 1; j1 + 1; : : :; K ? 1; (10) qk (t?1 ) < qk+1 (t?1 ); 10

qk (t?1 )  qk+1 (t?1 ); qk (t?1 ) = qi1 +1 (t?1 ); Putting

k = i 1 ; j1 ; k = i1 + 2; : : :; j1:

(11) (12)

= (DRD? 1) + ;

in general we have ?

qk (t+n ) = qk (Dtn ) + Rbk ;

(13)

k = 1; : : :; K; n = 1; 2; : : ::

(14)

It easily follows that

k = 1; 2; : : :; K ? 1:

qk (t+1 ) < qk+1 (t+1 );

As a consequence, by induction on n, the operation of the model during Tn is analogous to that during T1 , although the starting values of the priorities, the lengths of corresponding epochs, and even the number of epochs may be di erent. We now de ne:

 in as the index such that the jobs of classes 1; : : :; in have dedicated processors during Tn, and those of class in + 1 do not; if there are no such classes, we set in = 0;  jn as the highest index such that the jobs of class jn receive a non-zero amount of processor time during Tn ;  Qn = qj (t?n ) as the highest priority attained at the end of Tn by a class that receives service during Tn . Obviously, in  jn and in  k . If M^ k0 = P , then jn  k , otherwise jn > k . Because of the way of operation of the scheduling policy explained above, we have for n  1 n

0

0

qk (t?n ) qk (t?n ) qk (t?n ) qk (t+n )

< qk+1 (t?n );  qk+1 (t?n ); = qi +1 (t?n ); < qk+1 (t+n ); n

k = 1; : : :; in ? 1; jn + 1; : : :; K ? 1; k = i n ; jn ; k = in + 2; : : :; jn ; k = 1; 2; : : :; K ? 1:

0

(15) (16) (17) (18)

The operation of the model during a decay cycle is illustrated in Figure 1. The dashed lines indicate the priorities at the start of Tn (we take = 1). On the uniprocessor, class 1 catches up with class 2 at the end of Tn (1), classes 1 and 2 catch up with class 3 at the end of Tn (2), but Tn ends before the jobs of class 4 get any service, so in = 0 and jn = 3. On the multiprocessor, classes 1 and 2 have dedicated processors during Tn , and classes 3,4 and 5 receive some service, so M^ 2 < P < M^ 5 , in = 2 and jn = 5. Note that the area between the graphs of the priorities at the beginning and at the end of a decay cycle is RPT .

11

priority

priority

b4

Qn

RT

q1(t+n?1 ) b1



b3

b2 M1

- M - 2

M3

-

M4

-

Qn q2 (t?n ) q1 (t?n ) q1 (t+n?1 ) b1



b5

b4

b3 b2 M1

-M-M - M -M-M2

3

4

5

numbers of jobs numbers of jobs (a) (b) Figure 1: Examples of decay-usage scheduling on (a) a uniprocessor and (b) a multiprocessor.

3 Analysis of the Model In this section we analyze our decay-usage scheduling model. First, we derive a set of equations and inequalities indexed by the decay-cycle number, showing how to compute for all classes their shares of CPU time obtained during Tn and their priorities at the end of Tn, given the priorities at the start of Tn, for any n  1. Because the latter only depend on the priorities at the end of Tn?1 , this allows us to compute the shares in any decay cycle iteratively. Next we show that the decay-usage scheduling policy converges in the sense that for all classes the priorities at the end of Tn , the amounts of CPU time obtained by the jobs during Tn , and the decayed CPU usages at the end of Tn , have limits for n ! 1. It follows that the priorities at the start of a decay cycle also have limits. This result enables us to suppress the decay-cycle index in the set of equations and inequalities, and solve for the steady-state shares by means of an algorithm. We then discuss what happens when the workload is not purely compute bound.

3.1 Formulation of the Solution We introduce the following notation for k = 1; : : :; K and n = 1; 2; : : ::

ck (n) = (qk (t?n ) ? qk (t+n?1 ))=R; sk (n) = ck (n)=PT; vk (n) = vk (t?n ):

(19) (20) (21)

Here, ck (n) is the amount of CPU time obtained by a class-k job during Tn , sk (n) is the share of a class-k job in Tn , and vk (n) is the accumulated and decayed CPU usage due to decay cycles 1; : : :; n at the end of Tn . From (6), (8), (19), and (21), we nd v (n) = vk (n ? 1) + c (n) + b ; k = 1; : : :; K; n = 2; 3; : : :: (22) k

D

k

k

12

b6

6

In Proposition 1 we show that the ck (n) (or equivalently, the shares sk (n)) are the solution of a system of linear equations of a size that is not known beforehand, subject to four inequalities.

Proposition 1. The amounts ck (n); k = 1; : : :; K , of CPU time and the class indices in and jn are uniquely determined by the set of equations and inequalities

8 > ck (n) > > > qk (tn? ) + Rck (n) > > > ck (n) > > K X > < Mk ck (n) > k > > ci (n) > > > qi (tn? ) + RT > > cj (n) > > : +

1

= T; = qi +1 (t+n?1 ) + Rci = 0; n

= PT;

=1

<  1 > + qj (tn?1 ) + Rcj (n)  +

n

n

n

(26)

T; qi +1 (t+n?1 ) + Rci 0; qj +1(t+n?1 );

n +1

n

n

k = 1; : : :; in; (23) (24) +1 (n); k = in + 2; : : :; jn ; k = jn + 1; : : :; K; (25)

n

n +1

(n); if in > 0; if jn + 1  K:

n

(27) (28) (29) (30)

PROOF. Because of the de nition of in and jn , a solution has to satisfy (in)equalities (23), (25), (27), and (29). (In)equalities (24), (28) and (30) are rewritten from (16) and (17), and (26) states that the CPU time consumed is equal to the amount available. So we only have to prove that there is a unique solution. For xed values of in and jn , the system of linear equations (23){(26) indeed has a unique solution, because, putting k = qk (t+n?1 )=R and substituting (23){(25) into (26), one only has to solve j X n

k=in +1

Mk ( i

n +1

+ ci

n +1

(n) ? k ) = (P ? M^ i )T: n

(31)

We now show that for a joint solution to Equations (23){(26) and Inequalities (29) and (30), jn is uniquely determined when the value of in is xed. By (24), (29) and (30), we have j < i +1 + ci +1 (n)  j +1 ; which means that for xed in , the ranges of possible values for the left-hand side of (31) are mutually disjoint for di erent values of jn , proving the assertion. Now we consider solutions for di erent values of in , so assume that there are two solutions (c1; : : :; cK ; s; t) and (c01; : : :; c0K ; s0; t0) for (c1(n); : : :; cK (n); in; jn), with s < s0 . We rst show that t  t0 . Using (24), (28), the fact that cs < T because s < s0 , (15){(17), and (24), respectively, we have n

n

n

n

0

t + c0t = s +1 + c0s +1  s + T > s + cs  s+1 + cs+1 = t + ct: If t > t0 , then ct +1 > 0, so by (24) and (30) we have t + ct = t +1 + ct +1 > t +1  t + c0t ; 0

0

0

0

0

0

0

(32)

0

0

0

13

0

0

0

(33)

which contradicts (32), and so we conclude that t  t0 . Now substituting (23){(25) into (26) for the two solutions and using (32), we have (P ? M^ s )T =

t X 0

k=s+1

Mk ( t + c0t ? k ) > 0

0

Xt k=s+1

Mk ( t + ct ? k ) = (P ? M^ s)T;

(34)

which is a contradiction.

3.2 Convergence In this section we prove that the decay-usage scheduling policy converges in the sense described in the introduction of Section 3. The main step is to prove that from decay cycle T2 onwards, the set of classes with dedicated processors is non-increasing and the set of classes that receive CPU time is non-decreasing in successive decay cycles (see Proposition 2 below).

Proposition 2. For n = 2; 3; : : :, in  in and jn  jn. +1

+1

PROOF. If in = jn , then in = jn = k0, and in+1  k0  jn+1 , so the proposition holds. Now assume in < jn . We rst prove that in+1  in for n  2. Let k; l be such that in + 1  k  l  jn. Because qk (t?n ) = Qn for k = in + 1; : : :; jn, by (14) and (19) we have for n  1

ql (t+n ) ? qk (t+n ) = R(bl ? bk ); ql(t+n?1 ) ? qk (t+n?1 ) = R(ck (n) ? cl(n)): For n  2 we have by (14) and (15){(17)

(35) (36)

!

? ? ql(t+n?1 ) ? qk (t+n?1 ) = ql(tn?1 ) ?D qk (tn?1 ) + R(bl ? bk )  R(bl ? bk ):

(37)

Combining (35), (36) and (37), we have

ql (t+n ) ? qk (t+n )  R(ck (n) ? cl (n)):

(38)

Now assume that in+1 > in , so ci +1 (n + 1) = T . Because of (15){(17) and (19), we then have qk (t+n ) + Rck (n + 1)  qi +1 (t+n ) + RT; so by (38) and because ci +1 (n) < T , we nd n

n

n

ck (n + 1)  ck (n) ? ci

n +1

(n) + T > ck (n):

Now substituting (23){(25) into (26) for decay cycles n and n + 1, we have j j X X ^ Mk ck (n + 1)  (P ? M^ i )T; (P ? Mi )T = Mk ck (n) < n

n

n

k=in +1

k=in +1

14

n

which is a contradiction. We now prove that jn+1  jn for n  2. Assume jn+1 < jn , so cj (n + 1) = 0. We rst show that ck (n + 1) < ck (n); k = in+1 + 1; : : :; jn: (39) We already know that in+1  in , and if in+1 < in , then (39) is clear for k = in+1 +1; : : :; in. By (19), (15){(17), and because cj (n + 1) = 0, respectively, we have n

n

qk (t+n ) + Rck (n + 1) = qk (t?n+1 )  qj (t?n+1 ) = qj (t+n ); n

k = in + 1; : : :; jn;

n

so by (38) and because cj (n) > 0, we nd n

ck (n + 1)  ck (n) ? cj (n) < ck (n);

k = in + 1; : : :; jn;

n

which proves (39). Now we have (P ? M^ i +1 )T = n

jX n+1 k=in+1 +1

Mk ck (n + 1)
RT , and class 2 does not receive service in T2 (in fact, class 2 starves from T2 onwards). Proposition 2 holds also for n = 1 if RD  (which is the case for System V, 4.3BSD, and Mach): An easy check shows that then (38) is also true for n = 1.

Corollary. There exist N > 0; i  0; j  K , such that in = i and jn = j for all n  N . 0

0

0

0

In fact, from TN onwards, the model operates as if the P -way multiprocessor were partitioned into M^ i0 uniprocessors|one for each job of classes 1; : : :; i0|and a (P ? M^ i0 )-way multiprocessor serving classes i0 + 1; : : :; j0. Proposition 3 is only a preparation for the Theorem.

Proposition 3. (a) If during decay cycle Tn, P 0 processors only serve job classes k; k + 1; : : :; l and only these processors serve these classes, then

Xl i=k

Mi (qi(t?n ) ? qi (t+n?1 )) = RP 0 T:

15

(40)

(b) If during decay cycles Tm ; : : :; Tn, m  n, P 0 processors only serve job classes k; k + 1; : : :; l and only these processors serve these classes, then l X i=k

Mi (qi (t?n ) ? qi (0)) = RDm?n + R 

l X i=k

Xl i=k

Mivi (t+m?1 ) + RP 0 T

Mi bi 

n?X m?1 j =0

nX ?m j =0

D?j +

D?j :

(41)

If moreover, qi (t?n ) = qk (t?n ) for i = k + 1; : : :; l, then

Pl M q (0) + RDm?n Pl M v (t ) i k i i i k i i m? + ^ ^ Ml ? Mk ? P n ? m 0 ? RP T j D j + R  Pli k Mi bi  Pjn?m? D?j ; i = k; : : :; l: (42) + M^ l ? M^ k?

qi(t?n ) =

=

+

=

1

1

=0

=

=0

1

1

PROOFS. (a) This is obvious from (6) and (7). (b) For n = m, (41) coincides with (40). For n > m, we use induction on n, applying (6), (7) and (8). Equation (42) is an immediate consequence of (41).

Theorem. The decay-usage scheduling policy converges in the sense that the limits qk =

limn!1 qk (t?n ), ck = limn!1 ck (n), and vk = limn!1 vk (n) exist. For k = 1; : : :; i0 we have  RD  (T + bk )D ; qk = + D ? 1 bk + DRTD ; c = T; v = k k ?1 D?1

P for k = i0 + 1; : : :; j0 (writing b = jk0=i0 +1 Mk bk =(M^ j0 ? M^ i0 )) we have  RD  ^ ? 1 q ? b ; v = (ck + bk)D ; (43) qk = + D ? 1 b+ ^R(P ?^Mi0 )TD ; ck = DRD k k k D?1 (Mj0 ? Mi0 )(D ? 1) and for k = j0 + 1; : : :; K we have qk =

 RD

+ D ? 1 bk ;



ck = 0;

kD : vk = Db? 1

PROOF. By the Corollary, from TN onwards, the jobs of classes 1; : : :; i0 always have dedicated processors, the jobs of classes i0 + 1; : : :; j0 are jointly served by P ? M^ i0 processors and have equal priorities at the end of Tn for all n  N , and classes j0 + 1; : : :; K starve. For each of these three groups of classes, qk can be computed from (42). For k  i0 and k > j0, the value of ck is obvious and vk can be found from vk = (ck D+ ?b1k )D ; (44) which is obtained by taking the limit for n ! 1 in (22). For k = i0 + 1; : : :; j0, ck and vk can be found from qk = bk + Rvk , which is obtained from (6), and from (44). 16

Remark 2. In Proposition 5, a more explicit formula for the ck ; k = i + 1; : : :; j will be 0

given.

0

Remark 3. For n  N , qk (t?n ) = Qn for k = i + 1; : : :; j , so by (14), ql(tn ) ? qk (tn ) = R(bl ? bk ) for i +1  k < l  j , and so by Equations (23){(26), ck (n) = ck for n  N +1 0

0

0

+

+

0

and all k. We say that the model is in the steady state, when the allocation of processors to classes does not change anymore, i.e., after the rst decay cycle Tn with in = i0 and jn = j0. While the limiting priorities of the Theorem are never attained, the shares in the steady state are equal to their limits as given in the Theorem.

Remark 4. By the Theorem, we see that as far as the shares in the steady state and the

limiting priorities are concerned, a model with  > 0 is equivalent to a model with  = 0 and with the base priorities b0k given by



b0k = 1 + (RD D ? 1)



bk ;

k = 1; 2; : : :; K:

Remark 5. Assume that i = 0 and j = K , and let Q = qk ; k = 1; : : :; K . Then (with 0

0

the parameters as in Section 2), by (43) we have for the actual systems we consider: ; (45) System V: Q = b + 100  9 l l  25 4.3BSD: Q = 4 + 2 b + l + 50; (46) 0 b Mach: Q = 2b + 8R (47) 3  2 + 10: (These equations do not include PUSER.) Note that for Mach, the priority Q is invariant with respect to P and T .

3.3 Steady-State Shares Assuming the numbers of jobs Mk ; k = 1; : : :; K to be xed, we now show how to compute the steady-state shares sk = ck =PT given the base priorities bk .

Proposition 4. The amounts ck ; k = 1; : : :; K , of CPU time and the class indices i and j0 are uniquely determined by the set of equations and inequalities

17

0

8 > > > > > > > > > > < > > > > > > > > > > :

K X k=1

ck = T; k = 1; : : :; i0; ck = ci0 +1 ? (bk ? bi0+1 ); k = i0 + 2; : : :; j0; ck = 0; k = j0 + 1; : : :; K;

(48) (49) (50)

Mk ck = PT;

(51)

ci0 +1 ci0 +1 cj0 cj0

(52) (53) (54) (55)

<  > 

T; T ? (bi0 +1 ? bi0 ) 0; (bj0 +1 ? bj0 );

if i0 > 0; if j0 + 1  K:

PROOF. In order to nd the ck , we have to solve for c1; : : :; cK ; i0; j0 the set of equations and inequalities obtained by taking the limit for n ! 1 in (23){(30). Recalling that

qk (t+n?1 ) + Rck (n) = bk + Rvk (n);

k = 1; : : :; K;

and using (44) in (24), (28), and (30), we nd the set of equations and inequalities in the proposition. In a similar way as in Proposition 1, one can prove that this set has a unique solution. Because i0 and j0 are not known beforehand, it seems that there is no closed-form expression for the ck , but they can be computed by the algorithm in Figure 2. We start by assuming that the jobs in classes 1; : : :; k0 have dedicated processors throughout a decay cycle in the steady state (step s2), and that if M^ k0 = P (M^ k0 < P ), classes k0 + 1; : : :; K (k0 + 2; : : :; K ) starve (step s3). Whenever step s8 is executed, i and j indicate the highestnumbered class that is assumed to have dedicated processors, and the highest-numbered class that is assumed to receive any service at all, respectively. In step s8, we solve the linear system consisting of Equations (48){(51), which can be rewritten as in (58) and (59). The condition of step s6 is the same as (55), the condition of step s4 is the same as (53).

Proposition 5. (a) The amounts ck ; k = 1; : : :; K , of CPU time and the class indices i and j0 are correctly computed by the algorithm in Figure 2. (b) For k = i0 + 1; : : :; j0, we have

P (P ? M^ i0 )T + ji=0 i0 +1 Mi (bi ? bk ) ck = : M^ j0 ? M^ i0

0

(56)

(c) For k; l = i0 + 1; : : :; j0, we have

P sk = (P ? M^ i0 )T + ji=0 i0+1 Mi (bi ? bk ) : sl (P ? M^ i0 )T + Pji=0 i0 +1 Mi (bi ? bl) 18

(57)

input: P; T; R; D; ; ; K; Mk; bk; k = 1; 2; : : :; K output: i ; j ; ck; k = 1; 2; : : :; K s1: i := k + 1; if (M^ k0 = P ) then j := k else j := k + 1; s2: for k := 1 to k do ck := T od; s3: for k := j + 1 to K do ck := 0 od; s4: do s5: i := i ? 1; j := j ? 1 s6: do 0

0

0

0

0

0

s7: j := j + 1 s8: Solve for ck ; k = i + 1; : : :; j the following set of equations:

8 > ck = > < j X > > : k i Mk ck = until ( (j = K ) or (cj  (bj until ( (i = 0) or (ci  T ? (bi

ci+1 ? (bk ? bi+1 ); (P ? M^ i )T

= +1

+1

+1

k = i + 2; : : :; j

+1

(58) (59)

? bj )) ) ? bi)) )

s9: i0 = i; j0 = j Figure 2: Algorithm for the computation of the steady-state shares. PROOFS. (a) Clearly, the algorithm in Figure 2 computes a solution that satis es Equations (48){(51) and Inequalities (53) and (55). To prove the algorithm correct, we have to prove that (52) and (54) are also satis ed. We show this by proving that (ci+1 < T ) and (cj > 0)

(60)

is an invariant of the algorithm. Immediately after step s8 has been executed for the rst time, if M^ k0 = P , we have i = j = k0 , ck0 +1 = 0 < T , and ck0 = T > 0, and if M^ k0 < P , we have i + 1 = j = k0 + 1, and ? M^ k0 )T < T; 0 < ck0 +1 = (P M k0 +1

so (60) holds. Now suppose that for some execution of step s8, (60) holds. Let i and j be the loop indices of this execution, and let ci+1; : : :; cj be the solution of (58) and (59), so j X k=i+1

Mk (ci+1 ? (bk ? bi+1 )) =

j X k=i+1

19

Mk (cj ? (bk ? bj )) = (P ? M^ i )T:

(61)

The body of step s6 is re-executed if either the condition of step s6 or the condition of step s4 is not satis ed. In the rst case, we have By (58), we then also have

cj > (bj+1 ? bj ):

(62)

ci+1 > (bj+1 ? bi+1 ):

(63)

Denoting the solution of (58) and (59) of the re-execution by c0i+1 ; : : :; c0j +1, we have jX +1 k=i+1

Mk (c0i+1 ? (bk ? bi+1 )) =

jX +1 k=i+1

Mk (c0j+1 ? (bk ? bj+1 )) = (P ? M^ i )T:

(64)

Comparing the left-hand sides of (61) and (64) and using (63), we nd (M^ j +1 ? M^ i )c0i+1 = (M^ j ? M^ i )ci+1 + Mj +1 (bj +1 ? bi+1 ) < (M^ j +1 ? M^ i )ci+1 ; and so c0i+1 < ci+1 < T . Comparing the middle terms of (61) and (64) and using (62), we nd (M^ j +1 ? M^ i )c0j +1 = (M^ j ? M^ i )(cj ? (bj +1 ? bj )) > 0; and so c0j +1 > 0. Now assume that the body of step s6 is re-executed because the condition of step s4 is not true, so ci+1 < T ? (bi+1 ? bi ): (65) By (58), we then also have cj < T ? (bj ? bi): (66) Denoting the solution of (58) and (59) of the re-execution by c0i ; : : :; c0j , we have j X k=i

Mk (c0i ? (bk ? bi )) =

j X k=i

Mk (c0j ? (bk ? bj )) = (P ? M^ i?1 )T:

(67)

Comparing the left-hand sides of (61) and (67) and using (65), we nd j X (M^ j ? M^ i?1 )c0i = Mi T + Mk (ci+1 ? (bi ? bi+1)) < (M^ j ? M^ i?1 )T; k=i+1

so c0i < T . Comparing the middle terms of (61) and (67) and using (66), we nd (M^ j ? M^ i?1 )c0j = (M^ j ? M^ i )cj + Mi (T ? (bj ? bi )) > (M^ j ? M^ i?1 )cj ; and so c0j > cj > 0. (b) The formula for the ck can be obtained by substituting (48){(50) into (51), computing ci0 +1 , and using (49). (c) The share ratios follow directly from (b). 20

As a special case, when there are only two classes, class-1 jobs do not have dedicated processors, and class-2 jobs do not starve, then (57) can be written as

s1 = PTRD + ( (D ? 1) + RD)M2(b2 ? b1) ; s2 PTRD ? ( (D ? 1) + RD)M1(b2 ? b1)

(68)

which for the three actual systems we consider, yields s1 = 100P + M2(b2 ? b1) ; (69) System V: s2 100P ? M1 (b2 ? b1) s1 = 25P + (1=((M1 + M2 )=P + 0:5) + 0:25)M2(b2 ? b1 ) ; (70) 4.3BSD: s2 25P ? (1=((M1 + M2 )=P + 0:5) + 0:25)M1(b2 ? b1 ) s1 = 60:8(M1 + M2) + 3M2(b2 ? b1) : Mach: (71) s2 60:8(M1 + M2) ? 3M1(b2 ? b1)

3.4 Heterogeneous Workloads It is a natural question how our decay-usage policy behaves under a heterogeneous workload consisting of long and short compute-intensive jobs (for instance, real-time system functions), or when some jobs are not compute-bound (for instance, interactive work). In the former case, it seems impossible to analyze the impact of the short-running jobs on the long-running ones with only a stochastic description of the former part of the workload in terms of the distributions of the inter-arrival times and service times. Only when the base priorities of the short jobs are so low that their priorities never get as high the priorities of the long jobs, and when the short jobs jointly take a constant amount of time during each decay cycle, the solution is simple: In order to nd the shares of the long-running jobs, replace T by the amount of time T 0 remaining for these jobs. When there are also jobs that perform I/O operations, our analysis is still valid provided that the amount of time spent waiting for I/O per decay cycle and per job is not very large. During an I/O operation, a job is suspended and so its priority remains constant, assuming that the operation is completed in the same decay cycle in which it started. For disk I/O operations, this is probably very often the case, because such an operation takes on the order of tens of milliseconds and the length of a decay cycle is one second. When after the I/O operation has nished, the job becomes runnable again, its priority will fall short of the priority of the other jobs in its class, so it will be preferred for using a CPU until its priority becomes equal to that of the other jobs in its class.

4 Exercising Control In this section we deal with the control that can be exercised by the decay-usage scheduling policy over the share ratios. We show how to set the base priorities given the required shares, we trace the in uence of the scheduler parameters on the share ratios that can be attained, and we investigate the e ect on these ratios of the bounds on the CPU usage and priority elds in actual systems. 21

4.1 Achieving Share-Ratio Objectives In Section 3.3 we showed how to compute the steady-state shares from the base priorities. Conversely, one may want to set share-ratio objectives in terms of the required amounts ck of CPU time in a decay cycle or in terms of the required shares sk , and compute a set of base priorities bk (or rather the di erences bk ? b1) yielding these shares. We can assume that T > c1 >    > cK > 0 (or 1=P > s1 >    > sK > 0), so i0 = 0 and j0 = K . In addition, P we assume rst that all the available capacity is requested, that is, that Kk=1 Mk ck = PT P (or Kk=1 Mk sk = 1). Then there is always a solution, which can be found by inverting (58), after putting b1 equal to an arbitrary non-negative value:

bk = b1 + (c1 ? ck )= ;

k = 2; : : :; K:

(72)

For P = 1, (72) coincides with Equation 20 of [11]. In actual systems, the values of the base priorities are con ned to be integers. Then the integer solution which is closest to the solution of (72) has to be chosen, which may yield share ratios that deviate considerably from the objectives. In [11], the behavior of the decay-usage policy for uniprocessors is also analyzed in the P underloaded case characterized by Mk ck < PT , and in the overloaded case de ned by P M c > PT . It is shown that in either case s0 ? s = s0 ? s ; k = 2; : : :; K , where k 1 k k 1 k the s0k denote the obtained steady-state shares, provided that no starvation occurs in the overloaded case, i.e., that s0K > 0. Measurements in [11] show that the policy indeed behaves in this way in practice. This property of equal di erences between the required and the obtained shares clearly carries over to multiprocessors in those cases when no class has dedicated processors and no class starves: The underloaded and overloaded cases correspond to lengthening and shortening the decay cycle, which in general amounts to lengthening and shortening the last epoch, in which all jobs of all classes evenly share all processors. So the decay-usage scheduling policy is fair in the sense that an excess or de cit of capacity is spread equally over all jobs. It would perhaps be a more desirable, and fairer, policy if it enjoyed the property that s0k =sk = s01 =s1 ; k = 2; : : :; K . Clearly, lottery scheduling [18], stride scheduling [19] and time-function scheduling [7] do enjoy this property. One cannot easily employ the decay-usage scheduling policy to achieve Priority Processor Sharing (PPS, alternatively called discriminatory processor sharing, see [6, 14]) or Group Priority Processor Sharing (GPPS). In either of these two policies, for each class k, a P priority rk is de ned. In PPS, every job of class k has a processor share of rk = Ml rl , and so the share ratios of rk =rl are constant, independent of the numbers of jobs in the classes. P In GPPS, the jobs of class (or group) k jointly have a processor share of rk = rl , and jobs within a class have equal shares. By (57), achieving these policies with decay-usage scheduling entails complicated recomputations of the base priorities on arrivals and departures of jobs. One can however easily modify the decay-usage scheduling policy to implement PPS in a simple way. This modi cation consists in setting bk = 0; k = 1; : : :; K; D = 2; = 1, 22

and  = 0, and in replacing R by class-dependent parameters Rk , k = 1; : : :; K . Then, sk =sl = Rl=Rk , so one should put Rk = 1=rk . In fact, this modi cation of decay-usage scheduling is nothing else than stride scheduling [19]. PPS and GPPS can easily be achieved by lottery scheduling [18] and by stride scheduling [19]. For PPS, on arrival, a job simply gets the same number of tickets as the other jobs in its group, and after a departure, nothing has to be done. For GPPS, the currency of the group of an arriving or a departing job has to be in ated or de ated, respectively. GPPS is also achieved by time-function scheduling [7].

4.2 Scheduler Parameters and the Range of Share Ratios In this section we will trace the impact of the values of the parameters in the decay-usage scheduling model on the share ratios given by (57), where we assume that i0 = 0 and j0 = K . As far as the steady-state shares are concerned, the scheduler parameters R; D; ;  and the system parameters P; T are not independent: Only the value of =PT is relevant. The larger this value is, the higher the level of control is that can be exercised by the decayusage scheduling policy, that is, the larger the range of possible share ratios. If we assume that R and D are constants, = (D ? 1)=RD +  can assume any positive value, and so, as far as the steady-state share ratios are concerned, one parameter in the scheduler would be sucient, instead of four. For instance, one can take D = 2; = 1, and  = 0 (these are the values of System V), with R the only remaining scheduler parameter. A further consequence of (57) for constant R and D is that it is immaterial whether there are P processors of equal speed (each delivering T clock ticks per decay cycle), or one processor which is P times as fast (delivering PT clock ticks per decay cycle). In either case, an amount PT of processor time is delivered in one decay cycle, and the steady-state shares are equal. In addition, in order to achieve the same level of control for xed numbers of jobs for di erent values of P and T , the range of base priorities has to be proportional to either of these two parameters. In large multiprocessors, one may have the option to partition logically the system into a set of multiprocessors with a smaller number of processors each, for instance, in order to reduce the contention for the central run queue or so as to assign parts of the machine to di erent sets of applications. Assuming that the ranges of the base priorities will be the same in the components of the partitioned multiprocessor and in the original system, and that the numbers of jobs will be roughly proportional to the sizes of the components, partitioning yields about the same level of control. Using the values of the parameters given at the end of Section 2.2, we have for the three systems considered: System V: 4.3BSD: Mach:

= 1; 9; = 22ll + +1 3 PT  5 : = 16R0M^ K l 23

(73) (74) (75)

For xed values of P and T and for a xed range of base priorities, this means that 4.3BSD has a higher level of control than System V, that Mach has a higher (lower) level of control than System V for loads lower (higher) than 5, and nally, that 4.3BSD has a higher level of control than Mach, except for loads smaller than about 2. As an example, consider the case of two classes with P = M1 = M2 = 1. Then, because b2 ? b1  19, we nd from (69), (70), and (71), that in these three systems, s1 =s2  1:47, s1 =s2  2:95, and s1 =s2  2:60, respectively. (In Mach, b2 = 19 is treated as if b2 = 18 because of the multiplication by

= 0:5.) For two cases with two classes, the levels of control for the three systems are depicted in Figure 3. In Figure 3a, the load equals 6, and 4.3BSD has the highest level of control and Mach the lowest over the whole range of base priorities of class 2. In Figure 3b for varying load, the share ratio is constant for Mach (cf. (71)), again 4.3BSD has the highest level of control, and the graphs for System V and Mach intersect between the loads of 4 and 6. In Mach, in which R is not a constant, because of (75), the share ratios given in (57) reduce to sk = 16R0M^ K + 3 PKi=1 Mi (bi ? bk ) ; k; l = 1; 2; : : :; K; (76) sl 16R0M^ K + 3 PKi=1 Mi(bi ? bl) which is invariant with respect to P and T , so the range of base priorities does not have to be adjusted for multiprocessors or for di erent processor speeds. If R and D are constants (as in System V), or if they depend on the other parameters of the model in such a way that only depends on the load l (as in 4.3BSD), then by (57), the share ratios do not change when P and the Mk are replaced by P and Mk ; k = 1; 2; : : :; K , with a positive integer. By (76), the share ratios in Mach even do not change when only the Mk are replaced by the same multiples. As to the di erences between the steady-state shares of classes, from (49) we nd

sk ? sl = (bl ? bk )=PT;

k; l = 1; : : :; K;

so again, if R and D are constants, increasing P and/or T reduces the contrast among classes while increasing increases it, as was already concluded in [11], Section IV-A, for the uniprocessor case of System V.

4.3 Bounds in the Scheduler In actual systems, the values of p cpu and p usrpri are stored as non-negative integers in elds of nite size, and so these values each have a maximum. The corresponding values in the model are vk and PUSER + qk . We will now qualitatively indicate how our decayscheduling policy behaves when there is either only a bound for the CPU usages or only a bound for the priorities; considering what happens when there are bounds for both quantities is rather complicated, and in the actual systems we consider, only one of the bounds plays a role. We refer to the model with (without) bounds as the constrained (unconstrained) 24

40 35 30 25 ratio of 20 shares 15 10 5 0

20

4.3BSD System V Mach

4.3BSD System V Mach

15 10 5

1

3

5 7 9 11 13 15 17 19 base priority of class 2

0

2

4

6

8 10 12 14 16 18 20 load

(a) (b) Figure 3: Ratio of shares (s1 =s2) versus (a) the base priority of class 2 (b2) and (b) the load (l) for di erent parameterizations of the decay-usage scheduler (K = 2; b1 = 0; in (a), M1 = 5P; M2 = P ; in (b), M1 = M2 = MP; l = 2M; b2 = 10). model. Throughout this section, we assume that i0 = 0 and j0 = K in the unconstrained model. Let us rst assume there is a bound v on CPU usage. In the unconstrained model, v1 > vk ; k = 2; : : :; K , so if some vk attains the value v, v1 will certainly do so. Assume that when v1 attains v during a decay cycle in the steady state, classes 1; : : :; K  receive service. From then on, the priority of class 1 cannot increase anymore, but the slightest additional amount of service to any of the classes k = 2; : : :; K  will increase its priority qk () beyond q1(), and so during the remainder of the decay cycle, either class 1 monopolizes all CPU's and classes 2; : : :; K  (and K  + 1; : : :; K ) do not receive service (M1  P ), or the class-1 jobs will continue using dedicated CPU's, leaving the remainder (at that instant) to classes 2; : : :; K  (M1 < P ). Subsequently, v2 may attain the value v , etc. We conclude that the bound v on CPU usage favors the lower classes. We now turn to the case of a bound, say q , on the priorities. We claim that such a bound in general favors the higher classes. In the unconstrained model, we denote by qiu (t) the priority of class i at time t, and by Q the limiting priority of all classes at the end of a decay cycle in the steady state. Let qi0 (t) = bi + Rvi0 (t) be the virtual priority of class i in the corresponding constrained model, where vi0 is de ned as vi in 4. of Section 2.2, and let qi0 = limn!1 qi0 (t?n ). The (real) priority qic (t) of class i in the constrained model is equal to qic (t) = min(q; qi0(t)). The amount cci (n) of CPU time of class i in decay cycle Tn in the constrained model is given by (cf. (19)) 0 (t? ) ? q 0 (t+ ) q i c ci (n) = n R i n?1 : (77)

If Q  q , the constrained model behaves exactly like the unconstrained model, and the steady-state shares are same. Let us now assume that Q > q . Then, in the constrained model, qi0 > q; i = 1; : : :; K , because for at least some i this has to hold, and if it would not hold for some class, that class would have dedicated processors, and as a consequence, 25

qi0 would even exceed Q. Now if limn!1 qi0 (t+n )  q for i = 1; : : :; K , and if all classes reach priority q in the steady state simultaneously, obviously, the unconstrained and the constrained models again behave in the same way, and qi0 (t) = qiu (t) for t  0 and i = 1; : : :; K . Now assume that in every decay cycle Tn with n  n0 for some n0 , some class l attains priority q at least an amount of time t0 > 0 before some class k does, with k < l. (It is easy to see that this is possible; class l may even have qlc (t+n ) = q in the steady state.) Because all classes whose (real) priorities are equal to q share processors evenly, we then have ql0 (t?n ) > qk0 (t?n ) for n  n0 , and ql0 > qk0 (the di erence between the latter can be bounded from below by an expression in t0 and the parameters of the model). Similarly as in (14), we have

0 ?

qi0 (t+n ) = qi (Dtn ) + Rbi; i = 1; : : :; K; and so, using (77) and putting cci = limn!1 cci (n), we have qk0 ? ql0) + (b ? b ) < R(b ? b ); cck ? ccl = (D ? 1)( l k l k RD

which by (49) proves our claim. In particular, in the constrained model, the classes i with qic (t+n ) = q in the steady state cannot even be distinguished anymore. Also in [11], Section IV-C, it was concluded that an upper bound to the priority reduces the level of control. Usually in actual systems, p cpu  255, and in System V and in 4.3BSD, p usrpri  127, where the latter value contains the constant PUSER = 50. In general, by (9) we have vk (t?n )  (T D+ ?b1k)D : Because in System V, D = 2 and  = 0, only the bound on the priority may be attained. Now by (45), the condition that the system operates in an unconstrained fashion is b + 100  77;

l

so if l is not very low, this bound is not reached. This conclusion is in accordance with [11]. For 4.3BSD, assume b1 = 0. Then by (1) and (2), only the bound on the CPU usage p cpu matters. So the condition that the system operates in an unconstrained fashion is v1  255. Because q1 = Q and R = 0:25, and by (6) and (46), this equivalent to (9 + 2l)b + 100  55: (78)

l

For instance, when K = 2 and M1 = M2 = P , then v1 > 255 for all values of b2 > 0. We have done measurements on 4.3BSD uniprocessor systems in cases with two classes in which according to the model, the bound v is attained by v1, and these con rm the conclusion that in such a case, the bound v favors the lowest class (cf. Section 6.4). In Mach, by (47) the bound of 31 on the priority cannot be reached because PUSER = 12 and b  18. Because q1 = Q and R = lR0=T , and by (6) and (47), the condition v1  255 is equivalent to 800 + 50(b ? b1 )  255; (79) 3l 3:8l which is always satis ed when l  2, because b  18. 26

5 Analysis of the Transient Behavior In our model we have made the rather arti cial assumption that all jobs are present at time t = 0, and that no jobs depart. In this section we will do away with this assumption by showing that the Corollary and the Theorem still hold if after some point in time, no arrivals or departures occur. In order to do so, we shift the time origin to the start of the rst decay cycle after the last arrival or departure. Then we cannot assume anymore that the vk (0+ ) are equal to zero. However, our treatment remains valid when this is not the case, as long as the orderings of the classes according to their base priorities and to their priorities at t = 0+ coincide (i.e., as long as priority inversion (see below) is absent). This means that it is sucient to show that when priority inversion does occur, it disappears within a nite number of decay cycles. We give upper bounds on this number, and subsequently, we show how to deduce the rate of convergence to the steady state starting from a situation without priority inversion. Putting these two results together, we have dealt with the transient behavior of our model in that we have determined a bound on the rate of convergence to the steady state starting from any situation that can come about in the model (including arrivals and departures). In general, it is not necessary to add the amounts of time to let priority inversion disappear and to converge from there to the steady state. Priority inversion is a step towards convergence, and in fact, when there are only two classes, it is the same. Our conclusion will be that the rate of convergence can theoretically be arbitrarily low. However, in System V and Mach, convergence takes at most about 20 seconds, while in 4.3BSD, it may take some minutes.

5.1 Priority Inversion In dealing with arrivals, we face two problems. First, even if the base priority of an arriving job is equal to that of some class already present, we cannot in general include the job in that class, because in our treatment, we assumed that the jobs in the same class have equal CPU usages. However, there is no diculty in having simultaneously arriving jobs with equal base priorities constitute a class of their own, because while we assumed that bk < bl for k < l, our treatment remains valid if we only had assumed that bk  bl when k < l. Second, priority inversion may occur. We say that priority inversion occurs for classes k; l with bk < bl at time t, if qk (t) > ql (t). Clearly, departures do not cause additional problems. We now show that if priority inversion does occur at time t = 0 for classes k and l with k < l, then within a nite number of decay cycles, it has disappeared. By (6) and (8), we have 0 1 n c (j ) nX ?1 b + X v (0 ) i i i qi(t+n ) = bi + R @ Dn + Dn+1?j +  D?j A ; i = k; l: (80) j =1

j =0

The priority inversion has disappeared when qk (t+n )  ql (t+n ). Because when priority inver-

27

sion occurs at the beginning of Tj , ck (j )  cl (j ), this certainly holds when

!

 nX ?1  b + + v (0 ) v (0 ) b k l k l R Dn ? Dn  (bl ? bk) + R ?j ? D?j : j =0 D

(81)

By (8) and (9), when during a run of the model D is bounded from below by D > 1 (in particular, when D is a constant, we can take D = D), we have  vk (0+ ) < T D+b?k 1D : (82) Now replacing vk (0+ ) by this bound, putting vl (0+ ) = 0, and setting bl ? bk to its smallest possible value, we can nd the lowest value for n for which (81) holds. Clearly, in theory it can take arbitrary many decay cycles before priority inversion has disappeared. One has to keep in mind that in actual systems, the (base) priorities and the CPU usages are stored in integers, and that the result of any operation on them is rounded to below. One easily checks that in order to satisfy (81), in System V and Mach n = 6 and n = 9 are sucient, respectively. For 4.3BSD, we saw in Section 4.3 that vk (0? ) can attain its bound of 255. Then by (8), vk (0+ ) can also attain this bound, so for 4.3BSD, we have to put vk (0+ ) = 255 in (81). For l  127, each division of vk (0+ ) by D = (2l + 1)=2l, where l is the load after t = 0, in (81) lowers its value only by 1, so lifting the priority inversion may take as many as 255 decay cycles. However, for l  10, n = 64 suces.

5.2 The Rate of Convergence In theory, it can take arbitrarily many decay cycles before the steady state as de ned in Remark 3 is attained in the model. In order to demonstrate this, take the instance of the model with P = 1, K = 2,  = 0, and let Q be the limiting priority of class 1 if class 2 were absent. Now choose b2 such that q2 (0) = Q ?  for some small positive . We now show how one can derive the rate of convergence of the decay-usage scheduling policy, i.e., the number of decay cycles before the classes receive their steady-state shares. Because of arrivals and departures as treated in Section 5.1, we allow vk (0+ ) > 0, but we assume that there is no priority inversion. In addition, we assume that i2  i1 and j2  j1 by taking t1 = T as the origin of time, if necessary (cf. Proposition 2 and Remark 1), and that i0 = 0 and j0 = K . By Proposition 2, either class 1 or class K is the last one to join the set of classes sharing all P processors. In the former case, which is only possible when M1 < P , we can use (42) with m = 1 for k = l = 1; P 0 = M1 and for k = 2; l = K; P 0 = P ? M1, respectively, to nd the smallest value of n for which the priority of class 1 with dedicated priorities would at least be equal to the priority of classes 2; : : :; K . If class K is the last to join the other classes in sharing all P processors, we can use (42) with m = 1 for k = 1; l = K ? 1; P 0 = min(P; M^ K?1 ) and for k = l = K; P 0 = max(0; P ? M^ K?1 ) to nd the smallest value of n for which the priority of classes 1; : : :; K ? 1 would at least be equal to the priority of class K . 28

Because of the rounding to below to integer values, we see from (42) that the priorities have certainly reached their limiting values for n ! 1 in System V, Mach, and 4.3BSD, for n = 7, n = 10, and n = 256, respectively. It can be shown that in all 4.3BSD measurements reported in Section 6.4, in which we start with all jobs at time 0 and so vk (0+ ) = 0 for all k, the steady state must have been reached within 17 decay cycles.

6 Validation: Simulations and Measurements In order to validate our model, we have done measurements on UNIX and Mach systems, and we have performed simulations of our decay-usage scheduling model. We will compare the share ratios as predicted by our model according to (57), as measured in actual systems, and as obtained from the simulations for di erent experiment sets.

6.1 The Set-up of the Experiments Measurements were taken on 1- and 4-processor Sun Sparcstations running SunOS 4.1.3, on a 4-processor Sequent running DYNIX version V3.0.17.9, and on a Sequent running Mach 3.0 with 1,4,8 or 16 CPU's enabled. The rst two of these systems are based on 4.3BSD. Traces of SunOS and DYNIX (taken 10 times a second) show that these systems adhere to the priority calculations as in (1) and (2), including the use of a decay factor equal to (2l + 1)=2l with l = M^ K =P on the uniprocessor Sun and on the Sequent under DYNIX for any number of CPUs, but with l = M^ K on the 4-processor Sun (i.e., not divided by P = 4; this di erence shows in the documentation of this system). Also in DYNIX did we nd a deviation from the description of UNIX scheduling in Section 2.1. We will discuss the consequences of the deviations of these two systems in Section 6.4. Mach operates as described in Section 2 as far as we can gather from data reported by the system. Because

= 0:5 in Mach, we only consider even values of base priorities in that system. For the measurements, the jobs ran for (at least) 20 minutes in each experiment, and the share ratios sk =sl are computed as the quotient of the average amounts of CPU time obtained during the last 10 minutes by jobs of classes k and l, respectively. The reasons for having the jobs rst run for 10 minutes before actually measuring, are to let the system build up the load l, and reach the steady state. Within each experiment set, an experiment was started at least ve minutes after the end of the previous experiment, in order to eliminate any residual e ects in the scheduler. In Section 2.3, we mentioned the discrepancies between scheduling in UNIX and in our model: Discrete time versus continuous time, bounds for the values of the CPU usage and the priority, and the mapping of 128 priorities to 32 run queues. These discrepancies, and possibly other details of implementations of UNIX and Mach, may cause di erences between the predictions of our model and the measurements. In order to exclude any such e ects so that we might isolate the in uence of discrete time versus continuous time, we have built a simulator of the UNIX scheduler that does not employ bounds on the CPU 29

usages or the priorities. It is characterized by the parameters P; T; R; D; ; ; K; Mk; bk of our model, and by the number of clock ticks in a time quantum. It can run with and without division by 4 of the priority when entering a job in a run queue. We have observed in the simulation output, that the e ect of this division is to lower the share ratios sk =sl , k < l, as may be expected, because the scheduler can distinguish less well between di erent priorities. However, we found this e ect to be marginal, and all simulations reported below are without this division by 4. In the simulations, the simulated time in each experiment was also 20 minutes, and the ratios of shares were computed in the same way as for the measurements. In coarse-grained (cg) simulations, we used the values of and  given in Section 2, we put T = 100, and we let the number of clock ticks per quantum be 1 (System V) or 10 (4.3BSD and Mach). It turned out that the results of such simulations can deviate greatly from the model output. Because we suspected that the coarse discretization of time was to blame for this, in some cases we also ran ne-grained (fg) simulations, in which T = 1000, in which and  were replaced by 10 and 10 , respectively, and in which the number of clock ticks per quantum was always 10. In addition, for Mach R0 was replaced by 10R0. By (13) and (57), this does not change the ratios of shares, but continuous time is approximated more closely. Simulations were coarse-grained, unless otherwise stated. UNIX reorders the run queues at the end of each second by a linear scan of the process table. As a consequence, the order in which the jobs in an experiment appear in this table potentially has a strong e ect on the share ratios, favoring jobs in lower positions in the table. It turns out that this order is usually the order of the creation times of the jobs. Our simulator mimics the UNIX process table, so there a similar problem occurs. In order to exclude any bias due to the order in the process table, both in the simulations and in the measurements, jobs are created in random order with respect to their base priorities. (This is the only random element in our simulations.) However, in both the simulations and the measurements, the variability of the shares of CPU time obtained by the jobs within a single class was very small (at most 3%, usually much smaller). Also, the share ratios obtained from di erent simulation runs and di erent measurements|the latter were all performed twice|for the same experiments showed little variability, and so for both, below we always simply report a single result. In our model, the ratios of the steady-state shares sk =sl are realized during every single decay cycle. Clearly, if a steady state is reached in a simulation, in the sense that at the start of successive decay cycles all priorities have the same values and the jobs have the same order in the run queues, there is only a very restricted set of possible share ratios, and we can expect large deviations from the model. For instance, if P = 1; M1 = 3, and M2 = 1, then on a 4.3BSD system, which has 10 quanta a second, only s1 =s2 = 1=7; 1=2; 3; 1 are possible. Below we will nd that there is sometimes a very good match of the model and the simulations, which almost always means that there is no steady state in the simulations and that the share ratios are only achieved across a number of decay cycles, a situation for 30

10

5

model System V simulation System V ? 8 model 4.3BSD 6 simulation 4.3BSD

r

r

ratio of shares

4 2 0

r r r

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? r

r

1

r

r

3

r

r

5

r

r

r

r

r

7 9 11 13 15 17 19 base priority of class 2

4 3 2



r

r

r

r

r

r

r

r

fg simulation Mach 

2 4 6 8 10 12 14 16 18 base priority of class 2

(a) (b) Figure 4: Ratio of shares (s1 =s2 ) versus the base priority of class 2 (b2) (K = 2; M2 = P; b1 = 0; in (a), P = 1; 4; M1 = 3P ; in (b), P = 1; 4; 8; 16; M1 = 5P ). which our model says nothing. The same holds for the measurements.

6.2 The Model versus the Simulations We compare the model output with simulation results for four representative experiment sets. I) Two classes, xed numbers of jobs, increasing base priority of class 2. We rst consider System V and 4.3BSD, see Figure 4a. It turns out that the simulation output is identical for P = 1 and P = 4 (as is of course the model output). For 4.3BSD and b2 = 16; 17, the model and the simulations give ratios of 13.95 and 36.03, and of 13.00 and 26.33, respectively; for b2 = 18; 19, in both the model and the simulations, class-2 jobs starve. The model output and the simulation results match quite well. Furthermore, 4.3BSD discriminates much more between jobs with the same di erence in base priorities than System V does, as was to be expected (cf. Section 4.2). For Mach, the cg simulations deviated somewhat from the model, so we also ran fg simulations, which match the model better (see Figure 4b). For each of the values 1,4,8,16 for the number P of processors, both the cg and the fg simulations gave identical results. II) Two classes, xed base priorities, increasing number of jobs with the lowest base priority. Again we rst consider System V and 4.3BSD, see Figure 5a. The model output is correct for any value of P , the simulations have been run for P = 1; 4 with identical results. For 4.3BSD and M1 = 12; 13, the model gives ratios of 16:17 and 223:00, respectively, and for M1  14 it indicates starvation for class 2, while the simulations give starvation for M1  12. The deviation between the model and the cg simulations is considerable for some values of M1 =P , and is caused by the discrete values in the scheduler, as we now show by 31



r



   model Mach  1  simulation Mach 0



r

12

model System V 10 simulation System V ? + fg simulation System V +  8 model 4.3BSD simulation 4.3BSD ? ratio of 6 fg simulation 4.3BSD   shares + +  4  + +? +? ? ?  +? ?  2  +? +? +? +? +? r

r

r

r

+? +? +? r

0

r

r

r

r

r

r

r

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 number of class-1 jobs per processor

7   6      5   4   3  2 model Mach simulation Mach 1 fg simulation Mach  0 1 2 3 4 5 6 7 8 9 10 11 12 13 number of class-1 jobs per processor r

r

r

r

r

r

r

r

r

(a)

(b) Figure 5: Ratio of shares (s1=s2 ) versus the number of class-1 jobs per processor (M1=P ) (K = 2; M2 = P; b1 = 0; in (a), P = 1; 4; b2 = 6; in (b), P = 1; 4; 8; 16; b2 = 18). examining in detail the case with System-V parameters, P = 1; M1 = 14 and M2 = 1, for which the model gives s1 =s2 = 6:63, and the simulation gives s1 =s2 = 3:50. It turns out that in the simulation, a steady state is reached in the sense described above, in which at the end of every decay cycle, v1 = 14; v2 = 3; so the limiting priorities are q1 = q2 = 57 (including PUSER). After recomputation at the end of a decay cycle, the values vi0 of the CPU usage are v10 = 7; v20 = 1, and so the priorities are 53 and 56, respectively. Now rst all class-1 jobs receive one clock tick and go to run queue number 54. Then they all receive 4 clock ticks (one at a time), and are appended to queue 56 (which is then headed by the class-2 job). Subsequently, the class-2 job receives its rst clock tick and goes to run queue 57, and then all class-1 jobs receive 2 clock ticks and are appended to queue 57. Now all class-1 jobs have received 7 clock ticks, and so they all have v1 = 14. The only clock tick left in the decay cycle is given to the class-2 job, which then has v2 = 3, and we have the same situation as at the end of the previous decay cycle. For the best approximation of the ratio of the model within a decay cycle, the class-2 job should have received one clock tick instead of two. For 4.3BSD, it is even more dicult to achieve the ratios predicted by the model, because it has only 10 quanta per second, so it is not strange that for M1 =P = 10, the model breaks down. For M1=P = 10, the ratio 5.65 of the model can only be reasonably approximated in 6 seconds (by 5.90, with not all class-1 jobs receiving the same number of clock ticks), but the model is only correct on the basis of 1-second decay cycles. It is somewhat of a miracle that for M1 =P = 9; 11, the model and the simulations are so close. Note that in the fg simulations, in 4.3BSD the number of quanta per second is 100 instead of 10, but in System V it remains 100; in the latter system, the closer approximation of the model is solely due to a ner distinction in priorities. In Figure 5b, model output and simulation results of Mach systems are depicted. The fg simulations are close to the model, but the cg simulations are not (when no point is depicted 32

r

for cg simulations, starvation of class 2 occurs). Traces of the cg simulations show that for each of the values 6; 7; 8; 9 of M1 =P , the class-2 jobs get almost exactly the same amount of CPU time: For r = M1 =M2 = 7; 8; 9, the share ratios are virtually equal to 6=r times the ratio for M1 =M2 = 6. The explanation is that for M1 =P = 6; 7; 8; 9, the following sequence of events during a decay cycle occurs on the average the same number of times. The class-1 jobs make their way to the queue with the class-2 jobs by receiving a quantum, then each class-2 job gets a quantum, and then the class-1 jobs consume the remaining quanta. For M1=P  10, in every decay cycle, each of the class-1 jobs needs a quantum before it can reach the queue with the class-2 jobs, and so the latter starve. Finally, note the qualitative di erence in the behavior of System V and 4.3BSD on the one hand and Mach on the other. In the former two systems, class-2 jobs starve when the number of class-1 jobs is high enough, but in the latter, the share ratio has a limit for increasing M1=P (of about 8:94 by (76)). III) Three classes. Table 1 shows the ratios of shares and the limiting priority Q at the end of a decay cycle for an experiment set with three classes on a uniprocessor with 4.3BSD parameters. In the simulations, Q is computed as the sum of the average priority of all jobs at the end of the last decay cycle and of PUSER = 50. For the model, Q is computed as the sum of the value given by (43) and PUSER. In an actual system, priority clamping would have occurred because Q > 127. M1

M2

M3

1 1 1 1 2 2 2 2

1 1 2 2 1 1 2 2

1 2 1 2 1 2 1 2

s1 =s2

s1 =s3

Q

model simulation model simulation model simulation 1.58 1.58 3.75 3.71 142.1 140.7 1.58 1.53 3.78 3.83 154.1 152.5 1.68 1.67 5.25 5.00 144.5 142.8 1.67 1.68 5.11 5.00 156.3 154.6 1.82 1.85 10.07 9.25 134.9 134.0 1.78 1.78 7.98 8.01 147.7 145.8 1.92 1.92 24.11 22.59 139.2 137.2 1.87 1.87 14.66 14.04 151.4 149.7

Table 1: Ratios of shares and the limiting priority Q (4.3BSD, P = 1; K = 3; b1 = 0; b2 = 9; b3 = 18). IV) Four classes. Table 2 shows the share ratios for an experiment set with four classes for a 4-processor 4.3BSD system.

6.3 Model Adaptation for Mach In the next section we will nd that especially when there are many jobs (per processor), the Mach measurements sometimes deviate considerably from the model output. We will 33

M1

M2

M3

M4

3 3 3 3 3 3 3 3 7 7 7 7 7 7 7 7

3 3 3 3 7 7 7 7 3 3 3 3 7 7 7 7

3 3 7 7 3 3 7 7 3 3 7 7 3 3 7 7

3 7 3 7 3 7 3 7 3 7 3 7 3 7 3 7

s1 =s2

s1 =s3

s1 =s4

model simulation model simulation model simulation 1.32 1.31 1.96 1.97 3.75 3.71 1.32 1.32 1.96 1.98 3.78 3.75 1.35 1.34 2.09 2.12 4.61 4.33 1.35 1.37 2.09 2.12 4.62 4.67 1.39 1.35 2.27 2.25 6.16 6.00 1.38 1.38 2.23 2.22 5.76 5.62 1.41 1.42 2.40 2.36 7.98 7.70 1.40 1.39 2.36 2.38 7.39 6.90 1.43 1.41 2.50 2.51 10.07 9.67 1.41 1.42 2.40 2.37 7.98 8.01 1.45 1.44 2.63 2.60 14.05 13.33 1.43 1.44 2.54 2.54 10.90 10.55 1.49 1.49 2.94 2.89 101.59 starvation 1.47 1.45 2.76 2.70 23.02 18.00 1.51 1.51 3.07 2.83 starvation starvation 1.49 1.47 2.90 2.80 57.82 196.00

Table 2: Ratios of shares (4.3BSD, P = 4; K = 4; b1 = 0; b2 = 6; b3 = 12; b4 = 18). now argue that this phenomenon is probably due to a combination of the way in which Mach e ectuates decay described in Section 2.1.3, and of the relatively large quantum size of 100 ms. Let's for the sake of this discussion introduce the following terminology. An extended decay cycle is a 2-second interval at the end of which Mach e ectuates decay for every job; this decay is called global decay. Local decay is the decay e ectuated for a process when it runs and its local decay counter is found to be smaller than the global 1-second counter. Obviously, during the rst half of an extended decay cycle, local decay does not occur; it is only performed at the rst selection of a job for execution during the second half of an extended decay cycle|which happens on the basis of its priority to which no (local or global ) decay has yet been applied during the current extended decay cycle. We rst show that the only continuous-time model of this form decay we can think of, yields the same share ratios as the original model. In this model, at the end of the rst half of an extended decay cycle in the steady state, all jobs will have attained the same priority, and they can be thought of as all being in a single queue, ordered according to their class, with all class-K jobs at the head and all class-1 jobs at the tail. Then, the rst job that is selected for service in the second half of the extended decay cycle, which would get an in nitesimally small time quantum according to the PS-type scheduling policy, is subjected to local decay (with decay factor D). As a consequence, its priority drops, it will receive service until its priority is again equal to that of the other jobs, and it will be appended to the queue. This happens for all jobs in the order just described, after which they will all receive the same amount of service until the end of the extended decay cycle. Then, global decay (with decay factor D) is performed, as a result of which the jobs will be at priority levels which increase according to their class numbers. Clearly, by (14), the di erence in 34

priority increase between classes k and l is R(bk ? bl ) in either half of an extended decay cycle, and so the share ratios are the same as in the original model. Now if time is discrete, at the start of the second half of an extended decay cycle, following the course of things in the continuous-time model described above, the jobs of the higher classes will tend to be at the head(s) of the queue(s), and those of the lower classes at the tail(s), and so, because there not very many time quanta, the former will relatively be favored. We have included the Mach way of decay as an option in our simulator, and refer to simulations with this form of decay as adapted simulations, which again can be coarse-grained or ne-grained. The number of clock ticks in an extended decay cycle is 200 and 2000, respectively, and in the latter, the parameters ; ; R0 are adapted in the same way as indicated in Section 6.1. In Figure 6, it is shown that indeed class 2 gets better service with the way Mach performs decay, but only when the time quantum is large. We have found from the traces of the adapted simulations, that especially when their base priorities are much higher than that of class 1, or when there are many jobs per processor, jobs of class k > 1 starve often during the rst and sometimes during the second half of an extended decay cycle. Now if the jobs of class k each get an amount ck of CPU time during either the rst or the second half of an extended decay cycle, and nothing in the other half, and if vk is the CPU usage at the end of an extended decay cycle, we have vk = vk =D2 + ck , so the jobs experience a decay factor of D2 and a decay-cycle length of 2T . Class-1 jobs still get service during either half of an extended decay cycle; if they get the same amount c1 during either half, then v1 = v1 =D2 + c1=D2 + c1 =D. The share ratio s1 =sk for k  2 is then given by 2c1=ck . Because in addition jobs are rst selected during the second half of a decay cycle on the basis of their priority to which no (local) decay has yet been applied, and because Mach really operates in 2-second cycles, we conclude that Mach can be modeled more closely with a decay factor of D = 2:56|equal to the square of the original decay factor|and with a decay cycle of T = 200 clock ticks. As the increment in priority due to one second of CPU time is built into the system, we still have R = lR0=100. Using (57), the steady-state share ratios are now given by sk = 10:24R0M^ K + 1:56 PKi=1 Mi(bi ? bk ) ; k; l = 1; : : :; K: sl 10:24R0M^ K + 1:56 PKi=1 Mi (bi ? bl) We will refer to the model with the new values of the parameters T and D as the adapted model . In Figure 6, we nd that the output of the adapted model and the results of the adapted simulations agree reasonably well, especially when the number of class-1 jobs varies. The decreasing behavior of the share ratios in Figure 6b for M1 =P = 5; 6; 7 and for M1 =P = 10; 11; 12; 13 can be explained in the same way as in Section 6.2, with the class-2 jobs getting one quantum per extended decay cycle, and one quantum in every two extended decay cycles, respectively. By (13), we have  8=l in the adapted model of Mach. In order to evaluate the levels of control in the original and the adapted models, we have to compare the values of =PT with T = 100 and T = 200, respectively. So, because of (75), the e ect of the adaptation is 35

5

7

4 ratio of shares



3 2





r

r

2

r

r

r

r

r

  model Mach   adapted model Mach 1  0



r

r

adapted simulation Mach fg adapted simulation Mach  4 6 8 10 12 14 16 18 base priority of class 2 r



6 5 4 3 2 1 0

 r

 r

     r

r

r

r

r

r



 r

r

r

model Mach adapted model Mach adapted simulation Mach fg adapted simulation Mach  r

1 2 3 4 5 6 7 8 9 10 11 12 13 number of class-1 jobs per processor

(a)

(b) Figure 6: Ratio of shares (s1=s2 ) versus (a) the base priority of class 2 (b2) and (b) the number of class-1 jobs per processor (M1=P ) in the adapted model and the adapted simulations of Mach (K = 2; P = 1; 4; 8; 16; M2 = P; b1 = 0; in (a), M1 = 5P ; in (b), b2 = 18). a lower level of control. There are two other possible causes for deviations between the model and the measurements. First, the load l reported by Mach is higher than the number of jobs in our experiments divided by P . In fact, for P = 1; 4; 8, the di erence is usually slightly less than 1=P (which may be due to the OSF operating system, which runs as a job on top of Mach); for P = 16, the reported load is very close to M^ K =P . Second, we nd that our jobs do not get all of the available CPU time (again perhaps due to the OSF operating system); for P = 1; 4; 8; 16, they get about 75-85%, 92-95%, 95%, and 97%, respectively. By (57) and (74), the e ect of a higher load is to decrease the share ratios, and the e ect of not all CPU time being obtained by our jobs (which amounts to a shorter decay cycle), is to increase them. In Section 6.4, we have traced the impact of these two phenomena for one experiment set. In Figure 10, the graph labeled adapted model (measured load and percentage CPU ) gives the share ratios as predicted by the adapted model, but with the load l and the decay cycle length replaced by their measured values. As can be seen, this does not really provide an explanation for the deviations between the model and the measurements.

6.4 The Model versus the Measurements We compare measurements of 4.3BSD-based UNIX systems and Mach systems with model output for four representative experiment sets. For System V, measurements of uniprocessors were already reported in [10, 11]. Because the load l is at least equal to 2 in all our experiments, by (79), no bounds of scheduler variables are attained in Mach. For the 4.3BSD-based systems, we use (78) to determine for each experiment set for which values 36

r

r



of the parameters the model is valid. When the model is invalid, it is because the value of p cpu of CPU usage of class 1 (v1 in the model) exceeds 255. In some cases below, theoretically, 255 < v1 < 256, so because of the roundings to below in actual systems, the model can then still be used. As mentioned above, the 4-processor Sun uses a decay factor of D = (2M^ K + 1)=2M^ K , and in the same way as in Section 4.3, it can easily be shown that then the bound 255 of p cpu for class 1 is reached for any set of jobs. As a consequence, all our measurements on the 4-processor Sun are worthless for our purposes (but they support the conclusion of Section 4.3 that the bound for p cpu favors the lower classes by showing (much) larger share ratios, and often indicating starvation of the higher class(es)). I) Two classes, xed numbers of jobs, increasing base priority of class 2. First we consider the Sun, see Figure 7a. The model is only valid for b2  11. The match of the measurements and the model is perfect. For the Sequent under DYNIX (see Figure 7b, the model is valid for b2  13), the measurements show a larger share ratio than the model. Traces of the system revealed that jobs may get longer time slices than 100 ms, even up to 600 ms, while their priority does not justify this. Because high-priority processes are eligible for running earlier during a decay cycle, this favors the lower classes. We have observed the same phenomenon|higher share ratios|in some measurements of the same Sequent/DYNIX system with only one CPU enabled. Therefore, we think that the deviation is not due to a multiprocessor e ect, but to some implementation detail of the operating system. In Figure 8, we show Mach measurements for P = 1; 4; 8; 16, all with the same ratio of M1 =M2 . Note that indeed the share ratios do not depend on the number of processors P (cf. Section 4.2). On Mach, we could not let the numbers of jobs increase linearly with the number of processors because of a limitation of the numbers of jobs per user. A comparison of Figures 4b and 8 shows the di erence between the original and adapted models for Mach. II) Two classes, xed base priorities, increasing number of jobs with the lowest base priority. See Figure 9 for the 4.3BSD-based systems; the model is only valid for M1  3P . For the Sun (P = 1), for M1 = 12, the measurements twice give a ratio of 12.31 and the model gives 16.17; for M1  13, the measurements indicate starvation, while the model gives a ratio of 223.00 for M1 = 13 and starvation for M1  14. The cases M1 = 1; 2 con rm our conclusion of Section 4.3 that a bound for CPU usage favors the lower classes (here class 1). For the Sequent (P = 4), again we nd that the measurements indicate higher share ratios than the model does, which is again probably caused by giving time quanta larger than 100 ms to class 1. The match of the adapted model and the Mach measurements in Figure 10 is only reasonable (with one strange outlyer for P = M1 = 16). Taking into account the measured load and the obtained CPU time (cf. Section 6.3) does not explain the gap. Especially for this experiment set, the in uence of adapting the model is considerable (compare Figures 5b and 10). 37

III) Three classes, xed numbers of jobs, increasing base priority of class 3. The match of the model and the measurements on both the Sun (see Figure 11a, the model is only valid for b3  9) and Mach (see Figure 12) is truly remarkable. IV) Three classes, xed base priorities, increasing number of jobs with the lowest base priority. The results for the Sun are in Figure 11b. The model is only valid for M1  3. Again, the measurements agree quite well with the model.

7 Conclusions We have analyzed a decay-usage scheduling policy for multiprocessors modeled after different variations of UNIX and after Mach. Our main results are the convergence of the policy and the relation between the base priorities and the steady-state shares. Our simulations validate our analysis, but also show that the discretization of time may be a source of considerable deviation from the model. The measurements of the 4.3BSD uniprocessor and of Mach match the model remarkably well, those of the 4.3BSD multiprocessor show a discrepency that we ascribe to implementation details which we have not been able to identify. Our results show that share scheduling can be achieved in UNIX, but that unfortunately, in the decay-usage scheduling policy we have analyzed, the shares depend on the numbers of jobs in the classes in an intricate way. Therefore, the policy does not easily achieve Priority Processor Sharing (i.e., xed ratios of the shares of jobs of di erent classes, regardless of their numbers) or Group Priority Processor Sharing ( xed ratios of the total amounts of CPU time obtained jointly by all jobs of the di erent classes), both of which may be desirable. On the other hand, the objectives of the UNIX scheduler are fast interactive response to short, I/O-bound jobs and the prohibition of starvation of compute-bound jobs, not share scheduling. We have seen that in order to have the same range of possible share ratios for the same set of jobs in systems that do not employ the Mach load-factor technique, the range of base priorities should be proportional to both the number and the speed of the processors, or in other words, the leverage of decay-usage scheduling with the same range of base priorities is much larger in small and slow multiprocessors than in large or fast ones. Also, of the actual systems considered, 4.3BSD has the highest level of control over the share ratios. Finally, a scheduler with a constant decay factor and a constant increment of priority due to a clock tick of CPU time obtained, is completely described by one parameter instead of four, at least as far as the steady-state behavior is concerned.

38

8 Acknowledgments The support of the High Performance Computing Department, managed by W.G. Pope, of the IBM T.J. Watson Research Center in Yorktown Heights, NY, USA, where part of the research reported on in this paper was performed, and of IBM The Netherlands, is gratefully acknowledged. In addition, the author owes much to stimulating discussions with J.L. Hellerstein of the IBM Research Division. Furthermore, the author thanks the OSF Research Institute in Grenoble, France, for the opportunity to perform measurements on their multiprocessor Mach system, and Andrei Danes and Philippe Bernadat of OSF for their help. Finally, the author thanks I.S. Herschberg for his careful reading of a draft version of this paper and his suggesting numerous improvements in its exposition. UNIX is a registered trademark of X/OPEN Company, Ltd.

39

12

P = 1; M1 = 5; M2 = 1

10

measurement model

8

r

ratio of 6 shares 4 2 0

r

r r r

r

r

r

r

r

r

r

1 2 3 4 5 6 7 8 9 10 11 base priority of class 2

8 7 6 5 4 3 2 1 0

P = 4; M1 = 19; M2 = 3

measurement model simulation  fg simulation +

r

r

r

r

+ + +   +  + + r

r

r

r

r

1 2 3 4 5 6 7 8 9 base priority of class 2

Figure 7: Ratio of shares (s1 =s2 ) versus the base priority of class 2 (b2) (4.3BSD, K = 2; b1 = 0).

40

+

 + + r

4 3.5 3 2.5 ratio of 2 shares 1.5 1 0.5 0

4 3.5 3 2.5 ratio of 2 shares 1.5 1 0.5 0

P =1 M1 = 5 M2 = 1

r

r r r

r

2

r

4

r

6

r

r

8 10 12 14 16 18

P =8 M1 = 40 M2 = 8

r r r

r

2

r

4

r

r

r

r

6 8 10 12 14 16 18 base priority of class 2 measurement

4 3.5 3 2.5 2 1.5 1 0.5 0

4 3.5 3 2.5 2 1.5 1 0.5 0

P =4 M1 = 20 M2 = 4

r r r

r

2

r

4

r

6

r

r

r

8 10 12 14 16 18

P = 16 M1 = 40 M2 = 8

r r r

r

2

r

4

r

r

r

r

6 8 10 12 14 16 18 base priority of class 2 adapted model

r

Figure 8: Ratio of shares (s1=s2 ) versus the base priority of class 2 (b2) (Mach, K = 2; b1 = 0).

41

10 8 ratio of shares

12 r

P = 1; M2 = 1 measurement model

6

r

r

0

r

r

r

r

r

r

r

2

1 2 3 4 5 6 7 8 9 10 11 number of class-1 jobs

0

 r

measurement model simulation  fg simulation + r

6 4

r

r

P = 4; M2 = 4

8

4 2

10

r r r

+ +   +  + + + + + r

r

r

r

r

r

1 2 3 4 5 6 7 8 9 10 number of class-1 jobs per processor

Figure 9: Ratio of shares (s1 =s2) versus the number of class-1 jobs per processor (M1=P ) (4.3BSD, K = 2; b1 = 0; b2 = 6).

42

+

+

5

5

4 ratio of shares

3 2

r

r

r

2

r

r

r

r

r

r

r

r

r

r

P =1 M2 = 1

1 0

22222 2 22 2 2 22

4 3 2

0

5

5

4

4

2 222 222 3 2 22 ratio of 2 2 shares r

r

r

r

r

r

r

r

r

r

r

r

2 1 0

2

P =8 M2 = 4

2 1

2 3 4 5 6 7 8 9 10 11 12 13 ratio of numbers of jobs measurement

3

r

0

r

r

r

r

r

r

r

r

1 2 3 4 5 6 7 8 9 10 11 12 13

r

r

2 2 2 2 2 2 2 2 2 2 r

r

r

r

r

r

r

r

P = 16 M2 = 4 4 5 6 7 8 9 10 11 12 13 ratio of numbers of jobs

adapted model adapted model (measured load and percentage CPU) 2

Figure 10: Ratio of shares (s1 =s2) versus the ratio of the numbers of jobs of classes 1 and 2 (M1=M2 ) (Mach, K = 2; b1 = 0; b2 = 18).

43

r

r

P =4 M2 = 4

1

1 2 3 4 5 6 7 8 9 10 11 12 13

2 r

r

22 222222 2 22 r

10

M1 = 5; M2 = 2; M3 = 1; b2 = 2 ? 8 measurement, s1 =s2 model, s1 =s2 measurement, s1 =s3 ? 6 model, s1 =s3 ratios of ? shares r

4 2 0

?

r

?

3

4

r

(a)

?

r

? r

?

r

r

r

5 6 7 8 9 base priority of class 3

10

M2 = 2; M3 = 1; b2 = 3; b3 = 5 ? 8 measurement, s1 =s2 model, s1 =s2 measurement, s1 =s3 ? ? 6 model, s1 =s3 r

4 2 0

? ? ? ? ? ? ? ? ? r

r

r

r

r

r

r

r

r

? ? r

r

r

1 2 3 4 5 6 7 8 9 10 11 12 13 (b) number of class-1 jobs

Figure 11: Ratios of shares (s1 =sk ; k = 2; 3) versus (a) the base priority of class 3 (b3) and (b) the number of class-1 jobs (M1) (4.3BSD, P = 1; K = 3; b1 = 0).

44

r

4 3.5 3 2.5 ratios of 2 shares 1.5 1 0.5 0

4 3.5 3 2.5 ratios of 2 shares 1.5 1 0.5 0

P =1 M1 = 5 M2 = 2 M3 = 1 ?

?

?

?

?

?

?

r

r

r

r

r

r

6

8

10

12

14

16

18

r

P =8 M1 = 20 M2 = 8 M3 = 4 ?

?

6

8

r

r

? r

? r

? r

?

? r

r

10 12 14 16 base priority of class 3

measurement, s1 =s2 adapted model, s1 =s2

18

4 3.5 3 2.5 2 1.5 1 0.5 0

4 3.5 3 2.5 2 1.5 1 0.5 0

P =4 M1 = 10 M2 = 4 M3 = 2 ?

?

?

?

?

? r

r

r

r

r

r

6

8

10

12

14

16

18

r

P = 16 M1 = 30 M2 = 12 M3 = 6 ?

?

6

8

r

r

? r

? r

? r

? r

10 12 14 16 base priority of class 3 measurement, s1 =s3 ? adapted model, s1 =s3 : : :

r

Figure 12: Ratios of shares (s1 =sk ; k = 2; 3) versus the base priority of class 3 (b3) (Mach, K = 3; b1 = 0; b2 = 4).

45

?

?

r

18

References [1] M.J. Bach, The Design of the UNIX Operating System, Prentice-Hall, 1986. [2] D.L. Black, "Scheduling Support for Concurrency and Parallelism in the Mach Operating System," IEEE Computer, May, 35{43, 1990. [3] D.L. Black, Scheduling and Resource Management Techniques for Multiprocessors, Report CMU-CS-90-152, Carnegie Mellon University, 1990. [4] S. Curran and M. Stumm, "A Comparison of Basic CPU Scheduling Algorithms for Multiprocessor UNIX," Computing Systems, Vol. 3, 551{579, 1990. [5] R.B. Essick, "An Event-Based Fair Share Scheduler," USENIX, Winter, 147{161, 1990. [6] G. Fayolle, I. Mitrani, and R. Iasnogorodski, "Sharing a Processor among Many Job Classes," J. of the ACM, Vol. 27, 519{532, 1980. [7] L.L. Fong and M.S. Squillante, Time-Function Scheduling: A General Approach to Controllable Resource Management, IBM Research Report RC 20155, IBM Research Division, New York, NY, 1995. [8] B. Goodheart and J. Cox, The Magic Garden Explained, The Internals of UNIX System V Release 4, An Open Systems Design, Prentice-Hall, 1994. [9] A.G. Greenberg and N. Madras, "How Fair is Fair Queuing?," J. of the ACM, Vol. 39, 568{598, 1992. [10] J.L. Hellerstein, "Control Considerations for CPU Scheduling in UNIX Systems," USENIX, Winter, 359{374, 1992. [11] J.L. Hellerstein, "Achieving Service Rate Objectives With Decay-Usage Scheduling," IEEE Trans. on Softw. Eng., Vol. 19, 813{825, 1993. [12] G.J. Henry, "The Fair Share Scheduler," AT&T Bell Laboratories Technical Journal, Vol. 63, 1845{1857, 1984. [13] J. Kay and P. Lauder, "A Fair Share Scheduler," Comm. of the ACM, Vol. 31, 44{55, 1988. [14] L. Kleinrock, "Time-Shared Systems: A Theoretical Treatment," J. of the ACM, Vol. 14, 242{261, 1967. [15] S.J. Leer, M.K. McKusick, M.J. Karels, and J.S. Quarterman, The Design and Implementation of the 4.3BSD UNIX Operating System, Addison-Wesley, 1989. [16] M.K. McKusick, K. Bostic, M.J. Karels, and J.S. Quarterman, The Design and Implementation of the 4.4BSD Operating System, Addison-Wesley, 1996. 46

[17] U. Vahalia, UNIX Internals, The New Frontiers, Prentice-Hall, 1996. [18] C.A. Waldspurger and W.E. Weihl, "Lottery Scheduling: Flexible Proportional-Share Resource Management," Proc. of the First USENIX Symposium on Operating Systems Design and Implementation (OSDI), Monterey, CA, 1{11, 1994. [19] C.A. Waldspurger and W.E. Weihl, Stride Scheduling: Deterministic ProportionalShare Resource Management, Technical Memorandum MIT/LCS/TM-528, MIT Laboratory for Computer Science, Cambridge, MA, 1995.

47