Energy-Efficient CPU Scheduling for Multimedia ... - CiteSeerX

0 downloads 0 Views 372KB Size Report
Jun 15, 2005 - Among points in the list, the larger the cycle number x is, the higher the speed y ...... the mean of 6 measurements and the error bars show.
Energy-Efficient CPU Scheduling for Multimedia Applications Wanghong Yuan DoCoMo USA Labs and Klara Nahrstedt University of Illinois at Urbana-Champaign

This paper presents the design, implementation, and evaluation of GRACE-OS, an energy-efficient real-time CPU scheduler for multimedia applications running on a mobile device. GRACE-OS seeks to minimize the total energy consumed by the device while meeting multimedia timing requirements. To achieve this goal, GRACE-OS integrates dynamic voltage scaling into the traditional real-time CPU scheduling: It decides at what CPU speed to execute applications in addition to when to execute what applications. GRACE-OS makes these scheduling decisions based on the probability distribution of cycle demand of multimedia applications and obtains their demand distribution via online profiling. We have implemented GRACE-OS in the Linux kernel and evaluated it on a laptop with a variable-speed CPU and typical multimedia codecs. Our experimental results show four findings: First, the demand distribution of our studied codecs is stable or changes slowly. This stability implies the feasibility to perform our proposed energy-efficient scheduling with low overhead. Second, GRACE-OS delivers soft performance guarantees to these codecs by bounding their deadline miss ratio under the application-specific performance requirements. Third, GRACE-OS reduces the total energy of the laptop by 14.4% to 37.2% relative to the scheduling algorithm without voltage scaling and by 2% to 10.5% relative to voltage scaling algorithms without considering the demand distribution. Finally, GRACE-OS saves energy by 2% to 5% by explicitly considering the discrete CPU speeds and the corresponding total power of the whole laptop, rather than assuming continuous speeds and cubic speed-power relationship. Categories and Subject Descriptors: D.4.1 [Operating Systems]: Process Management—Scheduling; D.4.7 [Organization and Design]: Real-time systems and embedded systems General Terms: Algorithms, Design, Experimentation, Performance Additional Key Words and Phrases: Power Management, Mobile Computing, Multimedia, Soft Real-time

Parts of this work appeared as a conference publication in the Nineteenth ACM Symposium on Operating Systems Principles [Yuan and Nahrstedt 2003]. This work was supported in part by NSF under CCR 02-05638 and CISE EIA99-72884. Authors addresses: Wanghong Yuan, DoCoMo USA Labs, 181 Metro Drive, Suite 300, San Jose, CA 95110; email: [email protected]. Klara Nahrstedt, University of Illinois at Urbana-Champaign, Thomas M. Siebel Center for Computer Science, 201 N Goodwin Ave, Urbana, IL 61801; email: [email protected]. The corresponding author is Wanghong Yuan. Permission to make digital/hard copy of all or part of this material without fee for personal or classroom use provided that the copies are not made or distributed for profit or commercial advantage, the ACM copyright/server notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior specific permission and/or a fee. c 2005 ACM 1529-3785/2005/0700-0001 $5.00

ACM Transactions on Computer Systems, Vol. V, No. N, June 2005, Pages 1–36.

2

1.

·

W. Yuan and K. Nahrstedt

INTRODUCTION

Battery-powered mobile devices, e.g., cellphones and laptops, are expected to become important platforms for multimedia applications. Compared to desktop and server systems, mobile devices need to support multimedia quality of service (QoS) and save the battery energy at the same time. There is a conflict in the design goals for QoS support and energy saving. For QoS support, system resources, such as the CPU, often need to provide high performance, typically consuming more energy. For energy saving, system resources should consume low energy, thereby yielding lower performance. As a result, the operating system of mobile devices needs to manage resources in QoS-aware and energy-efficient manner and provide the flexibility to trade off QoS and energy. The tradeoff becomes possible due to the new opportunities in mobile devices. First, hardware components are being designed to support multiple power modes, trading off performance for energy; for example, mobile processors, such as Intel Pentium-M [Intel 2004], can change the speed (frequency/voltage) and energy consumption at runtime. Second, multimedia applications present soft real-time resource demands. Unlike hard realtime applications, they require only statistical performance guarantees (e.g., meeting 95% of deadlines). Unlike best-effort applications, as long as multimedia applications complete a job (e.g., decoding a video frame) by the deadline, the actual completion time does not matter from the QoS perspective. This soft real-time nature provides the possibility for saving energy without affecting multimedia QoS. This paper exploits these opportunities to address the problem of QoS provisioning and energy saving. In particular, we focus on the management of CPU and energy resources for stand-alone mobile devices. Dynamic voltage/frequency scaling (DVS) is a common mechanism to save energy by slowing down the CPU [Flautner and Mudge 2002; Gruian 2001; Grunwald et al. 2000; Lorch and Smith 2004; Mohapatra et al. 2003; Pering et al. 2000; Pillai and Shin 2001; Weiser et al. 1994]. The major goal of DVS is to reduce energy by as much as possible without degrading application performance. The effectiveness of DVS techniques is, therefore, dependent on the ability to predict application CPU demand— overestimation can waste CPU and energy, while underestimation can degrade application performance. In general, there are three prediction approaches: (1) monitoring the CPU utilization of all applications [Flautner and Mudge 2002; Grunwald et al. 2000; Pering et al. 1998; Weiser et al. 1994; Govil et al. 1995], (2) using the worst-case CPU demand of individual applications [Pering et al. 2000; Pillai and Shin 2001; Krishna and Lee 2000], and (3) monitoring the runtime CPU usage of individual applications [Gruian 2001; Lorch and Smith 2004; Pillai and Shin 2001; Krishna and Lee 2000]. The first two approaches, however, are not suitable for multimedia applications due to dynamic changes of their instantaneous CPU demand (for example, an MPEG video decoder needs different amount of CPU cycles to decode I, P and B frames): The first approach may violate the timing constraints of multimedia applications, while the second approach is too conservative for them. We therefore take the third approach and integrate it into soft real-time scheduling. Soft real-time scheduling is commonly used to support QoS by combining predictable CPU allocation (e.g., proportional sharing and reservation) and real-time scheduling algorithms (e.g., earliest deadline first and rate monotonic) [Chandra et al. 2000; Duda and Cheriton 1999; Goyal et al. 1996; Nieh and Lam 2003; Jones et al. 1997; Urgaonkar et al. 2002]. In our integrated approach, the DVS algorithms are implemented in the CPU scheduler. The enhanced scheduler, called GRACE-OS, decides at what CPU speed to execute applications ACM Transactions on Computer Systems, Vol. V, No. N, June 2005.

Energy-Efficient CPU Scheduling for Multimedia Applications

·

3

in addition to when to execute what applications. Our goal is to obtain the benefits of both soft real-time scheduling and DVS; i.e., we want to maximize the energy saving of DVS while preserving the performance guarantees of soft real-time scheduling. To do this, we introduce a statistical property into GRACEOS. Specifically, the scheduler allocates CPU cycles to individual processes (or threads) based on their statistical performance requirements and probability distribution of cycle demand. For example, if an MPEG decoder requires meeting 95% of deadlines and, for a particular input video, 95% of frames demand no more than 9 × 106 cycles to decode, then the scheduler can allocate the decoder 9 × 106 cycles per frame. Compared to the worst-case-based allocation, this statistical allocation increases CPU utilization. It also saves energy at the process-set level, since the CPU can run at a minimum speed that meets the aggregate statistical demand of all concurrent processes. There is also a potential to save more energy at the per-process level. The reason is that a process may, and often does, complete a job before using up its allocated cycles. For the above MPEG decoder example, about 95% of frames need less than the allocated 9 × 106 cycles. Such early completion often results in CPU idle time and hence wastes energy. To realize this potential for saving energy, GRACE-OS finds a speed schedule for each process based on the probability distribution of the process’s cycle demand. This speed schedule enables each job of the process to start slowly and accelerate as the job progresses. Consequently, if the job completes early, it can avoid the high speed (high energy consumption) execution part. This is similar to the statistical DVS approaches proposed by Lorch and Smith [Lorch and Smith 2004] and Gruian [Gruian 2001]. Since the statistical scheduling and DVS both depend on the demand distribution of multimedia processes, we estimate the demand distribution online. We first use a kernelbased profiling to monitor the cycle usage of each process, and then use a simple yet effective histogram technique to estimate the probability distribution of the process’s cycle usage. Our estimation approach is distinguished from others (e.g., [Gruian 2001; Lorch and Smith 2004; Urgaonkar et al. 2002]) in that it can be used online with low overhead. This is important and necessary for live applications such as video conferencing. We have implemented GRACE-OS in the Linux kernel 2.6.5 and evaluated it on the HP Pavilion laptop with a variable-speed CPU and multimedia applications, including codecs for speech, audio, and video. The experimental results show five interesting findings: (1) Although the studied codecs change their instantaneous cycle demand largely [Bavier et al. 1998], the probability distribution of their cycle demand is stable or changes slowly and smoothly. This stability means that GRACE-OS can estimate the demand distribution from a small part of process execution (e.g., the first 100 frames) or update the demand distribution infrequently. That is, it is feasible to perform statistical scheduling and DVS based on the demand distribution with low overhead. (2) GRACE-OS delivers soft performance guarantees with statistical (as opposed to worst case) CPU allocation. It meets almost all deadlines in a lightly loaded environment, and bounds the deadline miss ratio under the application-specific performance requirement (e.g., meeting 95% of deadlines) in a heavily loaded environment. (3) Depending on the application workload, GRACE-OS reduces the total energy of the laptop by 14.4% to 37.2% relative to the scheduling algorithm without voltage scaling and by 2% to 10.5% relative to voltage scaling algorithms without considering the demand distribution. ACM Transactions on Computer Systems, Vol. V, No. N, June 2005.

4

·

W. Yuan and K. Nahrstedt

(4) Compared to other statistical DVS algorithms [Lorch and Smith 2004; Gruian 2001] that assume continuous CPU speeds and cubic speed-power relationship, GRACE-OS further saves energy by 2% to 5% by explicitly considering the discrete CPU speed levels and the corresponding total power of the whole laptop. (5) GRACE-OS incurs acceptable overhead. The cost is (a) 26 to 38 cycles for each profiling operation that happens during each context switch, (b) 18 to 275 microseconds for each histogram-based estimation that happens every 100 frames, (c) less than 5 microseconds for real-time scheduling that happens every 500 microseconds, and (d) 17 to 28 microseconds for each speed change that happens a few times (less than 6 in our platform) per frame. This paper makes four major contributions: First, while the statistical DVS approach was proposed by Lorch and Smith [Lorch and Smith 2004] and Gruian [Gruian 2001], they presented simulation results only. GRACE-OS is one of the first two operating systems that implement statistical DVS. The other implementation by Lorch and Smith [Lorch and Smith 2003] does not perform DVS for individual processes. Second, GRACE-OS integrates statistical DVS with soft real-time scheduling, and hence can perform statistical voltage scaling for concurrent applications. Third, we propose and implement a statistical DVS algorithm for real CPUs with only discrete speeds. Finally, we propose a simple and effective profiling approach to automatically estimate the demand distribution of individual applications for both voltage scaling and soft real-time scheduling. The rest of the paper is organized as follows. Section 2 introduces system models for GRACE-OS. Section 3 describes the architecture of GRACE-OS and its major algorithms for online profiling, scheduling, and voltage scaling. Sections 4 and 5 present the implementation and experimental evaluation, respectively. Section 6 compares GRACE-OS with related work. Finally, Section 7 summarizes the paper. 2.

SYSTEM MODELS

Our target systems are battery-powered, stand-alone mobile devices that primarily run CPU-intensive multimedia applications. We currently focus on reducing CPU energy in such devices for two reasons: First, the CPU is one of the highest energy consumers in our targeted mobile devices. For example, we measured energy consumption of a laptop and found that the CPU consumes 15% to 52% of the total energy, depending on the application workload. Second, mobile CPUs on the market today (e.g., Intel Pentium-M [Intel 2004] and Mobile AMD Athlon 4 [AMD 2001]) allow software, typically the operating system, to trade off their performance for energy, typically through the Advanced Configuration and Power Interface (ACPI) standard [Compaq et al. 2000]. We next introduce models and assumptions on the CPU and multimedia applications in our target systems. 2.1

CPU Model

We consider mobile devices with a single processor, which has a single core and is singlethreaded. The processor is adaptive with multiple speeds (frequencies), {s1 , · · · , sK }, trading off performance for energy. The CPU power (energy consumption rate) typically consists of three major parts: C ×s×V2 + {z } |

dynamic power

V × Isc | {z }

short circuit power

ACM Transactions on Computer Systems, Vol. V, No. N, June 2005.

+

V ×I | {zleak}

leakage power

(1)

Energy-Efficient CPU Scheduling for Multimedia Applications

·

5

Table I. Speed-power relationship for the HP N5470 laptop with a single Mobile AMD Athlon 4 CPU (measured using the Agilent 54621A oscilloscope with 95% confidence intervals) CPU speed s (MHz) Total power p(s) (Watt) 300 500 600 700 800 1000

22.25 ± 3.14 25.84 ± 3.28 28.24 ± 3.25 31.05 ± 3.25 35.44 ± 3.81 39.06 ± 4.31

where C is the loading capacitance, s is the speed, V is the voltage, Isc is the short circuit current, and Ileak is the leakage current [Chandrakasan et al. 1992]. At a lower speed, the CPU can operate at a lower voltage, thus consuming less power. A lower CPU speed, however, may increase the power of other resources such as memory. Since our goal is to reduce the total energy consumed by the whole device, we are more interested in the total power consumed by the device at different CPU speeds. In general, the relationship between the speed s and the total device power p(s) can be obtained via measurements. Table I, for example, shows the relationship, measured with an Agilent oscilloscope, for the HP N5470 laptop with a single Mobile AMD Athlon 4 CPU [AMD 2001]. Without loss of generality, we assume that the total power decreases as the CPU speed decreases. If this assumption does not hold, we can use the critical power slope [Miyoshi et al. 2002] to avoid speeds that consume more power but provide lower performance than another speed.

2.2

Application Model

We consider long-lived, CPU-intensive multimedia processes (or threads) and assume that processes are independent from each other. Each process releases and executes jobs periodically, e.g., decoding 30 video frames per second. Different processes may have different periods. Individual jobs of the same process (e.g., decoding I, P, and B frames of an MPEG video) may need different amount of cycles. Each job has a release time and a soft deadline, which is defined as its release time plus the period. By soft deadline, we mean that a job may or may not finish by the deadline. If a job finishes after the deadline, the next job is released immediately; otherwise, the next job is released at the deadline of the previous job (e.g., if a video decoder finishes a frame earlier, it waits for a while and then starts the next frame). Note that we use the above roughly periodic model, rather than strictly periodic model in hard real-time systems, since multimedia applications process frames with an average frame rate. Multimedia jobs need to meet most of deadlines (e.g., 95%), since they have to satisfy soft real-time performance requirements. We use the statistical performance requirement, ρ, to denote the probability that a process should meet job deadlines; e.g., if ρ = 0.95, then the process needs to meet 95% of deadlines. In general, application developers or users can specify the parameter ρ, based on application characteristics (e.g., audio has a higher ρ than video) or user preferences (e.g., ρ can become lower when the battery is low). ACM Transactions on Computer Systems, Vol. V, No. N, June 2005.

6

3.

·

W. Yuan and K. Nahrstedt

DESIGN AND ALGORITHMS

Our goals are to save as much energy as possible and meet the statistical performance requirement of multimedia processes. There is a conflict in these two goals: For performance support, the CPU often needs to run fast, typically consuming more energy. For energy saving, the CPU should run slowly. On the other hand, simply running the CPU at the highest speed does not necessarily meet multimedia requirements [Nieh and Lam 2003; Jones et al. 1997]. We therefore need to consider the combination of predictable CPU scheduling and speed scaling. To do this, we enhance the CPU scheduler with dynamic speed scaling. That is, the scheduler decides at what CPU speed to execute processes in addition to when to execute what processes. This enhanced scheduler, called GRACE-OS, consists of three major components: a profiler, a soft real-time (SRT) scheduler, and a speed adaptor, as shown in Figure 1. The profiler monitors the cycle usage of individual processes, and automatically derives the probability distribution of their cycle demand. The SRT scheduler allocates cycles to individual processes based on their demand distribution, and enforces their allocation for soft performance guarantees. The speed adaptor adjusts the CPU speed dynamically to save energy. In particular, the speed adaptor adapts each process’s execution speed based on its time allocation, which is provided by the SRT scheduler, and its demand distribution, which is provided by the profiler. Multimedia processes

GRACE-OS

monitoring

performance requirement

scheduling

SRT Scheduler Profiler

demand distribution

time

allocation

Speed Adaptor speed scaling CPU

Fig. 1. GRACE-OS architecture: the enhanced scheduler performs real-time scheduling and speed scaling, based on the online-profiled demand distribution of individual processes.

Operationally, GRACE-OS achieves the energy-efficient scheduling via an integration of demand estimation, SRT scheduling, and speed scaling, which are performed by the profiler, SRT scheduler, and speed adaptor, respectively. We next describe these operations in details. 3.1

Estimation of Demand Distribution

Predictable scheduling and speed scaling both depend on the prediction of each process’s cycle demand. Hence, the first step in GRACE-OS is to estimate the probability distribuACM Transactions on Computer Systems, Vol. V, No. N, June 2005.

Energy-Efficient CPU Scheduling for Multimedia Applications

·

7

tion of cycle demand of individual processes. We estimate the demand distribution, rather than the instantaneous demand, for two reasons: First, the former is much more stable and predictable than the latter (as demonstrated in Section 5.1). Second, allocation based on the demand distribution of individual processes delivers them statistical performance guarantee, which is sufficient for our targeted multimedia applications. We next discuss how to estimate the demand distribution for each process. The estimation involves two steps, profiling the cycle usage and deriving the probability distribution of usage. 3.1.1 Online Profiling. Recently, a number of measurement-based profiling mechanisms have been proposed [Anderson et al. 1997; Urgaonkar et al. 2002; Zhang et al. 1997]. Profiling can be performed online or off-line. Off-line profiling provides more accurate estimation with the whole trace of CPU usage, but is not applicable to live applications. We therefore take the online profiling approach. Specifically, we add a cycle counter into the process control block of each process. This counter measures the number of cycles elapsed between the process’s switch-in and switch-out in context switches. The sum of these elapsed cycles during a job execution gives the number of cycles the job uses. Figure 2 illustrates this kernel-based online profiling technique.

jth job

(j+1)th job

in

out

in finish/out

in out in out in finish/out

c1

c2

c3

c5 c6

c4

c7 c8 c9

c10 cycles

cycles for jth job = (c2 – c1) + (c4 – c3) cycles for (j+1)th job = (c6 – c5) + (c8 – c7) + (c10 – c9) legend in process is switched in for execution out process is switched out for suspension finish process finishes a job Fig. 2. Kernel-based profiling: monitoring the number of cycles elapsed between each process’s switch-in and switch-out in context switches.

We want to make clear three issues on the above profiling. First, multimedia processes often use system calls to tell the kernel when they finish a job. For example, when an MPEG decoder finishes a frame decoding, it may call sleep to wait for the next frame [Flautner and Mudge 2002; Banachowski et al. 2004]. Second, although the number of cycles demanded by a job changes with different CPU speeds, primarily due to memory stall, this variation is typically small for multimedia applications. For example, we profiled several codecs of speech, audio, and video and found that the cycle usage of the same job changes less than 12% at different speeds. Further, our online profiling reduces the variation since as discussed later, the CPU speed changes within each job execution. Finally, when used with resource containers [Banga et al. 1999], our proposed profiling technique can be more accurate by subtracting cycles consumed by the kernel (e.g., for interrupt handling and deferred system calls). We currently do not subtract these ACM Transactions on Computer Systems, Vol. V, No. N, June 2005.

8

·

W. Yuan and K. Nahrstedt

cycles, since they are typically negligible relative to the number of cycles consumed by a multimedia job. Our profiling technique is distinguished from others [Anderson et al. 1997; Urgaonkar et al. 2002; Zhang et al. 1997] for three reasons. First, it profiles cycles during runtime, without requiring an isolated profiling environment (e.g., as in [Urgaonkar et al. 2002]). Second, it is customized for counting cycle usage, and it is simpler than general profiling systems that also assign counts to program functions [Anderson et al. 1997; Zhang et al. 1997]. Finally, it incurs small overhead, which happens only when updating the cycle counter before a context switch. There is no additional overhead, e.g., due to sampling interrupts [Anderson et al. 1997; Zhang et al. 1997]. 3.1.2 Online Estimation. We next employ a simple yet effective histogram technique to estimate the probability distribution of cycle usage of each process. Specifically, we use a profiling window to keep track of the number of cycles consumed by n jobs of the process, similar to the history-based policy in Odyssey [Flinn and Satyanarayanan 2004]. The parameter n can be either specified by the process or set to a default value (e.g., the last 100 jobs). Let Cmin and Cmax be the minimum and maximum number of cycles for jobs, respectively, in the window. We obtain a histogram from the cycle usage as follows:

cumulative probability

(1) We use Cmin = b0 < b1 < · · · < br = Cmax to split the range [Cmin , Cmax ] into r equal-sized groups, (bi−1 , bi ], 1 ≤ i ≤ r. We refer to the points, {b0 , b1 , ..., br }, as the group boundaries. (2) Let ni be the number of cycle usage that falls into the ith group (bi−1 , bi ]. The ratio ni the probability that the process’s cycle usage is in between bi−1 and bi ; n represents Pi n the ratio j=0 nj represents the probability that the process needs no more than bi cycles. Pi n (3) For each cycle group, we plot a rectangle in the interval (bi−1 , bi ] with height j=0 nj . All rectangles together form a histogram, as shown in Figure 3. 1 cumulative distribution function F(x)

Cmin=b0 b1 b2 cycle demand

br- 1 br=C max

Fig. 3. Histogram-based estimation: the histogram approximates the cumulative distribution function of a process’s cycle demand.

From a probabilistic point of view, the above histogram of a process approximates the cumulative distribution function of the process’s cycle demand, i.e., F (x) = P[X ≤ x] ACM Transactions on Computer Systems, Vol. V, No. N, June 2005.

(2)

Energy-Efficient CPU Scheduling for Multimedia Applications

·

9

where X Pisi thenrandom variable of the process’s cycle demand. In particular, the rectangle height, j=0 nj , of a group (bi−1 , bi ] approximates the cumulative distribution at bi , i.e., the probability that the process demands no more than bi cycles. In this way, we can estimate the cumulative distribution for the group boundaries of the histogram, i.e., F (x) for x ∈ {b1 , · · · , br }. Unlike distribution parameters such as the mean and standard deviation, the above histogram describes the property of the full demand distribution. This property is necessary for statistical speed scaling (see Section 3.3). On the other hand, compared to distribution functions such as normal and gamma (e.g., PACE [Lorch and Smith 2004]), the histogrambased estimation does not need to configure the parameters of distribution functions. It is also easy to update, with low overhead, when the demand distribution changes, e.g., due to a video scene change. 3.2

Statistical Real-Time Scheduling

Multimedia processes have computational requirements that must be met in soft real time (e.g., decoding a frame within a period). To support such timing requirements, the operating system needs to provide soft real-time CPU scheduling, typically in two steps: predictable cycle allocation and enforcement. The key problem in the first step is deciding the amount of cycles allocated to each process. Over-allocation (e.g., worst-case-based) may waste energy, while under-allocation (e.g., average-based) may degrade QoS. GRACE-OS instead takes a statistical approach: The scheduler decides cycle allocation based on the statistical performance requirement and demand distribution of each process. The purpose of this statistical allocation is to improve CPU and energy utilization, while delivering soft performance guarantees. Specifically, let ρ be the statistical performance requirement of a process— the process needs to meet ρ percentage of deadlines. In other words, each job of the process should meet its deadline with a probability ρ. To support this requirement, the scheduler allocates C cycles to each job of the process, such that the probability that each job requires no more than the allocated C cycles is at least ρ, i.e., F (C) = P[X ≤ C] ≥ ρ

(3)

To find this parameter C for a process, we search the group boundaries, {b1 , · · · , br }, of its histogram to find the smallest bm whose cumulative distribution is at least ρ, i.e., F (bm ) = P[X ≤ bm ] ≥ ρ. We then use this bm as the parameter C. Figure 4 illustrates the statistical allocation process. Each process first enters the system in a best-effort mode. The scheduler tries to put a process into soft real-time mode when it first determines the parameter C of the process. At this time, the scheduler performs an admission control based on the earliest deadline first (EDF) scheduling algorithm [Liu and Layland 1973; Jones et al. 1997; Leslie et al. 1996]. We use EDF, rather than rate monotonic scheduling policy, since EDF can achieve a full CPU utilization, which wastes less energy. In particular, the scheduler ensures that the total CPU utilization at the highest CPU speed is no more than one. That is, n X i=1

Ci sK

Pi

≤1

(4)

where there are n soft real-time processes, each with cycle demand Ci and period Pi , 1 ≤ ACM Transactions on Computer Systems, Vol. V, No. N, June 2005.

10

·

W. Yuan and K. Nahrstedt

cumulative probability

statistical performance requirement 1 p

Cmin=b0 b1 b2

cycle demand

bm

br=C max

cycle allocation C Fig. 4. Statistical cycle allocation: allocating the smallest bm with P[X ≤ bm ] ≥ ρ.

i ≤ n, and sK is the highest CPU speed. If the admission test succeeds, the new process enters soft real-time mode. When the parameter C is updated for a process due to the change of its demand distribution, the scheduler also makes the above admission test. The process’s cycle allocation is updated if the admission test succeeds and does not change otherwise. To prevent the starvation of best-effort processes, we treat them as a logic soft real-time process with P = 1 second and C = 100 × 106 cycles. This best-effort allocation enables multimedia processes that initially enter in best-effort mode to progress and enter real-time mode. Before a multimedia process enters real-time mode, it may miss a lot of deadlines. One possible approach to avoid these deadline misses is to allocate a multimedia process cycles demanded by its first job. The scheduler uses the EDF algorithm to enforce the allocation. More specifically, the scheduler allocates each process a budget of C cycles every period. It dispatches soft real-time processes based on their deadline and budget— selecting the process with the earliest deadline and positive budget (if there is no such soft real-time process, best-effort processes are executed). As the selected process is executed, its budget is decreased by the number of cycles it consumes. If a process overruns (i.e., it has used up the allocated cycles but does not finish the current job yet), the scheduler can either notify the process to abort the overrun part, or preempt it to run in best-effort mode until its budget is replenished at the beginning of next period. This overrun protection provides isolation among different processes [Nieh and Lam 2003; Zeng et al. 2002]. 3.3

Statistical Speed Scaling

Soft real-time scheduling determines when to execute what processes. We next discuss another scheduling dimension — at what CPU speed to execute processes (i.e., CPU speed scaling). The purpose of the speed scaling is to save energy, while preserving the performance guarantees of the above real-time scheduling. To do this, GRACE-OS adapts the CPU speed during the process execution. In this subsection, we explain the motivation for the statistical speed scaling, then introduce an operating system abstraction—speed schedule—for the speed scaling, and finally describe how to calculate the speed schedule for each individual process. ACM Transactions on Computer Systems, Vol. V, No. N, June 2005.

Energy-Efficient CPU Scheduling for Multimedia Applications

·

11

3.3.1 Motivation for Statistical Speed Scaling. The intuitive idea for speed scaling is to assign a uniform speed, that meets the aggregate CPU demand of the current process set, to execute all processes until the process set changes. Let’s assume there are n processes, each Ci cycles per period Pi . The aggregate CPU demand of all processes is Pn allocated Ci cycles per second. To meet this demand, the CPU only needs to run at speed Pi Pni=1 C i i=1 Pi Hz. For example, if an MPEG video decoder and an MP3 audio decoder are running concurrently and are allocated 12 × 106 cycles every 40 ms and 106 cycles every 6 6 20 ms, respectively, the CPU speed can be set to 12×10 + 1×10 = 350 MHz. 40 20 If each process used exactly its allocated cycles, this uniform speed technique would consume minimum energy due to the convex nature of the speed-power function [Aydin et al. 2001; Ishihara and Yasuura 1998]. However, the instantaneous cycle demand of multimedia processes often changes. In particular, a process may, and often does, complete a job before using up its allocated cycles. For example, if a process is allocated cycles based on its 95th percentile CPU demand (i.e., requires meeting 95% of deadlines), then about 95% of its jobs complete earlier with residual cycle budget. Such early completion often results in CPU idle time. When the CPU is idle, the device still consumes energy, thereby wasting energy. To save this energy, we need to dynamically adjust CPU speed. In general, there are two dynamic speed scaling approaches: (1) starting a job at the above uniform speed and then decelerating when it completes early, and (2) starting a job at a lower speed and then accelerating as it progresses [Lorch and Smith 2004; Gruian 2001; Xu et al. 2004]. The former is conservative by assuming that a job will use its allocated cycles, while the latter is aggressive by assuming that a job will use fewer cycles than allocated. In comparison, the second approach saves more energy for jobs that complete early, because these jobs avoid the high speed (high energy consumption) execution. GRACE-OS takes the second approach since the statistical cycle allocation means that most multimedia jobs use fewer cycles than the allocated. To illustrate why this approach can save energy, we next consider an intuitive, simplified example. Assume that (1) a video encoder demands 106 cycles to encode 80% of frames and 2 × 106 cycles to encode the other 20% of frames. In other words, for each frame, the first 106 cycles are executed with probability 100% and the second 106 cycles are executed with probability 20%. (2) Each frame should complete within 10 ms. (3) The total power consumption at speed s is s3 × 10−12 Watt; i.e., the energy consumption is s2 × 10−12 Joule for executing a CPU cycle at speed s. (4) There is no energy consumption when the CPU is idle. If we encode frames at a uniform CPU speed, this speed is 200 MHz, which can finish all frames in 10 ms. The average energy consumption per frame is 100% × 1 × 106 cycles × 2002 × 10−12 Joule per cycle {z } | for executing the first 106 cycles

+ 20% × 1 × 106 cycles × 2002 × 10−12 Joule per cycle {z } | for executing the second 106 cycles

= 4.8 × 10−2 Joule

(5)

Alternatively, we can execute the first 106 cycles at speed 158 MHz and the second 106 6 cycles at speed 272 MHz. This approach still finishes all frames in 10 ms (i.e., 1×10 158 + ACM Transactions on Computer Systems, Vol. V, No. N, June 2005.

12

·

2×106 272

= 10 ms). The average energy consumption per frame becomes

W. Yuan and K. Nahrstedt

100% × 1 × 106 cycles × 1582 × 10−12 Joule per cycle | {z } for executing the first 106 cycles

+ 20% × 1 × 106 cycles × 2722 × 10−12 Joule per cycle {z } | for executing the second 106 cycles

= 3.98 × 10−2 Joule

(6)

That is, the second approach saves energy by 1 − 3.98 4.8 = 17.08% compared to the first approach. This clearly illustrates the energy benefit of speed scaling within a job. We next discuss how GRACE-OS adapts the speed during each job execution. 3.3.2 Speed Schedule. To adapt the execution speed for each job of a process, we introduce a speed schedule for each process. The speed schedule is a list of scaling points. Each point (x, y) specifies that a job accelerates to the speed y when it uses x cycles. Among points in the list, the larger the cycle number x is, the higher the speed y becomes. The point list is sorted by the ascending order of the cycle number x (and hence speed y). According to this speed schedule, a process always starts a job at the speed of the first scaling point. As the job is executed, the scheduler monitors the job’s cycle usage. If its cycle usage is greater than or equal to the cycle number of the next scaling point, its execution is accelerated to the speed of the next scaling point. Figure 5-(a) shows an example of a process’s speed schedule with four scaling points. Figure 5-(b) shows the corresponding speed scaling for three jobs of the same process. Each job starts at speed 100 MHz and accelerates as it progresses. If a job needs fewer cycles, it avoids the high speed execution. For example, the first job requires 1.6 × 106 cycles and thus needs to execute at speed 100 and 120 MHz only. This example does not consider the overhead of changing the CPU speed, which is about 17 to 28 microseconds for each speed change. Each soft real-time process has its own speed schedule and its speed schedule applies to all its jobs. In other words, GRACE-OS changes the CPU speed in three cases (Figure 6): —Context switch. After a context switch, the scheduler sets the CPU speed based on the speed schedule of the switched-in process. This provides isolation of speed scaling among different processes. —New job. When the current process releases a new job, its execution speed is reset to the speed of the first point in its speed schedule. —Job progress. The scheduler also monitors the progress of each job execution and changes the CPU speed when the job reaches the next scaling point. Best-effort processes, however, do not have their own speed schedule. When a best-effort process is switched to execute, the scheduler does not change the speed. That is, the besteffort process executes at the CPU speed of the previous real-time process. After describing how to use the speed schedule, we next discuss how to construct the speed schedule for each process. This construction depends on the demand distribution of the process as well as the available CPU speeds and the corresponding power consumption. We consider two kinds of CPU hardware for the construction: We first consider a simple case by assuming an ideal processor that can change speed in a continuous manner and ACM Transactions on Computer Systems, Vol. V, No. N, June 2005.

Energy-Efficient CPU Scheduling for Multimedia Applications 1 x 106 120 MHz

0 cycle: speed: 100 MHz

2 x 106 180 MHz

·

13

3 x 106 300 MHz

speed (MHz)

(a) Speed schedule with four scaling points

120 100 job1's cycles=1.6x10 6

speed (MHz)

10 180 120 100

job2's cycles = 2.5 x 10 6 10

speed (MHz)

time (ms)

15

time (ms)

18.3 21.1

300 180 120 100 job3's cycles = 3.9 x 10 6 18.3

10

23.8 26.8 time (ms)

(b) Speed scaling for three jobs using speed schedule in (a) Fig. 5. Example of speed schedule and corresponding speed scaling for job execution: each job starts slowly and accelerates as it progresses. context switch

new job

speed

job progress

1

1 time

ith job of process 1

kth job of process 2

ith job of process 1

i+1th job of process 1

Fig. 6. GRACE-OS changes the CPU speed during job execution and at context switch.

whose power consumption is proportional to the cube of the speed. We then consider nonideal processors (e.g., most mobile processors in the market today) that support a discrete set of speeds and whose power consumption does not scale in a cubic manner. For both cases, we describe the problem formulation for the speed schedule and then present the solution. ACM Transactions on Computer Systems, Vol. V, No. N, June 2005.

14

·

W. Yuan and K. Nahrstedt

3.3.3 Calculating Speed Schedule for Ideal Processors. By an ideal CPU, we assume that (1) the CPU can change speed continuously and the speed change does not affect the power of other resources such as memory, and (2) the CPU power is dominated by the dynamic power, which is proportional to the speed and square of the voltage, and the voltage is proportional to the speed. That is, in an ideal CPU, a lower speed results in a cubic energy reduction. With the above two assumptions, we want to find the speed schedule that minimizes energy consumed during each job execution but finishes each job within a certain amount of time. The reason for bounding the execution time is not to miss the deadline of the job or other jobs executed after the job. To do this, we allocate a time budget for each job. Specifically, if there are n concurrent processes and each process is allocated Ci cycles per period Pi , then the ith process (1 ≤ i ≤ n) is allocated Ci Ti = Pn

Ci i=1 Pi

(7)

time units per period Pi (i.e. for each of its jobs). That is, we distribute the time among all processes based on their cycle demand. Intuitively, if there is only a single process, then its time budget equals to its period; if multiple processes run concurrently, they need to share time with each other and hence get a shorter time budget. By using cycle and time allocation together, we can adapt the execution speed for each job as long as the job can use its allocated cycles within its allocated time. Problem Formulation For each process, we find a speed for each of its allocated cycles, such that the total energy consumption of these allocated cycles is minimized while their total execution time is no more than the allocated time. Specifically, if a cycle x executes at speed s(x), its execution 1 1 and its energy consumption is proportional to s(x) × s3 (x) = s2 (x). Since time is s(x) a process requires cycles statistically, it uses each of its allocated cycles with a certain probability. As a result, the expected energy consumption of the cycle x is proportional to (1 − F (x))s2 (x), where F (x) is the cumulative distribution function defined in Equation (2). In this way, constructing the speed schedule for a process is equivalent to1 : minimize:

C X

(1 − F (x))s2 (x) (expected energy for each job)

(8)

x=1

subject to:

C X 1 ≤T s(x) x=1

(execution time for each job )

(9)

where C and T are the process’s allocated cycles and allocated time per period, respectively. To solve the above constrained optimization, we need to know the cumulative distribution F (x) for each allocated cycle. However, our profiler estimates the cumulative distribution for only the group boundaries. That is, we know F (x) for only x ∈ {b1 , ..., bm }, where bm is equal to the number of allocated cycles, C. 1 Recall that for an ideal CPU, the energy consumed by other resources does not change; so we only need to minimize the CPU energy. Further, if a process demands cycles more than the allocated for a job execution, the overrun part (i.e., x > C) can be executed at the highest speed.

ACM Transactions on Computer Systems, Vol. V, No. N, June 2005.

Energy-Efficient CPU Scheduling for Multimedia Applications

·

15

To address this problem, we use a piece-wise approximation technique that divides the allocated cycles into groups and finds a speed for each cycle group, rather than for each individual cycle. In particular, for each group (bi−1 , bi ], 1 ≤ i ≤ m, we find a speed s(bi ) and use this speed for cycles in the group. That is, we rewrite the above constrained optimization as: minimize:

m X

gi × (1 − F (bi ))s2 (bi )

(10)

i=1

subject to:

m X

gi ×

i=1

where gi is the size of the ith cycle group, i.e.,  b1 : gi = bi − bi−1 :

1 ≤T s(bi )

i=1 1