Program Phase-Aware Dynamic Voltage Scaling Under ... - IEEE Xplore

6 downloads 0 Views 656KB Size Report
Joint PDF with respect to computational workload (xcomp) and memory stall time ... Online profiling and calculation of statistics (Section III-B). 3: if (ni == nleaf ) ...
110

IEEE TRANSACTION ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 1, JANUARY 2011

Program Phase-Aware Dynamic Voltage Scaling Under Variable Computational Workload and Memory Stall Environment Jungsoo Kim, Student Member, IEEE, Sungjoo Yoo, Member, IEEE, and Chong-Min Kyung, Fellow, IEEE

Abstract—Most complex software programs are characterized by program phase behavior and runtime distribution. Dynamism of the two characteristics often makes the design-time workload prediction difficult and inefficient. Especially, memory stall time whose variation is significant in memory-bound applications has been mostly neglected or handled in a too simplistic manner in previous works. In this paper, we present a novel online dynamic voltage and frequency scaling (DVFS) method which takes into account both program phase behavior and runtime distribution of memory stall time, as well as computational workload. The online DVFS problem is addressed in two ways: intraphase workload prediction and program phase detection. The intraphase workload prediction is to predict the workload based on the runtime distribution of computational workload and memory stall time in the current program phase. The program phase detection is to identify to which program phase the current instant belongs and then to obtain the predicted workload corresponding to the detected program phase, which is used to set voltage and frequency during the program phase. The proposed method considers leakage power consumption as well as dynamic power consumption by a temperature-aware combined Vdd /Vbb scaling. Compared to a conventional method, experimental results show that the proposed method provides up to 34.6% and 17.3% energy reduction for two multimedia applications, MPEG4 and H.264 decoder, respectively. Index Terms—Dynamic voltage and frequency scaling (DVFS), energy optimization, memory stall, phase, runtime distribution.

I. Introduction YNAMIC voltage and frequency scaling (DVFS) is one of the most effective methods for lowering energy consumption. DVFS is used to suppress the leakage energy by a dynamic control of supply voltage (Vdd ) and body bias voltage (Vbb ). Accurate prediction of remaining workload (hereafter, workload prediction) plays a central role in DVFS where the

D

Manuscript received March 15, 2010; accepted July 27, 2010. Date of current version December 17, 2010. This work was supported in part by the National Research Foundation of Korea Grant funded by the Korean Government, under Grant 2010-0000823, and the Brain Korea 21 Project, the School of Information Technology, Korea Advanced Institute of Science and Technology in 2010. This paper was recommended by Associate Editor H.-H. S. Lee. J. Kim and C.-M. Kyung are with the Korea Advanced Institute of Science and Technology, Daejeon 305-701, South Korea (e-mail: [email protected]; [email protected]). S. Yoo is with the Pohang University of Science and Technology, Pohang 790-784, South Korea (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCAD.2010.2068630

frequency level of the processor is set as the ratio of remaining workload to time-to-deadline. Workload of software program varies due to data dependency (e.g., loop counts), control dependency (e.g., if/else, switch/case statement), and architectural dependency [e.g., cache hit/miss, translation lookaside buffer (TLB) hit/miss, and so on]. To tackle the workload variation, extensive works have been proposed [9]–[13], [19] assuming that workload (i.e., elapsed number of clock cycles seen by processor) is invariant to processor frequency scaling. However, the assumption is not appropriate for applications having significant memory accesses. Fig. 1(a) shows the distribution of per-frame workload of MPEG4 decoder at two different frequency levels, i.e., 1 and 2 GHz. It is obtained from decoding 3000 frames of 1920 × 800 movie clip (an excerpt from Dark Knight) on LG XNOTE LW25 laptop.1 As shown in Fig. 1(a), the workload increases as processor frequency increases. This is due to the processor stall cycles spent while waiting for data from external memory (e.g., SDRAM, SSD, and so on). For example, when the memory access time is 100 ns, each offchip memory access takes 100 and 200 processor clock cycles at 1 GHz and 2 GHz, respectively. Since the memory access time, called memory stall time, is invariant to processor clock frequency, the number of processor clock cycles spent for memory access grows as the clock frequency increases. To consider memory stall time in clock frequency scaling, [4]–[6] present DVFS methods which set the clock frequency of processor based on the decomposition of whole workload into two clock frequency-invariant workloads: computational and memory stall workloads. Computational workload is the number of clock cycles spent for instruction execution, and memory stall workload corresponds to memory stall time. Based on the decomposed workloads, previous methods set clock frequency, f , as f = wcomp /(tdR −t stall ), where wcomp and t stall represent average (or worst-case) computational workload and memory stall time, respectively. tdR is the time-to-deadline. Generally, computational workload and memory stall time have distributions as shown in Fig. 1(b) and (c). Fig. 1(b) shows the distribution of computational workload caused by data, control, and architectural dependency. Distribution of 1 LG XNOTE LW25 laptop consists of 2 GHz Intel Core2Duo T7200 processor with 128 KB L1 instruction and data cache, 4 MB shared L2 cache, and 667 MHz 2 GB DDR2 SDRAM.

c 2010 IEEE 0278-0070/$26.00 

KIM et al.: PROGRAM PHASE-AWARE DYNAMIC VOLTAGE SCALING UNDER VARIABLE COMPUTATIONAL WORKLOAD

111

excerpted from Dark Knight. The x-axis and the left-hand side y-axis represent frame index and per-frame decoding cycles, respectively. The right-hand side y-axis represents program phase index. Note that the program phase index does not correspond to the required performance level of the corresponding program phase in this example. As shown in Fig. 1(d), the entire time for decoding 1000 frames is classified into nine program phases, and, within a program phase, per-frame decoding cycle has a runtime distribution. Fig. 1(e) shows runtime distributions of three representative program phases out of nine program phases to illustrate that there can be a wide runtime distribution within each program phase characterized by its runtime distribution.

Fig. 1. Per-frame profile results of MPEG4 decoder when decoding Dark Knight movie clip. (a) Total workload at 1 and 2 GHz. (b) Computational workload. (c) Memory stall time. (d) Phase behavior in the per-frame workload. (e) Runtime distributions of three representative phases.

memory stall workload shown in Fig. 1(c) results mostly from L2 cache hit/miss, page hit/miss, and interference (e.g., memory access scheduling [1]) in accessing DRAM. As the distribution of memory stall workload becomes significant, previous DVFS methods based on average (or worst-case) memory stall workload become inappropriate in reducing energy consumption. Long-running software programs are mostly characterized by nonstationary phase behavior [14], [15]. For example, multimedia programs (e.g., MPEG4 and H.264 CODEC) have distinct time durations whose workload characteristics (e.g., mean, standard deviation, and max value of runtime) are clearly different from other time durations. We call each such distinct time duration “program phase” [14], [15]. Formal definition of program phase will be given later in Section VIII. Fig. 1(d) exemplifies the program phase behavior of MPEG4 decoder when decoding the first 1000 frames of the movie clip

A. Our Approach Our observation on the runtime characteristics of software program suggests that, as shown in Fig. 1, the program workload has two characteristics: nonstationary program phase behavior and runtime distribution (even within a program phase) of computational workload and memory stall time. Based on the observations above, this paper presents an online DVFS method that tackles the characteristics of program workload in order to minimize the average energy consumption of software program. We address the online DVFS problem in two ways: intraphase workload prediction and program phase detection. The intraphase workload prediction predicts workloads based on the runtime distribution of computational workload and memory stall time in the current program phase. The program phase detection identifies to which program phase the current instant belongs and then obtains the intraphase workload prediction of the corresponding program phase, which is used to set voltage and frequency during the program phase. Leakage power consumption often dominates total power consumption especially at high temperature. Our method tackles leakage power consumption with a temperature-aware combined Vdd /Vbb scaling. During runtime, based on temperature readings as well as the runtime distribution, the online method selects a set of appropriate Vdd and Vbb corresponding to frequency level from the solution table (which was prepared during design time). This paper is organized as follows. Section II reviews related works. Section III gives preliminaries on our energy model and profiling method. Section IV presents the problem definition and solution overview, followed by analytical formulation of our problem in Section V. Sections VI and VII explain the proposed runtime distribution-aware DVFS. Section VIII presents the program phase detection method. Section IX reports experimental results followed by the conclusion in Section X. II. Related Works There are a number of methods on the workload prediction for online DVFS based on (weighted) average, maximum, or the most frequent workload, or finding a repeated pattern among N recent workloads [2]. Recently, a control theorybased workload prediction method was proposed to accurately capture the transient behavior of workload [3]. To exploit memory stall time, [4] and [5] present memory stall timeaware DVFS for soft real-time intertask DVFS which lowers

112

IEEE TRANSACTION ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 1, JANUARY 2011

the clock frequency by an amount proportional to the average ratio of external memory access per instruction to clock cycle per instruction. However, these memory stall time-aware DVFS methods are based on average memory stall time and do not exploit the workload distribution and nonstationary program phase behavior. Runtime distribution in computational workload (in most cases, assuming a constant memory stall time) has been studied mostly in intratask DVFS methods where performance level is set dynamically during the execution of a task. There are several intratask DVFS methods where workload is predicted based on the program execution paths, e.g., worstcase execution path [7], average-case execution path [8], and virtual execution path based on the probability of branch invocation [9]. [10] presents an analytic workload prediction method which minimizes statistical average dynamic energy consumption. [11] presents a numerical solution for combined Vdd /Vbb scaling to tackle leakage energy. [12] and [13] present a DVFS method, called accelerating frequency schedules, which considers the per-task runtime distribution for a set of independent tasks. All the works mentioned above assume constant memory stall time and single program phase. Program phase concept has been one of the hottest research issues because it allows new opportunities of performance optimization, e.g., program phase-aware dynamic adaptations of cache architecture [14], [15]. Various methods have been proposed to characterize program phase behavior. Among them, a vector of the average execution cycles of basic blocks, called basic block vector (BBV), is most widely used. By characterizing a program phase with BBV, one can apply the program phase concept to DVFS as in [16] and [17]. A new program phase is detected when two BBVs are significantly different, e.g., when Hamming distance between two BBVs is larger than a pre-defined threshold value. Because there are a large number of basic blocks in typical software applications, program phase detection utilizing the BBV is usually impractical. Thus, the key issue is to reduce the dimensionality of the BBV by identifying a subset of basic blocks to represent the program phase behavior. A random linear projection method is described in [14] and [15] to reduce the effort of exploring all the combinations of basic blocks to identify the subset. In this paper, we present a program phase detection scheme suitable for DVFS purpose, based on the vectors of predicted workloads for coarsegrained code sections (instead of using BBV) as explained in Section VIII. In addition, unlike existing phase-based DVFS methods, our method exploits runtime distribution within each program phase to better predict the remaining workload. Several online DVFS methods have been presented to utilize the dynamic program behavior for further energy saving. [18] presents a workload prediction method utilizing the Kalman filter which captures time-varying workload characteristics by adaptively reducing the prediction error via feedback. We presented an online workload prediction method which minimizes both dynamic and leakage energy consumption by exploiting the program phase behavior and runtime distribution of computational cycle within each program phase [19]. Based on the assumption that memory stall time does not vary a

lot during runtime, the distribution of memory stall time is not considered. However, the memory stall time is simply accounted for as an integral (nonseparable) part of the total runtime of software program. However, in memory-bound applications where memory stall time becomes a significant portion of total program runtime, the distribution of memory stall time needs to be exploited to achieve further energy reduction. Compared to the method which sets voltage and frequency based on average computational workload and memory stall time during program runs [4], our method has three distinctive features. First, our approach exploits runtime distribution of both computational cycle and memory stall time, while only the average values are assumed in [4]. Second, we exploited program phase detection to achieve maximal reduction of energy consumption, while [4] utilizes average workload of whole program without the notion of program phase. Third, in our method, workload prediction is done in a temperatureadaptive manner to tackle the dependency of leakage energy and temperature, while the temperature dependence is ignored in [4]. III. Preliminary A. Processor Energy Model Energy consumption per cycle (e) consists of switching (es ) and leakage (el ) components. Additionally, in deep submicron regime, el is further divided into subthreshold (esub l ), gate gate junc (el ), and junction (el ) leakage energy. Putting them all together, we can express the total energy consumption per cycle as follows [20], [21]: e



2 Ceff Vdd + Ng f −1 · (Vdd K1 exp(K2 Vdd ) exp(K3 Vbb )  (1) +Vdd K4 exp(K5 Vdd ) + |Vbb |Ij

where Ceff and Ng are effective capacitance and effective number of gates of the target processor, respectively. K1 , K2 , K3 , K4 , K5 , and Ij  are process-dependent curve-fitting paramgate junc , respectively. Especially, the eter sets for esub l , el , and el values of K1 , K2 , K3  are functions of operating temperature increases exponentially as the operating tem(T ), since esub l perature increases. According to BSIM4 model and [21], the temperature dependence of the parameters (K1 , K2 , and K3 ) is modeled as follows:  T 2  K6 Tref  K1 (T ) ≈ exp (1 − ) · K1 (Tref ) (2) Tref Tref T  Tref  K2 (T ) ≈ · K2 (Tref ) (3) T  Tref  (4) · K3 (Tref ) K3 (T ) ≈ T where Tref is reference temperature and K6 is a curve-fitting parameter. Thus, K1 , K2 , K3  at temperature T can be obtained from the values at Tref using the relationship in (2)–(4). Since the temperature-aware energy model shown in (1)– (4) is too complicated to be used in our optimization, we adopted a simplified energy model of combined Vdd /Vbb scaling to approximate the energy consumption per cycle at each temperature T as follows: e(f, T ) ≈ as (T )f bs (T ) + al (T )f bl (T ) + c(T )

(5)

KIM et al.: PROGRAM PHASE-AWARE DYNAMIC VOLTAGE SCALING UNDER VARIABLE COMPUTATIONAL WORKLOAD

113

TABLE I Energy Fitting Parameters for Approximating the Processor Energy Consumption to the Accurate Estimation Obtained from PTscalar with BPTM High-k/Metal Gate 32 nm HP Model for Different Temperatures, Along with the Corresponding (Maximal and Average) Errors Temperature (°C)

Fitting Parameters as 1.2×10−1 1.2×10−1 1.2×10−1 1.2×10−1

25 50 75 100

bs 1.3 1.3 1.3 1.3

al 4.6×10−9 2.0×10−7 2.2×10−6 1.4×10−5

Maximum (Avg) Error (%) bl 20.5 16.6 14.2 12.4

c 0.11 0.12 0.14 0.15

2.8 1.4 1.4 1.7

[0.9] [0.5] [0.4] [0.7]

Fig. 2. Memory stall time vs. number of L2 cache misses as approximated by a straight line.

where as (T ), bs (T ) and al (T ), bl (T ) are sets of curvefitting parameters which model frequency-dependent portion in es (T ) and el (T ), respectively. c(T ) is a curve-fitting parameter corresponding to the amount of frequency-independent energy portion in e(f, T ). Table I shows examples of fitting parameters which approximate the processor energy consumption obtained from PTscalar [21] and Cacti5.3 [22] with the energy model, i.e., (1)–(4), for Berkeley predictive technology model (BPTM) high-k/metal gate 32 nm HP model [23] at 25 °C, 50 °C, 75 °C, and 100 °C. In the modeling, we configured a target processor in PTscalar as the best-effort estimate of Core 2-class microarchitecture using the parameters presented in [24]. As Table I shows, the simplified energy model tracks the original energy model within 2.8% of maximum error for all the operating temperatures. Note that fitting parameters for modeling switching energy consumption, i.e., as and bs , are unchanged as temperature varies because switching energy consumption is temperature invariant. Processor energy consumption depends on the type of instructions executed in the pipeline path [25]. To simply consider the energy dependence on instructions, we classify processor operation into two states: computational state for executing instructions and memory stall state mostly spent for waiting for data from memory. When a processor is in the memory stall state, switching energy consumption can be suppressed using clock gating while leakage energy consumption is almost the same as the computational state. The reduction ratio of switching energy, called clock gating fraction denoting the fraction of the clock-gated circuit, is modeled as β (0.1 in our experiments). Thus, energy consumption per clock cycle in each processor state can be calculated as follows: ecomp stall

e comp

= =

a s f bs + a l f bl + c βas f

bs

bl

+ al f + c

(6) (7)

stall

and e represent energy consumption per cycle where e in the computational and memory stall state, respectively. Given a desired frequency level (f ), one can always find a pair of Vdd and Vbb that gives minimum energy consumption per cycle using the combined Vdd /Vbb scaling [11]. B. Runtime Workload Profiling The total number of processor execution cycles, x, can be expressed as a sum of the number of clock cycles for executing instructions in a processor, xcomp , and that of stall cycles for

accessing an external memory, xstall , which is expressed as a function of memory stall time, t stall , and frequency, f , as follows: x = xcomp + xstall = xcomp + f · t stall .

(8)

For the decomposition of processor cycle into two processor clock frequency-invariant components, i.e., xcomp and t stall , during program runs, we adopt an online profiling method which uses performance counters in a processor as presented in [4]. We model t stall using only the number of the lastlevel cache misses (N L2 miss , in our experiment, L2 is the last-level cache). The rationale of modeling t stall only with N L2 miss is twofold. First, the effect of last-level cache miss dominates the others (TLB miss, interrupts, and so on) according to our experiment. Second, the number of events simultaneously monitored in a processor is usually limited (in our experimental platform, two events). In our model, t stall is expressed as follows: t stall = ap · N L2

miss

+ bp

(9)

where ap and bp are fitting parameters. Fig. 2 illustrates that (9) (solid line) tracks quite well the measured memory stall time (dots) when running H.264 decoder program in FFMPEG [29]. In a typical software program, xcomp and t stall obtained from running a code section are correlated with each other. It is because t stall of a code section is proportional to the number of external memory references which is highly correlated with the number of executed memory instructions in a code section, e.g., load and store. xcomp of a code section depends on the type and number of executed instructions including memory instructions. To consider the correlation between computational cycle (xcomp ) and memory stall time (t stall ), we model the distribution of xcomp and t stall of a code section using a joint probability density function (PDF) as shown in Fig. 3. During runtime, the joint PDF is obtained as follows. After the execution of a code section, t stall is obtained from (9). Then, from (8), xcomp is calculated with x and t stall . The probability of occurrence of a pair of xcomp and t stall is defined as the ratio of the number of occurrences of the pair to the total number of executions of the code section.

114

IEEE TRANSACTION ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 1, JANUARY 2011

Fig. 3. Joint PDF with respect to computational workload (xcomp ) and memory stall time (t stall ).

Fig. 4. Solution inputs. (a) Software program (or source code) partitioned into program regions. (b) Energy model (in terms of energy-per-cycle) as a function of frequency. (c) f–v table storing the energy-optimal pairs, (Vdd , Vbb ), for N frequency levels.

IV. Problem Definition and Solution Overview Fig. 4 illustrates three types of input required in the proposed procedure. Fig. 4(a) shows a software program partitioned into program regions each shown as a box. A program region is defined as a code section with associated voltage/frequency setting. The partition can be performed manually by a designer or via an automatic tool [26] based on execution cycles of code sections obtained by a priori simulation of the software program. The ith program region is denoted as ni while the first and the last program region are called root (nroot ) and leaf (nleaf ) program region, respectively. In this paper, we simply focused on a software program which periodically runs from nroot to nleaf at every time interval. At the start of a program region, voltage/frequency is set and maintained until the end of the program region. At the end of a program region, computational cycle and memory stall time are profiled. Then, as explained in Section III-B, the joint PDF of computational cycle and memory stall time are updated as shown in Fig. 3. Fig. 4(b) shows an energy model (more specifically, energy-per-cycle vs. frequency). Fig. 4(c) shows a pre-characterized table called f-v table in which the energyoptimal pair, (Vdd , Vbb ), is stored for each frequency level (f ). When the frequency is scaled, Vdd and Vbb are adjusted to the corresponding level stored in the table. Note that, due to the dependency of leakage energy on temperature, energy-optimal values of (Vdd , Vbb ) corresponding to f vary depending on the operating temperature. Therefore, we prepare f-v table for a set of quantized temperature level.

Algorithm 1 : Overall flow 1: if (end of ni ) then 2: Online profiling and calculation of statistics (Section III-B) 3: if (ni == nleaf ) then 4: iter++ 5: if ((iter % PHASE UNIT)==0) then 6: for from nleaf to nroot do 7: Workload prediction for each energy component (Section VI) 8: end for 9: Program phase detection (Section VIII) 10: end if 11: end if 12: else if (start of ni ) then 13: Finding workload of ni based on coordination (Section VII) 14: Voltage/frequency scaling with feasibility check 15: end if

Given the three inputs in Fig. 4, we find the energy-optimal opt workload prediction, i.e., wi , of each program region during program execution. Algorithm 1 shows the overall flow of the proposed method. The proposed method is largely divided into workload prediction (lines 1–11) and voltage/frequency (v/f) setting (lines 12–15) step, which are invoked at the end and the start of every program region, respectively. In the workload prediction step, we profile runtime information, i.e., xistall and tistall , and update the statistical parameters of the runtime distributions, e.g., mean, standard deviation, and skewness of xistall and tistall (lines 1–2). After the completion of the leaf program region, the number of program runs, i.e., iter, is increased (line 4). At every PHASE UNIT program runs (line 5), where PHASE UNIT is the predefined number of program runs (e.g., 20-frame decoding in MPEG4), we perform the workload prediction and program phase detection by utilizing the profiled runtime information and its statistical parameters (lines 5–10). The periodic workload prediction is performed in the reverse order of program flow as presented in [10], [11], and [19], i.e., from the end (nleaf ) to the beginning (nroot ) of a program (lines 6–8). As will be explained in Sections V and VI, in this step, we find local-optimal workload predictions of ni , each of which minimizes each energy component, instead of total energy.2 By utilizing the localoptimal workload predictions, the program phase detection is performed to identify which program phase the current instant belongs to (line 9). In the v/f setting step (lines 12–15), which is performed at the start of each program region, a process called coordination determines energy-optimal global workload prediction, opt wi , with the combination of the local-optimal workload predictions of the detected program phase (line 13). Based on opt wi , we set voltage/frequency while satisfying hard real-time constraint (line 14). V. Analytical Formulation of Memory Stall Time-Aware DVFS Assume that a program is partitioned into two program regions, i.e., ni and ni+1 , and that each program region has 2 In this paper, total energy consumption is calculated as the sum of the five independent energy components as shown in (11).

KIM et al.: PROGRAM PHASE-AWARE DYNAMIC VOLTAGE SCALING UNDER VARIABLE COMPUTATIONAL WORKLOAD

a distinct computational cycle and memory stall time. The energy model presented in Section III-A is used. The total energy consumption for running the two program regions, Ei , is calculated as follows: comp

Ei = Ei

+ Eistall

(10)

comp

and Eistall represent the energy consumption for where Ei running computational workload and memory stall workload, respectively. comp Ei and Eistall , respectively, consist of three independent energy components: frequency-dependent switching encomp ergy (Esi and Esistall ), frequency-dependent leakage energy comp (Eli and Elistall ), and frequency-independent energy called comp comp base energy (Ebi and Ebistall where Ebi = Ebi + Ebistall ). Thus, Ei is expressed as follows: comp

Ei = (Esi

comp

+ Eli

) + (Esistall + Elistall ) + Ebi .

(11)

Using (6)–(8), the five energy components in (11) are expressed as follows: comp

Esi

comp

= as fibs · xi

comp Eli Esistall Elistall

=

Ebi

=

= =

comp

bs + as fi+1 · xi+1

comp comp bl · xi + al fi+1 · xi+1 bs stall β(as fibs · fi tistall + as fi+1 · fi+1 ti+1 ) bl bl stall stall al fi · fi ti + al fi+1 · fi+1 ti+1 comp comp stall c(xi + xi+1 + fi tistall + fi+1 ti+1 ).

al fibl

(12) (13) (14) (15) (16)

Frequency of each program region, fi and fi+1 can be expressed as the ratio of the remaining computational workload prediction (wi and wi+1 ) to the remaining time-to-deadline prediction for running the computational workload, i.e., total R remaining time-to-deadline (tiR and ti+1 ) minus remaining memory stall time prediction (si and si+1 ), as shown in wi fi = R (17) ti − s i wi+1 . (18) fi+1 = R ti+1 − si+1 R in (18) is expressed as follows: ti+1 comp

R ti+1 = tiR −

xi fi

− tistall .

(19)

R with (17) and (19), fi+1 in (18) is By replacing fi and ti+1 rearranged as follows: wi+1 fi+1 = R (20) (ti − si )γi

where comp

γi

=

tistall

=

xi t stall − R i wi ti − s i (tistall + si+1 ) − si . 1−

(21) (22)

stall When memory stall time of ni and ni+1 , i.e., tistall and ti+1 , are unit functions, remaining memory stall time prediction is set to the sum of memory stall time of remaining program stall regions, i.e., si = tistall + ti+1 . In the same manner, si+1 is set stall to ti+1 because ni+1 is the leaf, i.e., last, program region in this case. Therefore, tistall in (22) becomes zero, thereby, γi

115

is independent of tiR . Since we perform workload prediction from leaf to root program region as presented in [10], wi+1 is opt already known as wi+1 when calculating wi . With (17)–(20), (12)–(16) can be expressed as functions of wi and tiR . Since Ei is continuous and convex with respect to wi , the energy-optimal workload prediction of computational workopt load, i.e., wi , can be obtained by finding a point which satisfies the following relation: comp

∂Ei ∂Esi = ∂wi ∂wi

comp

+

∂Eli ∂wi

+

∂Esistall ∂Elistall ∂Ebi + + = 0. (23) ∂wi ∂wi ∂wi

Since total energy consumption, Ei , is a function of wi as opt well as tiR , wi satisfying (23) varies with respect to tiR . In opt other words, wi has to be found for every tiR . Because tiR has a wide range of values, performing a workload prediction for every value of tiR is unrealistic. Therefore, we proposed a solution which performs a workload prediction for a set of quantized levels of tiR [28]. However, it also requires a lot of workload predictions since more energy savings can be obtained as tiR is quantized into larger number of quantization levels. Thus, the method causes a large runtime overhead if it is applied as the online solution while maintaining its effectiveness (according to our experiment, the runtime overhead is 3.4 times larger than the pure runtime for H.264 decoder when tiR is quantized into 30 levels). To reduce the runtime overhead of finding an energyoptimal workload prediction, we propose a workload preopt diction method which finds wi in two steps: 1) workload prediction which minimizes each energy component, called local-optimal workload prediction (in Section VI), and 2) coordination of the local-optimal workload predictions to opt obtain global workload prediction wi (in Section VII). A local-optimal workload prediction is to find the workload prediction which minimizes each of the five energy components in (11) by adjusting voltage/frequency based on the workload prediction. For example, voltage/frequency scaling comp based on the local-optimal workload prediction of Esi only comp minimizes energy consumption of Esi . It can be obtained by finding the point which equates the single derivative of comp (23) to zero, i.e., ∂Esi /∂wi = 0. Note that a local-optimal workload prediction can be calculated independently of tiR , because γi in (21) is independent of tiR (∵tistall = 0). A coordination of the local-optimal workload predictions is to find the workload prediction which minimizes Ei by utilizing the five local-optimal workload predictions. When a derivative of one energy component with respect to wi dominates others in (23), the workload prediction which satisfies the (23) can be obtained by finding a point where the derivative of the dominant energy component becomes comp opt zero. For instance, when ∂Esi /∂wi dominates others, wi comp is simply set to wsi . When there are multiple dominant energy components, we need to coordinate them so as to find the workload prediction with lower total energy consumption. Finding the workload prediction [satisfying (23)] requires a numerical solution whose complexity is too high to be applied during runtime, as presented in [28]. In this paper, we present an efficient approach to coordinate local-optimal workload

116

IEEE TRANSACTION ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 1, JANUARY 2011 comp

comp

comp

comp

and xi+1 represent the average of xi and xi+1 , where xi comp comp is fixed as xi since Ji is a respectively. Note that xi opt comp unit function in this case. wi+1 is replaced by wsi+1 since comp we perform the local-optimal workload prediction of Esi . comp comp Since Esi is continuous and convex on wi , wsi can be obtained by finding a point which satisfies as bs wbi s −1 · (tiR −si )bs

comp

∂Esi ∂wi

 comp

xi

=

comp

comp

+ (wsi+1 )bs xi+1



comp

−xi



(25)

 = 0.

comp (wi −xi )bs +1

comp

By rearranging (25) with respect to wi , we can express wsi in a closed-form expression as follows:  comp comp comp comp  1 = xi + (wsi+1 )bs xi+1 bs +1 wsi comp

= xi

comp

si+1 . +w

(26)

comp wsi

comp

Fig. 5. Three cases. (a) Case 1: unit functions for both xi and tistall . (b) comp and unit function of tistall . (c) Case 3: Case 2: runtime distribution for xi comp runtime distributions for both xi and tistall . opt

predictions in a runtime-adaptive manner. We find wi at the start of ni through the coordination of local-optimal workload predictions.

VI. Workload Prediction for Minimizing Energy Consumption of Single Energy Component In this section, we assume that a program is partitioned into two consecutive program regions, ni and ni+1 , and present a method which finds a local-optimal workload prediction while exploiting the runtime distribution of both computational workload and memory stall time. As Fig. 5 shows, we will explain the local-optimal workload prediction method in three comp different cases of Ji , the joint PDF of xi and tistall . Case 1: when Ji is given as a unit function while Ji+1 is a general comp function as shown in Fig. 5(a). Case 2: when xi alone has a runtime distribution while tistall is a unit function as shown comp in Fig. 5(b). Case 3: when both xi and tistall have runtime distributions as shown in Fig. 5(c). comp

A. Case 1: Both xi

and tistall Have Unit Functions

In this subsection, we explain the case where the joint PDF of ni is given as a unit function as shown in Fig. 5(a). We comp comp define wsi , wli , wsistall , wlistall , and wbi as the localcomp comp optimal workload prediction for minimizing Esi , Eli , stall stall Esi , Eli , and Ebi , respectively. Given the joint PDFs, Ji and Ji+1 , average switching energy consumption for running comp computational workload, i.e., Esi , is calculated as the sum comp of Esi with respect to Ji and Ji+1 as follows:   comp comp Esi = ··· Esi Ji Ji+1 (24) comp     bs as wsi+1 comp comp wbi s xi = R + xi+1 comp b s (ti − si ) 1 − xi /wi

consists of two components: Equation (26) shows that comp 1) workload of the ith program region, i.e., xi , and 2) comp comp bs comp 1/(bs +1) si+1 = ((wsi+1 ) xi+1 ) w , called effective remaining comp workload of ni+1 with respect to Esi , corresponding to the portion of remaining workload after program region ni . comp Fig. 5(a) illustrates the calculation of wsi presented in (26), where Ji and Ji+1 are replaced by their representative comp scomp workloads, i.e., xi and w i+1 , respectively. In the same comp stall stall way, wli , wsi , wli , and wbi can be expressed as follows:  comp comp  1 comp comp = xi + (wli+1 )bl xi+1 bl +1 wli comp comp = xi + wl (27) i+1

 wsistall

=

comp xi

+

=

xi

comp

+

=

comp xi

=

xi

comp

+

tistall

comp , wl i+1

l

stall bl +1 stall (wli+1 ) ti+1

stall

 + wl i+1

=

comp xi

=

xi

comp

+

(28)

 b 1+2

comp

xi

 wbi

s

stall bs +1 stall (wsi+1 ) ti+1 tistall sstall w i+1

 wlistall

 b 1+2

comp

xi

(29)  21

comp

xi

tistall

 i+1 + wb

stall wbi+1 ti+1

(30)

stall sstall w i+1 , wli+1 ,

 i+1 are effective remaining where and wb comp workload of ni+1 with respect to Eli , Esistall , Elistall , and Ebi , respectively. Since local-optimal workload can simply be calculated by just summing effective remaining workloads of program regions as shown in (26)–(30), it can be obtained during program runs with negligible runtime overhead.3 If the software program consists of a cascade of program regions with conditional branches, we can still calculate the effective remaining workload of program region in a similar manner to [10]. 3 The runtime overhead of the local-optimal workload prediction is presented in Table VI.

KIM et al.: PROGRAM PHASE-AWARE DYNAMIC VOLTAGE SCALING UNDER VARIABLE COMPUTATIONAL WORKLOAD comp

B. Case 2: xi Unit Function

117

Has a Runtime Distribution and tistall Has a comp

In this subsection, we explain the case where xi has a runtime distribution with tistall still assumed as unit function as shown in Fig. 5(b). In this case, average energy consumption of computational workload is expressed as follows:   comp as comp Esi = ··· Esi Ji Ji+1 = R · (31) (ti − si )bs   Nc comp  pi (j) comp comp comp wbi s xi + (wsi+1 )bs xi+1 comp (1 − xi (j)/wi )bs j=1

comp

can be obtained independently Note that, in this case, wsi of tiR , without loss of quality degradation. However, contrary comp comp to (26), no explicit form exists for wsi . Thus, wsi can be obtained only through a numerical solution approach as presented in [11], which is too time-consuming for runtime application. Instead, being inspired by (26), we can model the comp solution, wsi as follows: comp

comp

i = xs

comp

si+1 +w

(33)

comp

i where xs is the effective workload of program region ni comp scomp for Esi . w i+1 is obtained in the same way as presented in (26). From our observation that energy-optimal workload prediction tends to have a value near the average and depends comp on runtime distribution, we model xs as follows: i comp comp comp xs = (1 + λsi ) · xi i comp

comp

and Index 3:

follows:

comp

in its where Nc is the number of quantized levels of xi comp comp PDF. pi (j) represents the probability of xi falling into the jth quantized level. Note that, in this case where tistall is comp given as a unit function, the joint PDF (Ji ) of xi and tistall comp comp stall is the same as the PDF of xi at the given ti , i.e., pi . comp wsi can be obtained by finding wi which satisfies the following relation:  comp ∂Esi as bs wibs −1 comp = R xi + (32) ∂wi (ti − si )bs  Nc  comp comp   −x (j)p (j) comp i i s wbi+1 = 0. xi+1 comp bs +1 (w − x (j)) i i j=1

wsi

comp

Fig. 6. λscomp as a function of (a) Index 1: σi /xi comp comp comp xi / wsi+1 , and (b) Index 2: skewness (gi ), at 75 °C.

(34)

where λsi is a parameter which represents the ratio of comp comp comp the distance between xs and xi to xi . We calculate i comp i by exploiting the pre-characterization of solutions. First, xs comp we prepare a lookup table LUTλscomp for λsi during design comp during runtime. time and perform table lookup to obtain λsi comp λsi depends on the shape of runtime distribution. Thus, we derived the indexes of LUTλscomp as follows: comp comp 1) Index 1: σi /xi , normalized standard deviation comp comp (σi ) with respect to the mean of ni (xi ); comp comp 2) Index 2: gi , skewness of xi ; comp comp comp 3) Index 3: xi / wsi+1 , ratio of the mean of ni (xi ) to comp the effective remaining workload of ni+1 ( wsi+1 ). The rationale of choosing the three indexes is as follows. By comp substituting wsi with (33) and (34), (32) is rearranged as

comp

xi

comp

+ ( wsi+1 )bs +1 ·

Nc



(35)

j=1 comp

−xi comp

((1 + λsi

comp

) · xi

comp

(j)pi

(j)

comp

comp

si+1 − xi +w

(j))bs +1

= 0.

comp

Note that the optimal λsi can be obtained by finding a comp point which satisfies (35). As shown in (35), λsi depends comp comp scomp on xi , w (Index 3), and the PDF of x , i.e., i i+1 comp comp xi (j), pi (j), which is modeled as a skewed normal distribution in this paper, since the PDF usually does not have a nice normal distribution.4 The skewed normal distribution is comp comp comp characterized with three parameters: xi , σi , and gi comp (Index 1 and Index 2). Fig. 6 illustrates λsi as the indexes change. comp Fig. 6 shows λsi as a function of three indexes above. As comp shown in Fig. 6(a), λsi increases with the wider distribution comp comp comp of xi , i.e., σi /xi increases, and increases as the workload of ni (relative to the effective remaining workload comp comp of ni+1 ), i.e., xi / wsi+1 , increases. It also increases as the comp gi , skewness of PDF, moves to the right (gi > 0) as Fig. 6(b) shows. comp In the same way, wli , wsistall , wlistall , and wbi can also comp be calculated by finding λli , λsistall , λlistall , and λbi from LUTλlcomp , LUTλsstall , LUTλlstall , and LUTλb , respectively. Note comp ∼ λbi can be obtained by performing table lookup that λsi with the statistical parameters (e.g., mean, standard deviation, and skewness) and effective workload of ni+1 . Thus, it can be performed with negligible runtime overhead to find a local-optimal workload prediction while exploiting the runtime distribution of computational workload. comp

C. Case 3: Both xi

and tistall Have Runtime Distributions

comp xi

and tistall have their runtime distributions When both as shown in Fig. 5(c), average switching energy consumption comp for running computational workload, i.e., Esi , can be comp calculated as the sum of Esi with respect to the joint PDFs 4 Note that more accurate workload prediction can be performed with an comp additional effort, as presented in [19], where PDF of xi is modeled as a multimodal distribution with each mode given as a skewed normal distribution. Although more energy savings can be obtained from the multimodal modeling, in this paper, we simply approximated PDF as a single-modal skewed normal distribution in order to reduce the runtime overhead. However, it can be easily extended to the multimodal case [19].

118

IEEE TRANSACTION ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 1, JANUARY 2011

Ji and Ji+1 as follows:   comp comp Esi = ··· Esi Ji Ji+1  bs comp as comp  = wi xi + Zsi R b s (ti − si )

(36)

where comp

Zsi

comp

comp

= (wsi+1 )bs xi+1

Ns Nc   j=1 k=1

J(j, k) (γi (j, k))bs

(37)

where γi (j, k) and J(j, k) denote γi in (21) and the probability comp when (xi , tistall ) falls into the (j, k)th quantized level, respectively. Since we set the predicted remaining memory stall time stall (si ) to the sum of average of tistall and ti+1 , tistall [defined in (22)] in γi is not zero any longer. Due to the nonzero tistall , the local-optimal workload prediction is a function of tiR . To reduce the solution complexity, we approximate the calculation comp of Zsi in (37) as follows:   comp comp comp comp ≈ ηsi · (wsi+1 )bs xi+1 · Zsi Nc  j=1

(1 −

comp pi (j) comp xi (j)/wi )bs

TABLE II Threshold Parameters Used in Coordination Coordination Step C1

Threshold Parameter

Condition

fscomp

b ∂(as f bs ) lf l ) ≥† θc · ∂(a∂f ∂f b b s ∂(al f l ) ≥ θc · ∂(as∂ff ) ∂f b +1 l ∂(βal f ) ∂(cf ) ∂f ≥ θc · ∂f ∂(al f bl +1 ) ∂(cf ) ≥ θc · ∂f ∂f bl +1 ∂(βas f bs +1 ) ≥ θc · ∂(al f ∂f +cf ) ∂f b +1 bs +1 l ∂(al f +cf ) ≥ θc · ∂(βas∂ff ) ∂f

flcomp C2

fbstall flstall

(38) C3

fsstall fLstall

where comp ηsi

=

Ns Nc  



j=1 k=1

comp

1 − xi /wi γi (j, k)

bs

†θ : c

· Ji (j, k).

depends on wi and si because γi As shown in (39), (21) is a function of wi and si . Note that (wi , si ) will be calculated at the end of the current program phase using the joint PDFs (Ji and Ji+1 ) profiled during the time period of the current program phase. To simplify the interdependence comp between ηsi and (wi , si ), we approximate the calculation comp comp of ηsi by replacing (wi , si ) with (wsi , si ) of the current program phase. By substituting (38) with the approximated comp ηsi , we can rearrange (36) as follows:  as comp comp comp Esi ≈ · wbi s xi + ηsi · (40) (tiR − si )bs  Nc comp  comp b comp   p (j) i . (wsi+1 ) s xi+1 · comp (1 − xi /wi )bs j=1 comp

Note that (40) is the same as (31), except for ηsi . Therefore, comp in a similar way as (32) and (33), we can express wsi , comp which minimizes Esi , as follows: comp

wsi

comp

i = xs

comp

si+1 +w

1/(bs +1)  comp comp bs comp scomp w = ηs · (ws ) x . i i+1 i+1 i+1

user-defined threshold value.

(39)

comp ηsi

where

opt

Fig. 7. Hierarchical coordination to obtain global workload prediction, wi , where C1–C4 represent coordination steps.

(41)

(42)

comp

in the Case 1 and Case Compared to the calculation of wsi comp 2 in Fig. 5(a) and (b), the only difference is that (ηsi )1/(bs +1) is multiplied in the calculation of effective remaining workload comp 5 scomp of ni+1 , i.e., w , wsistall , wlistall , and wbi can also i+1 . wli be calculated in the same way. comp

5 Note that when memory stall has no distribution, i.e., t stall = 0, ηs i i comp si+1 becomes the same as Case I and Case II. becomes “1,” thereby w

Note that we perform the most time-consuming work of comp workload prediction, i.e., finding λsi ∼ λbi with respect to the runtime distribution, in a design-time step, and then, we store the parameters into LUTs. Thus, we can drastically reduce the runtime overhead of finding workload prediction while accurately considering the influence of the runtime distribution in workload predictions because we only access the LUTs to find workload prediction during runtime. However, it requires additional memory space to store the pre-characterized data. The runtime and area overhead are presented in Section IX-C.

VII. Frequency Selection Based on Coordination In this section, we present a method called coordination to opt find the global workload prediction of ni (wi ) based on the comp comp local-optimal workload predictions, i.e., wsi , wli , wsistall , stall wli , and wbi . As (23) shows, the workload prediction which minimizes average total energy consumption at given tiR varies according to the sensitivity of each energy components with comp comp respect to wi , i.e., ∂Esi /∂wi , ∂Eli /∂wi , ∂Esistall /∂wi , stall ∂Eli /∂wi , and ∂Ebi /∂wi in (23). Since the coordination of workload predictions is performed online, it needs to be done with low overhead. To achieve this goal, we present a simple hierarchical method which finds opt wi from local-optimal workload predictions (independent of R ti ), as shown in Fig. 7. As Fig. 7 shows, first, we obtain the workload prediction for each workload type, i.e., compucomp tational workload (wi ) through a coordination step called C1 and memory stall workload (wstall i ) through coordination opt comp steps called C2 and C3. Then, we find wi from wi and stall wi through a coordination step called C4.

KIM et al.: PROGRAM PHASE-AWARE DYNAMIC VOLTAGE SCALING UNDER VARIABLE COMPUTATIONAL WORKLOAD

comp

Fig. 8. Linear coordination of (a) C1: wsi comp opt C4: wi and wstall to find wi . i

comp

and wli

comp

to find wi

. (b)

comp wi

1) Coordination for (C1): A workload prediction comp for computational workload, wi represents the prediction comp comp comp which minimizes Ei , i.e., sum of Esi and Eli . Therecomp comp comp depends on wsi and wli . In this coordination, fore, wi comp has exponential dependency we utilize the fact that Eli on frequency in combined Vdd /Vbb scaling. The rationale is explained as follows. In the low frequency region, high reverse body bias voltage can be applied suppressing the leakage energy consumption due to high Vth . As frequency increases, |Vbb | is decreased to enable higher clock frequency operation by reducing Vth , which drastically increases leakage energy consumption. In combined Vdd /Vbb scaling, increase of switching energy consumption (with respect to frequency increase), i.e., ∂es /∂f , dominates leakage energy consumption in the lower frequency region while increase of leakage energy consumption, i.e., ∂el /∂f , dominates others in relatively high frequency region [27]. Therefore, when most operating frequency falls into the frequency range where the sensitivity of switching energy consumption is much larger than that of leakage energy comp comp consumption, i.e., ∂es /∂f  ∂el /∂f , wi approaches wsi because switching energy consumption is the major contributor in this frequency region. On the other hand, when the operating frequency is within the frequency region where comp comp ∂es /∂f  ∂el /∂f , wi approaches wli . We partition the frequency range into three regions: switching energy-dominant, leakage energy-dominant, and intermediate regions. The partition is done with two threshold frequencies, fscomp and flcomp . The frequency range below fscomp (above flcomp ) is called switching (leakage) energy-dominant region while the frequency range between the two threshold frequencies is called intermediate region. Each energy component has two threshold frequencies as shown in Table II. In order to identify which frequency partition the current program region belongs to, we introduce a simple evaluation metric, fieval , as the upper bound of the operating frequency in the remaining program regions from ni to nleaf comp(k)

fieval = comp(k)

WCECi

tiR − WCETistall(k)

.

(43)

In (43), WCECi and WCETistall(k) represent the remaining worst-case execution cycle of computational workload and remaining worst-case memory stall time from ni to nleaf when a current program phase is the kth program phase, respectively. The solid line in Fig. 8(a) illustrates a linear comp coordination method to find wi by utilizing fieval . When

119

fieval is lower than the threshold value, fscomp (in the second row in Table II where θc is set to 5.0 in our experiment), we set comp comp wi to wsi because that remaining program regions will be operated within the switching energy-dominant frequency region. When fieval is higher than the threshold value, flcomp comp comp (in the third row in Table II), we set wi to wli . As comp comp eval comp < fi the last case, i.e., fs < fl , we set wi in eval comp comp proportion to the ratio of (fi − fs ) to (fl − fscomp ) using a linear interpolation function L(·) defined as follows: L(X , Xupper , Ylower , Yupper , Xeval )  lower  eval −Xlower = XXupper · (Yupper − Ylower ) + Ylower . −Xlower

(44) comp

By applying Xlower = fscomp , Xupper = flcomp , Ylower = wsi , comp comp Yupper = wli , and Xeval = fieval , we can obtain wi as the output of the function L(·). 2) Coordination for wstall (C2 and C3): A workload prei diction for memory stall, wstall represents the prediction which i minimizes Eistall . Since Eistall depends on Ebi as well as Esistall and Elistall , wstall can be derived from wbi as well as wsistall and i by coordinating the three local-optimal wlistall . To obtain wstall i workload predictions, we perform the coordination in two steps as shown in Fig. 7. First, we find wLstall by coordinating i wlistall and wbi , i.e., C2, both of which are related to leakage energy consumption. Then, we find wstall by coordinating i wsistall and wLstall , i.e., C3. Note that the coordination for i comp wLstall can be done in the same way as w , which is i i comp comp shown in Fig. 8(a), by simply substituting (wsi , wli ) by (wlistall , wbi ) and (fscomp , flcomp ) by (flstall , fbstall ), where flstall and fbstall are threshold values defined in the fourth and fifth rows in Table II, respectively. In the same way, the coordination for wstall can also be done by the substitution of i corresponding workload predictions and threshold values, i.e., fsstall and fLstall in Table II. opt 3) Coordination for wi (C4): The last step of the coordiopt comp nation is to obtain wi from wi and wstall i . In CPU-bound opt comp comp since Ei applications, wi approaches wi dominates Eistall . On the contrary, in case of memory-bound applications, opt wstall contributes more to wi . We calculate the maximum i memory-boundedness of the remaining program region from ni to nleaf , denoted by i , as the ratio of the worst-case remaining memory stall cycles from ni at fieval (43) to that of the comcomp(k) . putational cycles, i.e., i = fieval · WCETistall(k) /WCECi Fig. 8(b) illustrates the linear coordination method to find opt wi by utilizing i . As i becomes larger (smaller), the remaining work is characterized to be more memory-bound (CPU-bound). When i is smaller than a certain threshold value, called comp (0.5, in our experiment), we regard that opt the remaining workload is CPU-bound, thereby, we set wi comp to wi . On the other hand, if i is larger than a certain threshold value, called stall (=1/ comp , in our experiment), opt we set wi to wstall since the remaining work is memory i bound. In the intermediate case, i.e., comp < < stall , opt we set wi in proportion to the ratio of ( i − comp ) to stall ( − comp ) using (44). opt After wi is obtained, voltage/frequency is set to fi = opt R wi /(ti −si ), where tiR is measured at the start of each program

120

IEEE TRANSACTION ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 1, JANUARY 2011

region. When setting the voltage/frequency, we check to see whether the performance level satisfies the given deadline constraint even if the worst-case execution time occurs after the frequency is set, which is called feasibility check. More details are explained in [10] and [27]. VIII. Program Phase Detection Program phase, especially, in terms of computational cycles and memory stall time, during PHASE UNIT (as defined in Algorithm 1) is characterized by a salient difference in computational cycle and memory stall time. Conventionally, the program phase is characterized by utilizing only average execution cycle of basic blocks without exploiting the runtime distributions of computational cycle and memory stall time [14], [15]. To exploit the runtime distributions in characterizing a program phase, we define a new program phase vector consisting of five local-optimal workload predictions for each program region. Note that local-optimal workload predictions reflect the correlation as well as the runtime distributions of both computational cycle and memory stall time. Thus, a set of local-workload predictions becomes a good indicator which represents the joint PDF of each program region. The program phase vector of the kth program phase is defined as follows: W (k) =[Wscomp(k) , Wlcomp(k) , Wsstall(k) , Wlstall(k) , Wb(k) ]T

(45)

where

comp(k) comp(k) comp(k) Wscomp(k) = wsroot , . . . , wsi , . . . , wsleaf comp(k) comp(k) comp(k) Wlcomp(k) = wlroot , . . . , wli , . . . , wlleaf stall(k) stall(k) Wsstall(k) = wsroot , . . . , wsistall(k) , . . . , wsleaf stall(k) stall(k) , . . . , wlistall(k) , . . . , wlleaf Wlstall(k) = wlroot (k) (k) Wb(k) = wbroot , . . . , wbi(k) , . . . , wbleaf .

(46) (47) (48) (49) (50)

Periodically, i.e., PHASE UNIT (set to the period defined as the time for decoding 20 frames in our experiments), we check to see whether a program phase is changed. It is evaluated by calculating Hamming distance between program phase vector of the current period and that of current program phase. When the Hamming distance is greater than the threshold called θp (set to 10% of the magnitude of the current program phase vector in our experiments), we evaluate that the program phase is changed, and then, check to see if there is any previous program phase whose Hamming distance with the program phase vector of the current period is within the threshold θp . If so, we reuse local-optimal workload predictions of the matched previous phase as that of the new phase to set voltage/frequency. If there is no previous phase satisfying the condition, we store the newly detected program phase and use the local-optimal workload predictions of a newly detected program phase to set voltage/frequency until the next program phase detection. IX. Experimental Results A. Setup In our experiments, we used two real-life multimedia programs, MPEG4 and H.264 decoder in FFMPEG [29]. We

applied two picture sets for the decoding. First, we used, in total, 4200 frames of 1920 × 1080 video clip consisting of eight test pictures, including Rush Hour (500 frames), Station2 (300 frames), Sunflower (500 frames), Tractor (690 frames), SnowMnt (570 frames), InToTree (500 frames), ControlledBurn (570 frames), and TouchdownPass (500 frames) in [30]. Second, we used 3000 frames of 1920 × 800 movie clip (as excerpted from Dark Knight). We inserted nine voltage/frequency setting points in each program: seven for macroblock decoding and two for file write operation for decoded image. We performed profiling with PAPI [31] running on LG XNOTE with Linux 2.6.3. We performed experiments at 25, 50, 75, and 100 °C. We calculated the energy consumption using the processor energy model with combined Vdd /Vbb shown in Section III-A. The parameters in (1)–(4) of the processor energy model were obtained from PTscalar [21] and Cacti5.3 with BPTM high-k/metal gate 32 nm HP model. We used seven discrete frequency levels from 333 MHz to 2.333 GHz with 333 MHz step size. We set 20 µs as the time overhead for switching voltage/frequency levels and calculate the energy overhead using the model presented in [7]. We compared the following four methods. 1) RT-CM-AVG [4]: runtime DVFS method based on the average ratio of memory stall time and computational cycle (baseline). 2) RT-C-DIST [19]: runtime DVFS method which only exploits the PDF of computational cycle. 3) DT-CM-DIST [28]: design-time DVFS method which exploits the joint PDF of computational cycle and memory stall time. 4) RT-CM-DIST : runtime version of DT-CM-DIST (proposed). We modified the original RT-CM-AVG [4], which runs intertask DVFS without real-time constraint, such that it supports intratask DVFS with a real-time constraint. In running DTCM-DIST [28], we performed a workload prediction with respect to 20 quantized levels of remaining time, i.e., bins, using the joint PDF of the first 100 frames in design time. B. Energy Savings Table III(a) and (b) shows the comparisons of energy consumption for MPEG4 and H.264 decoder, respectively, at 75 °C. The first column shows the name of test pictures. Columns 2, 3, and 4 represent the energy consumption of each DVFS method normalized with respect to that of RT-CM-AVG. Compared with RT-CM-AVG [4], our method, RT-CMDIST offers 5.1–34.6% and 4.5–17.3% energy savings for MPEG4 and H.264 decoder, respectively. Fig. 9 shows the statistics of used frequency levels when running SnowMnt in MPEG4 decoder. As Fig. 9 shows, RT-CM-AVG uses the lowest frequency level, i.e., 333 MHz, more frequently than other two methods. It also leads to the frequent use of high frequency levels, i.e., frequency levels above 2.00 GHz where energy consumption drastically increases as frequency rises, in order to meet the real-time constraint. However, by considering the runtime distribution in RT-CM-DIST, high frequency

KIM et al.: PROGRAM PHASE-AWARE DYNAMIC VOLTAGE SCALING UNDER VARIABLE COMPUTATIONAL WORKLOAD

121

TABLE III Comparison of Energy Consumption for Test Pictures at 75 °C: (a) MPEG4 (20 Frames/s) and (b) H.264 Decoder (12 Frames/s) (a) Image Rush Hour Station2 Sunflower Tractor SnowMnt InToTree ControlledBurn TouchdownPass Average

RT-C-DIST [19] 1.08 1.34 0.99 1.01 1.02 0.97 0.88 1.15 1.05

DT-CM-DIST [28] 0.83 0.97 0.76 0.78 0.90 0.79 0.67 0.91 0.83

RT-CM-DIST (Proposed) 0.79 0.95 0.74 0.75 0.81 0.71 0.65 0.86 0.78

DT-CM-DIST [28] 0.94 0.90 0.97 1.00 1.03 1.00 0.94 0.99 0.97

RT-CM-DIST (Proposed) 0.93 0.83 0.88 0.96 0.84 0.93 0.88 0.94 0.90

Fig. 9.

Statistics of used frequency levels in MPEG4 for decoding SnowMnt.

(b) Image Rush Hour Station2 Sunflower Tractor SnowMnt InToTree ControlledBurn TouchdownPass Average

RT-C-DIST [19] 1.11 1.05 1.09 1.18 1.03 1.14 1.07 1.10 1.10

levels incurring high energy overhead are less frequently used because the workload prediction with distribution awareness is more conservative than average-based method. Table IV shows energy savings results for one of the test pictures, i.e., SnowMnt, at four temperatures, 25 °C, 50 °C, 75 °C, and 100 °C. As the table shows, more energy savings can be achieved as temperature increases. It is because the energy penalty caused by frequent use of high frequency level can be more obviously observed as temperature increases, since leakage energy consumption is exponentially increasing according to the temperature. By considering the temperature dependency of leakage energy consumption, RT-CM-DIST sets voltage/frequency so as to use high frequency levels less frequently as temperature increases while RT-CM-AVG does not consider the temperature increases. Note that, in most cases, MPEG4 decoder gives more energy savings than H.264 case. It is because, as Fig. 10 shows, the distribution of memory boundedness (defined as the ratio of memory stall time to computational cycle) of MPEG4 has a wider distribution than that of H.264 in terms of Max/Avg and Max/Min ratios. Compared with RT-C-DIST [19], which exploits only the distribution of computational cycle in runtime, RT-CM-DIST provides up to 20.8–28.9% and 15.1–21.0% further energy savings for MPEG4 and H.264 decoder, respectively. The amount of further energy savings represents the effectiveness of considering the distribution of memory stall time as well as the correlation between computational cycle and memory stall time, i.e., the joint PDF of computational cycle and memory stall time. RT-C-DIST regards the whole number of clock cycles, which is profiled at the end of every program region,

Fig. 10. Distribution of memory boundedness in (a) MPEG4 and (b) H.264 decoder. TABLE IV Comparison of Energy Consumption for SnowMnt at Four Temperature Levels

MPEG4 dec.

H.264 dec.

Temp (°C) 25 50 75 100 25 50 75 100

RT-C-DIST [19] 1.15 1.10 01.02 0.96 1.06 1.05 1.03 1.01

DT-CM-DIST [28] 0.94 0.92 0.90 0.89 1.02 1.02 1.03 1.03

RT-CM-DIST (Proposed) 0.86 0.84 0.81 0.79 0.91 0.88 0.84 0.80

as the computational cycle. Thus, RT-C-DIST cannot consider the joint PDF distribution of computational cycle and memory stall time. As the consequence, it sets frequency levels higher than required levels, as shown in Fig. 9. In Table III, compared with DT-CM-DIST, which exploits runtime distributions of both computational and memory stall workload in design time, RT-CM-DIST provides 2.1–10.2% and 1.2–18.1% further energy savings for MPEG4 and H.264

122

IEEE TRANSACTION ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 1, JANUARY 2011

TABLE V Comparison of Energy Savings for DarkKnight at 75 °C

MPEG4 dec. H.264 dec.

RT-C-DIST [19] 1.26 1.16

DT-CM-DIST [28] 1.20 1.16

RT-CM-DIST (Proposed) 0.89 0.89

TABLE VI Summary of Runtime Overhead Source of Runtime Overhead Local-optimal workload prediction Coordination Feasibility check

Amount 40 400–52 400 cycles 2720–4780 cycles 407–3560 cycles

decoder, respectively. The largest energy savings can be obtained at SnowMnt for both MPEG4 and H.264 decoder, which has distinctive program phase behavior. Since DT-CM-DIST finds the optimal workload using the first 100 frames (designtime fixed training input), which is totally different from that of the remaining frames (runtime-varying input), it cannot provide proper voltage and frequency setting. To further investigate the effectiveness of considering complex program phase behavior, we performed another experiment using 3000 frames of Movie clip. Program phase behavior is more obviously observed at Movie clip whose scene is fast moving. Table V shows normalized energy consumption at 75 °C when decoding the movie clip from Dark Knight in MPEG4 and H.264 decoder, respectively. RT-CM-DIST outperforms DT-CM-DIST by up to 26.3% and 23.3% for MPEG4 and H.264 decoder, respectively. It is because, in the movie clip, complex program phase behavior exists due to frequent scene change as Fig. 1(d) shows. C. Overhead 1) Runtime Overhead: We measured the runtime overhead of the proposed online method, i.e., RT-CM-DIST, using PAPI [31]. The proposed method consists of three parts: local-optimal workload prediction, coordination, and feasibility check. Table VI shows the runtime overhead of the proposed method. The local-optimal workload prediction of a program region consumes 40 400–52 400 clock cycles when PHASE− UNIT is set to 20 frames. Note that the local-optimal workload prediction is performed at every PHASE UNIT. The runtime overhead of coordination and feasibility check, which is performed at every start of program region, takes 2720–4780 and 407–3560 clock cycles, respectively. The total runtime overhead in Table VI amounts to 0.38% and 0.25% of the average execution cycles in the case of MPEG4 and H.264 decoder, respectively. 2) Memory Overhead of LUTs: As explained in Section VI-B, the presented method requires three temperatureindependent LUTs, i.e., LUTλscomp , LUTλsstall , and LUTλb , and two temperature-dependent LUTs, i.e., LUTλlcomp and LUTλlstall . The LUTs incur memory overhead. The memory overhead largely depends on the number of steps (scales) in the indexes of the LUTs. The more steps are used, the more accurate workload prediction will be achieved with a higher memory

area overhead. In our implementation, we built each LUT with the ratio of standard deviation to mean (Index 1) ranging 0.05– 0.30 with 0.05 step size, with skewness (Index 2) ranging −1.00–1.00 with 0.10 step size, and with the ratio of mean to the effective remaining workload of the remaining program regions (Index 3) ranging 0.10–1.00 with 0.10 step size. Therefore, 1140 (= 19 × 6 × 10) entries are required for each LUT where 8 bits are assigned to each entry. Thus, about 1 kB memory space is required for each LUT. The area overhead can be further reduced by trimming and compressing entries. LUTs are built for four temperatures, i.e., 25, 50, 75, and 100 °C used in our experiment. The total area overhead amounts to 11 kB [=(3×1 kB) + 4×(2×1 kB)].

X. Conclusion In this paper, we presented a novel online DVFS method which exploits the distribution of both computational workload and memory stall workload during program runs in combined Vdd /Vbb scaling. To reduce the complexity of our previous design-time solution [28], we presented a DVFS method consisting of two steps: local-optimal workload prediction and coordination. In the local-optimal workload prediction step, we periodically calculated five local-optimal workload predictions each of which minimized single energy component under the joint PDF of computational cycle and memory stall time, which is profiled during runtime. To further reduce the runtime overhead, we prepared tables which are pre-characterized in design time based on the analytical formulation. During runtime, we utilized them to find local-optimal workloads. In the coordination step, we found the global workload prediction by coordinating the five local-optimal workload predictions. Experimental results show that the proposed method offers up to 34.6% and 17.3% energy savings for MPEG4 and H.264 decoder, respectively, compared with the existing method [4].

References [1] S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens, “Memory access scheduling,” in Proc. ISCA, 2000, pp. 128–138. [2] K. Govil, E. Chan, and H. Wasserman, “Comparing algorithms for dynamic speed-setting of a low-power CPU,” in Proc. MOBICOM, 1995, pp. 13–25. [3] Y. Gu and S. Chakraborty, “Control theory-based DVS for interactive 3-D games,” in Proc. DAC, 2008, pp. 740–745. [4] K. Choi, R. Soma, and M. Pedram, “Fine-grained dynamic voltage and frequency scaling for precise energy and performance tradeoff based on the ratio of off-chip access to on-chip computation times,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 24, no. 1, pp. 18–28, Jan. 2005. [5] W.-Y. Liang, S.-C. Chen, Y.-L. Chang, and J.-P. Fang, “Memory-aware dynamic voltage and frequency prediction for portable devices,” in Proc. RTCSA, 2008, pp. 229–236. [6] G. Dhiman and T. S. Rosing, “System-level power management using online learning,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 28, no. 5, pp. 676–689, May 2009. [7] A. Azevedo, I. Issenin, R. Cornea, R. Gupta, N. Dutt, A. Veidenbaum, and A. Nicolau, “Profile-based dynamic voltage scheduling using program checkpoints,” in Proc. DATE, 2002, pp. 168–175. [8] D. Shin and J. Kim, “Optimizing intra-task voltage scheduling using data flow analysis,” in Proc. ASPDAC, 2005, pp. 703–708. [9] J. Seo, T. Kim, and J. Lee, “Optimal intratask dynamic voltage-scaling technique and its practical extensions,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 25, no. 1, pp. 47–57, Jan. 2006.

KIM et al.: PROGRAM PHASE-AWARE DYNAMIC VOLTAGE SCALING UNDER VARIABLE COMPUTATIONAL WORKLOAD

[10] S. Hong, S. Yoo, H. Jin, K.-M. Choi, J.-T. Kong, and S.-K. Eo, “Runtime distribution-aware dynamic voltage scaling,” in Proc. ICCAD, 2006, pp. 587–594. [11] S. Hong, S. Yoo, B. Bin, K.-M. Choi, S.-K. Eo, and T. Kim, “Dynamic voltage scaling of supply and body bias exploiting software runtime distribution,” in Proc. DATE, 2008, pp. 242–247. [12] J. R. Lorch and A. J. Smith, “Improving dynamic voltage scaling algorithm with PACE,” ACM SIGMETRICS Perform. Eval. Rev., vol. 29, no. 1, pp. 50–61, Jun. 2001. [13] C. Xian and Y.-H. Lu, “Dynamic voltage scaling for multitasking realtime systems with uncertain execution time,” in Proc. GLSVLSI, 2006, pp. 392–397. [14] T. Sherwood, E. Perelman, G. Hamerly, S. Sair, and B. Calder, “Discovering and exploiting program phases,” IEEE Micro, vol. 23, no. 6, pp. 84–93, Nov. 2003. [15] T. Sherwood, S. Sair, and B. Calder, “Phase tracking and prediction,” in Proc. ISCA, 2003, pp. 336–347. [16] Q. Wu, M. Martonosi, D. W. Clark, V. J. Reddi, D. Connors, Y. Wu, J. Lee, and D. Brooks, “A dynamic compilation framework for controlling microprocessor energy and performance,” in Proc. IEEE MICRO, 2005, pp. 271–282. [17] C. Isci, G. Contreras, and M. Martonosi, “Live, runtime phase monitoring and prediction on real systems with application to dynamic power management,” in Proc. MICRO, 2006, pp. 359–370. [18] S.-Y. Bang, K. Bang, S. Yoon, and E.-Y. Chung, “Run-time adaptive workload estimation for dynamic voltage scaling,” IEEE Trans. Comput.Aided Design Integr. Circuits Syst., vol. 28, no. 9, pp. 1334–1347, Sep. 2009. [19] J. Kim, S. Yoo, and C.-M. Kyung, “Program phase and runtime distribution-aware online DVFS for combined Vdd /Vbb scaling,” in Proc. DATE, 2009, pp. 417–422. [20] T. Mudge, K. Flautner, D. Vlaauw, and S. M. Martin, “Combined dynamic voltage scaling and adaptive body biasing for lower power microprocessors under dynamic workloads,” in Proc. ICCAD, 2002, pp. 721–725. [21] W. Liao, L. He, and K. M. Lepak, “Temperature and supply voltage aware performance and power modeling at microarchitecture level,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 24, no. 7, pp. 1042–1053, Jul. 2005. [22] Cacti5.3 [Online]. Available: http://www.hpl.hp.com/research/cacti [23] BPTM High-k/Metal Gate 32 nm High Performance Model [Online]. Available: http://www.eas.asu.edu/ptm [24] K. Puttaswamy and G. H. Loh, “Thermal herding: Microarchitecture techniques for controlling hotspots in high-performance 3-D-integrated processors,” in Proc. HPCA, 2007, pp. 193–204. [25] N. Kavvadias, P. Neofotistos, S. Nikolaidis, C. A. Kosmatopoulos, and T. Laopoulos, “Measurement analysis of the software-related power consumption in microprocessors,” IEEE Trans. Instrum. Meas., vol. 53, no. 4, pp. 1106–1112, Aug. 2004. [26] S. Oh, J. Kim, S. Kim, and C.-M. Kyung, “Task partitioning algorithm for intra-task dynamic voltage scaling,” in Proc. ISCAS, 2008, pp. 1228– 1231. [27] J. Kim, S. Oh, S. Yoo, and C.-M. Kyung, “An analytical dynamic scaling of supply voltage and body bias based on parallelism-aware workload and runtime distribution,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 28, no. 4, pp. 568–581, Apr. 2009. [28] J. Kim, Y. Lee, S. Yoo, and C.-M. Kyung, “An analytical dynamic scaling of supply voltage and body bias exploiting memory stall time variation,” in Proc. ASPDAC, 2010, pp. 575–580. [29] FFMPEG [Online]. Available: http://www.ffmpeg.org [30] VQEG [Online]. Available: ftp://vqeg.its.bldrdoc.gov [31] PAPI [Online]. Available: http://icl.cs.utk.edu/papi

Jungsoo Kim (S’06) received the B.S. degree in electrical engineering from the Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea, in 2005, and graduated the unified course of the M.S. and Ph.D. degrees from the Department of Electrical Engineering and Computer Science, KAIST, in 2010. Since 2010, he has been in a post-doctoral position with KAIST. His current research interests include dynamic power and thermal management, multiprocessor system-on-a-chip design, and lowpower wireless surveillance system design.

123

Sungjoo Yoo (M’00) received the B.S., Masters, and Ph.D. degrees in electronics engineering from Seoul National University, Seoul, South Korea, in 1992, 1995, and 2000, respectively. He was a Researcher with the TIMA Laboratory, Grenoble, France, from 2000 to 2004, and was a Senior and Principal Engineer with Samsung Electronics, Seoul, from 2004 to 2008. Since 2008, he has been with the Pohang University of Science and Technology, Pohang, South Korea. His current research interests include dynamic power and thermal management, on-chip network, multithreaded software and architecture, and fault tolerance of solid-state disk.

Chong-Min Kyung (S’06–M’81–SM’99–F’08) received the B.S. degree in electronics engineering from Seoul National University, Seoul, South Korea, in 1975, and the M.S. and Ph.D. degrees in electrical engineering from the Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea, in 1977 and 1981, respectively. From April 1981 to January 1983, he was with Bell Telephone Laboratories, Murray Hill, NJ, in a postdoctoral position. Since he joined KAIST in 1983, he has been working on system-on-a-chip design and verification methodology, processor, and graphics architectures for highspeed and/or low-power applications, including mobile video codec. He was a Visiting Professor with the University of Karsruhe, Karsruhe, Germany, in 1989, as an Alexander von Humboldt Fellow, a Visiting Professor with the University of Tokyo, Tokyo, Japan, from January 1985 to February 1985, a Visiting Professor with the Technical University of Munich, Munich, Germany, from July 1994 to August 1994, with Waseda University, Tokyo, from 2002 to 2005, with the University of Auckland, Auckland, New Zealand, from February 2004 to February 2005, and with Chuo University, Tokyo, from July 2005 to August 2005. Dr. Kyung is the Director of the Integrated Circuit Design Education Center, Daejeon, established in 1995 to promote the integrated circuit (IC) design education in Korean universities through computer-aided design environment setup, and chip fabrication services. He is the Director of the SoC Initiative for Ubiquity and Mobility Research Center established to promote academia/industry collaboration in the SoC design-related area. From 1993 to 1994, he served as an Asian Representative in the International Conference on Computer-Aided Design Executive Committee. He received the Most Excellent Design Award, and the Special Feature Award from the University Design Contest in the ASP-DAC 1997 and 1998, respectively. He received the Best Paper Awards at the 36th DAC, New Orleans, LA, the 10th International Conference on Signal Processing Application and Technology, Orlando, FL, in September 1999, and the 1999 International Conference on Computer Design, Austin, TX. He was the General Chair of the Asian Solid-State Circuits Conference 2007, and ASP-DAC 2008. In 2000, he received the National Medal from the Korean Government for his contribution to research and education in the IC design. He is a member of the National Academy of Engineering Korea and the Korean Academy of Science and Technology. He is a Hynix Chair Professor with KAIST.