A Dynamic Control Mechanism for Pipeline Stage

0 downloads 0 Views 921KB Size Report
Apr 4, 2008 - With this dynamic control mechanism, we can obtain 11.4% ... key words: energy saving, dynamic optimization, pipeline stage unifica-.
IEICE TRANS. INF. & SYST., VOL.E91–D, NO.4 APRIL 2008

1010

PAPER

A Dynamic Control Mechanism for Pipeline Stage Unification by Identifying Program Phases Jun YAO†a) , Shinobu MIWA†† , Hajime SHIMADA† , Members, and Shinji TOMITA† , Fellow

SUMMARY Recently, a method called pipeline stage unification (PSU) has been proposed to reduce energy consumption for mobile processors via inactivating and bypassing some of the pipeline registers and thus adopt shallow pipelines. It is designed to be an energy efficient method especially for the processors under future process technologies. In this paper, we present a mechanism for the PSU controller which can dynamically predict a suitable configuration based on the program phase detection. Our results show that the designed predictor can achieve a PSU degree prediction accuracy of 84.0%, averaged from the SPEC CPU2000 integer benchmarks. With this dynamic control mechanism, we can obtain 11.4% Energy-DelayProduct (EDP) reduction in the processor that adopts a PSU pipeline, compared to the baseline processor, even after the application of complex clock gating. key words: energy saving, dynamic optimization, pipeline stage unification, program phase

1. Introduction In recent years, power consumption has shown its importance in the modern processor design, especially for portable and mobile platforms such as cellular phones and laptop computers. To reduce the total energy usage, a method called dynamic voltage frequency scaling (DVFS) is currently employed. Basically, DVFS decreases the supply voltage when the processor is under low workload. This reduces the overall energy consumption and thus saves energy for program execution. However, Shimada et al. [1] and Koppanalil et al. [2] have presented a different method, which is called pipeline stage unification (PSU). Its main purpose is to reduce the processor’s energy consumption via inactivating and bypassing pipeline registers and thus using shallow pipelines during the program execution. PSU can save energy in the following ways: 1. Power can be saved by inactivating some of the pipeline registers. In modern processors, pipeline registers consume a large part of the total power [1]. 2. Instructions Per Cycle (IPC) can be improved after the pipeline stage unification. Usually, a shallow pipeline will have a relatively better IPC due to the decreased Manuscript received May 31, 2007. Manuscript revised October 29, 2007. † The authors are with the Graduate School of Informatics, Kyoto University, Kyoto-shi, 606–8501 Japan. †† The author is with the Graduate School of Engineering, Tokyo University of Agriculture and Technology, Koganei-shi, 184–8588 Japan. a) E-mail: [email protected] DOI: 10.1093/ietisy/e91–d.4.1010

branch misprediction resolution latency and functional unit latencies, compared to the deep pipeline, as illustrated in [3] and [4]. Since both power and IPC play important roles in the final energy consumption, a processor that adopts a PSU pipeline is assumed to be suitable for a large range of different applications than the processors with fixed depth pipelines, for the consideration of power/performance efficiency. Our research in this paper is focused on controlling PSU hardware to achieve better power performance tradeoffs. Currently, there is only one paper [5] related to PSU control and it described execution with predefined throughput. The previous study did not consider the effect of different periodical program behaviors during the program execution. In this paper, we propose a mechanism to dynamically adjust the pipeline configuration to the changes of program behaviors, so as to achieve better Energy-DelayProduct (EDP). With the adoption of a table structured hardware, we can cluster program runtime periods into groups, from a high view of program characteristics. Based on the awareness of program behaviors, PSU controller can dynamically predict a suitable unification degree and thus reconfigure pipeline structure to fit for the characteristics of the coming runtime period. Our simulation results show that by adopting an additional structure to store history information, the designed PSU degree predictor achieves a misprediction rate of less than 16.0%, averaged from the workloads we studied in our research. This misprediction rate is 2.01 times better than the predictor that only utilizes the most recent history information. The good predictability results in the final power/performance savings for processors with PSU pipelines. Considering the EDP metric that is commonly used for mobile platforms, our proposed dynamic mechanism provides an 11.4% reduction, compared to a baseline processor, even after the application of complex clock gating which usually lowers the opportunities for other energy saving methods. In addition, the final EDP achieved by this on-the-fly control mechanism is only 1.6% larger than the processor instructed by a statically constructed ideal predictor, which works based on the profiling data. The remainder of the paper is organized as follows: Section 2 describes the background techniques of this paper. Section 3 introduces the dynamic prediction mechanism for a processor with PSU pipeline adoption. Simula-

c 2008 The Institute of Electronics, Information and Communication Engineers Copyright 

YAO et al.: A DYNAMIC CONTROL MECHANISM FOR PSU BY IDENTIFYING PROGRAM PHASES

1011

tion methodology and metrics to evaluate the efficiency of different pipeline configurations can be found in Sect. 4. In Sect. 5, the experimental results are presented, together with some analyses. Section 6 concludes the paper. 2. Related Works This section describes the background techniques related to our research. Section 2.1 describes Pipeline Stage Unification briefly and Sect. 2.2 introduces the working set signature method. 2.1 Pipeline Stage Unification Shimada et al. [1], [6] proposed an energy consumption reduction method called Pipeline Stage Unification (PSU) for mobile processors. It is designed as a method to unify adjacent pipeline stages via bypassing and inactivating some of the pipeline registers. As described in paper [1], a pipeline of 20 stages was adopted as the baseline processor, following a similar scheme of current microprocessors. Three PSU degrees were assumed to represent the different configurations of the pipeline structure. 1. PSU degree 1 (U1): The normal mode without bypassing any pipeline registers. 2. PSU degree 2 (U2): Merge every pair of adjacent pipeline stages by inactivating and bypassing the pipeline register between them. The new pipeline will have 10 stages. 3. PSU degree 4 (U4): Based on U2, merge the adjacent stages one step further. It becomes a 5-stage pipeline. Thus, PSU is a method based on pipeline structure reconfiguration and provides a pipeline with multi-usable depths. The bypassing of pipeline registers in the shallow pipeline modes (U2 and U4) can provide energy saving. And the changeable pipeline depth may have benefits in improving IPC according to workloads. As paper [7] illustrates, the variability in programs usually drives the optimal pipeline depth to quite different design points, when putting emphasis on both energy and performance. Considering a good power-performance balancing, the PSU pipeline may fit for a larger range of programs than the pipelines of fixed depths. According to its design, the PSU utilization can help processors save energy by adopting the suitable PSU degree. A simple way to apply such pipeline reconfiguration is to invoke a PSU degree switch when detecting some kind of specific hardware events, such as L2 cache misses without overlaps from sufficient parallelism, as proposed in paper [8]. This kind of hardware events driven mechanisms only requires an event detector and a relatively simple finite state machine to determine a suitable new configuration, and hence introduces low overhead. However, the simplicity also limits it to perform well in specific programs. As an example, the L2 cache miss driven processor reconfiguration achieves an average 20.7% power saving in memory

intensive benchmarks while the saving decreases to 7.0%, averaged from all the SPEC2000 benchmarks, as shown in paper [8]. Moreover, a further study of the PSU mechanism shows that the latency of PSU degree switch, which includes a pipeline flushing and a pipeline frequency scaling, can be expected to finish in hundred cycles. Although this value is smaller than the voltage and frequency scaling in a DVFS system, which is usually in the order of micro-seconds as reference [9] described, it is still very close to the the latencies of hardware events like the L2 cache miss penalty. Thus, adopting either PSU or DVFS reconfiguration at the point where a specific hardware event occurs and scaling back from the low energy mode when the penalty period of the hardware event expires, may not be beneficial, since the delay of the processor reconfiguration can not be fit in most hardware events. For this kind of processor reconfiguration, we need to find the balancing points where the granularity is large enough to conceal the hardware switch penalty while possible reconfiguration opportunities can still be sufficiently utilized. 2.2 Working Set Signature Recently, other than the specified hardware events detecting based methods, many researchers worked to study the workload by analyzing its periodic behavior. Their researches demonstrated that programs can be divided into recurring periods during the execution. Dhodapkar [10] and Sherwood [11] have shown that such recurring periods can be classified into phases in which the program has similar behaviors including the cache miss, IPC, and power consumption. The phase may contain a set of instruction intervals, regardless of temporal adjacency. This theory gives us an opportunity to study the pipeline reconfiguration chances at a high level, i.e., from the view of the program behavior. The understanding of program behavior enables this kind of reconfiguration methods the ability to fit for a large range of applications, not only the specified applications like memory intensive programs. Also, the recommended granularity of such program behavior analyzer is usually in the order of 100 k instructions or larger, in which the overhead from the PSU-like hardware switch can be easily obscured. In order to identify the program behaviors, Dhodapkar designed a working set signature to function as the compacted representation for a program interval in paper [10]. The signature is an N-bit vector formed by mapping each working set element into one of N positions via a hash function (Fig. 1). The working set element was designed to be of instruction cache line granularity, and hence it could be represented by the upper m-bit from the program counter. The N-bit signature vector will be cleared before a program interval begins. During the execution, a bit in the signature is set if the corresponding instruction cache line is touched. The interval length was set to be of 100 k instructions, and a signature vector of 1024-bit was adopted in paper [10].

IEICE TRANS. INF. & SYST., VOL.E91–D, NO.4 APRIL 2008

1012

Fig. 1

Mechanism for collecting a working set signature [10].

The distance between two working set signatures S 1 and S 2 was also given in paper [10], as the following equation:  S 2) num o f “1 bit in(S 1 (1) δ= num o f “1 bit in(S 1 + S 2 ) Where num of “1”bit in() represents the function that counts the number of “1” bits in the bit vector. A larger δ indicates a smaller code correlation between the corresponding intervals represented by S 1 and S 2 . The working set signature provides a quantitative way to study the program behaviors by making fingerprints of program intervals. In paper [12], Dhodapkar presented a comparison between several phase detection methods. Among them, the most effective one is the Basic Block Vector (BBV), given by Sherwood [11]. However, the instruction working set signature provides 85% correlation with the BBV method and it can be implemented with a smaller hardware cost, as stated in paper [12]. Hence, by employing working set signature, a fast program behaviors analyzer can be constructed while maintaining comparably accuracy. As discussed in Sect. 2.1, the PSU method can accomplish a pipeline structure change in hundred cycles, which is faster than other processor reconfiguration methods like DVFS. Paper [13], which proposed an online DVFS control system, presented that starting the reconfiguration after an interval of millisecond order, can cover the overhead of voltage frequency scaling. With PSU method, this granularity can be furtherly shrunk and thus helps discover more detailed program characteristics at a fine grid. While working at a finer level, other than the penalty from PSU degree switch itself, the additional cost from phase detection should also be carefully measured. For these reasons, the relatively fast program fingerprint constructor—working set signature—is adopted in our research, together with the program behavior analysis granularity of 100 k instructions, as recommended from Dhodapkar’s paper [10]. 3. Signature History Table Based PSU Degree Predictor The background in Sect. 2.2 indicates two assumptions that the energy consumption keeps nearly flat in program phases which have similar behaviors, and the program behavior can be quantitatively studied by its compacted representation,

Fig. 2

Flow of table based method.

like working set signature. According to these scenarios, in this section, we present our design of a dynamic mechanism to instruct the PSU controller—from the high level view of program characteristics—to achieve a better energy performance balancing. 3.1 Outline of the Dynamic Optimization Method In our research, we design a dynamic mechanism to take the benefits of the working set signature in identifying the program behaviors, and thus classify the runtime periods into phases. Moreover, with the knowledge that a phase will usually recur for multiple times, we employed an additional table structured hardware, to cache the phase history information. The corresponding power/performance tuning information of each phase is also stored in the table for further usage. As PSU technology provides a pipeline of multiusable depths with the three predefined PSU degrees, we can use this table based mechanism to predict a suitable PSU degree for the pending interval. If we predict that the program will come into a phase that has appeared in the past, we can easily choose a suitable pipeline structure reconfiguration without starting a new tuning procedure. As Fig. 2 shows, this mechanism contains two threads. The first one is the signature collection thread. We divide program into periods of fixed length (e.g., 100 k-instruction intervals), and assemble the working set signature for every interval, under every instruction cache line access. Considering the implementation, the working set signature constructor used in the first thread should be added as a specific hardware unit. Most probably, it will be constructed as a 1bit wide RAM array, with one read/write port, as introduced in paper [12]. To decrease the number of signature vector access and thus reduce the overhead of signature assembling, we define the last 10% instructions, as the sample for each interval (Fig. 2). In addition, the signature collection occurs simultaneously with the program execution, introducing no interference for the critical path. The second thread handles the PSU degree prediction. After each interval, we classify the signature with the cached

YAO et al.: A DYNAMIC CONTROL MECHANISM FOR PSU BY IDENTIFYING PROGRAM PHASES

1013

signatures in a table structured hardware, and predict a suitable PSU degree from the saved history information based on the signature classifying result. After assigning the newly predicted PSU degree, the signature vector is cleared for the new interval, and the program returns to the execution. This thread is the core of our dynamic PSU control scheme. In the implementation, we can assemble the second thread either as a specific hardware unit or as an operating system module. The advantages of OS module are that it can utilize system memory to store a relatively large range of history information and the algorithm can be designed with more flexibilities. However, transferring control signals between OS and hardware units introduces additional delay from the interruption handler. Also, the processor resources will be employed when grouping program intervals in the OS module. It is usually feasible when the interval length is large, such as the granularity in the order of hundred millions instructions. However, for the similar considerations as in Sect. 2.2, instead of the OS module implementation, a specified hardware unit is designed to accelerate the behavior analysis processing and thus can exert PSU’s relatively fast degree switch characteristics to the greatest extend. The hardware cost to handle the algorithm core will be studied in Sect. 5.

Fig. 3

Lookup SHT.

Fig. 4

Update SHT.

3.2 Structure of Signature History Table The designed table mainly contains the working set signatures of the past program intervals. Hence we denote it as Signature History Table (SHT) and use the notation in the latter part of this paper. The hardware approach of this table method is shown in both Figs. 3 and 4. In detail, the SHT employed in this algorithm is constructed as follows: 1. The signature field: The signature is the key word of each entry and should be unique. History signatures that once occurred are stored here to help late classification. This field has the same storage size as the working set signature vector. 2. The state field: It denotes the state of the table entry. In this method, we define two states: tuned and tuning. A state of tuning means that this entry has just been added into the table and the best PSU degree has not been tuned out. After all three PSU degrees have been attempted, the best PSU degree from the tuned power/performance metric results (another field in the table) is selected and the state is set as tuned. One bit is used for this field. 3. The power/performance metric field occupies three fixed point storage units for each entry. It holds the tuning information of different PSU degrees in the three sub-fields denoted as U1, U2 and U4, respectively. They are updated when an instruction interval finishes and the actual metric result for this interval is retrieved. 4. The bestU field: It holds the best PSU degree for this entry. This field is set after tuning is complete. If this phase is observed again, we can predict the suitable

PSU degree from this field. Two bits are used for this field. 5. The T field: It records the time that the entry is touched. This field is referred to when replacing old entries. Several bits are used, according to the table size. 3.3 Detailed Algorithm The constructor of working set signature (thread 1 in Fig. 2) has been discussed in paper [10]. In this section, we mainly introduce the detailed algorithm of the thread 2 in Fig. 2, which clusters program intervals into groups and assigns a suitable PSU degree for the next interval. Figures 3 and 4 demonstrate the structure of the SHT and the main processing of this PSU degree predictor. And Fig. 5 illustrates the detailed algorithm, in predicting the proper PSU degree for the pending interval. Assuming that we are at the end of interval Ik and are required to give an estimation of the PSU degree for the coming interval Ik+1 , the signature S Ik+1 that is needed as the key word to look up the best PSU degree field in the SHT, has not been determined at this point. To solve this problem, we design that each entry of SHT contains the signature S Ik of current interval Ik ,

IEICE TRANS. INF. & SYST., VOL.E91–D, NO.4 APRIL 2008

1014

and the probable power/performance tuning information of the next interval Ik+1 . Also, the “bestU” field of this entry, indexed by S Ik , suggests the best unification degree for its next interval. A specific register named prev table index is engaged for this purpose. As shown in Figs. 3 and 5, after the current interval Ik is complete, we can look for the current signature S Ik in the history table. If there is a hit, the corresponding entry will probably carry the best PSU degree for the next interval Ik+1 . Accordingly, we can predict the best PSU degree based on this entry before the start of Ik+1 . The register prev table index is then updated to the current table index. By this means, at the end of interval Ik+1 , before the SHT lookup, prev table index indicates the entry indexed by the signature S Ik of the last interval Ik . We calculate the power/performance result of the current interval Ik+1 and store it in the entry which prev table index refers to (Fig. 4). As an example for the algorithm described in these three figures, Fig. 6 shows the typical allocation and update procedures of one SHT entry.

After each interval Ik : if (prev && prev->state == tuned) update(prev, metric(Ik )); if (prev && prev->state == tuning) prev->metric[unif degree] = metric(Ik ); if (unif degree == U4) prev->bestU = best(prev->metric[U1, U2, U4]); prev->state = tuned; v = find nearest signature(); δ = signature distance between v->sig and Ik ; if (!v || δ > threshold) /* miss */ v = new table entry(); v->sig = signature of Ik ; unif degree = U1; v->state = tuning; else if (v->state == tuned) unif degree = v->bestU; else /* v->state == tuning */ unif degree = next unif degree for v; prev = v; clear signature(); Fig. 5 Algorithm of SHT-based predictor, as thread 2 in Fig. 2. Here, “prev” denotes prev table index and “v” serves as a temporary table index. Syntax “prev->state” denotes the “state” field of the entry pointed by “prev”. unif degree is the current PSU degree.

Fig. 6

As shown in Fig. 6, the SHT entry 0010 will be allocated at the end of interval Ik , if no similar signature like S Ik can be found in the table. We then arrange the three tunings of U1, U2 and U4 in the interval of Ik+1 , Ik +1 and Ik +1 . The interval Ik +1 and Ik +1 stand for the next interval where S Ik -like signature recurs, despite whether or not k, k and k are adjacent. The three Power Per f. results are stored in SHT entry 0010, where S Ik was originally saved. After Ik +1 , this entry can be marked as tuned, and the best unification degree can also be determined. Supposing that P2 is the smallest one among the three Power Per f. results P1 , P2 and P4 , the PSU degree 2 is now saved in the field bestU (“b.” in Fig. 6) of entry 0010. Here, all these operations are based on the assumption that if interval Ik+1 happens once after interval Ik and then Ik -like program blocks occur again, the next program interval will probably be like Ik+1 . It works similarly to a simple history branch predictor. As the Fig. 6 indicates, at the end point of I j , if we detect that S Ik is the nearest signature to S I j , we can now predict that the next interval I j+1 takes a similar behavior as the once experienced intervals Ik+1 , Ik +1 and Ik +1 . Therefore, the PSU degree for I j+1 can be set to PSU degree 2 before it starts, as indicated in SHT entry 0010. According to the above design, we can efficiently predict the best PSU degree for Ik+1 if the next interval for Ik is always like Ik+1 . In this case, the jump direction from Ik -like interval to Ik+1 -like interval can be regarded as the dominant one. However, we must endure some misprediction penalty if the next interval for Ik is variable. We will show the efficiency of this method in Sect. 5. In order to obtain a good PSU prediction accuracy, we need to provide sufficiently large SHT for the history signatures and the tuning information. Basically, large sizes of signature vector and SHT are expected to have better final efficiencies than the small ones. However, as we are to implement the SHT and its control mechanism in the hardware, too large signatures and SHT will not be beneficial for its extra energy and delay cost. In implementation, for a SHT with finite entries, there are two main actions that will be periodically performed: 1. Find the nearest signature. This function is used to classify the current signature with cached ones, so as

Outline of execution under SHT-based algorithm.

YAO et al.: A DYNAMIC CONTROL MECHANISM FOR PSU BY IDENTIFYING PROGRAM PHASES

1015

to determine the program phase of the current interval. We simply look up the table, comparing the new signature with all cached signatures in order to find a smallest distance. The distance is calculated by Eq. (1). If this smallest distance is larger than the threshold, we return a table miss and insert the new signature for late tuning. Otherwise, the function reports a table hit. 2. Replace the least recently used (LRU) table entry when there is not sufficient room for the new signature, while we call new table entry() in Fig. 5. Besides the LRU algorithm, the entry remains tuning is substituted prior to the tuned ones. These two actions provide the major complexity of the SHT-based algorithm. The performance of these two actions will greatly depend on the sizes of the table and signature vectors. These two sizes are extensively studied in Sect. 5, in the purpose of lowing the extra hardware cost of the algorithm. Also, Sect. 5.4 indicates some calculation of the overhead introduced by the algorithm, based on a rough implementation of the key hardware. 4. Simulation Methodology We use a detailed cycle-accurate out-of-order execution simulator, SimpleScalar Tool Set [14], together with the Wattch Tool Set [15], to measure the processor energy and performance with the dynamically controlled PSU pipeline. The pipeline in these tool sets has been lengthened to 20 stages, as an example of a current microprocessor with a deep pipeline, following a similar scheme of Intel Pentium 4 platforms. The extra stages have been implemented in both Wattch’s power model and SimpleScalar’s delay analyzing model. Table 1 lists the configuration information of the baseline processor with a 20-stage pipeline. We conducted our PSU pipeline structure similarly to Shimada’s proposal [1]. As described in Sect. 2.1, beginning with a 20-stage pipeline, the PSU pipeline will present 20 stages in U1, 10 stages in U2, and 5 stages in U4 mode, respectively. The frequency and delays of the processor units change according to the PSU degree, as listed in Table 2. We evaluated the dynamic mechanism on eight integer benchmarks (bzip2, gcc, gzip, mcf, parser, perlbmk, vortex, and vpr) from SPEC CPU2000 suite with the train inputs. 1.5 billion instructions were simulated after skipping the first billion instructions. In this paper, we mainly studied the reduction of dynamic power via Wattch Tool Set [15]. Although the leakage power dissipation also plays an important role in the total energy consumption in modern processors, the stress might be alleviated in the future by the hard work in the device area, as indicated in paper [16]. Also, since we do not require to lower the threshold voltage in our design, which may cause the increase of leakage current, we believe that our proposed PSU control mechanism is orthogonal to other mechanisms which reduce the leakage power dissipation. Besides the consideration for the leakage power, the glitch

Table 1

Baseline processor configuration.

Processor

Branch Prediction L1 Icache L1 Dcache L2 unified cache Memory TLB

Table 2

8-way out-of-order issue, 128-entry RUU, 64-entry LSQ, 8 int ALU, 4 int mult/div, 8 fp ALU, 4 fp mult/div, 8 memory ports 8 K-entry gshare, 6-bit history, 2 K-entry BTB,16-entry RAS 64 KB/32 B line/2way 64 KB/32 B line/2way 2 MB/64 B line/4-way 128 cycles first hit, 4 cycles burst interval 16-entry I-TLB, 32-entry D-TLB, 144 cycles miss latency

Assumptions of latencies and penalty.

PSU degree

U1

U2

U4

clock frequency rate branch misprediction resolution latency L1 Icache hit latency L1 Dcache hit latency L2 cache hit latency int Mult latency fp ALU latency fp Mult latency Memory access latency (first:burst) TLB miss latency

100% 20 4 4 16 3 2 4 128:4 144

50% 10 2 2 8 2 1 2 64:2 72

25% 5 1 1 4 1 1 1 32:1 36

power dissipation in the dynamic power shows a similar increasing trend, especially for processors with a large scale of combinational circuit. However, as this part is not explicitly calculated in Wattch Tool Set, we did not include it in this paper. The glitch factor of the total dynamic power dissipation will be studied during the implementation of the PSU hardware as a future task. 4.1 Clock Gating With the help of widely used clock gating, modern microprocessors can efficiently turn off the unused units, and thus achieve a significant energy reduction. In our experiments, we used cc3 style in Wattch [15] to provide a complex clock gating simulation. In this clock gating method, power is scaled linearly with port or unit usage, except that unused units dissipate 10% of their maximum power. The factor 10% exists because it is impossible to turn off a unit totally when it is not needed, in the practical circuits. The application of clock gating usually leaves little space for other power saving methods, including the widely adopted dynamic voltage frequency scaling (DVFS) and our proposed PSU methods. However, our results show that, although the chance of energy reduction in our model is lowered by the application of clock gating, it is not totally eliminated. Detailed results with clock gating adoption is presented in Sect. 5.

IEICE TRANS. INF. & SYST., VOL.E91–D, NO.4 APRIL 2008

1016

4.2 Power/Performance Metrics To evaluate the energy and performance together in the tuning procedure, we can use PDP, EDP, and EDDP as the metric which can be calculated as W/MIPS , W/(MIPS )2 , and W/(MIPS )3 , respectively [17]. Each equation puts different emphasis on energy and performance and will show dissimilar efficiencies according to the evaluated platforms. PDP is suitable for portable systems and EDP is usually used for high end systems such as workstations and laptop computers, while EDDP is good for server families. For simplicity, we apply one single metric during each program execution. The experiments and analyses in Sect. 5 are based on EDP because our PSU is targeted on highperformance mobile computers. It is also very easy to change the power/performance scheme to metric PDP or EDDP, for other platforms. In detail, EDP of the processor with PSU adoption can be calculated as the following equation. EDP = E × Delay =

n  i=1

Ei ×

n  CPIi i=1

fi

Table 3 Prediction accuracy mark characteristics. Stable duraBenchmark tions(%) perl. 99.25 1 vpr 99.72 mcf 30.18 vortex 66.48 2 gzip 68.47 bzip2 95.38 gcc 41.39 3 parser 61.46

5. Results and Analyses 5.1 PSU Degree Prediction Accuracy Since we are designing a dynamic mechanism to predict a suitable PSU degree for the next interval, the prediction accuracy is very important to the final energy saving result. Table 3 demonstrates the PSU degree prediction accuracies of SHT-based predictor. The prediction accuracy rate is obtained by comparing the predicted PSU degrees with a 100% optimal predictor, which was constructed on postsimulated trace data. Each interval in the 100% optimal predictor method will be executed in the PSU degree that has the smallest EDP. This 100% optimal predictor is denoted as “ideal” in the latter parts. In addition, the results of previously proposed phase detection method which only uses the most recent history information as described in paper [10], [18], are also listed in this table for comparison. Figure 7 is the modified algorithm, derived from paper [18] and combined with the PSU degree predicting module. The basic processing in this algorithm is to detect the phase switching by comparing the signatures of the current and the last intervals. The continuous intervals with similar signatures can share one PSU degree configuration since they are regarded as of similar

No. of sig. 2 4 25 20 22 19 231 122

Mean stable duration 425.37 1359.82 2.29 7.69 10.48 201.51 8.67 10.78

Prediction accuracy (%) Non-tab. Table 99.98 99.98 99.79 99.97 34.43 93.77 65.32 91.01 74.56 81.47 68.13 87.61 54.67 66.17 45.25 52.31

After each interval Ik : δ = signature distance of Ik and Ik−1 ; if (state == stable) if (δ > threshold) state = unstable; unification degree = U1; else if (state == unstable) if (δ ≤ threshold) state = tuning; unification degree = U1; else if (state == tuning) if (δ > threshold) state = unstable; unification degree = U1; else if (unification degree == U4) state = stable; unification degree=best from tuning; else unification degree = next tuning unification degree;

(2)

In this model, we assume that an application is divided into intervals of fixed length. In Eq. (2), n is the total number of intervals. For a specific interval, Ei is the energy consumed in the processor with a PSU pipeline, given by Wattch. CPIi is the cycles per instruction, collected by SimpleScalar. And fi is the corresponding frequency, depending on which PSU degree that specified interval experiences.

of each benchmark, together with bench-

Fig. 7

Non-table based PSU degree predictor algorithm [18].

behaviors. Correspondingly, a program runtime state is set to stable within this period. The stable state turns into unstable at the point when the distance of the current and last signatures exceeds the predefined threshold. For simplicity, PSU degree 1 is utilized under the unstable period, until the consecutive signatures are similar again. Tuning is triggered for the new stable program period, in order to search for the best PSU configuration. Since this method only relies on the signatures of the most recent two adjacent intervals to detect phase switching, we refer to it as “non-table” method in the latter sections. In this series of experiments, we employed a 512-bitlong working set signature vector and a table with unlimited entries for the SHT-based predictor, in order to study the efficiency of our proposed mechanism under a nearly ideal environment. The non-table method can be approximately treated as the utmost SHT method built with a one-entry table. The threshold to distinguish two signatures is set to be 50%, as derived from paper [10]. Each interval has 100 k instructions. A simple hash function based on shift and masking is used to lower the signature assembling cost. To help understand the prediction accuracies, some program characteristics related statistical values, are also listed in Table 3. Specifically, the “stable duration” is calculated as the percentage of application runtime spent in the

YAO et al.: A DYNAMIC CONTROL MECHANISM FOR PSU BY IDENTIFYING PROGRAM PHASES

1017

stable regions. We define the stable region as the duration in which the distance between the signatures of every two adjacent intervals is not larger than the predetermined threshold. The column of “No. of sig.” depicts the number of different signatures that occur during the program execution. The “Mean stable duration” is the averaged length of stable regions (in 100 k-instruction samples). These three values give a general scenario of the program behaviors. Based on these statistical characteristics results, we can divide the benchmarks into three groups, as shown in Table 3: • Group 1: Benchmarks that illustrate high stability from the view of working set signature, including perlbmk and vpr. These two benchmarks have the highest ratios of stable duration, together with the smallest numbers of different signatures among all the benchmarks. The mean stable durations are in the order of 100 or 1000 interval samples. Most of program behavior based mechanisms may perform good for such kind of stable benchmarks, as expected. • Group 2: Benchmarks with lower stable durations but limited different signatures, including gzip, mcf, and vortex. These benchmarks is less stable than those in group one. However, the number of different signatures is in the order of 10, which shows that code diversity between different intervals is not large. An exception is the benchmark bzip2, with a high stable duration ratio but medium grade of different signatures. We also regard it to be in group 2, since our latter results show that code diversity plays a more important role in the final PSU degree prediction. • Group 3: Benchmarks gcc and parser. The two benchmarks exhibit high instability by experiencing a large number of different working set signatures. Also the stable duration is relatively small in this group. Table 3 clearly indicates that the prediction accuracy varies due to the different characteristics between applications. For the most stable benchmark perlbmk and vpr, which have very high stable duration ratios and very limited numbers of signatures, the non-table and SHT-based methods perform almost equivalently. However the advantages of signature history table can be immediately observed with the less stable benchmarks of medium code diversities, from the accuracy results of Group 2 benchmarks. Specifically, for benchmark mcf, the predictor of “non-table” method, results in a misprediction of more than 65%, while SHT-based predictor achieves a less than 7% misprediction rate, in comparison. For the Group 3 benchmarks, although SHT-based method still performs slightly better than the non-table method, both of the non-table and table based methods experience degradations in the prediction accuracy. It indicates that for these most variable benchmarks, the relationship between two signatures is not stable. Thus it is hard for the predictor to estimate an appropriate PSU degree for the pending interval, based on the history information. To improve the accuracy for benchmarks in this group, we need

to store more history information about the jump directions between signatures. However, such design will increase the hardware complexity significantly and we are not going to include it in this paper. Overall, averaged from all the integer benchmarks we have studied, the SHT-based predictor achieves a misprediction of 16.0%, in estimating the PSU degree for the next bulk of instructions. It performs 2.01 times better than the non-table method, according to the misprediction rate. Better prediction accuracy will finally result in a high power/performance metric saving. Detailed results will be studied in Sect. 5.3. 5.2 Determining the Parameters for History Table 5.2.1 The Signature Vector Size The previous section clearly demonstrates that the SHTbased predictor performs effectively in detecting the program phases and predicting suitable PSU degrees for pipeline reconfiguration. However, adopting a 512-bit-long signature vector might not be practical for a real system. A large signature vector will drastically increase latencies in classifying signatures, and the extra power consumed in signature hardware units can easily overwhelm the energy saved by PSU technology. In Fig. 8, we show how the PSU degree prediction accuracy changes with different working set signature vector sizes. Specifically, we show the results with 512-bit, 256bit, 128-bit, and 64-bit signature vector sizes. Infinite signature history table size was assumed in the whole sequence, in order to see the pure influence on prediction accuracy introduced by different signature sizes. Here, the PSU degree prediction accuracy is also the correlation result compared to the previously introduced 100% optimal method (in Sect. 5.1), which statically adopts the most suitable PSU degree for each interval based on the profiling data. The x dimension of Fig. 8 lists the 8 benchmarks we have studied and in the y dimension we show the prediction accuracies of the SHT-based predictor for each benchmark, with different vector sizes. As Fig. 8 illustrates, large signature vectors tend to have better effects in the prediction. However, the PSU degree predictor performs almost identically to the predictor based on the 512-bit-long signature,

Fig. 8

Prediction accuracy vs. signature vector size.

IEICE TRANS. INF. & SYST., VOL.E91–D, NO.4 APRIL 2008

1018

down to 128-bit-long signature. The observable degradations in prediction accuracy occur on benchmark bzip2 and parser with a 64-bit vector. Furthermore, for some benchmarks like gcc, parser and vortex, the smaller 256-bit or 128-bit signature even slightly outperforms the largest 512bit-long signature. It is because that although larger signature vector usually shows more accurate distinction between program behaviors and helps divide program into more groups, it potentially increases the complexity for the predictor to detect the relationships between signatures and may also require more tuning information to determining a suitable PSU degree. In general, we consider that a 128-bitlong working set signature is sufficient for our SHT-based PSU degree predictor from the results shown in Fig. 8. This value is adopted in the remainder of this paper. 5.2.2 The Signature History Table Size Similar as the consideration of signature size, the number of entries in the history table is another important parameter for the history table based method. However, as we study Table 3, the numbers of different signatures are comparatively small for most of the benchmarks and the majority of the benchmarks are quite stable. Therefore, it is possible to set a small fixed table size without degrading the ratio of prediction accuracy. We conducted a second series of experiments, varying the table size from infinite entries to 1 entry. A 128-bit-long signature vector was adopted in these experiments. We used the LRU mechanism to replace a table entry when there is insufficient room for a new signature. The results are shown in Fig. 9. Similarly as Sect. 5.2.1, the prediction accuracy is used as the measurement of the effectiveness. Figure 9 depicts the prediction accuracies of SHT based predictor under different table sizes. It has a similar format of Fig. 8. For each benchmark, the columns correspond to prediction accuracies of infinite-entry, 64-entry, 16-entry, 8-entry, 2-entry, and 1-entry signature history tables. These prediction accuracies show that there are little degradations between the infinite-entry, 64-entry, and 16entry results for all benchmarks. And for the two most stable benchmarks in Group 1, perlbmk and vpr, the parameter of table sizes introduces no influence since very limited signatures are included in these two benchmarks, as Table 3

Fig. 9

Prediction accuracy vs. table size.

illustrates. For other benchmarks, a large table size tends to have better prediction accuracies and an observable decrease of prediction accuracy occurs after the table size shrinks to 8-entry. Also, there are some deviations, that some smaller table sizes even perform better than larger ones, in considering some individual benchmarks like gcc and gzip. The prediction accuracy of benchmark bzip2 also has such tendency in the small table size area. It indicates that we may get some benefits by properly discarding too old history information in some benchmarks. Nonetheless, we can still assume a relatively small table—a 16-entry table—to averagely fit for all 8 benchmarks we have studied. 5.2.3 The Interval Length Besides these two parameters related to the SHT itself, the interval length also affects the efficiency of signature based predictors, since it defines the granularity of program behaviors. A coarse granularity will conceal possible distinction inside the program interval, while too fine granularity will show less glitch tolerance and also increase the overhead of the SHT-based mechanism. Although it is possible to dynamically vary the interval length according to program behaviors, it will potentially increase the complexity of the phase detection algorithm. For these reasons, a suitable fixed interval length was explored in our research. We employed 10 k, 100 k, and 1 M instructions as the fixed interval length, respectively. The results showed that the prediction accuracy for 100 k was better than the other two. However, compared with vector and table size, these three different interval lengths introduce relatively smaller influences. For the consideration of paper length, the detailed results are not included. In summary, these three parameters—the size of working set signature vector, the number of SHT entries, and the interval length—define the hardware complexity of phase detection in the SHT-based predictor and can be tuned to a moderate level without sacrificing the accuracy. A hardware implementation based on these optimized values, together with the additional cost estimation, will be studied in Sect. 5.4. 5.3 EDP Reduction by SHT-Based PSU Adoption Method In previous sections, we studied the efficiency of our dynamic PSU degree predictor based on a history table hardware. The final target of this dynamic processor reconfiguration framework is to achieve better power/performance. In this section, we evaluate our dynamic SHT-based predictor, from the view of EDP reduction. In the following experiments, the signature size was chosen to be 128-bit and the table size is set to be 16, as evaluated in Sect. 5.2. Figure 10 shows the EDP results for all 8 integer benchmarks. Figure 10 depicts the obtained EDP results with our deployed environments, using the SHT-based predictor. For

YAO et al.: A DYNAMIC CONTROL MECHANISM FOR PSU BY IDENTIFYING PROGRAM PHASES

1019 Table 4

PSU degree ratios in the dynamic PSU adoption with SHT. Benchmark bzip2 vortex vpr gcc parser gzip perlbmk mcf average

Fig. 10

Normalized EDP for SPEC CPU2000 integer benchmarks.

comparison, the EDP results of the baseline processor and the ideal PSU degree predictor which was defined in Sect. 5.1, are also included. All these values are the results after the application of clock gating. In Fig. 10, the y dimension lists the achieved EDP results of each benchmark, as normalized to the EDP of the same benchmark under the baseline execution. Columns in each benchmark correspond to the executions of baseline processor, SHT-based PSU adoption with 128-bit-long signatures and 16 SHT entries, and the ideal PSU adoption, from left to right. The benchmarks are shown in decreasing EDP order with the SHT-based PSU adoption. The average results of all benchmarks are also listed. As Fig. 10 demonstrates, the processor with ideal PSU adoption tends to have better EDP than the baseline processor, by switching the PSU degree properly. The only exceptions exist in benchmark bzip2 and vortex. It is because the optimal method is designed to have the smallest EDP  in each interval. It will have the ideally smallest result of ni=1 (E i × Di ).  Since  the actual EDP in Fig. 10 is calcun lated as i=1 Ei × ni=1 Di (Eq. (2)), employing ideal PSU degree for some program phases may slightly impede final actual total EDP. However, the differences between the processor instructed by the ideal predictor and the baseline processor are relatively small for these two benchmarks, compared to its good EDP savings in other benchmarks. Therefore, we still consider it a criterion to evaluate other dynamic PSU adoption methods. Averagely, the ideal PSU achieves a 12.8% EDP saving, compared to the baseline processor. This indicates that even with the application of a complex clock gating, which usually limit the opportunity of other energy saving method, the chance of EDP saving is not totally eliminated for the processors that adopt PSU pipelines. With the high prediction accuracy, our SHT based prediction can manage the PSU pipeline to achieve an EDP results very similar to the ideal scenario. As Fig. 10 shows, averagely, the EDP of the processor, with the on-the-fly SHT predictor, achieves an 88.6% EDP result, as compared to the baseline processor. This value is only 1.6% larger than the ideal one, which is constructed on the profiling data. Considering the benchmarks individually, the EDP deviations between SHT-based and ideal predictors reflect a similar trend as the prediction accuracy. Specifically, for bench-

U1 83.00% 99.93% 0.01% 17.86% 7.97% 16.61% 0.01% 0.77% 28.27%

U2 12.56% 0.04% 99.98% 71.57% 64.96% 72.80% 99.98% 1.45% 52.92%

U4 4.44% 0.03% 0.01% 10.57% 27.07% 10.58% 0.01% 97.78% 18.81%

marks in Group 1 (defined in Sect. 5.1), in which SHT-based predictor achieves above 99% prediction accuracies, the corresponding EDP reductions are nearly perfect, as compared to the ideal ones. For benchmarks in Group 2, the deviations become slightly larger. The observable degradation from ideal to SHT method occurs in the Group 3 benchmarks, gcc and parser. The relatively high misprediction—above 40%—in the SHT-base PSU degree predictor leads to more than 25% opportunity loss in reducing EDP, according to the maximum EDP reduction in the ideal method. However, such degradations are considered to be tolerable since the chances provided by ideal method are also less significant in these two benchmarks. Another observation we can draw from Fig. 10 is that different benchmarks provide quite different EDP reduction opportunities. PSU can save EDP by executing program in shallow pipeline configurations (U2 or U4), if the increased delay in the shallow modes does not overcome the gainings from the saved energy. The chance to execute program in shallow modes totally relies on the program behavior, which is a complex combination of branch misprediction rate, number of pipeline hazards, memory access intensity, and etc. In Table 4, we list the percentage of each PSU degree experienced in the SHT-based method. Averagely, U2 is the most preferred PSU degree among these three benchmarks. However, the data in Table 4 also reveals that there is not a single fixed pipeline depth can perfectly match with all the benchmarks, in achieving a good powerperformance trade-off. Therefore, with an effective dynamic PSU management—SHT-based predictor in this paper—a pipeline with multi-usable depths can fit for a larger range of benchmarks. 5.4 Hardware Implementation As indicated in Sect. 3.3, the complexity of the SHT-based algorithm mainly comes from the phase clustering part. The major task performed in this part is to look up the current signature vector in SHT to find the most similar one, which requires applying the distance calculation equation (Eq. (1)) on the pair of current signature and each cached signature. In the implementation, we used the logic shown in Fig. 11 to S 2 ), which achieve the function of num o f “1 bit in(S 1 is the numerator part of Eq. (1). The logic acts as an xor a parallel full adder, which actually outputs n−1gate and (S [i] S 2 [i]). We replaced the first two steps of 11 i=0

IEICE TRANS. INF. & SYST., VOL.E91–D, NO.4 APRIL 2008

1020

we estimate that it adds approximately 0.042% to the processor area. As for the additional combinational logics required by this SHT method, from the synthesis result, the approximate number of transistors used to get a distance is less than 50,000, which adds 0.037% to the total amount of processor transistors. With the short active period per each 100 k instructions, the cost by SHT-based method can also be roughly regarded as negligible. 5.5 Considering Voltage Frequency Scaling

Fig. 11

Fig. 12

num o f “1 bit in(S 1



S 2 ).

Distance calculation for each entry in SHT.

bit and 2-bit full adders by a single logic gate with the corresponding mapping described in Fig. 11, which can slightly accelerate the whole procedure. The implementation of num o f “1 bit in(S 1 + S 2 ) is very similar as Fig. 11, except that the first step of xor logic gates changes to or gates. After we obtained these two summaries, a simple division is required to generate the final distance, as Fig. 12 shows. Combining with the conclusion that a 128-bit signature is sufficient for SPEC CPU2000 integer benchmarks, we described the above logic in Verilog HDL (Hardware Description Language) and performed behavioral and logic synthesis. The logic delay values, generated by the timing analyzer, showed that the critical path—with both cell and interconnect delays included—incurs a latency of about 20.6 fan-out-of-four (FO4) inverter delays, from the input of two signatures to the output of distance result. Since current processor usually contains pipeline stages of around 10 FO4 inverter delays, one 128-bit signature distance calculation can be finished in 2 cycles, approximately. Thus, even in the worst case that all the 16 SHT entries are occupied, the distances can be calculated in less than 50 cycles. It will be even faster, if we elaborate the calculations in parallel. Comparing with the granularity of invoking the signature clustering procedure, which is the predefined 100 kinstruction interval and is usually a period of more than 50 k cycles, we can roughly consider the latency of additional SHT algorithm as negligible. The RAM structure that holds the 16 history signatures is 256 bytes. Using Cacti 4.2 [19],

Dynamic Voltage Frequency Scaling (DVFS) is a commonly used method in energy saving fields. We provided some comparison about the efficiency of DVFS, with our proposed PSU based dynamic optimization, in this section. As we studied the DVFS model, we found that in the recent chip technologies, it can hardly help the reduction in EDP or EDDP, despite of its great saving in PDP. Roughly, we can assume that the performance degrades linearly, as frequency scales down in DVFS system† . For a bulk of instructions that is selected to execute in the lowered voltage, we can approximately have:  2  m−1 Vlow f MetricDVFS (m) = (3) Metricnormal (m) Vdd flow Where Metric(m) stands for the power/performance metrics defined in paper [17]. It corresponds to PDP, EDP, and EDDP, with m taking value of 1, 2, or 3, respectively. Specifically, if we are considering a processor like 90 nm Pentium M [9], we can assume the following parameters for Eq. 3: Vdd = 1.34 V, f = 2 GHz and Vlow = 1.1 V, PDPDVFS flow = 1 GHz. When m = 1, PDP = 0.6738, which normal means DVFS can efficiently reduce the PDP. However, EDPDVFS = 1.347. It shows that DVFS technolwhen m = 2, EDP normal ogy has some penalties in considering EDP or EDDP-like metrics. It suffers from the penalty in the delay as m becomes larger. Moreover, there are many restrictions of voltage scaling in the current and the future process technologies (e.g. soft error, process deviation). The ineffectiveness of DVFS in metric EDP and EDDP will probably become even larger in the future process technology. Recently, some researchers begin to consider hiding the DVFS performance degradation under the L2 cache misses with the help of different power-supply networks between the processor and memory, as in paper [8], [13]. Their researches show that DVFS-like technology can also provide good EDP reduction, especially for memory intensive applications. As a comparison to the efficiency of dynamic PSU adoption mechanism in our research, we built a fast model in the employed simulation environment, following † Strictly speaking, clock frequency degradation improves IPC because of the decrease in memory access latency (in cycles), so that the performance of the processor will not degrade as linearly as frequency. However, other than memory intensive benchmarks, the effectiveness of memory latency reduction is very limited. Detailed simulation results are given in latter part.

YAO et al.: A DYNAMIC CONTROL MECHANISM FOR PSU BY IDENTIFYING PROGRAM PHASES

1021 Table 5 Voltage (V) 1.340 1.244 1.148 1.052

Fig. 13

Freq. (GHz) 2.0 1.6 1.2 0.8

DVFS settings. Voltage (V) 1.292 1.196 1.1 0.988

Freq. (GHz) 1.8 1.4 1.0 0.6

Normalized EDP in processors with ideal DVFS adoption.

the voltage scaling scheme in Pentium M processor. Eight voltage-frequency settings were assumed according to [9], as shown in Table 5. Figure 13 depicts the simulated EDP results from the constructed DVFS system. For simplicity, DVFS management was designed to be “ideal” as we did in the 100% optimal PSU adoption, to indicate the potential EDP reduction a DVFS enabled system can achieve. The EDP results without or with the application of clock gating, are listed in Fig. 13, for comparison. All the EDP results for each benchmark are normalized to the EDP of the baseline processor without any EDPideal cc3 as clock gating. We also provided the values of EDP baseline cc3 the line with points in Fig. 13, to indicate the possible EDP saving chance under the clock gating model. Similarly as we did in the PSU control mechanism, the clock gating applied here is of style cc3 from Wattch [15]. Since we are using different platforms to calculate the EDP reduction, as compared to paper [13], the final results are thus quite different. However, this figure illustrates a similar trend as paper [13], that EDP can be reduced by DVFS in memory intensive benchmark like mcf, while the saving seems less observable in other benchmarks. As we study the results further, we can see that the clock gating itself can provide more than 70% of EDP reduction, which mainly comes from the gated clocks in the RAM structures. The modified energy breakdown after applying clock gating leads to a decrease in the EDP saving chances for DVFS. Specifically, in benchmark mcf, the EDP reduction can reach 12.0% in a non-clock-gating DVFS system, but this value will be halved after we apply clock gating style cc3 via Wattch. In general, under our deployed evaluation environments, DVFS does not perform very well in reducing EDP, as compared to the proposed dynamic PSU adoption method.

6. Conclusions and Future Work In this paper, we focused on the power/performance saving in modern processors based on PSU technology. We designed a signature history table based PSU degree predictor to instruct the PSU controller by estimating a suitable PSU degree for the pending program period, with the recognition of program phases. Averaged from the 8 integer benchmarks we have studied, our simulation results depict that this dynamic predictor can achieve a PSU degree misprediction within 16%. It is 2.01 times better than the nontable based predictor, according to the misprediction rate. The high prediction accuracy results in good energy efficiencies for PSU pipelines. With this SHT-based dynamic mechanism, we can obtain 11.4% more EDP reduction in the processors with PSU adoption, compared to the baseline processor, even after the application of a complex clock gating. This EDP result is only 1.6% larger than the processor instructed by the ideal predictor. Future work will focus on the implementation of PSU. By studying the hardware approach, a more accurate model for the calculation of energy consumption by inclusion of the detailed overhead introduced by the dynamic prediction mechanisms is envisioned. In addition, alternate program phase detection methods other than the working set signature will be tried on the PSU system. Acknowledgments This research is partially supported by Grant-in-Aid for Fundamental Scientific Research (S) #16100001 from Ministry of Education, Culture, Sports, Science and Technology Japan. References [1] H. Shimada, H. Ando, and T. Shimada, “Pipeline stage unification: A low-energy consumption technique for future mobile processors,” Proc. 2003 International Symposium on Low Power Electronics and Design, pp.326–329, ACM Press, 2003. [2] J. Koppanalil, P. Ramrakhyani, S. Desai, A. Vaidyanathan, and E. Rotenberg, “A case for dynamic pipeline scaling,” Proc. 2002 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, pp.1–8, ACM Press, 2002. [3] M.S. Hrishikesh, D. Burger, N.P. Jouppi, S.W. Keckler, K.I. Farkas, and P. Shivakumar, “The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays,” Proc. 29th Annual International Symposium on Computer Architecture, pp.14–24, IEEE Computer Society, 2002. [4] V. Srinivasan, D. Brooks, M. Gschwind, P. Bose, V. Zyuban, P.N. Strenski, and P.G. Emma, “Optimizing pipelines for power and performance,” Proc. 35th Annual ACM/IEEE International Symposium on Microarchitecture, pp.333–344, IEEE Computer Society Press, 2002. [5] H. Shimada, H. Ando, and T. Shimada, “Power consumption reduction through combining pipeline stage unification and DVS,” IPSJ Trans. Advanced Computing Systems, vol.48, no.3, pp.75–87, Feb. 2007. [6] H. Shimada, H. Ando, and T. Shimada, “Reducing processor energy consumption with pipeline stage unification,” Trans. IPSJ, vol.45,

IEICE TRANS. INF. & SYST., VOL.E91–D, NO.4 APRIL 2008

1022

no.1, pp.18–30, 2004. [7] A. Hartstein and T.R. Puzak, “Optimum power/performance pipeline depth,” Proc. 36th Annual IEEE/ACM International Symposium on Microarchitecture, pp.117–128, IEEE Computer Society, 2003. [8] H. Li, C.Y. Cher, K. Roy, and T.N. Vijaykumar, “Combined circuit and architectural level variable supply-voltage scaling for low power,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol.13, no.5, pp.564–576, May 2005. [9] Intel Corporation, “Intel Pentium M processor on 90 nm process with 2 MB L2 cache datasheet,” 2006. [10] A.S. Dhodapkar and J.E. Smith, “Managing multi-configuration hardware via dynamic working set analysis,” Proc. 29th Annual International Symposium on Computer Architecture, pp.233–244, IEEE Computer Society, 2002. [11] T. Sherwood, E. Perelman, G. Hamerly, S. Sair, and B. Calder, “Discovering and exploiting program phases,” IEEE Micro, vol.23, no.6, pp.84–93, Nov./Dec. 2003. [12] A.S. Dhodapkar and J.E. Smith, “Comparing program phase detection techniques,” Proc. 36th Annual IEEE/ACM International Symposium on Microarchitecture, pp.217–227, IEEE Computer Society, 2003. [13] C. Isci, G. Contreras, and M. Martonosi, “Live, runtime phase monitoring and prediction on real systems with application to dynamic power management,” Micro, vol.0, pp.359–370, 2006. [14] D. Burger and T.M. Austin, “The simpleScalar tool set, version 2.0,” SIGARCH Computer Architecture News, vol.25, no.3, pp.13–25, 1997. [15] D. Brooks, V. Tiwari, and M. Martonosi, “Wattch: A framework for architectural-level power analysis and optimizations,” Proc. 27th Annual International Symposium on Computer Architecture, pp.83– 94, ACM, 2000. [16] P. Bai, C. Auth, S. Balakrishnan, M. Bost, R. Brain, V. Chikarmane, R. Heussner, M. Hussein, J. Hwang, D. Ingerly, R. James, J. Jeong, C. Kenyon, E. Lee, S.H. Lee, N. Lindert, M. Liu, Z. Ma, T. Marieb, A. Murthy, R. Nagisetty, S. Natarajan, J. Neirynck, A. Ott, C. Parker, J. Sebastian, R. Shaheed, S. Sivakumar, J. Steigerwald, S. Tyagi, C. Weber, B. Woolery, A. Yeoh, K. Zhang, and M. Bohr, “A 65 nm logic technology featuring 35 nm gate lengths, enhanced channel strain, 8 Cu interconnect layers, low-k ILD and 0.57 /spl mu/m/sup 2/ SRAM cell,” 2004 IEEE International Electron Device Meeting Technical Digest, pp.657–660, Dec. 2004. [17] R. Gonzalez and M. Horowitz, “Energy dissipation in general purpose microprocessors,” IEEE J. Solid-State Circuits, vol.31, no.9, pp.1277–1284, Sept. 1996. [18] R. Balasubramonian, D. Albonesi, A. Buyuktosunoglu, and S. Dwarkadas, “Memory hierarchy reconfiguration for energy and performance in general-purpose processor architectures,” Proc. 33rd Annual ACM/IEEE International Symposium on Microarchitecture, pp.245–257, ACM Press, 2000. [19] D. Tarjan, S. Thoziyoor, and N.P. Jouppi, “CACTI 4.0,” Technical Report 2006/86, HP Laboratories, 2006.

Jun Yao was born in 1978 and received his B.E. and M.E. degrees from Tsinghua University in 2001 and 2004, respectively. He became a researcher at the Graduate School of Informatics, Kyoto University from October, 2007. His research interests are computer architecture and storage area networks. He is currently a member of IPSJ.

Shinobu Miwa was born in 1977 and received Ph.D. degree from Kyoto University in 2008. He is now an assistant professor in the Graduate School of Engineering, Tokyo University of Agriculture and Technology from 2008. His research interests are computer architecture and neural networks. He is a member of IPSJ and JSAI.

Hajime Shimada was born in 1976 and received his B.E., M.E. and D.E. degrees from Nagoya University in 1998, 2000 and 2004 respectively. He was a research associate in the Graduate School of Informatics, Kyoto University in 2005 and is now an assistant professor in the same faculty. He is currently focusing on the computer architecture related researches. He is a member of IPSJ.

Shinji Tomita was born in 1945 in Japan. He received the B.E., M.E. and D.E. degrees from Kyoto University in 1968, 1970 and 1973 respectively. He is currently Dean of the Graduate School of Informatics, Kyoto University. His major research interests are on computer architecture and parallel processing. He is a member of IPSJ, ACM and IEEE.