Power Savings in Globally Asynchronous Locally ... - CiteSeerX

7 downloads 0 Views 37KB Size Report
Thomas Olsson and Peter Nilsson. Dept. of Applied Electronics, Lund University. P.O. Box 118, SE-22100, Lund, Sweden. [email protected], Peter.
Power Savings in Globally Asynchronous Locally Synchronous Designs Thomas Olsson and Peter Nilsson Dept. of Applied Electronics, Lund University. P.O. Box 118, SE-22100, Lund, Sweden. [email protected], [email protected]. ABSTRACT Clock frequencies will increase to a level where they are not suitable for centralized distribution over a large silicon die. Global synchronous methods will therefore not be suitable. Partitioning large high-speed globally synchronous ASICs into locally clocked blocks reduces clock skew problems and can also be utilized to reduce the power consumption.

INTRODUCTION On the path towards large single-chip solutions, many functional blocks come together on one die. With growing die sizes and shrinking clock periods, clock skew will be a problem. In order to handle this problem, many different approaches of how to build the clock line has been suggested. Traditionally, a large balanced clock tree has been the most common solution. Clock frequencies will however increase to a level where they are not suitable for centralized distribution over a large silicon die. Global synchronous methods will therefore not be suitable. Besides the clock skew problem, the clock line often suffers from a huge load, causing problems with high power consumption in the clock net. A design concept known as GALS (Globally Asynchronous Locally Synchronous) has recently attracted renewed attention as it promises a solution to the named problems. In GALS, the design is partitioned into a number of synchronous blocks with local autonomous clocks. Data exchange between the blocks is realized by asynchronous means. GALS is introduced in [1]. Experimental designs have been implemented in [2] and [3]. In the GALS concept, a number of advantages are introduced. Clock skew constraints and clock power consumption are reduced since the extensions of the local clock trees become smaller. The clock speed and power supply voltage of the synchronous blocks can also be adjusted individually to save power.

POWER SAVINGS IN GALS There are several possibilities for power savings in a GALS design: 1) Less need for power consuming clock buffers, 2) Locally optimized frequencies for synchronous blocks and 3) Locally optimized supply voltages for synchronous blocks.

The effect of not needing large clock buffers is discussed in [4], where average power saving of 70% in the clock net is indicated. The power consumption in a synchronous digital design is mainly due to switching, and proportional to f times Vdd2, where f is the clock frequency and Vdd is the power supply voltage. Therefore, dividing a design into blocks and assigning locally optimized clock frequencies and supply voltages to the blocks has a potential for power reduction. In [5] and [6], a power saving in the range of 25-50 % is achieved for a number of design examples using dual supply voltage. Compared to a single voltage approach. To enlighten the opportunities for power savings using locally optimized supply voltages and frequencies, a design example is chosen: Consider a large digital design, which is divided into 8 locally synchronous blocks. The blocks have different critical path lengths. Some blocks handle data in a lower rate than the others do and therefore they have different requirements on operation frequency. Also, the blocks have varying switched capacitance and activity factor. For each block, the critical path, lowest possible operation frequency and switching capacitance is shown in figure 1. The switching activity is included in the switching capacitance for each block.

tcrit=48 t.u fmin=16 f0 aC=8 C0

64 t.u 8 f0 4 C0 36 t.u 8 f0 16 C0

10 t.u 8 f0 24 C0

24 t.u 4 f0 16 C0

32 t.u 1 f0 10 C0

16 t.u 2 f0 20 C0

20 t.u 8 f0 12 C0

Figure 1. Design example. If all blocks are run at one frequency, the whole design operates at the frequency 16 f0. The power consumption in a block which is idle while waiting for new data is likely to be low due to low switching activity in such a block. It is therefore assumed that the blocks, which can run at a lower frequency, are paused after completing its task, which in an



As a reference, a supply voltage of 3.0 volts is chosen to match the longest critical path of 64 t.u at the frequency 16 f0. The propagation delay through a logic gate is proportional to Vdd/(Vdd-Vt)2 [7]. The lowest possible supply voltage for each block is therefore derived from equation 2. A Vt of 0.65 V for a standard 0.35 mm process is used in the calculations. 64×Vblock

t ×3.0 = crit 2 (Vblock −Vt ) (3.0 −Vt ) 2

(2)

To show the advantages of multiple supply voltage, The relative power consumption using different numbers of supply voltages is calculated and displayed as the above curve in figure 2. The supply voltages are in each case selected for lowest possible power consumption. To fully exploit the potential for power reduction by lowering the supply voltages, it is now assumed that each block has its own local clock-generator running the block at the lowest possible frequency. The supply voltage for all blocks which can be run at a lower frequency than 16 f0 can now be reduced further. The new supply voltage can be calculated using equation 3. 64×Vblock

t ×3.0 f min = crit × 2 (Vblock −Vt ) (3.0 −Vt ) 2 16× f 0

1 0.9

elative power consumption

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

1

2

3

4 5 6 Number of supply voltages

Figure 3 shows the effect of non-ideal DC/DCconverters. The four curves show the effect of 100%, 90%, 80% and 70% effective converters. Other sources of overhead such as local clock generation is ignored in figure 3, since they can be made relatively small. For the chosen design example, substantial power savings can be made even if fairly bad DC/DC-converters are used. From figure 2 and 3, it is obvious that increasing the number of available supply voltages from one to two often results in large power savings, while using more than two voltages only leads to minor savings. For the designs in [5], an increase in the number of voltages from two to three results in less than 5% power savings. 1 0.9 0.8 0.7 0.6 0.5

70%

0.4 0.3 0.2

100%

0.1

(3)

The lower curve in figure 2 shows the relative power consumption using different numbers of supply voltages when the supply voltages are reduced due to the local clock generators.

0

fairly large overhead. For transforming a supply voltage down to a lower voltage a DC/DC-converter is used. In [11] and [12], there are examples of highly efficient DC/DCconverter solutions with efficiency of about 90%.

Relative power consumption

energy perspective is equivalent to using a lower clock frequency. The total power consumption then becomes: 8 Ptot = f min iαCiVblocki 2 (1) i =1

7

8

Figure 2. Relative power consumption. For the calculations, no overhead was considered for the blocks. The overhead for the local clocks can be kept low if the local clocks do not need to be synchronized. The supply voltage scaling on the other hand always comes with a

0

1

2

3

4 5 6 Number of supply voltages

7

8

Figure 3. Relative power consumption with overhead.

OVERHEAD COMPONENTS The overhead components in a GALS design are the block to block asynchronous communication, Local clock generators, DC/DC-converters and signal level converters. In [4], the overhead from asynchronous communication is shown not to become significant in a GALS design, unless the number of partitions become too large. Local clock generators can be of varying complexity. They can be as simple as a division of the global frequency, which causes very little circuit overhead. Traditionally, a phase locked loop (PLL) is used for on-chip clock generation. The reasons for using a PLL is primarily to get a set of well synchronized clock sources. However, in a GALS design, clock synchronization is of less importance. Robust and easy implemented fully digital clock multipliers without phase locking are proposed in [9] and [10]. These clock multipliers produce a fixed number of cycles for each period of an external reference clock signal followed by an idle margin. The most frequently used DC/DC converter for power savings by lowering the supply voltage is the Buck

converter [14]. The buck converter switches the main supply voltage down to a lower voltage using large transistor switches (see figure 4). A control circuit is used for generating a square-wave with variable duty factor, which controls the switches. The switched supply is filtered using a second order low-pass filter to minimize the ripple at "Vout". Besides from power consumption in the control circuits for the Buck converter, there is always power loss in the switches due to the on resistance of the transistors. In [11] and [12], there are examples of highly efficient DC/DC-converter solutions with efficiency of about 90%. VDD

Vs Vout

Figure 4. The Buck converter. In a system with more than one supply voltage, signal level converters are needed to ensure correct functionality when going from a low voltage region to a higher voltage region. For this purpose, the level converter shown in figure 5 is feasible. This signal level converter is used in [6] and [8]. Furthermore, if each output bit is fed through an inverter supplied by a lower voltage, all buses and signals are run at the lower voltage. This leads to further power savings [13]. Vhigh

Out Vlow In

Figure 5. Signal level converter. CONCLUSIONS In a GALS design the clock skew constraints are reduced since the clocking is no longer a global issue. GALS is therefore a promising technique for simplifying the design of large high-performing system-on-a-chip designs. The lack of global synchronization makes power savings in the clock net possible. Moreover, the inherited block structure with more or less independent blocks opens up great possibilities for substantial power reductions using local supply voltages together with on-chip clock generators.

REFERENCES [1] D. M. Chapiro, “Globally-Asynchronous LocallySynchronous Systems”, Ph.D. Dissertation, Computer Science Department, Stanford University, Stanford, CA, October, 1984. [2] K. Y. Yun, R. P. Donohue, ”Pausible clocking : A first Step towards Heterogenous Systems”, Proceedings of ICCD’96, pp. 118-123. [3] J. Muttersbach, et al. ”Globally Asynchronous Locally Synchronous Architectures to Simplify the Design of On-Chip Systems”, Proceedings of the 12th Annual IEEE International ASIC/SOC Conference , 1999, pp. 317-321. [4] A. Hemani, T. Meincke, S. Kumar, A. Postula, T. Olsson, P. Nilsson, J. Öberg , P. Ellervee and D. Lindqvist, “Lowering power consumption in clock by using globally asynchronous locally synchronous design style”, Proceedings of DAC’99, pp. 873-878. [5] M. Johnson, K. Roy "Optimal selection of supply voltages and level conversions during data path scheduling under resource constraints", In Proceedings of ICCD '96, pp. 72-77. [6] V. Sundararajan, K. Parhi, "Synthesis of Low Power CMOS VLSI Circuits using Dual Supply Voltages", In Proceedings of DAC '99, pp. 72-75. [7] J. M. Rabaey. Digital Integrated Circuits: A Design perspective, Prentice hall, 1996. [8] M. Johnson, K. Roy, "Scheduling and optimal voltage selection for low power multi-voltage DSP datapaths", In Proceedings of ISCAS '97, pp. 2152-2155. [9] P. Nilsson and M. Torkelson, “A Monolitic Digital Clock-Generator for On-Chip Clocking of Custom DSP’s”, IEEE Journal of Solid-State Circuits, vol.31, No. 5, pp. 700-706, May. 1996. [10] T. Olsson, T. Meincke, A. Hemani and M. Torkelson. “A Digitally Controlled Low-Power Clock Multiplier for Globally Asynchronous Locally Synchronous Designs“, In Proceedings of ISCAS’2000, Geneva, May 2000. [11] G-Y. Wei and M. Horowitz, "A Fully Digital, EnergyEfficient, Adaptive Power-Supply Regulator", IEEE Journal of Solid-State Circuits, pp. 520-528, April 1999. [12] T. Kuroda et al. "Variable Supply-Voltage Scheme for Low-Power High-Speed CMOS Digital Design", IEEE Journal of Solid-State Circuits, pp. 454-462, March 1998. [13] Y. Nakagome et al. "Sub-1-V Swing Internal Bus Architecture for Future Low-Power ULSI's". IEEE Journal of Solid-State Circuits, pp. 414-419, April 1993. [14] A. P. Chandrakasan, and R. W. Brodersen. "Minimizing Power Consumption in Digital CMOS Circuits". Proceedings of the IEEE,vol. 83, April 1995, pp. 498-523.