Power Optimization of Sum-of-Products Design for Signal Processing ...

2 downloads 205 Views 385KB Size Report
Computer Science Department. University of California at Los Angeles. CA, USA 90095 [email protected]. Suk Joong Huh. Samsung Electronics. Suwon ...
Power Optimization of Sum-of-Products Design for Signal Processing Applications Seok Won Heo

Suk Joong Huh

Milo˘s D. Ercegovac

Computer Science Department University of California at Los Angeles CA, USA 90095 [email protected]

Samsung Electronics Suwon, Korea [email protected]

Computer Science Department University of California at Los Angeles CA, USA 90095 [email protected]

Abstract—Power consumption is a critical aspect in today’s mobile environment, while high-throughput remains a major design goal. To satisfy both low-power and high-throughput requirements, parallelism has been employed. In this paper we present an approach to reducing power dissipation in the design of sum-of-products operation by utilizing parallel hardware while maintaining high-throughput. The proposed design reduces about 46% of execution time with about 12% energy penalty compared to the ARM7TDMI-S multipliers in benchmark programs.

Keywords—Low-power Arithmetic; Arithmetic; Sum-of-products.

High-throughput

I. INTRODUCTION There is a fundamental technological shift taking place in the electronics industry. It is moving from the wired era driven by Personal Computer (PC) to the wireless era driven by mobile devices. With an increasing complexity of mobile VLSI systems and a growing number of signal processing applications, minimizing the power consumption of signal processing applications has become of great importance in today’s mobile system design while performance and area remain the other two major design goals. Multiplication and related arithmetic operations are frequently executed operations in conventional digital signal processing applications. However, digital signal processing applications may take many clock cycles using a conventional multiplier even when they include a high-performance parallel multiplier. This is the critical problem for the arithmetic operations in state-of-the-art signal processing applications which require intensive numerical calculations. Moreover, studies on power dissipation in Digital Signal Processors (DSPs) and Graphics Processing Units (GPUs) indicate that the multiplier is one of the most power demanding components on these chips [1]. Therefore, research on new arithmetic models is needed to satisfy low-power and high-throughput requirements in mobile systems. The total power consumed by a CMOS circuit is composed of two sources: dynamic power and static power [2]. Dynamic power dissipation is the dominant factor in the total power consumption of a CMOS circuit and typically contributes over 60% of the total system power dissipation. Although the effect of static power dissipation increases significantly as

978-1-4799-0493-8/13/$31.00 © 2013 IEEE

VLSI manufacturing technology shrinks, the dynamic power dissipation remains dominant [3]. It can be described by 2 Pdynamic = 0.5 × CL × VDD × fp × N

(1)

where CL is the load capacitance, VDD is the power supply voltage, fp is the clock frequency, and N is the switching activity. The equation indicates that the power supply voltage has the largest impact on the dynamic power dissipation due to its squared term factor. Unfortunately, reducing power supply voltage causes performance degradation. A great deal of effort has been expended in recent years on the development of techniques to utilize the low-power supply voltage while minimizing the throughput degradation. Parallel architectures mitigate such throughput degradation [4]. This paper proposes a new arithmetic architecture for signal processing applications and develops a scheme to achieve power savings in the sum-of-products operation by utilizing parallel architectures. This paper is organized as follows. Section II addresses the problem of conventional arithmetic architectures. Section III presents an in-depth view of recent research in the design of parallel multipliers and provides the proposed sum-of-products architecture. In Section IV, the paper provides power and throughput estimates for the sumof-products design and compares them to the estimates for conventional ARM multipliers and the proposed multipliers. Section V discusses current problems. Finally, a summary is given in Section VI. The designs presented in this paper assume 32-bit integer operands, but they can easily be extended to other types of fixed-point operands. II. PROBLEM Sum-of-products are found in many digital signal processing and multimedia applications including FIR filter, high pass filter, and inner-products. This computation is a summation of two products. It can be described by S =a×b+x×y

(2)

A variation of the sum-of-products is the inner-product. The inner-product is usually computed by repeatedly using a sumof-products.

192

S[i + 1] = a[i] × b[i] + x[i] × y[i] + S[i],

(3)

ASAP 2013

where S[0] = 1. Previous research has mainly focused on designs for dedicated multipliers demonstrating that parallel multipliers can be implemented with clustering/partitioning [5], pipelining [6], bypassing [7] and signal gating [8] techniques for reduced power dissipation. An improved modified Booth encoding with multiple-level conditional sum adder [9] and a sign select Booth encoder [10] techniques have been proposed for high-performance. However, recent studies show conventional arithmetic design cannot efficiently support increasing highthroughput and low-power requirements. The sum-of-product architecture offers an opportunity to satisfy these requirements. Due to the frequent use of multiplication and related arithmetic calculations in digital signal processing applications, many processors provide multiply and/or multiply-accumulate instructions. In order to execute sum-of-products operations, processors use an existing multiplier or a multiply-accumulate (MAC) unit. Conventional processors take extra cycles when using multipliers and MAC units to perform sum-of-products. Clearly, by including a sum-of-product operation one expects that fewer cycles are needed. We want to show that the energydelay product is also reduced. Consider a typical FIR filter: y[n] =

∞ X

c[k] × x[n − k]

(4)

k=−∞

This equation can be implemented in high level language, such as C, as follows: y[n] = 0 for(k = 0; k < N ; k + +) { y[n] = y[n] + c[k] × x[n − k] }

(5)

The last line corresponds a multiply-accumulate operation: x = x + y × z. This equation can be translated into a single multiply-accumulate instruction. The FIR filter can be also implemented in C in another way as:

III. SUM-OF-PRODUCTS DESIGN A. Baseline Architecture The sum-of-products baseline model needs two multipliers and one adder. One way to design the sum-of-products is to use two Partial Product Reduction (PPR) arrays and [4:2] adders followed by a single final Carry Propagate Adder (CPA). The other way to design the sum-of-products is to use two PPR arrays and two CPAs followed by a single CPA. The structure using a [4:2] adder followed by a single CPA will be a better solution, because it has one less carry-propagate addition, and thus the power and delay of this architecture is slightly better than those of its counterpart. The inner-products can be designed based on sum-of-products model. It consists of two PPR arrays, [6:2] adders and latches for accumulation and a single CPA. The [6:2] adders accumulate four inputs with the previous partial sums and carries. Figure 1 shows the baseline models. B. Multiplier Multipliers consume more power and have a longer latency than adders, and thus this paper mainly describes multiplier designs. The previous studies demonstrate that array multipliers that integrate array splitting and left-to-right techniques are better than tree multipliers in terms of power while keeping similar delay and area for up to 32-bit [11][12]. Therefore, in this paper we focus on developing the sum-of-products based on the left-to-right split array multipliers. 1) Left-to-Right Array Multiplier: In conventional right-toleft array multipliers, the Partial Products (PPs) are added sequentially from the rightmost multiplier bit. In contrast, in left-to-right array multipliers, the PPs are added in series starting from the leftmost multiplier bit [13]. Of the two designs, left-to-right array multipliers have the potential of saving power and delay, because the carry signals propagate fewer stages, which reduce the power consumption in the Most Significant (MS) region. Left-to-right array multipliers are superior for data with large range, because PPs corresponding to sign bits with low switching activities are located in the upper region of array [14][15].

y[n] = 0 for(k = 0; k < N ; k+ = 2) { y[n] = y[n] + c[k] × x[n − k] + c[k + 1] × x[n − k + 1] } (6) The last line corresponds to accumulated sum-of-products: x = x + y0 × z0 + y1 × z1 . This equation can be translated into a single instruction using the sum-of-products design. In the best case scenario, the sum-of-products operations will require only half the number of cycles using sum-of-products hardware compared to using a single multiplier.

(a)

(b)

Fig. 1. The baseline models: (a) Sum-of-products design (b) Inner-products design

193

2) Split Array Multiplier: In array multipliers, the lower rows consume much more power than the upper rows in PPR array, because glitches cause a snow balling effect as signals propagate through array [16]. Therefore, if the length of array could be reduced, there would be significant power savings. The way to reduce the array is to split the array into several parts. The previous architectures are the two-level even/odd [17] and upper/lower split array multiplier [14]. Each part only has a half number of rows, and is added separately in parallel. The final even/odd (upper/lower) vectors from two parts can be reduced to two vectors using [4:2] adders. The upper/lower split array architecture is shown in Figure 2. Previous studies have mainly focused on developing twolevel split array designs. However, there would be more powerand delay-efficient if each part is split further. The upper/lower structure is better than the even/odd structure if four-level splitting is used, because it allows simpler interconnection. The physical regularity of array multipliers will be maintained by interleaved placement & routing if we apply upper/lower structure. Moreover, the two-level upper/lower split array multiplier consumes the less power compared to the two-level even/odd counterpart. Therefore, in this paper, we utilize the four-level upper/lower split array multiplier. 3) Carry Save Adder: A [4:2] adder has been widely used in parallel multipliers. As technology scales to the deep sub-micron, the importance of simple wire interconnections increases. Compared to two cascaded [3:2] adders, a [4:2] adder has a regular structure with simple interconnection, and thus it reduces the physical complexity. Moreover a [4:2] adder has the same gate complexity as two [3:2] adders. However a [4:2] adder is faster than two cascaded [3:2] adders because it has 3 × TXOR2 delay while each single [3:2] adder has 2×TXOR2 delay. Thus, by using [4:2] adders, the PPR delay is reduced about 25% without area penalties. The delay reduction is positive for power savings as less switching activities can be generated when signals propagate fewer stages. In this paper, we utilize [4:2] adders for low-power design. IV. EXPERIMENTAL RESULTS A. ARM Multiplier Results We summarize relative performance using total execution time of programs. The execution time required for a program Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ

Ǹ Ǹ Ǹ Ǹ Ǹ

Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ

Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ

Fig. 2.

Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ

Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ

Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ

Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ

Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ

Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ

Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ

Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ

Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ

Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ

Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ

Ǹ Ǹ Ǹ Ǹ Ǹ

Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ

Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ

Ǹ Ǹ Ǹ Ǹ Ǹ

Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ

Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ

Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ

Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ

Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ

Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ

Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ

Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ

Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ

Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ

Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ

Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ

Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ

Upper/lower split array architecture [13].

Ǹ Ǹ Ǹ Ǹ Ǹ

Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ Ǹ

can be written as Execution time for a program (7) = Clock cycles for a program × Clock cycle time = Instructions for a program × Clock cycles per instruction × Clock cycle time The instruction for a program depends upon compiler and Instruction Set Architecture (ISA), and the Clock cycles Per Instruction (CPI) depends upon ISA and micro architecture [18]. Therefore, we are restricted to the specific compiler, ISA and micro architecture for accurate results. A good example is the ARM architecture. The ARM instruction set differs from the pure RISC definition in several ways that make the ARM instruction set suitable for low-power embedded applications, and hence the ARM core is used to perform real-time digital signal processing in most embedded systems. Specifically, digital signal processing programs are typically multiplication intensive and the performance of the multiplication hardware is critical to meeting the real-time constraints. All ARM processors have included hardware support for integer multiplication and used two styles of multiplier [19]. Several ARM cores include a low-cost multiplication hardware that supports only the 32-bit results multiply and multiplyaccumulate instructions. This multiplier uses the main data path iteratively, employing the barrel shifter and Arithmetic Logic Unit (ALU) to generate a 2-bit PP in each clock. The other version is ARM cores with an M in their name (for example the ARM7DM) and recent higher performance cores have high-performance multiplier and support the 64-bit results multiply and multiply-accumulate instructions. This multiplier employs a modified Booth’s algorithm to produce the 2-bit PP. The carry save array has four layers of adders, each handling two multiplier bits, so the array can multiply 8-bit per clock cycle. The array is cycled up to four times, and the partial sum and carry are combined 32-bit at a time and written back into the register. As multiplication performance is very important, more hardware resource must be dedicated. The best choice is the ARM7TDMI-S processor. The ARM7TDMIS includes an enhanced 32 × 8 single multiplier with a radix-4 modified Booth’s algorithm, and this is synthesizable version of the ARM7TDMI core. Therefore, when trying to measure the cycle counts for an application executed on the ARM multiplier with the cycle-level simulator, synthesizable core provides efficient solution. ARM and Thumb are two different instruction sets supported by ARM cores with a T in their name. ARM instructions are 32-bit wide, and Thumb instructions are 16-bit wide. Thumb mode allows for code to be smaller, and can potentially be faster if the target has slow memory. However, multiply-accumulate operations are not available in Thumb mode. Therefore, we use ARM mode in this experiment. ARM7TDMI-S core does not have a sum-of-products hardware, but includes an enhanced single multiplier, and thus, in particular, cannot use the sum-ofproducts instruction directly. The ARM compiler avoids generating the sum-of-products instructions, and hence we cannot

194

directly measure the total clock cycles with sum-of-products using cycle-level simulation with compiled assembly code. This means for every sum-of-products instruction code must be regenerated manually after analyzing the original ARM assembly code. Suppose we have the modified implementation of ARM7TDMI-S ISA. We replace two consecutive multiplication operations into one sum-of-products operation. The sum-of-products instruction can execute two multiplications simultaneously, and then two products are converted into the final result using a CPA. The ARM7 multiplication finishes up to 4 clock cycles, and thus sum-of-products take up to 5 clock cycles due to single cycle final addition. To regenerate the modified ARM assembly code, we use the ARM technical reference manual after compiling the original C code [20]. The reference manual shows all instructions and their cycle count. We can measure the clock cycles using the ARM multiplier for benchmark programs by running cycle-level simulation using compiled ARM assembly code. The hardware/software co-simulation tool such as Mentor Graphics Questa Codelink profiles clock cycles for programs. The comparison results of clock cycle estimates are shown in Table I. Based on an analysis of clock cycles, we expect the clock cycles of sum-ofproducts to be 42% and 48% less than those of multiplication for FIR filter and high pass filter programs, respectively. The clock cycle time is usually published as part of the specification document. However, as the ARM7TDMI-S is synthesizable core, we can directly measure the power and latency of the ARM multiplier using Synopsys Design Compiler with ARM7TDMI-S HDL code, and estimate those of the sum-of-products hardware. We assume the sum-of-products hardware consists of two identical ARM7TDMI-S multiplier and ALU. Table II shows the power, delay and area of a multiplier and a sum-of-products hardware. The amount of energy used depends on the power and the time for which it is used, and can be written as

The execution time for benchmark programs can be calculated by using equation (7) and measured clock cycles for a program and clock rate. Table III summarizes the energy and execution time. The sum-of-products unit dissipates between 23% and 24% more energy than a single multiplier while between 40% and 41% decrease in execution time for a FIR filter program and between 11% and 12% more energy while between 45% and 46% decrease in execution time for a high pass filter. The designer often faces a trade-off between execution time and energy. Thus, we need to have a suitable metric for energy efficiency. Energy-delay product is widely used when reporting a new architecture design that addresses energy-performance effectiveness [21]. The sum-of-products units are better than multipliers only in terms of energy-delay product in the considered benchmarks. Shorter execution time of sum-of-products can provide less energy demanded by the design. If we reduce the supply voltage, our design can save significant energy. The reason is that the clock cycles per program with sum-of-products are reduced by half compared to those with the multiplier while reducing supply voltage is to increase the clock cycle time slightly. For example, if we replace the ARM multiplier at 1.32V with the sum-of-products at 1.08V for high pass filter, about 22% in execution time and 10% in energy can be decreased. For FIR filter, sum-of-products has 14% less execution time while keeping the same energy. The multiplier and sum-of-products are characterized in the execution time ratio vs. energy ratio in Figure 3. Energy ratio is decreased as execution time ratio is increased. The sumof-products unit consumes more energy as the difference of execution time between a sum-of-products and a multiplier is increased. The energy ratio is expected to be less than 1 if their execution time is the same. This means the sum-ofproducts unit consumes less power than a multiplier only when the execution time is the same.

Energy (Joules) = Power(Watts) × Time(Seconds) (8)

B. The Design Characteristics of the Proposed Sum-ofProducts Units

TABLE I C LOCK CYCLES FOR BENCHMARK PROGRAMS . FIR Filter (length = 100) 1415 1.00 817 0.58

Clock Cycles Multiplication Sum-of-products

High Pass Filter (length = 100) 1617 1.00 845 0.52

TABLE II T HE POWER ,

DELAY AND AREA OF THE ARM7TDMI-S MULTIPLIER AND A SUM - OF - PRODUCTS HARDWARE .

Supply Voltage 1.32 V 1.2 V 1.08 V ∗

Hardware MUL∗ SOP† MUL∗ SOP† MUL∗ SOP†

multiplier: measured value,



Power (µW) 1678 3461 1250 2578 940 1940

Delay (ns) 0.99 1.02 1.15 1.19 1.42 1.48

Area (NAND2) 1384 2941 1316 2788 1364 2896

sum-of-products: estimated value

We implemented the proposed sum-of-products unit using Verilog and a top-down methodology. The designs were verified using Cadence NC-Verilog and synthesized using Synopsys Design Compiler in Samsung 65 nanometer CMOS standard cell low power library. The proposed designs were synthesized with three supply voltages 1.08V, 1.20V, and 1.32V supported by technology. To reduce the effects of changes made by the synthesis tool to the structure of the original Verilog code, we implement the same design technology and use the same Synopsys Design Compiler constraints in all designs. Placement & routing processes were performed to obtain more precise results using Synopsys Astro. Delays were obtained from Synopsys PrimeTime, and powers were obtained from the Samsung in-house power estimation tool, CubicWare. Table IV shows power, delay and area estimates for sum-of-products design. Synthesis results indicate that the multipliers have most power, delay, and area of the sum-ofproducts design.

195

TABLE III T HE EXECUTION TIME , ENERGY, Benchmark Programs

AND ENERGY- DELAY PRODUCT OF THE ARM7TDMI-S MULTIPLIER AND A SUM - OF - PRODUCTS HARDWARE FOR BENCHMARK PROGRAMS .

Supply Voltage

Hardware ARM7TDMI-S Multiplier∗ Sum-of-products † ARM7TDMI-S Multiplier∗ Sum-of-products † ARM7TDMI-S Multiplier∗ Sum-of-products † ARM7TDMI-S Multiplier∗ Sum-of-products † ARM7TDMI-S Multiplier∗ Sum-of-products † ARM7TDMI-S Multiplier∗ Sum-of-products †

1.32 V FIR Filter (length = 100)

1.2 V 1.08 V 1.32 V

High Pass Filter (length = 100)

1.2 V 1.08 V



Execution Time (ns) 1400.85 1.00 833.34 0.59 1627.25 1.00 972.23 0.60 2009.30 1.00 1209.16 0.60 1600.83 1.00 861.90 0.54 1859.55 1.00 1005.55 0.54 2296.14 1.00 1250.60 0.55

Energy (µJ) 2.35 1.00 2.88 1.23 2.03 1.00 2.50 1.23 1.89 1.00 2.35 1.24 2.69 1.00 2.98 1.11 2.32 1.00 2.59 1.12 2.16 1.00 2.43 1.12

Energy-Delay 3292.00 2400.02 3303.32 2430.58 3797.58 2841.53 4306.23 2568.46 4314.16 2604.37 4959.66 3038.96

Product 1.00 0.73 1.00 0.74 1.00 0.75 1.00 0.60 1.00 0.60 1.00 0.61



: measured value, : estimated value

TABLE IV P OWER , Hardware Sum-of-products LR 4ULS 42 Array∗ × 2 [4:2] adder, CPA

DELAY, AND AREA FOR SUM - OF - PRODUCTS .

Power (µW) 3799.45 1.00 1469.90 × 2 0.39 × 2 859.65 0.22

Delay (ns) 13.80 1.00 12.68 0.92 1.12 0.08

Area (NAND2) 11870 1.00 5295 × 2 0.45 × 2 1280 0.10

*LR 4ULS 42 Array: 4-Level Upper/Lower Split Left-to-Right Array using [4:2] adder with supply voltage 1.08 V

TABLE V T HE POWER , Supply Voltage 1.32 V 1.2 V 1.08 V ∗

DELAY AND AREA OF THE PROPOSED MULTIPLIER AND SUM - OF - PRODUCTS HARDWARE .

Hardware LR 4ULS 42∗ SOP† LR 4ULS 42∗ SOP† LR 4ULS 42∗ SOP†

Power (µW) 2691 5825 2246 4844 1856 3799

Delay (ns) 9.84 10.74 11.48 11.85 13.02 13.80

Area (NAND2) 5864 12226 5736 12038 5722 11870

LR 4ULS 42: 4-Level Upper/Lower Split Left-to-Right Array Multiplier using [4:2] adder = LR 4ULS 42 Array + CPA † sum-of-products

To compare cycle-level results, the other experiment is to use the proposed multipliers and sum-of-products. For accurate results, we assume the clock cycles for benchmark programs are the same as those in ARM7 test environments. Table V shows the power, delay and area of the proposed multiplier and sum-of-products hardware, and Table VI summarizes the energy and execution time. The sum-of-products unit dissipates between 25% and 36% more energy than a single multiplier while between 37% and 40% decrease in execution time and between 15% and 23% less in energy-delay product for a FIR filter program. Also the sum-of-products consumes between 13% and 23% more energy with between 43% and 46% decrease in execution time and between 30% and 37% decrease in energy-delay product for a high pass filter. The sum-of-products is better than the multiplier only solution in terms of energy-delay product. V. DISCUSSION

mainly determined by the silicon process technology and the total number of transistors. Unfortunately, the sum-of-products design will consume more static power than a multiplier only due to larger area. One obvious technique to reduce static power is to reduce the supply voltages used in the circuit [22]. However, it is difficult to find opportunities to reduce the supply voltage, since static power dissipation decreases with the scaling of supply voltage, while delay only increases linearly. It is possible to use high supply voltage in the critical paths of a design to achieve the required performance while the offcritical paths of the design use lower supply voltage to achieve low static power dissipation. By partitioning the circuit into several domains operating at different supply voltages, static power savings are possible. However, level shifter circuits are required for inter-domain communication. This should come at the cost of added circuitry. The other way is to use multi-threshold voltages. This technology has transistors with multiple threshold voltages in order to optimize delay or power. In modern process technology, multiple threshold voltages are provided for each transistor. A multi-threshold voltage process provides the designer with transistors that are either fast with a high static power or slow with the low static power. Therefore, a circuit can be partitioned into high and low threshold voltage gates, which are a trade-off between high-performance and reduced static power dissipation. However, a limitation of this technique is that CAD tools need to be developed and integrated into the design to optimize the partitioning process.

A. Static Power Dissipation

VI. SUMMARY

As mentioned earlier, power of a circuit consists of static and dynamic power dissipation components. Static power is

In this paper, we have proposed a new sum-of-products arithmetic architecture, and have discussed ways of power

196

TABLE VI T HE EXECUTION TIME , ENERGY,

AND ENERGY- DELAY PRODUCT OF THE PROPOSED MULTIPLIER AND SUM - OF - PRODUCTS HARDWARE FOR BENCHMARK PROGRAMS .

Benchmark Programs

Supply Voltage

LR 4ULS 42 Sum-of-products LR 4ULS 42 Sum-of-products LR 4ULS 42 Sum-of-products LR 4ULS 42 Sum-of-products LR 4ULS 42 Sum-of-products LR 4ULS 42 Sum-of-products

1.32 V FIR Filter (length = 100)

1.2 V 1.08 V 1.32 V

High Pass Filter (length = 100)

Execution Time (µs) 13.92 1.00 8.77 0.63 16.24 1.00 9.68 0.60 18.42 1.00 11.27 0.61 15.91 1.00 9.08 0.57 18.65 1.00 10.01 0.54 21.05 1.00 11.66 0.55

Hardware

1.2 V 1.08 V

ARM7 TDMIS-S mu ltip lier v s su m-of-produ cts

Trend line When t he execut ion t ime is t he same, sum-of -prod uct s unit s consume less power.

(a) LR_ 4 ULS_ 4 2 v s su m-o f-p rodu cts

Trend line

When t he execut ion t ime is t he same, sum-of -prod uct s unit s consume less power.

(b) Fig. 3. Comparison of energy ratio for execution time ratio of benchmarks: (a) ARM7TDMI-S multiplier and sum-of-products (b) LR 4ULS 42 and sumof-products

savings of the sum-of-products. The proposed unit achieves a significant power or delay savings. It is comparable in power and latency to other current multiplier designs. Compared to sum-of-products implementation using the ARM7TDMI-S multiplier, the proposed sum-of-products unit can reduce the execution time by approximately 45% with 15% of energy penalty in benchmark applications. R EFERENCES

Energy (µJ) 37.48 1.00 51.11 1.36 36.49 1.00 46.90 1.29 34.20 1.00 42.84 1.25 42.84 1.00 52.86 1.23 41.70 1.00 48.51 1.16 39.08 1.00 44.31 1.13

Energy-Delay Product 521.85 1.00 448.50 0.85 592.72 1.00 454.07 0.77 630.10 1.00 482.97 0.77 681.48 1.00 479.76 0.70 774.03 1.00 485.72 0.63 822.83 1.00 516.65 0.63

[2] J. M. Rabaey, A. P. Chandrakasan, and B. Nikolic, Digital integrated circuits: a design perspective. 2nd ed., Prentice Hall, 2003. [3] J. M. Rabaey, Low power design essentials. Springer, 2009. [4] D. E. Culler, J. P. Singh, and A. Gupta, Parallel computer architecture: a hardware/software approach. Morgan Kaufmann Publishers, 1998. [5] A. A. Fayed and M. A. Bayoumi, “A novel architecture for low-power design on parallel multipliers,” in Proc. IEEE Comput. Soc. Workshop on VLSI, Apr. 2001, pp. 149–154. [6] J. Di and J. S. Yuan, “Power-aware pipelined multiplier design based on 2-dimensional pipeline gating,” in Proc. GLSVLSI, Apr. 2003, pp. 64–67. [7] M.-C. Wen, S.-J. Wang, and Y.-N. Lin, “Low power parallel multiplier with column bypassing,” in Proc. ISCAS, May 2005, pp. 1638–1641. [8] Z. Huang and M. D. Ercegovac, “Two-dimensional signal gating for low-power array multiplier design,” in Proc. ISCAS, vol. 1, Aug. 2002, pp. 489–492. [9] W. Yeh and C. Jen, “High-speed Booth encoded parallel multiplier design,” IEEE Trans. Comput., vol. 49, no. 7, pp. 692–700, Jul. 2000. [10] K. Choi and M. Song, “Design of a high performance 32 × 32-bit multiplier with a novel sign select Booth encoder,” in Proc. ISCAS, vol.2, pp. 701–704, May 2001. [11] Z. Huang, High-level optimization techniques for low power multiplier design, Ph.D. dissertation, University of California at Los Angeles, 2004. [12] Z. Huang and M. D. Ercegovac, “High-performance low-power left-toright array multiplier design,” IEEE Trans. Comput., vol. 54, no. 3, pp. 272–283, Mar. 2005. [13] M. D. Ercegovac and T. Lang, “Fast multiplication without carrypropagate addition,” IEEE Trans. Comput., vol. 39, no. 11, pp. 1385– 1390, Nov. 1990. [14] Z. Huang and M. D. Ercegovac, “Low power array multiplier design by topology optimization,” in Proc. SPIE Advanced Signal Processing Algorithms, Architectures, and Implementations XII, vol. 4791, Jul. 2002, pp. 424–435. [15] Z. Huang and M. D. Ercegovac, “Number representation optimization for low power multiplier design,” in Proc. SPIE Advanced Signal Processing Algorithms, Architectures, and Implementations XII, vol. 4791, Jul. 2002, pp. 345–356. [16] T. Sakuta, W. Lee, and P.T. Balsara, “Delay balanced multipliers for low power/low voltage DSP core,” in Proc. ISLPED, Oct. 1995, pp. 36–37. [17] S. S. Mahant-Shetti, P. T. Balsara, and C. Lemonds, “High performance low power array multiplier using temporal tiling,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 7, no. 1, pp. 121–124, Mar. 1999. [18] J. L. Hennessey and D. A. Patterson, Computer organization and design: the hardware/software interface, Morgan Kaufmann Publishers, 2005. [19] S. Furber, ARM system architecture, Addison-Wesley, 1996. [20] ARM, ARM7TDMI technical reference manual [21] R. Gonzales and M. Horowitz, “Energy dissipation in general purpose microprocessors,“ IEEE J. Solid-State Circuits, vol. 31, no. 9, Sep. 1996. [22] S. W. Heo, S. J. Huh, and M. D. Ercegovac, “Power optimization in a parallel multiplier using voltage islands,” To appear in Proc. ISCAS, May 2013.

[1] W. Suntiamorntut, Energy efficient functional unit for a parallel asynchronous DSP, Ph.D. dissertation, University of Manchester, 2005.

197