Power Optimization of Sum-of-Products Design for Signal Processing Applications Defense Talk
Seok Won Heo Ph.D. candidate Computer Science, UCLA
[email protected]
Computer Science
Outline • Introduction Research Goal Research Approach Research Point • Research Problem
• Proposed Design Sum-of-product Design • Future Work • Conclusion Computer Science
2
Seok Won Heo, Ph.D. candidate
Motivation • Signal Processing Applications – A growing number of arithmetic calculations → Increasing demand for high-quality signal processing applications. – Large power dissipation → Minimization of the power consumption
Computer Science
3
Seok Won Heo, Ph.D. candidate
Power Dissipation Ptotal = Pdynamic + Pstatic where Ptotal is the total power dissipation, Pdynamic is the dynamic power dissipation, and Pstatic is the static power dissipation[1][2]
• Dynamic power – The power consumed by the intended work of the circuit to switch states.
• Static power – The power consumed by leakage current.
Computer Science
4
Seok Won Heo, Ph.D. candidate
Research Goal Pdynamic = 0.5 × CL × VDD2 × fclk × α0→1 where CL is the load capacitance, VDD is the power supply voltage, fclk is the clock frequency, and α0→1 is the switching activity [1][2]
• Dynamic Power – The dominant factor in the total power dissipation of CMOS circuits. – Optimizing static power heavily depends on low level techniques: circuit, device. • Focus Point – Power supply voltage: the largest impact (squared term) – Switching activity: simple design (glitches – 30% of total energy )
Computer Science
5
Seok Won Heo, Ph.D. candidate
Research Approach Pdynamic = 0.5 × CL × VDD2 × fclk × α0→1 where CL is the load capacitance, VDD is the power supply voltage, fclk is the clock frequency, and α0→1 is the switching activity [1][2]
Research Approaches Algorithm level Architecture Level Gate Level
High-level optimization affects all four factors → achieve great potential power savings Consideration computation feature input data pattern
Circuit Level Device Level
Computer Science
6
Seok Won Heo, Ph.D. candidate
Research Points • Evaluation Point – Dynamic power reduction – Power-performance trade-offs • Consideration Point – How to control arithmetic unit to match external data characteristics. – How to optimize internal algorithm and architecture.
Computer Science
7
Seok Won Heo, Ph.D. candidate
Software Design for Benchmarks
Architectural Design
Random Input Characteristics
Structure-level Verilog
Compile Compiled using
RTL Design
ARM compiler
Cycle-level Simulation*
Signal Statistics
RTL Verification Verified using Cadence NC-Verilog
Signal Activity analysis -Simulation -Glitch analysis
RTL Synthesis Synthesized using Synopsys Design Compiler
Placement and Routing *
Cycle Estimate
Delay Estimate
Measured using Mentor Questa Codelink
Measured using Synopsys PrimeTime
* conducted by the help of Samsung engineer.
SAMSUNG LIBRARY (65nm CMOS standard)
Area Estimate Measured using Synopsys IC Compiler
Power Estimate Measured using Samsung CubicWare
Proposed Design for Sum-of-products
• Form – S=a×b+x×y
• Signal processing applications – FIR filter, High pass filter, Inner-Products
Computer Science
9
Seok Won Heo, Ph.D. candidate
Research Problems • Problems using a single multiplier in implementing sum-of-products operation Arithmetic unit (Multiplier) in conventional signal processing application 1) Many clock cycles
2) Separate arithmetic units for different arithmetic functions (like squarer and multiply-add)
3) Input data with a large dynamic range
Computer Science
10
Seok Won Heo, Ph.D. candidate
• Solution – Parallelism →The number of cycles can be reduced by hardware parallelism. → Power savings can be achieved by reducing supply voltage with some loss of performance.
FIR filter : y[n] = c[k] × x[n - k] • MAC instruction y[n] = 0 for (k = 0; k < N; k++) { y[n] = y[n] + c[k] × x[n - k] }
y=y+c×x
• Sum-of-products instruction y[n] = 0 for (k = 0; k < N; k+=2){ y[n] = y[n] + c[k] × x[n - k] + c[k +1] × x[n - k +1] }
y = y + c0 × x0 + c1 × x1
The best case scenario: Sum-of-products operations require only half the number of cycles.
Sum-of-products[4] x y
x y
a b
a b
PPR Array
PPR Array
PPR Array
2
a×b
PPR Array 2
2
x×y
2 Carry Propagate Adder Carry Propagate Adder
[4:2] Adder
a×b Carry Propagate Adder
x×y
Carry Propagate Adder
a×b+x×y
a×b+x×y
Left is better structure . less carry propagate addition → less power and delay Computer Science
12
Seok Won Heo, Ph.D. candidate
Components x y a b
• Main components PPR Array
– Partial Product (PP) reduction
PPR Array
Determine the overall power. 2 2
a×b
x×y
– CPA
[4:2] Adder
Contribute to significant delay.
Carry Propagate Adder a×b+x×y
Computer Science
13
Seok Won Heo, Ph.D. candidate
Partial Product Reduction x y a b
• Main components PPR Array
– Partial Product (PP) reduction
PPR Array
Determine the overall power. 2 2
a×b
x×y
– CPA
[4:2] Adder
Contribute to significant delay.
Carry Propagate Adder a×b+x×y
Computer Science
14
Seok Won Heo, Ph.D. candidate
Left-to-Right Array Multiplier[5][6] (cont’d) • PPs are added in series from the leftmost multiplier bit. • Carry signals propagate fewer stage in MS region. → Power savings
Bit matrix for 8 × 8 left-to-right multiplication example 15
Seok Won Heo, Ph.D. candidate
Left-to-Right Array Multiplier The disadvantage of array multiplier: Many unbalanced delay paths - unbalanced signal arrivals in the adders[7].
3 ×TXOR Upper row
0
Snow balling effect: glitches increase as signal propagate through array .
+
2 ×TXOR 9 ×TXOR 0
+
7 ×TXOR
Lower row
The lower rows consume much more power than the upper rows. The implementation of 8 × 8 left-to-right array multiplier 16
Seok Won Heo, Ph.D. candidate
Multi-level Split Array Multiplier Each part has a quarter number of rows, and added separately in parallel. Each part has a half number of rows, and added separately in parallel.
The final vectors from two parts can be reduced using one [4:2] adders. Two-level upper/lower split array multiplier[5][6]
The intermediate vectors from four parts can be reduced using two [4:2] adders followed by one [4:2] adder. Four-level upper/lower split array multiplier[4] Seok Won Heo, Ph.D. candidate
Voltage Islands[8] Processor Core 5GHz 1.2V
Pdynamic = 0.5 × CL × VDD2 × fclk × α0→1
The other functional units 2GMhz 1.0V
An example by utilizing voltage islands
Power dissipation → depends on the square of the supply voltage. → significantly reduced by scaling down the supply voltage.
Voltage islands: use multiple supply voltage • The most performance critical element of design → Require the highest voltage level in order to maximize its performance. • The other functional units → May not require the highest voltage. Saving power dissipation, if they can be run at lower voltages.
18
Seok Won Heo, Ph.D. candidate
• Problem – Non-uniform arrival time • Solution – Voltage islands Middle region of multiplier + CPA → High supply voltage The other region → Low supply voltage critical path → high power supply voltage MS region
Middle region
LS region
short path → [10] low power supply voltage
19
Partition of non-uniform arrival time for voltage islands Seok Won Heo, Ph.D. candidate
Carry Propagate Adder (cont’d) x y a b
• Main components PPR Array
– Partial Product (PP) reduction
PPR Array
Determine the overall power. 2 2
a×b
x×y
– CPA
[4:2] Adder
Contribute to significant delay.
Carry Propagate Adder a×b+x×y
Computer Science
20
Seok Won Heo, Ph.D. candidate
• Problem – Uniform (but not perfectly flat) input arrival time to the final CPA • Solution – Fast adder under the assumption of uniform arrivals
Uniform arrival time for voltage islands
Uniform arrival time for 4-level split structure
21
Seok Won Heo, Ph.D. candidate
Carry Propagate Adder • Carry-Select Adder (CSELA) [9] – One of fast adders – Large area and power (duplicate structure)
Computer Science
22
Seok Won Heo, Ph.D. candidate
Uniform arrival time for 4-level split structure
Uniform arrival time for voltage islands
Delay is increased from LSB to 7(8)-bit. Carry-Ripple Adder (CRA) need not wait for the incoming input from reduction array.
Seok Won Heo, Ph.D. candidate An example of standard CSELA
Optimal group size: minimize delay
1. Fast adder (CLG) : the sum and carries can be calculated earlier. → high-performance, high-power 2. Slow adder (CRA) : low-performance, low-power (speed increase if supply voltage is increased)
An example of modified CSELA
Standard CSELA
Modified CSELA
Add-one Circuit [11]
Seok Won Heo, Ph.D. candidate
Optimized Sum-of-products x y a b
PPR Array
A 4-level split array structure with [4:2] adder
PPR Array
A array structure with voltage Islands 2 2
a×b
x×y
[4:2] Adder
A variable block width modified Carry-Select Carry Propagate Adder a×b+x×y
Computer Science
Adder with add-one logic
25
Seok Won Heo, Ph.D. candidate
Measurement • Architecture
– We are restricted to the specific compiler, ISA and micro architecture. • ARM7TDMI-S [12]
– RISC processor for low-power embedded applications – 32 × 8 high-performance multiplier (4 cycles) – Synthesizable core → easy and efficient for cycle-level simulation • Assumption
– ARM ISA → no sum-of-products instruction – Sum-of-products : Clock cycles= ARM multiplication cycles (parallel) + ALU cycles (5 = 4 + 1) Area = 2 × ARM7 multiplier + ALU 26
Seok Won Heo, Ph.D. candidate
Results
35 ~ 45% less execution time
15 ~ 40% more energy A sum-of-products unit consumes less energy than a multiplier only when the execution time is the same.
Better energy-delay product 27
Seok Won Heo, Ph.D. candidate
Research Problems • Problems using single multiplier in implementing sum-of-products operation Arithmetic unit (Multiplier) in conventional signal processing application 1) Many clock cycles Solution) sum-of-products (hardware parallelism) 2) Separate arithmetic units for different arithmetic functions (like squarer and multiply-add)
3) Input data with a large dynamic range
Computer Science
28
Seok Won Heo, Ph.D. candidate
Research Problems • Problems using single multiplier in implementing sum-of-products operation Arithmetic unit (Multiplier) in conventional signal processing application 1) Many clock cycles Solution) sum-of-products (hardware parallelism) 2) Separate arithmetic units for different arithmetic functions (like squarer and multiply-add)
3) Input data with a large dynamic range
Computer Science
29
Seok Won Heo, Ph.D. candidate
Solution •
Multifunctional Arithmetic Unit – Supporting several arithmetic operations (multiplication, multiply-add, square, sum-of-squares, add-multiply) using essentially the same hardware with input control.
•
Sum-of-products Unit – A general operation easier to transform to the desired arithmetic operation.
Computer Science
30
Seok Won Heo, Ph.D. candidate
Proposed Arithmetic Unit •
Sum-of-Products Unit – Baseline module : two PPR array + [4:2] adder + CPA – Additional modules : opcode decoder – determine operation, deliver control signals MUX – select appropriate output.
Computer Science
31
Seok Won Heo, Ph.D. candidate
Heterogeneous Sum-of-Products Unit • Problem – Support each unique arithmetic operations without power increase – Most operations do not use all modules. Multiplication • Solution Multiply-add – Two different PPR array 1) Main array : Support frequently used operations: multiplication, multiply-add : Lower-power, higher-performance 2) Auxiliary array : Support other operations : Extra gates
Computer Science
32
Both: Sum-of-products
Square Sum-of-squares Add-multiply
Seok Won Heo, Ph.D. candidate
Operation
Expression
Condition
Sum-of-Products
a×b+x×y
−
Multiplication
a×b
(x = 0) or (y = 0)
Multiply-Add
a×b+x
y=1
Squares
a2
(a = b) and ((x = 0) or (y = 0))
Sum-of-Squares
a2 + x2
(a = b) and (x = y)
Add-Multiply
a × (b + y) Operand modes
Input
x=a Output
Opcode
Operand
Turn-on Modules
Turn-off Modules
Result Selection
SOP
Sum-of-Products
Two multipliers, CPA
None
CPA
M
Multiplication
Main multiplier
Auxiliary multiplier, CPA
Main Multiplier
MA
Multiply-Add
Main multiplier, CPA
Auxiliary multiplier
CPA
S
Squares
Auxiliary multiplier
Main multiplier, CPA
Auxiliary multiplier
SS
Sum-of-Squares
Auxiliary multiplier
Main multiplier, CPA
Auxiliary multiplier
AM
Add-Multiply
Auxiliary multiplier
Main multiplier, CPA
Auxiliary multiplier
Control signals
Sum-of-products / Multiplication • Sum-of-products • Multiplication − Two PPR arrays + [4:2] adder + CPA − Main PPR array + CPA a × b + x × y → a × b, if (x = 0) or (y = 0)
a×b+x×y
Computer Science
34
Seok Won Heo, Ph.D. candidate
Multiply-add / Square • Multiply-Add • Square − Main PPR array + [4:2] adder + CPA − Aux. PPR array + CPA a × b + x × y → a × b + x, if y = 1
Computer Science
a × b + x × y → x2 , if (x = y) and ((a = 0) or (b = 0))
35
Seok Won Heo, Ph.D. candidate
a 4b 0
Square
three PP bits: a4b0, a3b1 , a2b2
a3b1 a2b2
+ a 3b 0
a1b3 +
Conventional left-to-right array multiplier
a 0b 4
a 2b 1 + a 1b 2 a 0b 3
+
+
+
+
+ : [3:2] adder
Seok Won Heo, Ph.D. candidate
•
Rearrangement of FAs – [3:2], [4:2] adders have two parameters: xj yi and xi yj.
8 × 8 left-to-right modified array multiplier
Seok Won Heo, Ph.D. candidate
Square (cont’d)
If input a = input b (squaring) → Sout = Cin Cout = a (or b)
[3:2] adder
Computer Science
38
[3:2] adder – Use BYPASS.
+
Seok Won Heo, Ph.D. candidate
Square (cont’d) •
[4:2] adder – Use BYPASS.
NAND-based Implementation → inefficient computation
If input c = input d (square) → two operations (XOR, MUX) are ignored.
+
MUX-based Implementation → efficient computation
[4:2] adder
Computer Science
39
Seok Won Heo, Ph.D. candidate
Sum-of-Squares (a = b) and (x = y)
aibj aibj Bit-matrix for multiplication
Little difference between multiplication and sum-ofsquares bibj → CMSSU[14] : Sum-of-squares can be executed by one array and additional gates.
Input control using MUX Additional XOR gates
aiaj Bit-matrix for sum-of-squares
Computer Science
40
Seok Won Heo, Ph.D. candidate
Add-Multiply •
Add-multiply – a × b + x × y = a × (b + y), + is arithmetic addition (when x = a) Conventional structure
Computer Science
Modified structure
41
Seok Won Heo, Ph.D. candidate
Results
35 ~ 50% less power (except for sum-of-products)
10 ~ 20% more delay
Better power-delay product (except for sum-of-products) 42
Seok Won Heo, Ph.D. candidate
Research Problems • Problems using single multiplier in implementing sum-of-products operation Arithmetic unit (Multiplier) in conventional signal processing application 1) Many clock cycles Solution) sum-of-products (hardware parallelism) 2) Separate arithmetic units for different arithmetic functions (like squarer and multiply-add) Solution) sum-of-products (multi-functional arithmetic unit) 3) Input data with a large dynamic range
Computer Science
43
Seok Won Heo, Ph.D. candidate
Research Problems • Problems using single multiplier in implementing sum-of-products operation Arithmetic unit (Multiplier) in conventional signal processing application 1) Many clock cycles Solution) sum-of-products (hardware parallelism) 2) Separate arithmetic units for different arithmetic functions (like squarer and multiply-add) Solution) sum-of-products (multi-functional arithmetic unit) 3) Input data with a large dynamic range
Computer Science
44
Seok Won Heo, Ph.D. candidate
Solution •
Low-Precision Data: SIMD Operation – Throughput increase and power decrease.
•
High-Precision Data: Approximate Operation – Power decrease with approximate results – Mobile systems with limited screen size tolerate a reasonable amount of error.
Computer Science
45
Seok Won Heo, Ph.D. candidate
Proposed Arithmetic Unit •
Sum-of-Products Unit – Baseline module : two PPR array + [4:2] adder + CPA – Additional modules : dynamic range detector – detect the signs and the ranges of all input data main controller – generate SIMD and APPR signals.
Computer Science
46
Seok Won Heo, Ph.D. candidate
Signal Gating[14] Pdynamic = 0.5 × CL × VDD2 × fclk × α0→1 Power dissipation → depends on the switching activity. → reduced by reducing switching activity. Signal gating → eliminate unnecessary switching activities • enable = 1 → gated signal = signal (switching) • enable = 0 → gated signal = 0 (no switching) enable
Signal Gating enable
signal
gated signal signal
gated signal
47
Seok Won Heo, Ph.D. candidate
SIMD operation
Bit-matrix for standard operation
• Inverter for bit inversion • MUX for selection Signal Gating
Bit-matrix for SIMD operation
Computer Science
48
Seok Won Heo, Ph.D. candidate
Approximate operation Gate Level
• MS 32-bit column: used • LS 32-bit column: ignored
[15]
Signal Gating
Bit-matrix for approximate operation Mean Error estimates: 1. generate random values 2. error = correct value - approximate value
Computer Science
49
Seok Won Heo, Ph.D. candidate
Approximate operation Architecture Level Method I (one 32 × 32 multiplication) If one result >> the other result
Method II (two 16 × 16 multiplication)
Mean Error estimates: 1. generate random values 2. error = correct value approximate value
Computer Science
50
Seok Won Heo, Ph.D. candidate
Modified Module •
PPG – Baseline module + AND2 gates for masking
•
PPR Array – Baseline module + extra module for SIMD + AND2 gates for masking
Computer Science
51
Seok Won Heo, Ph.D. candidate
The Modified Module •
Final CPA – Baseline module + AND2 gates for masking + MUX for output selection
Computer Science
52
Seok Won Heo, Ph.D. candidate
Results (Gate Level)
SIMD: 50% less power APPR: 30% less power
SIMD: 30% less delay APPR: 10% more delay (gates for masking)
Better power-delay product
53
Seok Won Heo, Ph.D. candidate
Results (Architecture Level)
One 32 × 32: 40% less power Two 16 × 16: 75% less power
One 32 × 32: 12% more delay Two 16 × 16: 40% less delay Power reduction: One 32 × 32 < Two 16 × 16 Delay reduction: One 32 × 32 < Two 16 × 16 Mean error: One 32 × 32 < Two 16 × 16 54
Seok Won Heo, Ph.D. candidate
Research Problems • Problems using single multiplier in implementing sum-of-products operation Arithmetic unit (Multiplier) in conventional signal processing application 1) Many clock cycles Solution) sum-of-products (hardware parallelism) 2) Separate arithmetic units for different arithmetic functions (like squarer and multiply-add) Solution) sum-of-products (multi-functional arithmetic unit) 3) Input data with a large dynamic range Solution) sum-of-products (SIMD and approximate operation) Computer Science
55
Seok Won Heo, Ph.D. candidate
Future Work 1) Other Composite Arithmetic Operations (e.g. 4-Dimensional sum-of-products: a × b + c × d + e × f + g × h) 2) 64-bit Floating-Point Arithmetic Operations
Computer Science
56
Seok Won Heo, Ph.D. candidate
Conclusion (cont’d) • Problems using single multiplier in implementing sum-of-products operation Arithmetic unit (Multiplier) in conventional signal processing application 1) Many clock cycles Solution) sum-of-products (hardware parallelism) 2) Separate arithmetic units for different arithmetic functions (like squarer and multiply-add) Solution) sum-of-products (multi-functional arithmetic unit) 3) Input data with a large dynamic range Solution) sum-of-products (SIMD and approximate operation) Computer Science
57
Seok Won Heo, Ph.D. candidate
Conclusion (cont’d) • Proposed sum-of-products unit Technology: Samsung 65nm standard cell ISA: ARM7
x y
a b
PPR Array PPR Array
1. 4-level UL LR array, 2. LR array w/ voltage islands
2 2
a×b
x×y
[4:2] Adder
Carry Propagate Adder a×b+x×y
Computer Science
Modified CSELA with addone circuit (CRA and CLA)
Compared to a single multiplier, Execution time: 35~45% decrease Energy: 15~40% increase Less energy when the execution time is the same.
58
Seok Won Heo, Ph.D. candidate
Conclusion (cont’d) • Proposed multifunctional sum-of-products unit
Compared to a baseline sum-of-products unit, Power: 35 ~ 50% decrease (except for sum-of-products) Delay: 10 ~ 20% more delay Better power-delay product
Computer Science
59
Seok Won Heo, Ph.D. candidate
Conclusion • Proposed sum-of-products unit for supporting SIMD and approximate operation
Compared to a baseline sum-of-products unit, Power: SIMD - 50% decrease, APPR – 30% decrease Delay: SIMD - 30% decrease, APPR – 10% increase Better power-delay product
Computer Science
60
Seok Won Heo, Ph.D. candidate
Acknowledgement
Prof. Ercegovac
Prof. Cong
Computer Science
Prof. Tamir
61
Prof. Marković
Seok Won Heo, Ph.D. candidate
Acknowledgement
With fiancé
With family
Computer Science
62
Seok Won Heo, Ph.D. candidate
Thank you.
Reference [1] W. Suntiamorntut, Energy efficient functional unit for a parallel asynchronous DSP, Ph.D. dissertation, University of Manchester, 2005. [2] J. M. Rabaey, Low power design essentials, Springer, 2009. [3] G. K. Yeap, Practical low power digital VLSI design. Kluwer Academic Publishers, 1998. [4] S. W. Heo, S. J. Huh and M. D. Ercegovac, "Power optimization of sum-of-products design for signal processing applications, in Proc. ASAP, Jun. 2013, pp. 192–197. [5] M. D. Ercegovac and T. Lang, Digital Arithmetic. Morgan Kaufman, 2004. [6] Z. Huang and M.D. Ercegovac, "Low power array multiplier design by topology optimization," in Proc. SPIE Advanced Signal Processing Algorithms, Architectures, and Implementations XII, vol. 4791, July 2002, pp. 424–435. [7] Z. Huang, High-level optimization techniques for low power multiplier design, Ph.D. dissertation, University of California at Los Angeles, 2004. [8] U. Ko, P. T. Balsara, and W. Lee, "A self-timed method to minimize spurious transitions in low power CMOS circuits," in Proc. ISLPED, Oct. 1994, pp. 62–63 [9] D. E. Lackey, P. S. Znchowski, T. R. Eednar, D. W. Stout, S. W. Gould, and J. M. Cobn, "Managing power and performance for system-on-chip designs using voltage islands," in Proc. ICCAD, Nov. 2002, pp. 195–202. [10] S. W. Heo, S. J. Huh, and M. D. Ercegovac, "Power optimization in a parallel multiplier using voltage islands," in Proc. ISCAS, May 2013., pp. 345–348.
64
Seok Won Heo, Ph.D. student
Reference [11] M. J. Schulte, L. P. Marquette, S. Krithivasan, E. G. Walters, and J. Glossner, "Combined multiplication and sum-of-squares units," in Proc. ASAP, Jun. 2003, pp. 204–214. [12] R. Goering, "Low power Design," Special technology report, SCD source, Sep. 2008. [13] M.J. Schulte, "Reduced power dissipation through truncated multiplication," in Proc. IEEE Alessandro Volta Memorial Workshop on Low-Power Design, Mar. 1999, pp. 61–69.
65
Seok Won Heo, Ph.D. student