Power Optimization of Sum-of-Products Design for

24 downloads 0 Views 2MB Size Report
Power supply voltage: the largest impact (squared term). – Switching activity: simple design .... Fast adder (CLG) : the sum and carries can be calculated earlier.
Power Optimization of Sum-of-Products Design for Signal Processing Applications Defense Talk

Seok Won Heo Ph.D. candidate Computer Science, UCLA [email protected]

Computer Science

Outline • Introduction Research Goal Research Approach Research Point • Research Problem

• Proposed Design Sum-of-product Design • Future Work • Conclusion Computer Science

2

Seok Won Heo, Ph.D. candidate

Motivation • Signal Processing Applications – A growing number of arithmetic calculations → Increasing demand for high-quality signal processing applications. – Large power dissipation → Minimization of the power consumption

Computer Science

3

Seok Won Heo, Ph.D. candidate

Power Dissipation Ptotal = Pdynamic + Pstatic where Ptotal is the total power dissipation, Pdynamic is the dynamic power dissipation, and Pstatic is the static power dissipation[1][2]

• Dynamic power – The power consumed by the intended work of the circuit to switch states.

• Static power – The power consumed by leakage current.

Computer Science

4

Seok Won Heo, Ph.D. candidate

Research Goal Pdynamic = 0.5 × CL × VDD2 × fclk × α0→1 where CL is the load capacitance, VDD is the power supply voltage, fclk is the clock frequency, and α0→1 is the switching activity [1][2]

• Dynamic Power – The dominant factor in the total power dissipation of CMOS circuits. – Optimizing static power heavily depends on low level techniques: circuit, device. • Focus Point – Power supply voltage: the largest impact (squared term) – Switching activity: simple design (glitches – 30% of total energy )

Computer Science

5

Seok Won Heo, Ph.D. candidate

Research Approach Pdynamic = 0.5 × CL × VDD2 × fclk × α0→1 where CL is the load capacitance, VDD is the power supply voltage, fclk is the clock frequency, and α0→1 is the switching activity [1][2]

Research Approaches Algorithm level Architecture Level Gate Level

High-level optimization affects all four factors → achieve great potential power savings Consideration computation feature input data pattern

Circuit Level Device Level

Computer Science

6

Seok Won Heo, Ph.D. candidate

Research Points • Evaluation Point – Dynamic power reduction – Power-performance trade-offs • Consideration Point – How to control arithmetic unit to match external data characteristics. – How to optimize internal algorithm and architecture.

Computer Science

7

Seok Won Heo, Ph.D. candidate

Software Design for Benchmarks

Architectural Design

Random Input Characteristics

Structure-level Verilog

Compile Compiled using

RTL Design

ARM compiler

Cycle-level Simulation*

Signal Statistics

RTL Verification Verified using Cadence NC-Verilog

Signal Activity analysis -Simulation -Glitch analysis

RTL Synthesis Synthesized using Synopsys Design Compiler

Placement and Routing *

Cycle Estimate

Delay Estimate

Measured using Mentor Questa Codelink

Measured using Synopsys PrimeTime

* conducted by the help of Samsung engineer.

SAMSUNG LIBRARY (65nm CMOS standard)

Area Estimate Measured using Synopsys IC Compiler

Power Estimate Measured using Samsung CubicWare

Proposed Design for Sum-of-products

• Form – S=a×b+x×y

• Signal processing applications – FIR filter, High pass filter, Inner-Products

Computer Science

9

Seok Won Heo, Ph.D. candidate

Research Problems • Problems using a single multiplier in implementing sum-of-products operation Arithmetic unit (Multiplier) in conventional signal processing application 1) Many clock cycles

2) Separate arithmetic units for different arithmetic functions (like squarer and multiply-add)

3) Input data with a large dynamic range

Computer Science

10

Seok Won Heo, Ph.D. candidate

• Solution – Parallelism →The number of cycles can be reduced by hardware parallelism. → Power savings can be achieved by reducing supply voltage with some loss of performance.

FIR filter : y[n] = c[k] × x[n - k] • MAC instruction y[n] = 0 for (k = 0; k < N; k++) { y[n] = y[n] + c[k] × x[n - k] }

y=y+c×x

• Sum-of-products instruction y[n] = 0 for (k = 0; k < N; k+=2){ y[n] = y[n] + c[k] × x[n - k] + c[k +1] × x[n - k +1] }

y = y + c0 × x0 + c1 × x1

The best case scenario: Sum-of-products operations require only half the number of cycles.

Sum-of-products[4] x y

x y

a b

a b

PPR Array

PPR Array

PPR Array

2

a×b

PPR Array 2

2

x×y

2 Carry Propagate Adder Carry Propagate Adder

[4:2] Adder

a×b Carry Propagate Adder

x×y

Carry Propagate Adder

a×b+x×y

a×b+x×y

Left is better structure . less carry propagate addition → less power and delay Computer Science

12

Seok Won Heo, Ph.D. candidate

Components x y a b

• Main components PPR Array

– Partial Product (PP) reduction

PPR Array

Determine the overall power. 2 2

a×b

x×y

– CPA

[4:2] Adder

Contribute to significant delay.

Carry Propagate Adder a×b+x×y

Computer Science

13

Seok Won Heo, Ph.D. candidate

Partial Product Reduction x y a b

• Main components PPR Array

– Partial Product (PP) reduction

PPR Array

Determine the overall power. 2 2

a×b

x×y

– CPA

[4:2] Adder

Contribute to significant delay.

Carry Propagate Adder a×b+x×y

Computer Science

14

Seok Won Heo, Ph.D. candidate

Left-to-Right Array Multiplier[5][6] (cont’d) • PPs are added in series from the leftmost multiplier bit. • Carry signals propagate fewer stage in MS region. → Power savings

Bit matrix for 8 × 8 left-to-right multiplication example 15

Seok Won Heo, Ph.D. candidate

Left-to-Right Array Multiplier The disadvantage of array multiplier: Many unbalanced delay paths - unbalanced signal arrivals in the adders[7].

3 ×TXOR Upper row

0

Snow balling effect: glitches increase as signal propagate through array .

+

2 ×TXOR 9 ×TXOR 0

+

7 ×TXOR

Lower row

The lower rows consume much more power than the upper rows. The implementation of 8 × 8 left-to-right array multiplier 16

Seok Won Heo, Ph.D. candidate

Multi-level Split Array Multiplier Each part has a quarter number of rows, and added separately in parallel. Each part has a half number of rows, and added separately in parallel.

The final vectors from two parts can be reduced using one [4:2] adders. Two-level upper/lower split array multiplier[5][6]

The intermediate vectors from four parts can be reduced using two [4:2] adders followed by one [4:2] adder. Four-level upper/lower split array multiplier[4] Seok Won Heo, Ph.D. candidate

Voltage Islands[8] Processor Core 5GHz 1.2V

Pdynamic = 0.5 × CL × VDD2 × fclk × α0→1

The other functional units 2GMhz 1.0V

An example by utilizing voltage islands

Power dissipation → depends on the square of the supply voltage. → significantly reduced by scaling down the supply voltage.

Voltage islands: use multiple supply voltage • The most performance critical element of design → Require the highest voltage level in order to maximize its performance. • The other functional units → May not require the highest voltage. Saving power dissipation, if they can be run at lower voltages.

18

Seok Won Heo, Ph.D. candidate

• Problem – Non-uniform arrival time • Solution – Voltage islands  Middle region of multiplier + CPA → High supply voltage  The other region → Low supply voltage critical path → high power supply voltage MS region

Middle region

LS region

short path → [10] low power supply voltage

19

Partition of non-uniform arrival time for voltage islands Seok Won Heo, Ph.D. candidate

Carry Propagate Adder (cont’d) x y a b

• Main components PPR Array

– Partial Product (PP) reduction

PPR Array

Determine the overall power. 2 2

a×b

x×y

– CPA

[4:2] Adder

Contribute to significant delay.

Carry Propagate Adder a×b+x×y

Computer Science

20

Seok Won Heo, Ph.D. candidate

• Problem – Uniform (but not perfectly flat) input arrival time to the final CPA • Solution – Fast adder under the assumption of uniform arrivals

Uniform arrival time for voltage islands

Uniform arrival time for 4-level split structure

21

Seok Won Heo, Ph.D. candidate

Carry Propagate Adder • Carry-Select Adder (CSELA) [9] – One of fast adders – Large area and power (duplicate structure)

Computer Science

22

Seok Won Heo, Ph.D. candidate

Uniform arrival time for 4-level split structure

Uniform arrival time for voltage islands

Delay is increased from LSB to 7(8)-bit. Carry-Ripple Adder (CRA) need not wait for the incoming input from reduction array.

Seok Won Heo, Ph.D. candidate An example of standard CSELA

Optimal group size: minimize delay

1. Fast adder (CLG) : the sum and carries can be calculated earlier. → high-performance, high-power 2. Slow adder (CRA) : low-performance, low-power (speed increase if supply voltage is increased)

An example of modified CSELA

Standard CSELA

Modified CSELA

Add-one Circuit [11]

Seok Won Heo, Ph.D. candidate

Optimized Sum-of-products x y a b

PPR Array

A 4-level split array structure with [4:2] adder

PPR Array

A array structure with voltage Islands 2 2

a×b

x×y

[4:2] Adder

A variable block width modified Carry-Select Carry Propagate Adder a×b+x×y

Computer Science

Adder with add-one logic

25

Seok Won Heo, Ph.D. candidate

Measurement • Architecture

– We are restricted to the specific compiler, ISA and micro architecture. • ARM7TDMI-S [12]

– RISC processor for low-power embedded applications – 32 × 8 high-performance multiplier (4 cycles) – Synthesizable core → easy and efficient for cycle-level simulation • Assumption

– ARM ISA → no sum-of-products instruction – Sum-of-products : Clock cycles= ARM multiplication cycles (parallel) + ALU cycles (5 = 4 + 1) Area = 2 × ARM7 multiplier + ALU 26

Seok Won Heo, Ph.D. candidate

Results

35 ~ 45% less execution time

15 ~ 40% more energy A sum-of-products unit consumes less energy than a multiplier only when the execution time is the same.

Better energy-delay product 27

Seok Won Heo, Ph.D. candidate

Research Problems • Problems using single multiplier in implementing sum-of-products operation Arithmetic unit (Multiplier) in conventional signal processing application 1) Many clock cycles Solution) sum-of-products (hardware parallelism) 2) Separate arithmetic units for different arithmetic functions (like squarer and multiply-add)

3) Input data with a large dynamic range

Computer Science

28

Seok Won Heo, Ph.D. candidate

Research Problems • Problems using single multiplier in implementing sum-of-products operation Arithmetic unit (Multiplier) in conventional signal processing application 1) Many clock cycles Solution) sum-of-products (hardware parallelism) 2) Separate arithmetic units for different arithmetic functions (like squarer and multiply-add)

3) Input data with a large dynamic range

Computer Science

29

Seok Won Heo, Ph.D. candidate

Solution •

Multifunctional Arithmetic Unit – Supporting several arithmetic operations (multiplication, multiply-add, square, sum-of-squares, add-multiply) using essentially the same hardware with input control.



Sum-of-products Unit – A general operation easier to transform to the desired arithmetic operation.

Computer Science

30

Seok Won Heo, Ph.D. candidate

Proposed Arithmetic Unit •

Sum-of-Products Unit – Baseline module : two PPR array + [4:2] adder + CPA – Additional modules : opcode decoder – determine operation, deliver control signals MUX – select appropriate output.

Computer Science

31

Seok Won Heo, Ph.D. candidate

Heterogeneous Sum-of-Products Unit • Problem – Support each unique arithmetic operations without power increase – Most operations do not use all modules. Multiplication • Solution Multiply-add – Two different PPR array 1) Main array : Support frequently used operations: multiplication, multiply-add : Lower-power, higher-performance 2) Auxiliary array : Support other operations : Extra gates

Computer Science

32

Both: Sum-of-products

Square Sum-of-squares Add-multiply

Seok Won Heo, Ph.D. candidate

Operation

Expression

Condition

Sum-of-Products

a×b+x×y



Multiplication

a×b

(x = 0) or (y = 0)

Multiply-Add

a×b+x

y=1

Squares

a2

(a = b) and ((x = 0) or (y = 0))

Sum-of-Squares

a2 + x2

(a = b) and (x = y)

Add-Multiply

a × (b + y) Operand modes

Input

x=a Output

Opcode

Operand

Turn-on Modules

Turn-off Modules

Result Selection

SOP

Sum-of-Products

Two multipliers, CPA

None

CPA

M

Multiplication

Main multiplier

Auxiliary multiplier, CPA

Main Multiplier

MA

Multiply-Add

Main multiplier, CPA

Auxiliary multiplier

CPA

S

Squares

Auxiliary multiplier

Main multiplier, CPA

Auxiliary multiplier

SS

Sum-of-Squares

Auxiliary multiplier

Main multiplier, CPA

Auxiliary multiplier

AM

Add-Multiply

Auxiliary multiplier

Main multiplier, CPA

Auxiliary multiplier

Control signals

Sum-of-products / Multiplication • Sum-of-products • Multiplication − Two PPR arrays + [4:2] adder + CPA − Main PPR array + CPA a × b + x × y → a × b, if (x = 0) or (y = 0)

a×b+x×y

Computer Science

34

Seok Won Heo, Ph.D. candidate

Multiply-add / Square • Multiply-Add • Square − Main PPR array + [4:2] adder + CPA − Aux. PPR array + CPA a × b + x × y → a × b + x, if y = 1

Computer Science

a × b + x × y → x2 , if (x = y) and ((a = 0) or (b = 0))

35

Seok Won Heo, Ph.D. candidate

a 4b 0

Square

three PP bits: a4b0, a3b1 , a2b2

a3b1 a2b2

+ a 3b 0

a1b3 +

Conventional left-to-right array multiplier

a 0b 4

a 2b 1 + a 1b 2 a 0b 3

+

+

+

+

+ : [3:2] adder

Seok Won Heo, Ph.D. candidate



Rearrangement of FAs – [3:2], [4:2] adders have two parameters: xj yi and xi yj.

8 × 8 left-to-right modified array multiplier

Seok Won Heo, Ph.D. candidate

Square (cont’d)

If input a = input b (squaring) → Sout = Cin Cout = a (or b)

[3:2] adder

Computer Science

38

[3:2] adder – Use BYPASS.

+

Seok Won Heo, Ph.D. candidate

Square (cont’d) •

[4:2] adder – Use BYPASS.

NAND-based Implementation → inefficient computation

If input c = input d (square) → two operations (XOR, MUX) are ignored.

+

MUX-based Implementation → efficient computation

[4:2] adder

Computer Science

39

Seok Won Heo, Ph.D. candidate

Sum-of-Squares (a = b) and (x = y)

aibj aibj Bit-matrix for multiplication

Little difference between multiplication and sum-ofsquares bibj → CMSSU[14] : Sum-of-squares can be executed by one array and additional gates.

Input control using MUX Additional XOR gates

aiaj Bit-matrix for sum-of-squares

Computer Science

40

Seok Won Heo, Ph.D. candidate

Add-Multiply •

Add-multiply – a × b + x × y = a × (b + y), + is arithmetic addition (when x = a) Conventional structure

Computer Science

Modified structure

41

Seok Won Heo, Ph.D. candidate

Results

35 ~ 50% less power (except for sum-of-products)

10 ~ 20% more delay

Better power-delay product (except for sum-of-products) 42

Seok Won Heo, Ph.D. candidate

Research Problems • Problems using single multiplier in implementing sum-of-products operation Arithmetic unit (Multiplier) in conventional signal processing application 1) Many clock cycles Solution) sum-of-products (hardware parallelism) 2) Separate arithmetic units for different arithmetic functions (like squarer and multiply-add) Solution) sum-of-products (multi-functional arithmetic unit) 3) Input data with a large dynamic range

Computer Science

43

Seok Won Heo, Ph.D. candidate

Research Problems • Problems using single multiplier in implementing sum-of-products operation Arithmetic unit (Multiplier) in conventional signal processing application 1) Many clock cycles Solution) sum-of-products (hardware parallelism) 2) Separate arithmetic units for different arithmetic functions (like squarer and multiply-add) Solution) sum-of-products (multi-functional arithmetic unit) 3) Input data with a large dynamic range

Computer Science

44

Seok Won Heo, Ph.D. candidate

Solution •

Low-Precision Data: SIMD Operation – Throughput increase and power decrease.



High-Precision Data: Approximate Operation – Power decrease with approximate results – Mobile systems with limited screen size tolerate a reasonable amount of error.

Computer Science

45

Seok Won Heo, Ph.D. candidate

Proposed Arithmetic Unit •

Sum-of-Products Unit – Baseline module : two PPR array + [4:2] adder + CPA – Additional modules : dynamic range detector – detect the signs and the ranges of all input data main controller – generate SIMD and APPR signals.

Computer Science

46

Seok Won Heo, Ph.D. candidate

Signal Gating[14] Pdynamic = 0.5 × CL × VDD2 × fclk × α0→1 Power dissipation → depends on the switching activity. → reduced by reducing switching activity. Signal gating → eliminate unnecessary switching activities • enable = 1 → gated signal = signal (switching) • enable = 0 → gated signal = 0 (no switching) enable

Signal Gating enable

signal

gated signal signal

gated signal

47

Seok Won Heo, Ph.D. candidate

SIMD operation

Bit-matrix for standard operation

• Inverter for bit inversion • MUX for selection Signal Gating

Bit-matrix for SIMD operation

Computer Science

48

Seok Won Heo, Ph.D. candidate

Approximate operation Gate Level

• MS 32-bit column: used • LS 32-bit column: ignored

[15]

Signal Gating

Bit-matrix for approximate operation Mean Error estimates: 1. generate random values 2. error = correct value - approximate value

Computer Science

49

Seok Won Heo, Ph.D. candidate

Approximate operation Architecture Level Method I (one 32 × 32 multiplication) If one result >> the other result

Method II (two 16 × 16 multiplication)

Mean Error estimates: 1. generate random values 2. error = correct value approximate value

Computer Science

50

Seok Won Heo, Ph.D. candidate

Modified Module •

PPG – Baseline module + AND2 gates for masking



PPR Array – Baseline module + extra module for SIMD + AND2 gates for masking

Computer Science

51

Seok Won Heo, Ph.D. candidate

The Modified Module •

Final CPA – Baseline module + AND2 gates for masking + MUX for output selection

Computer Science

52

Seok Won Heo, Ph.D. candidate

Results (Gate Level)

SIMD: 50% less power APPR: 30% less power

SIMD: 30% less delay APPR: 10% more delay (gates for masking)

Better power-delay product

53

Seok Won Heo, Ph.D. candidate

Results (Architecture Level)

One 32 × 32: 40% less power Two 16 × 16: 75% less power

One 32 × 32: 12% more delay Two 16 × 16: 40% less delay Power reduction: One 32 × 32 < Two 16 × 16 Delay reduction: One 32 × 32 < Two 16 × 16 Mean error: One 32 × 32 < Two 16 × 16 54

Seok Won Heo, Ph.D. candidate

Research Problems • Problems using single multiplier in implementing sum-of-products operation Arithmetic unit (Multiplier) in conventional signal processing application 1) Many clock cycles Solution) sum-of-products (hardware parallelism) 2) Separate arithmetic units for different arithmetic functions (like squarer and multiply-add) Solution) sum-of-products (multi-functional arithmetic unit) 3) Input data with a large dynamic range Solution) sum-of-products (SIMD and approximate operation) Computer Science

55

Seok Won Heo, Ph.D. candidate

Future Work 1) Other Composite Arithmetic Operations (e.g. 4-Dimensional sum-of-products: a × b + c × d + e × f + g × h) 2) 64-bit Floating-Point Arithmetic Operations

Computer Science

56

Seok Won Heo, Ph.D. candidate

Conclusion (cont’d) • Problems using single multiplier in implementing sum-of-products operation Arithmetic unit (Multiplier) in conventional signal processing application 1) Many clock cycles Solution) sum-of-products (hardware parallelism) 2) Separate arithmetic units for different arithmetic functions (like squarer and multiply-add) Solution) sum-of-products (multi-functional arithmetic unit) 3) Input data with a large dynamic range Solution) sum-of-products (SIMD and approximate operation) Computer Science

57

Seok Won Heo, Ph.D. candidate

Conclusion (cont’d) • Proposed sum-of-products unit Technology: Samsung 65nm standard cell ISA: ARM7

x y

a b

PPR Array PPR Array

1. 4-level UL LR array, 2. LR array w/ voltage islands

2 2

a×b

x×y

[4:2] Adder

Carry Propagate Adder a×b+x×y

Computer Science

Modified CSELA with addone circuit (CRA and CLA)

Compared to a single multiplier, Execution time: 35~45% decrease Energy: 15~40% increase Less energy when the execution time is the same.

58

Seok Won Heo, Ph.D. candidate

Conclusion (cont’d) • Proposed multifunctional sum-of-products unit

Compared to a baseline sum-of-products unit, Power: 35 ~ 50% decrease (except for sum-of-products) Delay: 10 ~ 20% more delay Better power-delay product

Computer Science

59

Seok Won Heo, Ph.D. candidate

Conclusion • Proposed sum-of-products unit for supporting SIMD and approximate operation

Compared to a baseline sum-of-products unit, Power: SIMD - 50% decrease, APPR – 30% decrease Delay: SIMD - 30% decrease, APPR – 10% increase Better power-delay product

Computer Science

60

Seok Won Heo, Ph.D. candidate

Acknowledgement

Prof. Ercegovac

Prof. Cong

Computer Science

Prof. Tamir

61

Prof. Marković

Seok Won Heo, Ph.D. candidate

Acknowledgement

With fiancé

With family

Computer Science

62

Seok Won Heo, Ph.D. candidate

Thank you.

Reference [1] W. Suntiamorntut, Energy efficient functional unit for a parallel asynchronous DSP, Ph.D. dissertation, University of Manchester, 2005. [2] J. M. Rabaey, Low power design essentials, Springer, 2009. [3] G. K. Yeap, Practical low power digital VLSI design. Kluwer Academic Publishers, 1998. [4] S. W. Heo, S. J. Huh and M. D. Ercegovac, "Power optimization of sum-of-products design for signal processing applications, in Proc. ASAP, Jun. 2013, pp. 192–197. [5] M. D. Ercegovac and T. Lang, Digital Arithmetic. Morgan Kaufman, 2004. [6] Z. Huang and M.D. Ercegovac, "Low power array multiplier design by topology optimization," in Proc. SPIE Advanced Signal Processing Algorithms, Architectures, and Implementations XII, vol. 4791, July 2002, pp. 424–435. [7] Z. Huang, High-level optimization techniques for low power multiplier design, Ph.D. dissertation, University of California at Los Angeles, 2004. [8] U. Ko, P. T. Balsara, and W. Lee, "A self-timed method to minimize spurious transitions in low power CMOS circuits," in Proc. ISLPED, Oct. 1994, pp. 62–63 [9] D. E. Lackey, P. S. Znchowski, T. R. Eednar, D. W. Stout, S. W. Gould, and J. M. Cobn, "Managing power and performance for system-on-chip designs using voltage islands," in Proc. ICCAD, Nov. 2002, pp. 195–202. [10] S. W. Heo, S. J. Huh, and M. D. Ercegovac, "Power optimization in a parallel multiplier using voltage islands," in Proc. ISCAS, May 2013., pp. 345–348.

64

Seok Won Heo, Ph.D. student

Reference [11] M. J. Schulte, L. P. Marquette, S. Krithivasan, E. G. Walters, and J. Glossner, "Combined multiplication and sum-of-squares units," in Proc. ASAP, Jun. 2003, pp. 204–214. [12] R. Goering, "Low power Design," Special technology report, SCD source, Sep. 2008. [13] M.J. Schulte, "Reduced power dissipation through truncated multiplication," in Proc. IEEE Alessandro Volta Memorial Workshop on Low-Power Design, Mar. 1999, pp. 61–69.

65

Seok Won Heo, Ph.D. student