SRAM Building ... - IEEE Xplore

0 downloads 0 Views 3MB Size Report
at the reliability-limited maximum VCC (Vmax) of 1.2V, allows usage of the smallest possible SRAM cell whose active Vmin is close to Vmax, thus maximizing the ...
ISSCC 2006 / SESSION 34 / SRAM / 34.2 34.2

A 4.2GHz 0.3mm2 256kb Dual-Vcc SRAM Building Block in 65nm CMOS

Muhammad Khellah, Nam Sung Kim, Jason Howard, Greg Ruhl, Murad Sunna, Yibin Ye, James Tschanz, Dinesh Somasekhar, Nitin Borkar, Fatih Hamzaoglu, Gunjan Pandya, Ali Farhang, Kevin Zhang, Vivek De Intel, Hillsboro, OR For conventional single-VCC processors, minimum VCC (Vmin) allowed for dynamic VCC and frequency (DVF) during active operation is dictated primarily by read/write margins of SRAM cells in the last level cache (LLC). Vmin in standby modes requiring fast reactivation is set mainly by the minimum voltage required for data retention in the LLC rather than soft error rate, since error detection and correction are used in LLCs. Single-VCC processors have to use relatively large SRAM cells in the LLC to achieve the low Vmin values needed to meet energy-efficiency goals. As a result, die area increases or the amount of on-die LLC reduces, thus adversely impacting cost/performance. A dual-VCC 512kb SRAM macro featuring active power management with autonomous compensation of PVT variation and aging impacts is designed and fabricated (Fig. 34.2.7) in a 65nm CMOS technology. This design enables high-density SRAM 136Mb/cm2, for 64 to 256Mb LLC arrays in dual-VCC processors while providing low active/standby Vmin of 0.7/0.6V (Fig. 34.2.1). In contrast, a conventional single-VCC design in the same technology can achieve only 95Mb/cm2 LLC array density at 0.7/0.6V Vmin, or only 1.1/1V Vmin at 136Mb/cm2 density. A 256kb SRAM block consisting of four 64kb sub-arrays, is optimally partitioned into two different voltage domains. The fixed high-VCC region (VLLC), operating at the reliability-limited maximum VCC (Vmax) of 1.2V, allows usage of the smallest possible SRAM cell whose active Vmin is close to Vmax, thus maximizing the bit density. The variable coreVCC region (VCORE) that shares VCC with the rest of the processor core, designed for ultra-low-voltage operation, uses DVF across the 0.7 to 1.2V active VCC and 0.6V standby VCC to achieve the best energy efficiency while minimizing performance impacts. Minimal usage of explicit low-to-high VCC level shifters, enabled by the optimal dual-VCC partitioning, and use of dc power free embedded level converters in the WL and write drivers help minimize impacts on area, delay, and power (Fig. 34.2.2). This design is simpler than previously reported dual-VCC SRAMs [1] and it provides an optimal balance between array density and energy efficiency. In addition, it incurs lower overheads in area, power, delay, metal resources, and VCC distribution, especially compared to techniques that use row- or column-based cell VCC switching to improve SRAM read/write margins. Dynamic sleep transistors, active virtual ground clamps, and dynamically programmable reference (VREF) voltages (Fig. 34.2.3) minimize active and standby power of the LLC under all conditions during the lifetime of the processor by autonomously and continuously tracking and compensating for changes in transistor characteristics, leakage currents, and Vmin values induced by within-die (WID) and die-to-die (D2D) PVT variations and aging. Virtual-ground control in this design is more accurate than passive gated-MOS diode clamps [2] or active replica cell bias clamps [3] across extremes of PVT variations and aging. While programmable bias transistors [4] can compensate for P variation impacts to some extent, they incur significant silicon calibration overheads, and are not as effective against V and T variations. In addition, since no additional bias transistors are used for clamping, the proposed active clamping has lower area and power overheads. Instead, gate bias of a portion of the sleep device itself is controlled to provide clamping.

Clock gating and programmable deactivation times, derived from benchmark access patterns or leakage currents, are used to further reduce the LLC power. Setting VLLC-VREF close to zero minimizes sub-array leakage power when data retention is not required or a faulty sub-array needs to be disabled. Interfaces between active and inactive regions are designed to eliminate dc power consumption in circuits at active region boundaries. An early wake-up scheme is used where, based on block request, portions of sleep transistors in all sub-arrays in a block are activated, along with timers and decoders, one cycle before full activation of the selected sub-array. This helps minimize short-circuit currents in SRAM cells and ground-bounce noise due to sudden discharge of the virtual ground. The chip supports (1) programmable weak-write-test-mode (WWTM) to measure cell stability; (2) instruction/data control that provide a range of data storage, access patterns, and read/write/idle sequences, all programmable via scan; (3) programmable sleep-transistor width to measure power-area-noise trade-offs; and (4) on-chip circuits to inject ground noises of programmable amplitudes, plus tunable supply impedances, needed to measure impacts on cell stability. Measurements are performed for multiple dies on a 12inch wafer for 0.6 to 1.4V VCC and 25˚C to 85˚C. The dual-VCC design operates at 2.3 to 4.2GHz for 0.7 to 1.2V variable VCORE, 1.2V fixed VLLC, and consumes 16 to 29mW active power at 85˚C per 256kb block (Fig. 34.2.4). Sleep transistors reduce leakage power of idle sub-arrays by 30 to 60% at 0.8 to 1V standby Vmin, for 2% area overhead. Leakage of subarrays that are disabled or do not need to retain data is reduced by 75% in a typical die. Direct measurements of the virtual-ground voltage (VVSS) across wide ranges of PVT and aging demonstrate clamping accuracy within a few mV of the VREF setting (Fig. 34.2.5). In contrast, 100 to 400mV deviations in VVSS, across PV corners alone, have been reported by other sleep transistor and clamping schemes [2, 3, 4]. Since the difference between active VCC and standby Vmin is around 100 to 200mV, accurate and efficient control of VVSS across changes in transistor characteristics and leakage current is critical for effective LLC power reduction by sleep transistors, especially for processors in high-volume manufacturing. The proposed active-clamping technique demonstrates 14 to 24% smaller leakage power than passive bias transistor schemes [4] across large V and T variations while guaranteeing data retention. Changes in Vmin due to WID and D2D PT variations and aging can be tracked easily by autonomous reprogramming of the VREF settings at subarray, block, or array levels, to further boost sleep-transistor effectiveness for power reduction in the LLC. Dynamic sleep transistors reduce the total active power, including power overheads of turning sleep transistors ON/OFF, by 23% for 1% activity (Fig. 34.2.6). Power/area overheads of the clamping circuit and level shifters are 4%/1%. Delay impacts of level shifters are easily absorbed in timing slacks. Finally, this SRAM macro can improve energy efficiency of a dual-VCC microprocessor, containing 64Mb (256Mb) LLC, by 35% (23%), compared to a single-VCC design, with minimal impact on performance or die area. Acknowledgments: The authors thank D. Finan & K. Ikeda, for chip implementation; M. Haycock, C. Webb and S. Borkar for encouragement and support. References: [1] K. Zhang et al., “A 3GHz 70Mb SRAM in 65nm CMOS Technology with Integrated Column-Based Dynamic Power Supply,” ISSCC Dig. Tech. Papers, pp. 474-475, Feb., 2005. [2] A. Bhavnagarwala et al., “A Pico-Joule Class, 1GHz, 32kByte X 64b DSP SRAM with Self Reverse Bias,” Symp. VLSI Circuits, pp. 251-252, Jun., 2003. [3] Y. Takeyama et al., “A Low Leakage SRAM Macro with Replica Cell Biasing Scheme,” Symp. VLSI Circuits, pp. 166-167, Jun., 2005. [4] K. Zhang et al., “SRAM Design on 65-nm CMOS Technology with Dynamic Sleep Transistor for Leakage Reduction,” IEEE J. Solid-State Circuits, vol. 40, no. 4, pp. 895-901, Apr., 2005.

• 2006 IEEE International Solid-State Circuits Conference

1-4244-0079-1/06 ©2006 IEEE

ISSCC 2006 / February 8, 2006 / 2:00 PM 1.2V

''''''

(on VLLC)

0.8

1

bli#

bli

VCC

0.4 in

95

112

123

VCORE out

Explicit level shifter

VCORE

wrdata

VLLC

(on VLLC)

DIN (MLogic)

2

VCORE

VCC

WE (timer)

136

Bit Density (Mb/cm )

Vcc (V)

wrdata#

bli

bli# wrdata#

wrdata

VLLC

VCC

(on VCORE)

Single-Vcc Single Vcc write driver

data#

0.6

1.2

VLLC

VLLC VLLC

0

0.6

VLLC

standby Vmin

0.8

low

VCC

data

Vmin (V)

8:1 VCC

0.2

this work

DualDual Vcc cc w WL Driver

Single Vcc W Single-Vcc WL driver

''''''

cells

logic

cells

cells

active Vmin

middle

16x8 C

pch & io timer pch & io

TAG&Logic

LLC Cell Stability (WWTM setting)

0

WLEN ''''

..

2

(on V CORE)

..

worst bit min. required stability

VLLC WL

WL

active Vmin

1

4

''''

VLLC

VCC

address

256kb block 256 1.2

typical bit

VCC VCC

..

high

6

wl-dec wl-dec cells

8

cells

..

..

sleep transistors

Dual-Vcc Processor Dual

wl-dec

''''

'

wl-dec

cells

wl-dec wl-dec

pch & io timer pch & io

VLLC

cells

128 R VCORE

embedded levelshifter for WL and write d drives

..

4 sub-array 64kb

logic & L0 to L2 caches using largest cell

uP Core PP

VLLC

VLLC = 1.2V

0.7V 0.6V (standby)

''''

VCORE=

VCORE

Dual-Vcc Dual Vcc write driver

Figure 34.2.2: Comparison of explicit and embedded level shifters comparison.

Figure 34.2.1: Single-Vcc and dual-Vcc processors.

active clamp circuit

85%

14%

VLLC subarray RD subarray WR

Frequency (GHz)

1%

_

wake

clk

clock gating

load 8-bit cnt down tc counter programmable deactivation

timer clk

clk

early wake-up

wake-up timer, decoder & for block clk gating

early block RD early block WR

4.5 4 3.5 3 2.5 2 1.5 1 0.5 0

12

30

10

25 20 15 10

VLLC = 1.2V

5

T = 85C

0.8 1 VCORE (V)

T = 25C T = 85C

8 6

subarray disabled/ no retention

4 2 0

0 0.6

CLK

35

standby VMIN = 1.0V

+

VREF = VLLC - standby VMIN

Leakage Power (mW)

programmable generator (externally supplied in chip implementation)

standby VMIN = 0.8V

sleep devices

VVSSi = VREF

Power (mW)

VLLC

sub-array i

early wake

reference generator

standby VMIN

VLLC complementary folded cascode opamp

1.2

0.4

0.6

0.8

1

VLLC - V REF (V)

RD/WR wake WL load/cnt deactivation delay

Figure 34.2.3: Power management scheme featuring autonomous clamp.

P

0.799

0.997

1.7

V

Leakage (mA)

VLLC-VVSS =

VLLC-VVSS=

VLLC-VREF = 0.8V 1.0V

5.9

leakage (this work)

leakage (passive bias)

Vmin (this work)

Vmin (passive bias)

6

leakage (passive bias)

Vmin (this work)

Vmin (passive bias)

70

1.2 standby VMIN = VLLC – VREF = 0.9V

1.15 1.1

-24%

5 4 3 2 1 0

1.05 1

bias set for worstcase VLLC = 1V

0.95

60 50 40 30

1.1

1.2 VLLC (V)

1.3

-23%

20 sleep disabled sleep enabled

10 0

0.9 1

VLLC =1.2V standby V MIN = 0.9V T = 85C

1.4

0.001

0.98

0.01 0.1 Activity

0.96 4

-14%

bias set for worstcase T = 85C

3 2

0.95 0.94

VLLC = 1.2V

0.93

VLLC – VREF = 0.9V

0.92

1

0.91

0 0

20

40

60

80

0.9 100

VLLC - VVSS (V)

0.97

5 Leakage (mA)

10 9 8 7 6

leakage (this work)

VLLC - VVSS (V)

Tolerance to Process Variation

Figure 34.2.4: Measured frequency and power of 256kb SRAM block.

Total Power (mW)

tc

Normalized Total Processor Power

Aging: continuous RD/WR for 24 hours at 100MHz/1.6V/110C. Standby V MIN = V LLC VREF = 1.2V - 0.3V = 0.9V

Leakage (mA) VLLC-VV SS(V)

Before aging

After aging

% change

5.002 0.912

4.810 0.914

-3.8% 0.2%

VCORE (V) VLLC (V) Single-Vcc Single Vcc processor Dual Vcc Dual-Vcc with no sleep Dual Vcc with Dual-Vcc sleep (This work)

T (C)

Figure 34.2.5: Measured virtual-ground voltage sensitivity to PVT and aging.

1

1.0

0.7

64Mb

256Mb

Normalized Cache Area

1

1

1

0.74

0.91

1.002

0.65

0.77

1.032

1.0

Figure 34.2.6: Measured power reduction and area overhead.

• 2006 IEEE International Solid-State Circuits Conference

1-4244-0079-1/06 ©2006 IEEE

1.2

comparator

ISSCC 2006 / SESSION 34 / SRAM / 34.2 VCORE VLLC

tunable off-chip impedance

32b din 32b dout

14b address RD/WR

programmable WWTM programmable sleep

Direct bitline access

Die area SRAM cell size Process Interconnect Frequency Temperature Power/256kb block Power/256 Vmax Testing interface Pad count

1.375mm2 0.57um2 0.57Pm 65nm CMOS 1 poly, 8 metal 4.2GHz 85C 30mW 1.2V membrane card 30

Programmable noise injector

.. ..

Programmable instruction & data control unit

Scan Control

External Interface

Clock Circuit

512kb macro 512Kb macro = 2 x 256Kb 256kb blocks blocks with with active power management

Dual-Vcc Dual Vcc 512Kb 512kb macro macro == 22 xx 256kb 256Kb blocks with sleep transistor

Single-Vcc Single Vcc 256kb 256Kb block without sleep transistor

Figure 34.2.7: Chip block diagram, process characteristics, and chip micrograph.

• 2006 IEEE International Solid-State Circuits Conference

1-4244-0079-1/06 ©2006 IEEE

ISSCC 2006 / SESSION 34 / SRAM / 34.2 1.2V VLLC = 1.2V

4

1.2 1 typical bit worst bit

min. required stability

2 0

this work

logic middle

active Vmin standby Vmin

0.8 0.6 0.4 0.2 0

low

0.6

cells

cells

active Vmin

6

wl-dec wl-dec

256kb block 256

Vmin (V)

uP Core PP Cell Stability (WWTM setting)

high

wl-dec

cells

sleep transistors

wl-dec

cells

TAG&Logic

'

Dual-Vcc Processor Dual 8

wl-dec wl-dec

..

..

LLC

VLLC

cells

VCORE

pch & io timer pch & io

128 R

cells

logic & L0 to L2 caches using largest cell

cells

sub-array 4 64kb pch & io timer pch & io

0.7V 0.6V (standby)

16 x8 C

VCORE=

0.8

1

1.2

Vcc (V)

95

112

123

2

Bit Density (Mb/cm )

Figure 34.2.1: Single-Vcc and dual-Vcc processors.

• 2006 IEEE International Solid-State Circuits Conference

136

1-4244-0079-1/06 ©2006 IEEE

ISSCC 2006 / SESSION 34 / SRAM / 34.2

VLLC embedded levelshifter for WL and write d drives

VLLC

VCC WL

'''''' ''''

VCC VCC

VLLC WL

address

(on V CORE)

WLEN

(on VLLC)

''' '

''''

VLLC

VLLC

wrdata#

VCC

wrdata

wrdata#

.. wrdata

bli#

bli

bli#

''''''

bli ..

..

''''

8:1 VCC

Dualcc w WL Driver ual Vcc

..

Single Vcc W Single-Vcc WL driver

VLLC VLLC

in

VCORE out

Explicit level shifter

VLLC

VCORE

(on VLLC)

WE (timer) DIN (MLogic)

VCORE

VCC

VCC

(on VCORE)

Sin gle-Vcc gle Vcc write driver

VCORE

Dual-Vcc ual Vcc write driver

Figure 34.2.2: Comparison of explicit and embedded level shifters comparison.

• 2006 IEEE International Solid-State Circuits Conference

1-4244-0079-1/06 ©2006 IEEE

data#

VCC

data

VLLC

ISSCC 2006 / SESSION 34 / SRAM / 34.2

complementary folded cascode opamp

VLLC

sub-array i active clamp circuit sleep devices

VVSSi = VREF

programmable generator (externally supplied in chip implementation)

+

VREF = VLLC - standby VMIN

1%

_ VLLC

subarray RD subarray WR

85%

early wake

reference generator

standby VMIN

VLLC

14%

wake

clk

load 8-bit cnt down tc counter programmable deactivation

clock gating clk

timer clk

early wake-up early block RD early block WR

wake-up timer, decoder & for block clk gating

CLK RD/WR wake WL load/cnt tc

deactivation delay

Figure 34.2.3: Power management scheme featuring autonomous clamp.

• 2006 IEEE International Solid-State Circuits Conference

1-4244-0079-1/06 ©2006 IEEE

30

10

25 20 15 10

VLLC = 1.2V

5

T = 85C

0 0.6

0.8 1 VCORE (V)

1.2

T = 25C T = 85C

8 6

subarray disabled/

standby VMIN = 1.0V

12

standby VMIN = 0.8V

35 Leakage Power (mW)

4.5 4 3.5 3 2.5 2 1.5 1 0.5 0

Power (mW)

Frequency (GHz)

ISSCC 2006 / SESSION 34 / SRAM / 34.2

no retention

4 2 0 0.4

0.6

0.8

1

VLLC - V REF (V)

Figure 34.2.4: Measured frequency and power of 256kb SRAM block.

• 2006 IEEE International Solid-State Circuits Conference

1-4244-0079-1/06 ©2006 IEEE

1.2

ISSCC 2006 / SESSION 34 / SRAM / 34.2

P

0.799

0.997

1.7

V

Leakage (mA)

VLLC-VVSS =

VLLC-VVSS=

VLLC-VREF = 0.8V 1.0V

5.9

leakage (this work)

leakage (passive bias)

Vmin (this work)

Vmin (passive bias)

6

Vmin (this work)

Vmin (passive bias)

1.2 standby VMIN = VLLC – VREF = 0.9V

1.15 1.1

-24%

5 4 3 2 1 0

1.05 1

bias set for worstcase VLLC = 1V

0.95 0.9

1

1.1

1.2 VLLC (V)

1.3

1.4

0.98

0.96 4

-14%

bias set for worstcase T = 85C

3 2

0.95 0.94

VLLC = 1.2V

0.93

VLLC – VREF = 0.9V

0.92

1

VLLC - VVSS (V)

0.97

5 Leakage (mA)

10 9 8 7 6

leakage (passive bias)

0.91

0 0

20

40

60

80

0.9 100

Aging: continuous RD/WR for 24 hours at 100MHz/1.6V/110C. Standby V MIN = V LLC VREF = 1.2V - 0.3V = 0.9V

Leakage (mA) VLLC-VV SS(V)

Before aging

After aging

% change

5.002 0.912

4.810 0.914

-3.8% 0.2%

T (C)

Figure 34.2.5: Measured virtual-ground voltage sensitivity to PVT and aging.

• 2006 IEEE International Solid-State Circuits Conference

1-4244-0079-1/06 ©2006 IEEE

VLLC - VVSS (V)

Tolerance to Process Variation

leakage (this work)

ISSCC 2006 / SESSION 34 / SRAM / 34.2

Total Power (mW)

70 60 50

VLLC =1.2V standby V MIN = 0.9V T = 85C

40 30

-23%

20 sleep disabled sleep enabled

10 0 0.001

0.01 0.1 Activity

1

Normalized Total Processor Power

Single-Vcc Single Vcc processor Dual Vcc Dual-Vcc with no sleep Dual Vcc with Dual-Vcc sleep (This work)

VCORE (V) VLLC (V)

64Mb

256Mb

Normalized Cache Area

1.0

1

1

1

0.74

0.91

1.002

0.65

0.77

1.032

0.7

1.0

Figure 34.2.6: Measured power reduction and area overhead.

• 2006 IEEE International Solid-State Circuits Conference

1-4244-0079-1/06 ©2006 IEEE

comparator

ISSCC 2006 / SESSION 34 / SRAM / 34.2

VCORE VLLC

tunable off-chip impedance

32b din 32b dout

14b address RD/WR

programmable WWTM programmable sleep

Direct bitline access

Die area SRAM cell size Process Interconnect Frequency Temperature Power/256kb block Power/256 Vmax Testing interface Pad count

1.375mm2 0.57um2 0.57Pm 65nm CMOS 1 poly, 8 metal 4.2GHz 85C 30mW 1.2V membrane card 30

Programmable noise injector

Programmable instruction & data control unit

.. ..

Scan Control

External Interface

Clock Circuit

512kb macro 512Kb macro = 2 x 256Kb 256kb blocks blocks with with active power management

Dual-Vcc Dual Vcc 512kb 512Kb macro macro == 22 xx 256kb 256Kb blocks with sleep transistor

Single-Vcc Single Vcc 256kb 256Kb block without sleep transistor

Figure 34.2.7: Chip block diagram, process characteristics, and chip micrograph.

• 2006 IEEE International Solid-State Circuits Conference

1-4244-0079-1/06 ©2006 IEEE