at the reliability-limited maximum VCC (Vmax) of 1.2V, allows usage of the smallest possible SRAM cell whose active Vmin is close to Vmax, thus maximizing the ...
ISSCC 2006 / SESSION 34 / SRAM / 34.2 34.2
A 4.2GHz 0.3mm2 256kb Dual-Vcc SRAM Building Block in 65nm CMOS
Muhammad Khellah, Nam Sung Kim, Jason Howard, Greg Ruhl, Murad Sunna, Yibin Ye, James Tschanz, Dinesh Somasekhar, Nitin Borkar, Fatih Hamzaoglu, Gunjan Pandya, Ali Farhang, Kevin Zhang, Vivek De Intel, Hillsboro, OR For conventional single-VCC processors, minimum VCC (Vmin) allowed for dynamic VCC and frequency (DVF) during active operation is dictated primarily by read/write margins of SRAM cells in the last level cache (LLC). Vmin in standby modes requiring fast reactivation is set mainly by the minimum voltage required for data retention in the LLC rather than soft error rate, since error detection and correction are used in LLCs. Single-VCC processors have to use relatively large SRAM cells in the LLC to achieve the low Vmin values needed to meet energy-efficiency goals. As a result, die area increases or the amount of on-die LLC reduces, thus adversely impacting cost/performance. A dual-VCC 512kb SRAM macro featuring active power management with autonomous compensation of PVT variation and aging impacts is designed and fabricated (Fig. 34.2.7) in a 65nm CMOS technology. This design enables high-density SRAM 136Mb/cm2, for 64 to 256Mb LLC arrays in dual-VCC processors while providing low active/standby Vmin of 0.7/0.6V (Fig. 34.2.1). In contrast, a conventional single-VCC design in the same technology can achieve only 95Mb/cm2 LLC array density at 0.7/0.6V Vmin, or only 1.1/1V Vmin at 136Mb/cm2 density. A 256kb SRAM block consisting of four 64kb sub-arrays, is optimally partitioned into two different voltage domains. The fixed high-VCC region (VLLC), operating at the reliability-limited maximum VCC (Vmax) of 1.2V, allows usage of the smallest possible SRAM cell whose active Vmin is close to Vmax, thus maximizing the bit density. The variable coreVCC region (VCORE) that shares VCC with the rest of the processor core, designed for ultra-low-voltage operation, uses DVF across the 0.7 to 1.2V active VCC and 0.6V standby VCC to achieve the best energy efficiency while minimizing performance impacts. Minimal usage of explicit low-to-high VCC level shifters, enabled by the optimal dual-VCC partitioning, and use of dc power free embedded level converters in the WL and write drivers help minimize impacts on area, delay, and power (Fig. 34.2.2). This design is simpler than previously reported dual-VCC SRAMs [1] and it provides an optimal balance between array density and energy efficiency. In addition, it incurs lower overheads in area, power, delay, metal resources, and VCC distribution, especially compared to techniques that use row- or column-based cell VCC switching to improve SRAM read/write margins. Dynamic sleep transistors, active virtual ground clamps, and dynamically programmable reference (VREF) voltages (Fig. 34.2.3) minimize active and standby power of the LLC under all conditions during the lifetime of the processor by autonomously and continuously tracking and compensating for changes in transistor characteristics, leakage currents, and Vmin values induced by within-die (WID) and die-to-die (D2D) PVT variations and aging. Virtual-ground control in this design is more accurate than passive gated-MOS diode clamps [2] or active replica cell bias clamps [3] across extremes of PVT variations and aging. While programmable bias transistors [4] can compensate for P variation impacts to some extent, they incur significant silicon calibration overheads, and are not as effective against V and T variations. In addition, since no additional bias transistors are used for clamping, the proposed active clamping has lower area and power overheads. Instead, gate bias of a portion of the sleep device itself is controlled to provide clamping.
Clock gating and programmable deactivation times, derived from benchmark access patterns or leakage currents, are used to further reduce the LLC power. Setting VLLC-VREF close to zero minimizes sub-array leakage power when data retention is not required or a faulty sub-array needs to be disabled. Interfaces between active and inactive regions are designed to eliminate dc power consumption in circuits at active region boundaries. An early wake-up scheme is used where, based on block request, portions of sleep transistors in all sub-arrays in a block are activated, along with timers and decoders, one cycle before full activation of the selected sub-array. This helps minimize short-circuit currents in SRAM cells and ground-bounce noise due to sudden discharge of the virtual ground. The chip supports (1) programmable weak-write-test-mode (WWTM) to measure cell stability; (2) instruction/data control that provide a range of data storage, access patterns, and read/write/idle sequences, all programmable via scan; (3) programmable sleep-transistor width to measure power-area-noise trade-offs; and (4) on-chip circuits to inject ground noises of programmable amplitudes, plus tunable supply impedances, needed to measure impacts on cell stability. Measurements are performed for multiple dies on a 12inch wafer for 0.6 to 1.4V VCC and 25˚C to 85˚C. The dual-VCC design operates at 2.3 to 4.2GHz for 0.7 to 1.2V variable VCORE, 1.2V fixed VLLC, and consumes 16 to 29mW active power at 85˚C per 256kb block (Fig. 34.2.4). Sleep transistors reduce leakage power of idle sub-arrays by 30 to 60% at 0.8 to 1V standby Vmin, for 2% area overhead. Leakage of subarrays that are disabled or do not need to retain data is reduced by 75% in a typical die. Direct measurements of the virtual-ground voltage (VVSS) across wide ranges of PVT and aging demonstrate clamping accuracy within a few mV of the VREF setting (Fig. 34.2.5). In contrast, 100 to 400mV deviations in VVSS, across PV corners alone, have been reported by other sleep transistor and clamping schemes [2, 3, 4]. Since the difference between active VCC and standby Vmin is around 100 to 200mV, accurate and efficient control of VVSS across changes in transistor characteristics and leakage current is critical for effective LLC power reduction by sleep transistors, especially for processors in high-volume manufacturing. The proposed active-clamping technique demonstrates 14 to 24% smaller leakage power than passive bias transistor schemes [4] across large V and T variations while guaranteeing data retention. Changes in Vmin due to WID and D2D PT variations and aging can be tracked easily by autonomous reprogramming of the VREF settings at subarray, block, or array levels, to further boost sleep-transistor effectiveness for power reduction in the LLC. Dynamic sleep transistors reduce the total active power, including power overheads of turning sleep transistors ON/OFF, by 23% for 1% activity (Fig. 34.2.6). Power/area overheads of the clamping circuit and level shifters are 4%/1%. Delay impacts of level shifters are easily absorbed in timing slacks. Finally, this SRAM macro can improve energy efficiency of a dual-VCC microprocessor, containing 64Mb (256Mb) LLC, by 35% (23%), compared to a single-VCC design, with minimal impact on performance or die area. Acknowledgments: The authors thank D. Finan & K. Ikeda, for chip implementation; M. Haycock, C. Webb and S. Borkar for encouragement and support. References: [1] K. Zhang et al., “A 3GHz 70Mb SRAM in 65nm CMOS Technology with Integrated Column-Based Dynamic Power Supply,” ISSCC Dig. Tech. Papers, pp. 474-475, Feb., 2005. [2] A. Bhavnagarwala et al., “A Pico-Joule Class, 1GHz, 32kByte X 64b DSP SRAM with Self Reverse Bias,” Symp. VLSI Circuits, pp. 251-252, Jun., 2003. [3] Y. Takeyama et al., “A Low Leakage SRAM Macro with Replica Cell Biasing Scheme,” Symp. VLSI Circuits, pp. 166-167, Jun., 2005. [4] K. Zhang et al., “SRAM Design on 65-nm CMOS Technology with Dynamic Sleep Transistor for Leakage Reduction,” IEEE J. Solid-State Circuits, vol. 40, no. 4, pp. 895-901, Apr., 2005.
• 2006 IEEE International Solid-State Circuits Conference
1-4244-0079-1/06 ©2006 IEEE
ISSCC 2006 / February 8, 2006 / 2:00 PM 1.2V
''''''
(on VLLC)
0.8
1
bli#
bli
VCC
0.4 in
95
112
123
VCORE out
Explicit level shifter
VCORE
wrdata
VLLC
(on VLLC)
DIN (MLogic)
2
VCORE
VCC
WE (timer)
136
Bit Density (Mb/cm )
Vcc (V)
wrdata#
bli
bli# wrdata#
wrdata
VLLC
VCC
(on VCORE)
Single-Vcc Single Vcc write driver
data#
0.6
1.2
VLLC
VLLC VLLC
0
0.6
VLLC
standby Vmin
0.8
low
VCC
data
Vmin (V)
8:1 VCC
0.2
this work
DualDual Vcc cc w WL Driver
Single Vcc W Single-Vcc WL driver
''''''
cells
logic
cells
cells
active Vmin
middle
16x8 C
pch & io timer pch & io
TAG&Logic
LLC Cell Stability (WWTM setting)
0
WLEN ''''
..
2
(on V CORE)
..
worst bit min. required stability
VLLC WL
WL
active Vmin
1
4
''''
VLLC
VCC
address
256kb block 256 1.2
typical bit
VCC VCC
..
high
6
wl-dec wl-dec cells
8
cells
..
..
sleep transistors
Dual-Vcc Processor Dual
wl-dec
''''
'
wl-dec
cells
wl-dec wl-dec
pch & io timer pch & io
VLLC
cells
128 R VCORE
embedded levelshifter for WL and write d drives
..
4 sub-array 64kb
logic & L0 to L2 caches using largest cell
uP Core PP
VLLC
VLLC = 1.2V
0.7V 0.6V (standby)
''''
VCORE=
VCORE
Dual-Vcc Dual Vcc write driver
Figure 34.2.2: Comparison of explicit and embedded level shifters comparison.
Figure 34.2.1: Single-Vcc and dual-Vcc processors.
active clamp circuit
85%
14%
VLLC subarray RD subarray WR
Frequency (GHz)
1%
_
wake
clk
clock gating
load 8-bit cnt down tc counter programmable deactivation
timer clk
clk
early wake-up
wake-up timer, decoder & for block clk gating
early block RD early block WR
4.5 4 3.5 3 2.5 2 1.5 1 0.5 0
12
30
10
25 20 15 10
VLLC = 1.2V
5
T = 85C
0.8 1 VCORE (V)
T = 25C T = 85C
8 6
subarray disabled/ no retention
4 2 0
0 0.6
CLK
35
standby VMIN = 1.0V
+
VREF = VLLC - standby VMIN
Leakage Power (mW)
programmable generator (externally supplied in chip implementation)
standby VMIN = 0.8V
sleep devices
VVSSi = VREF
Power (mW)
VLLC
sub-array i
early wake
reference generator
standby VMIN
VLLC complementary folded cascode opamp
1.2
0.4
0.6
0.8
1
VLLC - V REF (V)
RD/WR wake WL load/cnt deactivation delay
Figure 34.2.3: Power management scheme featuring autonomous clamp.
P
0.799
0.997
1.7
V
Leakage (mA)
VLLC-VVSS =
VLLC-VVSS=
VLLC-VREF = 0.8V 1.0V
5.9
leakage (this work)
leakage (passive bias)
Vmin (this work)
Vmin (passive bias)
6
leakage (passive bias)
Vmin (this work)
Vmin (passive bias)
70
1.2 standby VMIN = VLLC – VREF = 0.9V
1.15 1.1
-24%
5 4 3 2 1 0
1.05 1
bias set for worstcase VLLC = 1V
0.95
60 50 40 30
1.1
1.2 VLLC (V)
1.3
-23%
20 sleep disabled sleep enabled
10 0
0.9 1
VLLC =1.2V standby V MIN = 0.9V T = 85C
1.4
0.001
0.98
0.01 0.1 Activity
0.96 4
-14%
bias set for worstcase T = 85C
3 2
0.95 0.94
VLLC = 1.2V
0.93
VLLC – VREF = 0.9V
0.92
1
0.91
0 0
20
40
60
80
0.9 100
VLLC - VVSS (V)
0.97
5 Leakage (mA)
10 9 8 7 6
leakage (this work)
VLLC - VVSS (V)
Tolerance to Process Variation
Figure 34.2.4: Measured frequency and power of 256kb SRAM block.
Total Power (mW)
tc
Normalized Total Processor Power
Aging: continuous RD/WR for 24 hours at 100MHz/1.6V/110C. Standby V MIN = V LLC VREF = 1.2V - 0.3V = 0.9V
Leakage (mA) VLLC-VV SS(V)
Before aging
After aging
% change
5.002 0.912
4.810 0.914
-3.8% 0.2%
VCORE (V) VLLC (V) Single-Vcc Single Vcc processor Dual Vcc Dual-Vcc with no sleep Dual Vcc with Dual-Vcc sleep (This work)
T (C)
Figure 34.2.5: Measured virtual-ground voltage sensitivity to PVT and aging.
1
1.0
0.7
64Mb
256Mb
Normalized Cache Area
1
1
1
0.74
0.91
1.002
0.65
0.77
1.032
1.0
Figure 34.2.6: Measured power reduction and area overhead.
• 2006 IEEE International Solid-State Circuits Conference
1-4244-0079-1/06 ©2006 IEEE
1.2
comparator
ISSCC 2006 / SESSION 34 / SRAM / 34.2 VCORE VLLC
tunable off-chip impedance
32b din 32b dout
14b address RD/WR
programmable WWTM programmable sleep
Direct bitline access
Die area SRAM cell size Process Interconnect Frequency Temperature Power/256kb block Power/256 Vmax Testing interface Pad count
1.375mm2 0.57um2 0.57Pm 65nm CMOS 1 poly, 8 metal 4.2GHz 85C 30mW 1.2V membrane card 30
Programmable noise injector
.. ..
Programmable instruction & data control unit
Scan Control
External Interface
Clock Circuit
512kb macro 512Kb macro = 2 x 256Kb 256kb blocks blocks with with active power management
Dual-Vcc Dual Vcc 512Kb 512kb macro macro == 22 xx 256kb 256Kb blocks with sleep transistor
Single-Vcc Single Vcc 256kb 256Kb block without sleep transistor
Figure 34.2.7: Chip block diagram, process characteristics, and chip micrograph.
• 2006 IEEE International Solid-State Circuits Conference
1-4244-0079-1/06 ©2006 IEEE
ISSCC 2006 / SESSION 34 / SRAM / 34.2 1.2V VLLC = 1.2V
4
1.2 1 typical bit worst bit
min. required stability
2 0
this work
logic middle
active Vmin standby Vmin
0.8 0.6 0.4 0.2 0
low
0.6
cells
cells
active Vmin
6
wl-dec wl-dec
256kb block 256
Vmin (V)
uP Core PP Cell Stability (WWTM setting)
high
wl-dec
cells
sleep transistors
wl-dec
cells
TAG&Logic
'
Dual-Vcc Processor Dual 8
wl-dec wl-dec
..
..
LLC
VLLC
cells
VCORE
pch & io timer pch & io
128 R
cells
logic & L0 to L2 caches using largest cell
cells
sub-array 4 64kb pch & io timer pch & io
0.7V 0.6V (standby)
16 x8 C
VCORE=
0.8
1
1.2
Vcc (V)
95
112
123
2
Bit Density (Mb/cm )
Figure 34.2.1: Single-Vcc and dual-Vcc processors.
• 2006 IEEE International Solid-State Circuits Conference
136
1-4244-0079-1/06 ©2006 IEEE
ISSCC 2006 / SESSION 34 / SRAM / 34.2
VLLC embedded levelshifter for WL and write d drives
VLLC
VCC WL
'''''' ''''
VCC VCC
VLLC WL
address
(on V CORE)
WLEN
(on VLLC)
''' '
''''
VLLC
VLLC
wrdata#
VCC
wrdata
wrdata#
.. wrdata
bli#
bli
bli#
''''''
bli ..
..
''''
8:1 VCC
Dualcc w WL Driver ual Vcc
..
Single Vcc W Single-Vcc WL driver
VLLC VLLC
in
VCORE out
Explicit level shifter
VLLC
VCORE
(on VLLC)
WE (timer) DIN (MLogic)
VCORE
VCC
VCC
(on VCORE)
Sin gle-Vcc gle Vcc write driver
VCORE
Dual-Vcc ual Vcc write driver
Figure 34.2.2: Comparison of explicit and embedded level shifters comparison.
• 2006 IEEE International Solid-State Circuits Conference
1-4244-0079-1/06 ©2006 IEEE
data#
VCC
data
VLLC
ISSCC 2006 / SESSION 34 / SRAM / 34.2
complementary folded cascode opamp
VLLC
sub-array i active clamp circuit sleep devices
VVSSi = VREF
programmable generator (externally supplied in chip implementation)
+
VREF = VLLC - standby VMIN
1%
_ VLLC
subarray RD subarray WR
85%
early wake
reference generator
standby VMIN
VLLC
14%
wake
clk
load 8-bit cnt down tc counter programmable deactivation
clock gating clk
timer clk
early wake-up early block RD early block WR
wake-up timer, decoder & for block clk gating
CLK RD/WR wake WL load/cnt tc
deactivation delay
Figure 34.2.3: Power management scheme featuring autonomous clamp.
• 2006 IEEE International Solid-State Circuits Conference
1-4244-0079-1/06 ©2006 IEEE
30
10
25 20 15 10
VLLC = 1.2V
5
T = 85C
0 0.6
0.8 1 VCORE (V)
1.2
T = 25C T = 85C
8 6
subarray disabled/
standby VMIN = 1.0V
12
standby VMIN = 0.8V
35 Leakage Power (mW)
4.5 4 3.5 3 2.5 2 1.5 1 0.5 0
Power (mW)
Frequency (GHz)
ISSCC 2006 / SESSION 34 / SRAM / 34.2
no retention
4 2 0 0.4
0.6
0.8
1
VLLC - V REF (V)
Figure 34.2.4: Measured frequency and power of 256kb SRAM block.
• 2006 IEEE International Solid-State Circuits Conference
1-4244-0079-1/06 ©2006 IEEE
1.2
ISSCC 2006 / SESSION 34 / SRAM / 34.2
P
0.799
0.997
1.7
V
Leakage (mA)
VLLC-VVSS =
VLLC-VVSS=
VLLC-VREF = 0.8V 1.0V
5.9
leakage (this work)
leakage (passive bias)
Vmin (this work)
Vmin (passive bias)
6
Vmin (this work)
Vmin (passive bias)
1.2 standby VMIN = VLLC – VREF = 0.9V
1.15 1.1
-24%
5 4 3 2 1 0
1.05 1
bias set for worstcase VLLC = 1V
0.95 0.9
1
1.1
1.2 VLLC (V)
1.3
1.4
0.98
0.96 4
-14%
bias set for worstcase T = 85C
3 2
0.95 0.94
VLLC = 1.2V
0.93
VLLC – VREF = 0.9V
0.92
1
VLLC - VVSS (V)
0.97
5 Leakage (mA)
10 9 8 7 6
leakage (passive bias)
0.91
0 0
20
40
60
80
0.9 100
Aging: continuous RD/WR for 24 hours at 100MHz/1.6V/110C. Standby V MIN = V LLC VREF = 1.2V - 0.3V = 0.9V
Leakage (mA) VLLC-VV SS(V)
Before aging
After aging
% change
5.002 0.912
4.810 0.914
-3.8% 0.2%
T (C)
Figure 34.2.5: Measured virtual-ground voltage sensitivity to PVT and aging.
• 2006 IEEE International Solid-State Circuits Conference
1-4244-0079-1/06 ©2006 IEEE
VLLC - VVSS (V)
Tolerance to Process Variation
leakage (this work)
ISSCC 2006 / SESSION 34 / SRAM / 34.2
Total Power (mW)
70 60 50
VLLC =1.2V standby V MIN = 0.9V T = 85C
40 30
-23%
20 sleep disabled sleep enabled
10 0 0.001
0.01 0.1 Activity
1
Normalized Total Processor Power
Single-Vcc Single Vcc processor Dual Vcc Dual-Vcc with no sleep Dual Vcc with Dual-Vcc sleep (This work)
VCORE (V) VLLC (V)
64Mb
256Mb
Normalized Cache Area
1.0
1
1
1
0.74
0.91
1.002
0.65
0.77
1.032
0.7
1.0
Figure 34.2.6: Measured power reduction and area overhead.
• 2006 IEEE International Solid-State Circuits Conference
1-4244-0079-1/06 ©2006 IEEE
comparator
ISSCC 2006 / SESSION 34 / SRAM / 34.2
VCORE VLLC
tunable off-chip impedance
32b din 32b dout
14b address RD/WR
programmable WWTM programmable sleep
Direct bitline access
Die area SRAM cell size Process Interconnect Frequency Temperature Power/256kb block Power/256 Vmax Testing interface Pad count
1.375mm2 0.57um2 0.57Pm 65nm CMOS 1 poly, 8 metal 4.2GHz 85C 30mW 1.2V membrane card 30
Programmable noise injector
Programmable instruction & data control unit
.. ..
Scan Control
External Interface
Clock Circuit
512kb macro 512Kb macro = 2 x 256Kb 256kb blocks blocks with with active power management
Dual-Vcc Dual Vcc 512kb 512Kb macro macro == 22 xx 256kb 256Kb blocks with sleep transistor
Single-Vcc Single Vcc 256kb 256Kb block without sleep transistor
Figure 34.2.7: Chip block diagram, process characteristics, and chip micrograph.
• 2006 IEEE International Solid-State Circuits Conference
1-4244-0079-1/06 ©2006 IEEE