High Performance Processors in a Power. Limited World ... AMD Senior Fellow
..... Exce. MS. Ppt. MS. Excel. MS. Word. MS Word and. Netscape. The Answer:.
Sam Naffziger AMD Senior Fellow High Performance Processors in a Power Limited World
Outline Today’s processor design landscape y Trends Issues making designer’s lives difficult y Power limits y Scaling effects Design opportunities y Circuit level y Architectural Summary
The All Consuming Quest for Greater Performance at Lower Cost Increasing Transistor Density
Increasing Performance
Moore’s Law has served us well.
Processor Frequency vs. Time MPU Performance vs Time 10000
Performance (MHz)
4GHz
1000
100 Jan- Jan- Jan- Jan- Jan- Jan- Jan- Jan- Jan- Jan- Jan- Jan- Jan97 98 99 00 01 02 03 04 05 06 07 08 09
The amazing frequency increases of the past decade have leveled off – Why? Power Limits Process Issues
Outline Today’s processor design landscape y Trends Issues making designer’s lives difficult y Power limits y Scaling effects Design opportunities y Circuit level y Architectural Summary
Power Consumption Background
Source: Holt, HotChips 17
Power has always challenged circuit integration We’ve been bailed out by technology in the past
Bipolar → NMOS → CMOS
Scaling Background Realistic power limit 10
2005 ITRS Projections of Vt and Vdd
Dennard Scaling
2500
2000
1500
mV
Vt 1000
1 Feat Size (um)
500
Vdd 0
20 00 20 01 20 02 20 03 20 04 20 05 20 06 20 07 20 08 20 09 20 10
Power/10 Source: Horowitz et al, IEDM 2005
0.1 Jan-85
Jan-88
Jan-91
Jan-94
Jan-97
Jan-00
Jan-03
Scaling doesn’t bail us out any more
Vdd
Power Consumption Background y Reducing Vdd y Reducing CTOT y Reducing ILEAK, ICO y Reducing α But now, not only are those improvements fading, but we have a host of new challenges
g in al Sc s es oc es Pr su Is
The Process guys have had the biggest impact on these
y Variation The Processor Designer
y Voltage droop y Wire non-scaling Switching Power
P ≈ CTOT·α·F·Vdd2
Crossover Power +
Leakage Power
NTOT·α·F·Vdd·ICO + NON·ILEAK·Vdd
Outline Today’s processor design landscape y Trends Issues making designer’s lives difficult y Power limits y Scaling effects
y Architectural Summary
es su Is
y Circuit level
g in al Sc
Design opportunities
The Silicon Age Still on a Roll, But … 2004
2006
2008
2010
2012
2014
2016
2018
Technology Node (nm)
90
65
45
32
22
16
11
8
Delay = CV/I scaling
0.7
~0.7
>0.7
Delay scaling will slow down
>0.35
>0.5
>0.5
Energy scaling will slow down
High Volume
Manufacturing
Energy/Logic Op scaling Bulk Planar CMOS
High Probability
Low Probability
Alternate, 3G etc
Low Probability
High Probability
Medium
Variability
But … all this scaling has some nasty side effects ITRS ITRS Roadmap Roadmap Source: European Nanoelectronics Initiative Advisory Council (ENIAC)
1
1
1
1
Very High 1
poly SiON
electrostatic control
RC Delay
High 1
1
metal high k
1 gate stack
bulk
planar PDSOI
stressors
2007 65nm
3D FDSOI
MuGFET MuCFET
+ substrate engineering
2010 45nm
+ high µ materials
2013 32nm
2016 22nm
Device Variation Reverse Scales
The Problem: Atoms don’t scale
Source: Pelgrom, IEEE lecture 5/11/06
Variations subtract directly off cycle time Îpower efficiency drops ÎCircuit margins degrade
One impact of variation is leakage spreads
Note: 1.25X Fmax
Sidd (A)
Chip SIDD set by “smallest” gates; Fmax set by slowest gates;
Fmax (a.u)
>3X SIDD spread
Scaling Intrinsically Hurts Supply Integrity 120% Leakage % Normalized Vdd
100%
Vdd Droop %
80%
60%
40%
20%
0% 250nm
180nm
130nm
90nm
Technology Node
Source: Bose, Hotchips 17
With power per core staying constant but area, voltage and cycle times dropping, we have a big challenge Requiring a higher voltage to hit frequency is a quadratic power impact
Outline Today’s processor design landscape y Trends Issues making designer’s lives difficult y Power limits y Scaling effects g in al Sc
Design opportunities y Circuit level y Architectural
es su Is
Summary Circuit Designer
Some Ways to Shoulder the Variation Burden: Adaptive clocking
Programmable Delay Buffers Empirically set the clock edge to optimize frequency Higher granularity → more variation tolerance LBIST and GA search algorithms show promise for per-part optimization
Some Ways to Shoulder the Variation Burden: Self Healing Designs Simplest example is cache ECC on memory arrays Next level is Intel’s Pellston technology implemented on Montecito and Tulsa y Disable defective lines detected by multiple ECC errors
Future directions involve self-checking with redundant logic and retry y Predict result through parity, residues or redundant logic y On an error, replay calculation before committing architectural state y If replay correct, it was a transient error (particle strike, Vdd droop, random noise coupling etc.) y If incorrect can reduce frequency, increase voltage or retry with an alternate execution path
Some Ways to Shoulder the Variation Burden: Self Healing Designs From Fall Microprocessor Forum 2006
Adaptive Supply Voltage Per-part and dynamic voltage management are key
Energy / Operation
More range flexibility and finer grain response will provide differentiation Short High
Nom
Channel Length
Long
Low
Vdd
Integrated Power and Thermal Management
“Fuse and forget” is no longer viable
Sidd (A)
Too much variation in environment, manufacturing and operating conditions Some means of dynamic optimization needed
Fmax (a.u)
Integrated Power and Thermal Management
An autonomous programmable controller enables real time optimizations An embedded controller provides the needed flexibility y OS interfacing y Multi-core management y Per-part optimization
Outline Today’s processor design landscape y Trends Issues making designer’s lives difficult y Power limits y Scaling effects g in al Sc
Design opportunities
es su Is
y Circuit level y Architectural Summary
Chip Architect
Traversing the Power Contour Powe r Consumption
Short H igh N om
C hannel Length
Switching Power
P ≈ CTOT·α·F·Vdd2
Crossover Power +
Long
Low
Vdd
Leakage Power
NTOT·α·F·Vdd·ICO + NON·ILEAK·Vdd
Traversing the Power Contour Fre que ncy
Short High
Nom
Channe l Le ngth
Long
Low
Vdd
Traversing the Power Contour for a Given Implementation Energy / Operation
Max perform ance
Best p ow efficien er cy Short High
Nom
Channel Length
Long
Low
Vdd
For Comparing Architectural Efficiency, Performance3/W is most effective Performance^3 / Watt 1.2
1
0.8
0.6 0.4 0.2 0 Short
Channel Length
High
Nom Nom Long
Low
Vdd
Optimal Pipeline Depth
Pipe stages 35 30 25
~2X power efficient
20 15 10 5 0
Williamette (2000)
Prescott (2004)
Core2 (2006)
Opteron
Power6
• The industry has been moving from “hyperpipelining” with short pipe stages, to something more moderate
A Look at Mobile Processor Power
A Look at Mobile System Power Mobile System Power
Rest of system Chipset Memory controller CPU Memory
TDP
Average Power
If a laptop burned TDP power all the time, battery life would be measured in minutes How do we get mobile average power so much lower than TDP?
The Answer: Take Advantage of Typically Low CPU Utilization MobileMark 2002 Tj 95 1800MHz 1.35V Adobe
35
Flash
MS Outlook
MS Word
MS Ppt
MS Exce
MS Ppt
MS Excel
MS Word
MS Word and Netscape
Processor Utilization During Normal Laptop Usage as Approximated by MobileMark
30
[C1, PN!, & C3 enabled]
Power (W)
25
20
Core Power 15
10
5
IO Power
0 0
10
20
30
40
Time(min)
50
60
70
Reducing Power and Cooling Requirements with Processor Performance States P-State P0
HIGH
Average CPU Core Power (measured at CPU)
2600MHz 1.40V ~95watts
25
P1
20
P2
2200MHz 1.30V ~76watts
P3
PROCESSOR UTILIZATION
2000MHz 1.25V ~65watts
15
AMD PowerNow!TM ENABLED
-33% -62%
-75%
10
5
P4
1800MHz 1.20V ~55watts
0 10500 Connections 5000 Connections (~62% CPU Utilization) (~40% CPU Utilization)
P5
1000MHz 1.10V ~32watts
Power (W)
2400MHz 1.35V ~90watts
AMD PowerNow!TM DISABLED
LOW
Idle (in OS)
Up to 75% power savings (at idle)!
Additionally “C-states” reduce power further by cutting clocks completely and dropping voltage to retention levels
Improving Peak Performance per Watt
Adding Features to Increase Performance Watts/(Spec*Vdd*Vdd*L) 1
0.1
Source: Horowitz et al, IEDM 2005
0.01 0
1
10
100 Spec2000*L
1000
•Increasing execution efficiency has, historically hurt power efficiency •However, the cubic reduction of power with V/F scaling has tended to make this a good tradeoff
Adding Features to Increase Performance Works with V/F Scaling 3.5
3 2.5
2
energy/op frequency
1.5
performance
1 0.5
0 0.5 0.6 0.7 0.8 0.9
1
1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9
2
IPC
Voltage scaling has it limits ÎMore power efficient designs have an advantage ÎHigh power designs get penalized due to higher di/dt, higher temperatures etc.
If we hit VMIN however, the game is over
How Hard is Improving Existing Processors? Watts/(Spec*Vdd*Vdd*L) 1 Current and Next Generation Core Comparison Source: Horowitz et al, IEDM 2005
0.01 0
1
10
100 Spec2000*L
1000
Peak performance costs more energy/operation
C switch
0.1
Logic
Clock
Despite more features, next gen core has substantiall y lower Cap
Logic
Clock
Most of the Big hitter improvements have been heavily mined already
Gen1 Peak
Gen2 Peak
Next generation AMD cores have >> 50% of clocks gated off even for high power code
Multi-Core to the Rescue? Cache
Cache Core
Voltage =1 Frequency =1 Area =1 Power =1 Perf =1 Perf/Watt =1
Core
Core
Voltage =.85 Frequency =.85 Area =2 Power =1 Perf ≈1.7 Perf/Watt ≈1.7
Sounds like a great story, what’s the catch?
Multi-Core to the Rescue? Some of the catches:
y What if you’re already at VMIN? Need to cut frequency in half to stay within power limit / y How much parallelizable code is really out there? y More compute capacity means more IO and memory bandwidth demands …
Cache
Cache Core
Voltage =1 Frequency =1 Area =1 Power =1 Perf =1 Perf/Watt =1
Core
Core
Voltage =.85 Frequency =.85 Area =2 Power =1 Perf ≈1.7 Perf/Watt ≈1.7
Multi-Core Issues: Amdahl’s Law Cache
There is almost always a portion of an application that cannot be parallelized y This portion becomes a bottleneck as the number of threads is increased y A typical value is in the range of 10%
speedup
multi-core speedup with serial code and constant power considered 10.00 8.00
al e Id
6.00 4.00
0 5% 10% 15% 20%
2.00
ideal
0.00 1
2
3
4
5
cores
6
7
8
Core
Voltage =1 Frequency =1 Area =1 Power =1 Perf =1 Perf/Watt =1
Cache Core
Core
Voltage =.85 Frequency =.85 Area =2 Power =1 Perf ≈1.7 Perf/Watt ≈1.7
Just 10% serial code drops 8 core performance improvement by 41%
Multi-Core Issues: IO Power All those extra cores need their own data …
Cache Core
IO power in terms of W/Gb/s has been pretty constant in the range of 20mW for years
speedup
multi-core speedup with serial code, constant power+ IO power considered 10.00 8.00
al e Id
6.00 4.00
0 5% 10% 15%
2.00
20% ideal
0.00 1
2
3
4
5
cores
6
7
8
Voltage =1 Frequency =1 Area =1 Power =1 Perf =1 Perf/Watt =1
Cache Core
Core
Voltage =.85 Frequency =.85 Area =2 Power =1 Perf ≈1.7 Perf/Watt ≈1.7
• If we increase IO power accordingly, but hold total chip power constant with V/F scaling, things get worse • Overall performance drops by another 10% or so …
The Transition to Parallel Applications Single-threaded Applications
Parallel Applications
Most of today’s applications
Small number of applications (worked by experts for 10+ yrs)
Well understood optimization techniques
Awkward development, analysis and debug environments
Advanced development, analysis and debug tools Conceptually, easy to think about
Parallel programming is hard! Amdahl’s law is still a law… SW productivity is already in a crisis Æ this worsens things!
Establishing an appropriate balance is key for managing this important transition
Other Architectural Directions: Integration
Typical Server Power Breakdown
Not only does the integration of more system components (i.e. memory controllers, IO etc.) improve performance
Bose, HotChips 17
Integration reduces power significantly as well y IO communication overhead drops y CPU integrated power management can dynamically optimize y Power efficiency of special function components (i.e. graphics accelerators, network processors etc.) greatly exceeds that of general purpose CPUs
System-level Power Consumption Chip
X
Chip
Chip
X
X
MCP MCP
USB USB PCI PCI
Chip
X
MCP MCP
Chip
X
Chip
X
MCP MCP
X
Chip
X
MCP MCP
SRQ
SRQ
Crossbar
Crossbar
Mem.Ctrlr
HT
Mem.Ctrlr
HT
8 GB/S
PCI-E PCI-E Bridge Bridge PCI-E PCI-E Bridge TM Bridge PCIe PCIeTM Bridge Bridge
Memory Memory Controller Controller Hub Hub
I/O Hub I/O I/OHub Hub I/O Hub
Chip
SRQ
SRQ
Crossbar
Crossbar
Mem.Ctrlr
HT
Mem.Ctrlr
HT
8 GB/S
XMB XMB
XMB XMB
XMB XMB
8 GB/S
TM PCIe PCIeTM Bridge Bridge
XMB XMB
TM PCIe PCIeTM Bridge Bridge
8 GB/S
I/O I/OHub Hub
USB USB PCI PCI
Dual-Core Packages with legacy technology
Dual-Core AMD Opteron™ processors
• 692 watts for processors (173w each) • 48 watts for external memory controller
• 380 watts for processors (95w each)
95% More Power
• Integrated memory controllers
Source: Mixture of publicly available data sheets and AMD internal estimates. Actual system power measurements may vary based on configuration and components used
System-level Power Consumption Chip
X
Chip
Chip
X
X
Chip
X
Chip
Chip
Chip
X
X
X
692 watts MCP MCP
MCP MCP
MCP
Chip
X
MCP MCP
MCP
SRQ
SRQ
Crossbar
Crossbar
HT
Mem.Ctrlr
Mem.Ctrlr
HT
380 watts USB USB PCI PCI
I/O Hub I/O I/OHub Hub I/O Hub
8 GB/S
PCI-E PCI-E Bridge Bridge PCI-E PCI-E Bridge TM Bridge PCIe PCIeTM Bridge Bridge
14 watts Memory Memory Controller Controller Hub Hub
SRQ
SRQ
Crossbar
Crossbar
Mem.Ctrlr
HT
Mem.Ctrlr
HT
8 GB/S
8.5 XMB XMB watts
8.5 XMB XMB watts
8.5 XMB XMB watts
8 GB/S
TM PCIe PCIeTM Bridge Bridge
8.5 XMB XMB watts
TM PCIe PCIeTM Bridge Bridge
8 GB/S
I/O I/OHub Hub
USB USB PCI PCI
Dual-Core Packages with legacy technology
Dual-Core AMD Opteron™ processors
• 692 watts for processors (173w each) • 48 watts for external memory controller
• 380 watts for processors (95w each)
95% More Power
• Integrated memory controllers
740 watts
Source: Mixture of publicly available data sheets and AMD internal estimates. Actual system power measurements may vary based on configuration and components used
380 watts
Other Architectural Directions: Integration IO
Integrating dual designs for processor core enable both peak performance and throughput/watt
Special Accelerators
Small CPU Small CPU
Big CPU
Barriers? y Integration of heterogeneous designs non-trivial y IP barriers y Schedule issues with multiple converging components
Memory
Small CPU Small CPU
R F
Watts/(Spec*Vdd*Vdd*L) 1
0.1
0.01 0
1
10
100 Spec2000*L
1000
Summary
(1 of 2)
Silicon process technology is unlikely to be the major engine of processor performance increases in the future
Designers
Major circuit related challenges that we’ve only just started to address lie ahead: y Design for variation tolerance and mitigation y Maintaining dynamic voltage headroom within reliability and variation imposed limits y Adaptive, self-healing techniques are a key direction
V
on i t a ari
Leakage V dd
Droo
p Hot spots
Summary
(2 of 2)
Silicon process technology is unlikely to be the major engine of processor performance increases in the future y CPU architectures are converging on modest pipe length, limited issue out of order designs
Moore’s Law
y Multi-core is good, but has limits in the not too distant future y Heterogeneous integration is a key direction
Design Community
We’re up to the challenge, but it will be a joint effort …