Sam Naffziger AMD Senior Fellow

34 downloads 170 Views 2MB Size Report
High Performance Processors in a Power. Limited World ... AMD Senior Fellow ..... Exce. MS. Ppt. MS. Excel. MS. Word. MS Word and. Netscape. The Answer:.
Sam Naffziger AMD Senior Fellow High Performance Processors in a Power Limited World

Outline Today’s processor design landscape y Trends Issues making designer’s lives difficult y Power limits y Scaling effects Design opportunities y Circuit level y Architectural Summary

The All Consuming Quest for Greater Performance at Lower Cost Increasing Transistor Density

Increasing Performance

Moore’s Law has served us well.

Processor Frequency vs. Time MPU Performance vs Time 10000

Performance (MHz)

4GHz

1000

100 Jan- Jan- Jan- Jan- Jan- Jan- Jan- Jan- Jan- Jan- Jan- Jan- Jan97 98 99 00 01 02 03 04 05 06 07 08 09

The amazing frequency increases of the past decade have leveled off – Why? Power Limits Process Issues

Outline Today’s processor design landscape y Trends Issues making designer’s lives difficult y Power limits y Scaling effects Design opportunities y Circuit level y Architectural Summary

Power Consumption Background

Source: Holt, HotChips 17

Power has always challenged circuit integration We’ve been bailed out by technology in the past

Bipolar → NMOS → CMOS

Scaling Background Realistic power limit 10

2005 ITRS Projections of Vt and Vdd

Dennard Scaling

2500

2000

1500

mV

Vt 1000

1 Feat Size (um)

500

Vdd 0

20 00 20 01 20 02 20 03 20 04 20 05 20 06 20 07 20 08 20 09 20 10

Power/10 Source: Horowitz et al, IEDM 2005

0.1 Jan-85

Jan-88

Jan-91

Jan-94

Jan-97

Jan-00

Jan-03

Scaling doesn’t bail us out any more

Vdd

Power Consumption Background y Reducing Vdd y Reducing CTOT y Reducing ILEAK, ICO y Reducing α But now, not only are those improvements fading, but we have a host of new challenges

g in al Sc s es oc es Pr su Is

The Process guys have had the biggest impact on these

y Variation The Processor Designer

y Voltage droop y Wire non-scaling Switching Power

P ≈ CTOT·α·F·Vdd2

Crossover Power +

Leakage Power

NTOT·α·F·Vdd·ICO + NON·ILEAK·Vdd

Outline Today’s processor design landscape y Trends Issues making designer’s lives difficult y Power limits y Scaling effects

y Architectural Summary

es su Is

y Circuit level

g in al Sc

Design opportunities

The Silicon Age Still on a Roll, But … 2004

2006

2008

2010

2012

2014

2016

2018

Technology Node (nm)

90

65

45

32

22

16

11

8

Delay = CV/I scaling

0.7

~0.7

>0.7

Delay scaling will slow down

>0.35

>0.5

>0.5

Energy scaling will slow down

High Volume

Manufacturing

Energy/Logic Op scaling Bulk Planar CMOS

High Probability

Low Probability

Alternate, 3G etc

Low Probability

High Probability

Medium

Variability

But … all this scaling has some nasty side effects ITRS ITRS Roadmap Roadmap Source: European Nanoelectronics Initiative Advisory Council (ENIAC)

1

1

1

1

Very High 1

poly SiON

electrostatic control

RC Delay

High 1

1

metal high k

1 gate stack

bulk

planar PDSOI

stressors

2007 65nm

3D FDSOI

MuGFET MuCFET

+ substrate engineering

2010 45nm

+ high µ materials

2013 32nm

2016 22nm

Device Variation Reverse Scales

The Problem: Atoms don’t scale

Source: Pelgrom, IEEE lecture 5/11/06

Variations subtract directly off cycle time Îpower efficiency drops ÎCircuit margins degrade

One impact of variation is leakage spreads

Note: 1.25X Fmax

Sidd (A)

Chip SIDD set by “smallest” gates; Fmax set by slowest gates;

Fmax (a.u)

>3X SIDD spread

Scaling Intrinsically Hurts Supply Integrity 120% Leakage % Normalized Vdd

100%

Vdd Droop %

80%

60%

40%

20%

0% 250nm

180nm

130nm

90nm

Technology Node

Source: Bose, Hotchips 17

With power per core staying constant but area, voltage and cycle times dropping, we have a big challenge Requiring a higher voltage to hit frequency is a quadratic power impact

Outline Today’s processor design landscape y Trends Issues making designer’s lives difficult y Power limits y Scaling effects g in al Sc

Design opportunities y Circuit level y Architectural

es su Is

Summary Circuit Designer

Some Ways to Shoulder the Variation Burden: Adaptive clocking

Programmable Delay Buffers Empirically set the clock edge to optimize frequency Higher granularity → more variation tolerance LBIST and GA search algorithms show promise for per-part optimization

Some Ways to Shoulder the Variation Burden: Self Healing Designs Simplest example is cache ECC on memory arrays Next level is Intel’s Pellston technology implemented on Montecito and Tulsa y Disable defective lines detected by multiple ECC errors

Future directions involve self-checking with redundant logic and retry y Predict result through parity, residues or redundant logic y On an error, replay calculation before committing architectural state y If replay correct, it was a transient error (particle strike, Vdd droop, random noise coupling etc.) y If incorrect can reduce frequency, increase voltage or retry with an alternate execution path

Some Ways to Shoulder the Variation Burden: Self Healing Designs From Fall Microprocessor Forum 2006

Adaptive Supply Voltage Per-part and dynamic voltage management are key

Energy / Operation

More range flexibility and finer grain response will provide differentiation Short High

Nom

Channel Length

Long

Low

Vdd

Integrated Power and Thermal Management

“Fuse and forget” is no longer viable

Sidd (A)

Too much variation in environment, manufacturing and operating conditions Some means of dynamic optimization needed

Fmax (a.u)

Integrated Power and Thermal Management

An autonomous programmable controller enables real time optimizations An embedded controller provides the needed flexibility y OS interfacing y Multi-core management y Per-part optimization

Outline Today’s processor design landscape y Trends Issues making designer’s lives difficult y Power limits y Scaling effects g in al Sc

Design opportunities

es su Is

y Circuit level y Architectural Summary

Chip Architect

Traversing the Power Contour Powe r Consumption

Short H igh N om

C hannel Length

Switching Power

P ≈ CTOT·α·F·Vdd2

Crossover Power +

Long

Low

Vdd

Leakage Power

NTOT·α·F·Vdd·ICO + NON·ILEAK·Vdd

Traversing the Power Contour Fre que ncy

Short High

Nom

Channe l Le ngth

Long

Low

Vdd

Traversing the Power Contour for a Given Implementation Energy / Operation

Max perform ance

Best p ow efficien er cy Short High

Nom

Channel Length

Long

Low

Vdd

For Comparing Architectural Efficiency, Performance3/W is most effective Performance^3 / Watt 1.2

1

0.8

0.6 0.4 0.2 0 Short

Channel Length

High

Nom Nom Long

Low

Vdd

Optimal Pipeline Depth

Pipe stages 35 30 25

~2X power efficient

20 15 10 5 0

Williamette (2000)

Prescott (2004)

Core2 (2006)

Opteron

Power6

• The industry has been moving from “hyperpipelining” with short pipe stages, to something more moderate

A Look at Mobile Processor Power

A Look at Mobile System Power Mobile System Power

Rest of system Chipset Memory controller CPU Memory

TDP

Average Power

If a laptop burned TDP power all the time, battery life would be measured in minutes How do we get mobile average power so much lower than TDP?

The Answer: Take Advantage of Typically Low CPU Utilization MobileMark 2002 Tj 95 1800MHz 1.35V Adobe

35

Flash

MS Outlook

MS Word

MS Ppt

MS Exce

MS Ppt

MS Excel

MS Word

MS Word and Netscape

Processor Utilization During Normal Laptop Usage as Approximated by MobileMark

30

[C1, PN!, & C3 enabled]

Power (W)

25

20

Core Power 15

10

5

IO Power

0 0

10

20

30

40

Time(min)

50

60

70

Reducing Power and Cooling Requirements with Processor Performance States P-State P0

HIGH

Average CPU Core Power (measured at CPU)

2600MHz 1.40V ~95watts

25

P1

20

P2

2200MHz 1.30V ~76watts

P3

PROCESSOR UTILIZATION

2000MHz 1.25V ~65watts

15

AMD PowerNow!TM ENABLED

-33% -62%

-75%

10

5

P4

1800MHz 1.20V ~55watts

0 10500 Connections 5000 Connections (~62% CPU Utilization) (~40% CPU Utilization)

P5

1000MHz 1.10V ~32watts

Power (W)

2400MHz 1.35V ~90watts

AMD PowerNow!TM DISABLED

LOW

Idle (in OS)

Up to 75% power savings (at idle)!

Additionally “C-states” reduce power further by cutting clocks completely and dropping voltage to retention levels

Improving Peak Performance per Watt

Adding Features to Increase Performance Watts/(Spec*Vdd*Vdd*L) 1

0.1

Source: Horowitz et al, IEDM 2005

0.01 0

1

10

100 Spec2000*L

1000

•Increasing execution efficiency has, historically hurt power efficiency •However, the cubic reduction of power with V/F scaling has tended to make this a good tradeoff

Adding Features to Increase Performance Works with V/F Scaling 3.5

3 2.5

2

energy/op frequency

1.5

performance

1 0.5

0 0.5 0.6 0.7 0.8 0.9

1

1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9

2

IPC

Voltage scaling has it limits ÎMore power efficient designs have an advantage ÎHigh power designs get penalized due to higher di/dt, higher temperatures etc.

If we hit VMIN however, the game is over

How Hard is Improving Existing Processors? Watts/(Spec*Vdd*Vdd*L) 1 Current and Next Generation Core Comparison Source: Horowitz et al, IEDM 2005

0.01 0

1

10

100 Spec2000*L

1000

Peak performance costs more energy/operation

C switch

0.1

Logic

Clock

Despite more features, next gen core has substantiall y lower Cap

Logic

Clock

Most of the Big hitter improvements have been heavily mined already

Gen1 Peak

Gen2 Peak

Next generation AMD cores have >> 50% of clocks gated off even for high power code

Multi-Core to the Rescue? Cache

Cache Core

Voltage =1 Frequency =1 Area =1 Power =1 Perf =1 Perf/Watt =1

Core

Core

Voltage =.85 Frequency =.85 Area =2 Power =1 Perf ≈1.7 Perf/Watt ≈1.7

Sounds like a great story, what’s the catch?

Multi-Core to the Rescue? Some of the catches:

y What if you’re already at VMIN? Need to cut frequency in half to stay within power limit / y How much parallelizable code is really out there? y More compute capacity means more IO and memory bandwidth demands …

Cache

Cache Core

Voltage =1 Frequency =1 Area =1 Power =1 Perf =1 Perf/Watt =1

Core

Core

Voltage =.85 Frequency =.85 Area =2 Power =1 Perf ≈1.7 Perf/Watt ≈1.7

Multi-Core Issues: Amdahl’s Law Cache

There is almost always a portion of an application that cannot be parallelized y This portion becomes a bottleneck as the number of threads is increased y A typical value is in the range of 10%

speedup

multi-core speedup with serial code and constant power considered 10.00 8.00

al e Id

6.00 4.00

0 5% 10% 15% 20%

2.00

ideal

0.00 1

2

3

4

5

cores

6

7

8

Core

Voltage =1 Frequency =1 Area =1 Power =1 Perf =1 Perf/Watt =1

Cache Core

Core

Voltage =.85 Frequency =.85 Area =2 Power =1 Perf ≈1.7 Perf/Watt ≈1.7

Just 10% serial code drops 8 core performance improvement by 41%

Multi-Core Issues: IO Power All those extra cores need their own data …

Cache Core

IO power in terms of W/Gb/s has been pretty constant in the range of 20mW for years

speedup

multi-core speedup with serial code, constant power+ IO power considered 10.00 8.00

al e Id

6.00 4.00

0 5% 10% 15%

2.00

20% ideal

0.00 1

2

3

4

5

cores

6

7

8

Voltage =1 Frequency =1 Area =1 Power =1 Perf =1 Perf/Watt =1

Cache Core

Core

Voltage =.85 Frequency =.85 Area =2 Power =1 Perf ≈1.7 Perf/Watt ≈1.7

• If we increase IO power accordingly, but hold total chip power constant with V/F scaling, things get worse • Overall performance drops by another 10% or so …

The Transition to Parallel Applications Single-threaded Applications

Parallel Applications

Most of today’s applications

Small number of applications (worked by experts for 10+ yrs)

Well understood optimization techniques

Awkward development, analysis and debug environments

Advanced development, analysis and debug tools Conceptually, easy to think about

Parallel programming is hard! Amdahl’s law is still a law… SW productivity is already in a crisis Æ this worsens things!

Establishing an appropriate balance is key for managing this important transition

Other Architectural Directions: Integration

Typical Server Power Breakdown

Not only does the integration of more system components (i.e. memory controllers, IO etc.) improve performance

Bose, HotChips 17

Integration reduces power significantly as well y IO communication overhead drops y CPU integrated power management can dynamically optimize y Power efficiency of special function components (i.e. graphics accelerators, network processors etc.) greatly exceeds that of general purpose CPUs

System-level Power Consumption Chip

X

Chip

Chip

X

X

MCP MCP

USB USB PCI PCI

Chip

X

MCP MCP

Chip

X

Chip

X

MCP MCP

X

Chip

X

MCP MCP

SRQ

SRQ

Crossbar

Crossbar

Mem.Ctrlr

HT

Mem.Ctrlr

HT

8 GB/S

PCI-E PCI-E Bridge Bridge PCI-E PCI-E Bridge TM Bridge PCIe PCIeTM Bridge Bridge

Memory Memory Controller Controller Hub Hub

I/O Hub I/O I/OHub Hub I/O Hub

Chip

SRQ

SRQ

Crossbar

Crossbar

Mem.Ctrlr

HT

Mem.Ctrlr

HT

8 GB/S

XMB XMB

XMB XMB

XMB XMB

8 GB/S

TM PCIe PCIeTM Bridge Bridge

XMB XMB

TM PCIe PCIeTM Bridge Bridge

8 GB/S

I/O I/OHub Hub

USB USB PCI PCI

Dual-Core Packages with legacy technology

Dual-Core AMD Opteron™ processors

• 692 watts for processors (173w each) • 48 watts for external memory controller

• 380 watts for processors (95w each)

95% More Power

• Integrated memory controllers

Source: Mixture of publicly available data sheets and AMD internal estimates. Actual system power measurements may vary based on configuration and components used

System-level Power Consumption Chip

X

Chip

Chip

X

X

Chip

X

Chip

Chip

Chip

X

X

X

692 watts MCP MCP

MCP MCP

MCP

Chip

X

MCP MCP

MCP

SRQ

SRQ

Crossbar

Crossbar

HT

Mem.Ctrlr

Mem.Ctrlr

HT

380 watts USB USB PCI PCI

I/O Hub I/O I/OHub Hub I/O Hub

8 GB/S

PCI-E PCI-E Bridge Bridge PCI-E PCI-E Bridge TM Bridge PCIe PCIeTM Bridge Bridge

14 watts Memory Memory Controller Controller Hub Hub

SRQ

SRQ

Crossbar

Crossbar

Mem.Ctrlr

HT

Mem.Ctrlr

HT

8 GB/S

8.5 XMB XMB watts

8.5 XMB XMB watts

8.5 XMB XMB watts

8 GB/S

TM PCIe PCIeTM Bridge Bridge

8.5 XMB XMB watts

TM PCIe PCIeTM Bridge Bridge

8 GB/S

I/O I/OHub Hub

USB USB PCI PCI

Dual-Core Packages with legacy technology

Dual-Core AMD Opteron™ processors

• 692 watts for processors (173w each) • 48 watts for external memory controller

• 380 watts for processors (95w each)

95% More Power

• Integrated memory controllers

740 watts

Source: Mixture of publicly available data sheets and AMD internal estimates. Actual system power measurements may vary based on configuration and components used

380 watts

Other Architectural Directions: Integration IO

Integrating dual designs for processor core enable both peak performance and throughput/watt

Special Accelerators

Small CPU Small CPU

Big CPU

Barriers? y Integration of heterogeneous designs non-trivial y IP barriers y Schedule issues with multiple converging components

Memory

Small CPU Small CPU

R F

Watts/(Spec*Vdd*Vdd*L) 1

0.1

0.01 0

1

10

100 Spec2000*L

1000

Summary

(1 of 2)

Silicon process technology is unlikely to be the major engine of processor performance increases in the future

Designers

Major circuit related challenges that we’ve only just started to address lie ahead: y Design for variation tolerance and mitigation y Maintaining dynamic voltage headroom within reliability and variation imposed limits y Adaptive, self-healing techniques are a key direction

V

on i t a ari

Leakage V dd

Droo

p Hot spots

Summary

(2 of 2)

Silicon process technology is unlikely to be the major engine of processor performance increases in the future y CPU architectures are converging on modest pipe length, limited issue out of order designs

Moore’s Law

y Multi-core is good, but has limits in the not too distant future y Heterogeneous integration is a key direction

Design Community

We’re up to the challenge, but it will be a joint effort …