CS4/MSc Parallel Architectures

12 downloads 143 Views 531KB Size Report
Hennessy & Patterson - Computer Architecture: A Quantitative Approach –. Morgan Kaufmann – 5th edition. ▫ Lecture slides (no lecture notes). ▫ More info: ...
CS4/MSc Parallel Architectures Dr. Vijay Nagarajan Institute for Computing Systems Architecture

Parallel Architectures ▪ How to build computers that execute tasks concurrently – Tasks can be instructions, methods, threads, programs etc.

▪ How to provide support for coordination and communication

– interconnection networks, coherence protocols, memory consistency model, synchronisation instructions, transactional memory etc.

CS4/MSc Parallel Architectures - 2016-2017

2

Parallel Architectures: Why? ▪ Be a good (systems) programmer – Most computers today are parallel (supercomputers, datacentres, even mobile phones), need to understand them if you want to program them well! ▪ Research future computer architectures and systems

– A job in Intel, ARM, platform/ infrasturcture job at Google, Microsoft, Amazon etc. – Academic researcher. ▪ Appreciate other related courses – PPLS, Extreme computing etc.

CS4/MSc Parallel Architectures - 2016-2017

3

General Information ▪ TA: Priyank Faldu ([email protected]) ▪ Pre-requisites: CS3 Computer Architecture ▪ Assignments: Assignment 1 – out 23-01-16; due 3-02-17 Assignment 2 – out 10-02-16; due 3-03-17 ▪ Recommended Books:

– Culler & Singh - Parallel Computer Architecture: A Hardware/Software Approach – Morgan Kaufmann – Hennessy & Patterson - Computer Architecture: A Quantitative Approach – Morgan Kaufmann – 5th edition

▪ Lecture slides (no lecture notes) ▪ More info: www.inf.ed.ac.uk/teaching/courses/pa/ ▪ Please interrupt with questions at any time

CS4/MSc Parallel Architectures - 2016-2017

4

What is a Parallel Architecture? “A collection of processing elements that cooperate to solve large problems fast” Almasi and Gottlieb, 1989

CS4/MSc Parallel Architectures - 2016-2017

5

Examples: Parallel Architectures ▪ ARM11

– 8 stage pipeline – Upto 8 instructions in-flight

▪ Intel Pentium 4

– 31 stage pipeline superscalar – Upto 124 instructions in-flight

▪ Intel Skylake

– Quad core – 2 threads per core (SMT) – GPU

CS4/MSc Parallel Architectures - 2016-2017

6

Examples: Parallel Architectures ▪ Sunway Taihulight

– 40,960 SW26010 256 core CPUs – Upto 125 petaflops

▪ Tianhe-2 – 32000 Intel Xeon 12 cores CPUs

– 48000 Intel Xeon Phi 57 cores CPUs – Upto 54 petaflops

▪ Google Network

– ??? linux machines – Several connected cluster farms

CS4/MSc Parallel Architectures - 2016-2017

7

Why Parallel Architectures? ▪ Performance of sequential architecture is limited – Computation/data flow through logic gates, memory devices – At all of these there is a non-zero delay (at least delay of speed of light) – Thus, the speed of light and the minimum physical feature sizes impose a hard limit on the speed of any sequential computation

▪ Important applications that require performance (pull) – Nuclear reactor simulation; Predicting future climate etc. – Cloud/Big data: – Google: 40K searches per second – Facebook about a billion active users per day.

▪ Technological reasons (push) – What to do with all those transistors? – New memory technologies: phase-change memory, memristor etc.

CS4/MSc Parallel Architectures - 2016-2017

8

Technological Trends: Moore’s Law Moore’s law





Growth in Transistor Density 10000

1.E+07

1000

1.E+06 1.E+05

100

1.E+04 1.E+03

10

1.E+02 1.E+01

1

1.E+00

1960

1970

1980

1990

2000

Year of Introduction Intel CPUs

Siroyan CPU

SSI

Best fit

DRAM

Transistor size

1975 – Moore’s Law revised



Actually 5x every 5 years (1.35x)



Transistor size (nm)

Transistors per device

1.E+08

Densities double every year (2x)



Densities double every 2 years (1.42x)

Dennard Scaling Growth in Microprocessor Clock Frequency 10000

Clock frequency (MHz)

1.E+09

1965 – Gordon Moore’s “Law”

1000

100

Intel Alpha

10

1

0.1

1960

1970

1980

1990

2000

Year of Introduction

CS4/MSc Parallel Architectures - 2016-2017

9

Technological Trend: Memory Wall Memory

CPU

100000 Performance

10000 1000 100 10 1 1980

1990

2000

H&P Fig. 5.2

▪ Bottom-line: memory access is increasingly expensive and CA must devise new ways of hiding this cost CS4/MSc Parallel Architectures - 2016-2017

10

Tracking Technology: The role of CA 1.35x/year

1.58x/year

1800

Intel Pentium III

SPECint rating

1600 1400 1200 1000 800 600 400 200

DEC Alpha DEC Alpha

IBM Pow er1

HP 9000

MIPS R2000

2000

1999

1998

1997

1996

1995

1994

1993

1992

1991

1990

1989

1988

1987

1986

1985

1984

0

H&P Fig. 1.1

▪ Bottom-line: architectural innovation complement technological improvements (ILP + Cache) CS4/MSc Parallel Architectures - 2016-2017

11

Future Technology Predictions (2003) ▪

Moore’s Law will continue to ~2016 Poly 1/2 pitch Gate length

Procs. will have 2.2 billion transistors

Desktop Microprocessor Transistor Count 100



2500

90

90

DRAM capacity to reach 128 Gbit

2209

80

2000

65



Procs. clocks should reach 40 GHz

Microns

70 60

1500

45

50 40

1104

37 25

30

18 552

20

22 13

276

10

1000

32

9

500

Millions of Transistors



138

0

0 2004

2007

2010

2013

2016

Source: International Technology Roadmap for Semiconductors, 2003

CS4/MSc Parallel Architectures - 2016-2017

12

State-of-the-art - January 2017 ▪

IBM Power8



AMD Bulldozer



Intel i7 ‘Broadwell’ 22-core – 7.2 billion transistors

– – – – –

– – – – –

– – – –

4.2 billion transistors 4, 6, 8, 10 or 12 core versions 8 SMT threads per core 22nm CMOS 5 GHz clock

1.2 billion transistors 4, 8 cores 32nm 4.2 GHz 125 W

14nm CMOS silicon 22 cores, 44 threads 3.6 GHz 145 Watts

CS4/MSc Parallel Architectures - 2016-2017

13

End of the Uniprocessor? ▪ Frequency has stopped scaling : Power Wall – End of Dennard scaling ▪ Memory wall

– Instructions and data must be fetched – Memory becomes the bottleneck

▪ ILP Wall

– Dependencies between instruction limit ILP

End of performance scaling for uniprocessors has forced industry to turn to chip-multiprocessors (Multicores) CS4/MSc Parallel Architectures - 2016-2017

14

Multicores ▪ Use transistors for adding cores ▪ (Note: Moore’s law projected to end soon, now for real!) ▪ (According to ITRS 2015, by 2021 will not be viable to shrink transistor any further!) ▪ But, software must be parallel! – Remember Amdahl’s law

▪ Lot of effort on making it easier to write parallel programs – For e.g Transactional memory

CS4/MSc Parallel Architectures - 2016-2017

15

Amdahl’s Law ▪ Let: F → fraction of problem that can be optimized Sopt → speedup obtained on optimized fraction ∴ Soverall =

1 (1 – F) +

F Sopt

▪ e.g.: F = 0.5 (50%), Sopt = 10 Soverall =

1 (1 – 0.5) +

0.5

= 1.8

Sopt = ∞

Soverall =

1

=2

(1 – 0.5) + 0

10

▪ Bottom-line: performance improvements must be balanced CS4/MSc Parallel Architectures - 2016-2017

16

Amdahl’s Law and Efficiency ▪ Let: F → fraction of problem that can be parallelized Spar → speedup obtained on parallelized fraction P → number of processors 1

Soverall =

(1 – F) +

E=

F

Soverall P

Spar

▪ e.g.: 16 processors (Spar = 16), F = 0.9 (90%), Soverall =

6.4

1 (1 – 0.9) +

0.9

= 6.4

E=

16

= 0.4 (40%)

16

For good scalability: E>50%; when resources are “free” then lower efficiencies are acceptable CS4/MSc Parallel Architectures - 2016-2017

17

Topics ▪ Fundamental concepts – Introduction – Types of parallelism

▪ Uniprocessor parallelism – Pipelining, Superscalars

▪ Shared memory multiprocessors

– Cache coherence and memory consistency – Synchronization and Transactional Memory

▪ Hardware Multithreading ▪ Vector, SIMD Processors, GPUs ▪ Supercomputers and datacenter architecture (if time permits)

CS4/MSc Parallel Architectures - 2016-2017

18