Hennessy & Patterson - Computer Architecture: A Quantitative Approach –.
Morgan Kaufmann – 5th edition. ▫ Lecture slides (no lecture notes). ▫ More info: ...
CS4/MSc Parallel Architectures Dr. Vijay Nagarajan Institute for Computing Systems Architecture
Parallel Architectures ▪ How to build computers that execute tasks concurrently – Tasks can be instructions, methods, threads, programs etc.
▪ How to provide support for coordination and communication
– interconnection networks, coherence protocols, memory consistency model, synchronisation instructions, transactional memory etc.
CS4/MSc Parallel Architectures - 2016-2017
2
Parallel Architectures: Why? ▪ Be a good (systems) programmer – Most computers today are parallel (supercomputers, datacentres, even mobile phones), need to understand them if you want to program them well! ▪ Research future computer architectures and systems
– A job in Intel, ARM, platform/ infrasturcture job at Google, Microsoft, Amazon etc. – Academic researcher. ▪ Appreciate other related courses – PPLS, Extreme computing etc.
CS4/MSc Parallel Architectures - 2016-2017
3
General Information ▪ TA: Priyank Faldu (
[email protected]) ▪ Pre-requisites: CS3 Computer Architecture ▪ Assignments: Assignment 1 – out 23-01-16; due 3-02-17 Assignment 2 – out 10-02-16; due 3-03-17 ▪ Recommended Books:
– Culler & Singh - Parallel Computer Architecture: A Hardware/Software Approach – Morgan Kaufmann – Hennessy & Patterson - Computer Architecture: A Quantitative Approach – Morgan Kaufmann – 5th edition
▪ Lecture slides (no lecture notes) ▪ More info: www.inf.ed.ac.uk/teaching/courses/pa/ ▪ Please interrupt with questions at any time
CS4/MSc Parallel Architectures - 2016-2017
4
What is a Parallel Architecture? “A collection of processing elements that cooperate to solve large problems fast” Almasi and Gottlieb, 1989
CS4/MSc Parallel Architectures - 2016-2017
5
Examples: Parallel Architectures ▪ ARM11
– 8 stage pipeline – Upto 8 instructions in-flight
▪ Intel Pentium 4
– 31 stage pipeline superscalar – Upto 124 instructions in-flight
▪ Intel Skylake
– Quad core – 2 threads per core (SMT) – GPU
CS4/MSc Parallel Architectures - 2016-2017
6
Examples: Parallel Architectures ▪ Sunway Taihulight
– 40,960 SW26010 256 core CPUs – Upto 125 petaflops
▪ Tianhe-2 – 32000 Intel Xeon 12 cores CPUs
– 48000 Intel Xeon Phi 57 cores CPUs – Upto 54 petaflops
▪ Google Network
– ??? linux machines – Several connected cluster farms
CS4/MSc Parallel Architectures - 2016-2017
7
Why Parallel Architectures? ▪ Performance of sequential architecture is limited – Computation/data flow through logic gates, memory devices – At all of these there is a non-zero delay (at least delay of speed of light) – Thus, the speed of light and the minimum physical feature sizes impose a hard limit on the speed of any sequential computation
▪ Important applications that require performance (pull) – Nuclear reactor simulation; Predicting future climate etc. – Cloud/Big data: – Google: 40K searches per second – Facebook about a billion active users per day.
▪ Technological reasons (push) – What to do with all those transistors? – New memory technologies: phase-change memory, memristor etc.
CS4/MSc Parallel Architectures - 2016-2017
8
Technological Trends: Moore’s Law Moore’s law
▪
–
Growth in Transistor Density 10000
1.E+07
1000
1.E+06 1.E+05
100
1.E+04 1.E+03
10
1.E+02 1.E+01
1
1.E+00
1960
1970
1980
1990
2000
Year of Introduction Intel CPUs
Siroyan CPU
SSI
Best fit
DRAM
Transistor size
1975 – Moore’s Law revised
▪
Actually 5x every 5 years (1.35x)
–
Transistor size (nm)
Transistors per device
1.E+08
Densities double every year (2x)
▪
Densities double every 2 years (1.42x)
Dennard Scaling Growth in Microprocessor Clock Frequency 10000
Clock frequency (MHz)
1.E+09
1965 – Gordon Moore’s “Law”
1000
100
Intel Alpha
10
1
0.1
1960
1970
1980
1990
2000
Year of Introduction
CS4/MSc Parallel Architectures - 2016-2017
9
Technological Trend: Memory Wall Memory
CPU
100000 Performance
10000 1000 100 10 1 1980
1990
2000
H&P Fig. 5.2
▪ Bottom-line: memory access is increasingly expensive and CA must devise new ways of hiding this cost CS4/MSc Parallel Architectures - 2016-2017
10
Tracking Technology: The role of CA 1.35x/year
1.58x/year
1800
Intel Pentium III
SPECint rating
1600 1400 1200 1000 800 600 400 200
DEC Alpha DEC Alpha
IBM Pow er1
HP 9000
MIPS R2000
2000
1999
1998
1997
1996
1995
1994
1993
1992
1991
1990
1989
1988
1987
1986
1985
1984
0
H&P Fig. 1.1
▪ Bottom-line: architectural innovation complement technological improvements (ILP + Cache) CS4/MSc Parallel Architectures - 2016-2017
11
Future Technology Predictions (2003) ▪
Moore’s Law will continue to ~2016 Poly 1/2 pitch Gate length
Procs. will have 2.2 billion transistors
Desktop Microprocessor Transistor Count 100
▪
2500
90
90
DRAM capacity to reach 128 Gbit
2209
80
2000
65
▪
Procs. clocks should reach 40 GHz
Microns
70 60
1500
45
50 40
1104
37 25
30
18 552
20
22 13
276
10
1000
32
9
500
Millions of Transistors
▪
138
0
0 2004
2007
2010
2013
2016
Source: International Technology Roadmap for Semiconductors, 2003
CS4/MSc Parallel Architectures - 2016-2017
12
State-of-the-art - January 2017 ▪
IBM Power8
▪
AMD Bulldozer
▪
Intel i7 ‘Broadwell’ 22-core – 7.2 billion transistors
– – – – –
– – – – –
– – – –
4.2 billion transistors 4, 6, 8, 10 or 12 core versions 8 SMT threads per core 22nm CMOS 5 GHz clock
1.2 billion transistors 4, 8 cores 32nm 4.2 GHz 125 W
14nm CMOS silicon 22 cores, 44 threads 3.6 GHz 145 Watts
CS4/MSc Parallel Architectures - 2016-2017
13
End of the Uniprocessor? ▪ Frequency has stopped scaling : Power Wall – End of Dennard scaling ▪ Memory wall
– Instructions and data must be fetched – Memory becomes the bottleneck
▪ ILP Wall
– Dependencies between instruction limit ILP
End of performance scaling for uniprocessors has forced industry to turn to chip-multiprocessors (Multicores) CS4/MSc Parallel Architectures - 2016-2017
14
Multicores ▪ Use transistors for adding cores ▪ (Note: Moore’s law projected to end soon, now for real!) ▪ (According to ITRS 2015, by 2021 will not be viable to shrink transistor any further!) ▪ But, software must be parallel! – Remember Amdahl’s law
▪ Lot of effort on making it easier to write parallel programs – For e.g Transactional memory
CS4/MSc Parallel Architectures - 2016-2017
15
Amdahl’s Law ▪ Let: F → fraction of problem that can be optimized Sopt → speedup obtained on optimized fraction ∴ Soverall =
1 (1 – F) +
F Sopt
▪ e.g.: F = 0.5 (50%), Sopt = 10 Soverall =
1 (1 – 0.5) +
0.5
= 1.8
Sopt = ∞
Soverall =
1
=2
(1 – 0.5) + 0
10
▪ Bottom-line: performance improvements must be balanced CS4/MSc Parallel Architectures - 2016-2017
16
Amdahl’s Law and Efficiency ▪ Let: F → fraction of problem that can be parallelized Spar → speedup obtained on parallelized fraction P → number of processors 1
Soverall =
(1 – F) +
E=
F
Soverall P
Spar
▪ e.g.: 16 processors (Spar = 16), F = 0.9 (90%), Soverall =
6.4
1 (1 – 0.9) +
0.9
= 6.4
E=
16
= 0.4 (40%)
16
For good scalability: E>50%; when resources are “free” then lower efficiencies are acceptable CS4/MSc Parallel Architectures - 2016-2017
17
Topics ▪ Fundamental concepts – Introduction – Types of parallelism
▪ Uniprocessor parallelism – Pipelining, Superscalars
▪ Shared memory multiprocessors
– Cache coherence and memory consistency – Synchronization and Transactional Memory
▪ Hardware Multithreading ▪ Vector, SIMD Processors, GPUs ▪ Supercomputers and datacenter architecture (if time permits)
CS4/MSc Parallel Architectures - 2016-2017
18