Parallel Computing

9 downloads 395 Views 5MB Size Report
“Sourcebook of Parallel Computing”, Morgan Kauf- mann Publishers ... Ed Akin, “ Object-Oriented Programming via Fortran90/95”, Cambridge Univer- sity Press .... If a computer can do 109 flops (nowadays PC-s), it will take around. 1.15∗1015 ...
University of Tartu Institute of Computer Science

Parallel Computing MTAT.08.020 PARALLEELARVUTUSED

[email protected] Liivi 2 - 324

2012/13 Fall term

2 Practical information about the course

Lectures: Thursdays Liivi 2 - 612 16:15 – Eero Vainikko Computer classes: Mondays Liivi 2 - 205 at 16:15 – Oleg Batrašev

• 6 EAP • Exam

3 Practical information about the course

In theory, there is no difference between theory and practice. But, in practice, there is. Jan L. A. van de Snepscheut

4 Practical information about the course

NB! To be allowed to the exam, all computer class exercises must be submitted, and the deadlines should be met! The final grade consists of: 1. Computer class exercises 50% 2. Oral exam with the blackboard 40% 3. Active participation at lectures 10%

Course homepage (http://www.ut.ee/~eero/PC/)

5 Course syllabus

Course content (preliminary, about to change on the way) 1. Motivation for parallel computing 2. Classifications of parallel systems 3. Clos-networks, Beowulf clusters, Grid, Peer-to-peer networks 4. Benchmarks 5. Speedup, efficiency 6. Amdahl’s law 7. Scaled efficiency, ways for increasing the efficiency 8. Fortran95, HPF 9. OpenMP 10. MPI 11. Assessment of parallel algorithms, definition of optimality

6 Course syllabus 12. Main techniques in parallel algorithms: 13. Balanced trees 14. Pointer jumping 15. Divide and conquer 16. Partitioning 17. Pipelining 18. Accelerated Cascading 19. Symmetry breaking

7 Course syllabus Computer class activities:

1. Fortran 90/95 2. OpenMP 3. Message Passing Interface 4. Solving various parallelisation problems

Literature: 1. A.Grama, A.Gupta, G.Karypis and V.Kumar. “Introduction to Parallel Computing”, Second Edition, Addison-Wesley, 2003. 2. L. Ridgway Scott, Terry Clark and Babak Bagheri. “Scientific Parallel Computing”, Princeton University Press, 2005.

8 Course syllabus 3. Jack Dongarra, Ian Foster, Geoffrey Fox, William Gropp, Ken Kennedy, Linda Torczon and Andy White. “Sourcebook of Parallel Computing”, Morgan Kaufmann Publishers, 2003. 4. I.Foster. "Designing and Building Parallel Programs", Reading, MA: AddisonWesley, 1995; http://www.mcs.anl.gov/dbpp. 5. Scott, L. Ridgway Clark, Terry Bagheri, Babak, “Scientific Parallel Computing”, Princeton University Press, 2005. 6. E.Vainikko. "Fortran95 ja MPI", TÜ Kirjastus, 2004. 7. Ed Akin, “Object-Oriented Programming via Fortran90/95”, Cambridge University Press, 2003. 8. A. C. Marshall, Fortran 90 1 Day Course (http://www.liv.ac.uk/HPC/ F90Course1DayHomePage.html) ,[1997] 9. C.-K- Shene, “Fortran 90 Tutorial (http://www.cs.mtu.edu/~shene/ COURSES/cs201/NOTES/fortran.html)” Department of Computer Science, Michigan Technological University, [2009.]

9 Course syllabus 10. P.S.Pacheco. "Parallel Programming with MPI", Morgan Kaufmann Publ, 1997. 11. Joseph Ja’Ja’. "An Introduction to Parallel Algorithms", Addison-Wesley Publ., 1992. 12. Ralf Steinmetz, Klaus Wehrle (Eds.), Peer-to-Peer Systems and Applications, LNCS 3485, Springer, 2005.

10 Introduction

Why do we need parallel computing? Write First, form downgroups as many of 2! reasons as you can during 5 minutes!

11 Introduction

Introduction Parallel Computing is a part of Computer Science and Computational Sciences (hardware, software, applications, programming technologies, algorithms, theory and practice) with special emphasis on parallel computing or supercomputing

1

Parallel Computing – motivation The main questions in parallel computing:

• How is organised interprocessor communication? • How to organise memory? • Who is taking care of parallelisation? ◦ CPU? ◦ compiler? ◦ ... or programmer?

12 Introduction

1.1

1.1 History of computing

History of computing

Pre-history Even before electronic computers parallel computing existed ? . . . – (pyramides.) 1929 – parallelisation of weather forecasts ≈1940 – Parallel computations in war industry using Felix-like devices 1950 - Emergence of first electronic computers. Based on lamps. ENIAC and others

13 Introduction

1960 - Mainframe era. IBM

1.1 History of computing

14 Introduction

1970 - Era of Minis 1980 - Era of PCs 1990 - Era of parallel computers 2000 - Clusters / Grids 2010 - Clouds

1.1 History of computing

15 Introduction

1.2

1.2 Expert’s predictions

Expert’s predictions

Much-cited legend: In 1947 computer engineer Howard Aiken said that USA will need in the future at most 6 computers 1950: Thomas J. Watson as well: 6 computers will be needed in the world 1977: Seymour Cray: The computer Cray-1 will attract only 100 clients in the world 1980: Study by IBM – about 50 Cray-class computers could be sold at most in a year worldwide Reality: Many Cray-* processing power in many of nowadays homes Gordon Moore’s (founder of Intel) law: (1965: the number of switches doubles every second year ) 1975: - refinement of the above: The number of switches on a CPU doubles every 18 months Until 2020 or 2030 we would reach in such a way to the atomic level or quantum computer! – The But why? reason is: light speed limit (see Example 1.2 below!)

16 Introduction

1.2 Expert’s predictions

Flops: first computers desktop computers today supercomputers nowadays the aim of today next step

102 109 1012 1015 1018

100 Flops Gigaflops (GFlops) Teraflops (TFlops) Petaflops (PFlops) Exaflops (EFlops)

(sada op/s) (miljard op/s) (triljon op/s) (kvadriljon op/s) (kvintiljon op/s)

http://en.wikipedia.org/wiki/Orders_of_magnitude_ (numbers)

17 Introduction

1.3

1.3 Usage areas of a petaflop computer

Usage areas of a petaflop computer

Engineering applications

• Wind tunnels • Combustion engine optimisation • High frequency cirquits • Buildings and mechanical structures (optimisation, cost, shape, safety etc) • Effective and safe nuclear power station design • Simulation of nuclear explosions

18 Introduction

1.3 Usage areas of a petaflop computer

Scientific applications

• Bioinformatics • Computational physics and chemistry (quantum mechanics, macromolecules, composite materials, chemical reactions etc).

• Astrophysics (development of galaxies, thermonuclear reactions, postprocessing of large datasets produced by telescopes)

• Weather modelling • Climate and environment modelling • Looking for minerals • Flood predictions

19 Introduction

1.3 Usage areas of a petaflop computer

Commercial applications

• Oil industry • Market and monetary predictions • Global economy modelling • Car industry (like crash-tests) • Aviation (design of modern aircrafts and space shuttles ) Applications in computer systems

• Computer security (discovery of network attacks) • Cryptography

20 Introduction

• Embedded systems (for example, car) • etc

1.3 Usage areas of a petaflop computer

21 Introduction

1.4

1.4 Example 1.1

Example 1.1

Why speeds of order 1012 are not enough? Weather forecast in Europe for next 48 hours from sea lever upto 20km – need to solve an ODE (xyz and t ) The volume of the region: 5000 ∗ 4000 ∗ 20 km3 . Stepsize in xy-plane 1km Cartesian mesh z-direction: 0.1km. Timesteps: 1min. Around 1000 flop per each timestep. As a result, ca

5000 ∗ 4000 ∗ 20 km3 × 10 rectangles per km3 = 4 ∗ 109 meshpoints and

4 ∗ 109 meshpoints ∗ 48 ∗ 60 timesteps × 103 flop ≈ 1.15 ∗ 1015 flop. If a computer can do 109 flops (nowadays PC-s), it will take around

1.15 ∗ 1015 flop / 109 flop = 1.15 ∗ 106 seconds ≈ 13 days !!

22 Introduction

1.4 Example 1.1

But, with 1012 flops,

1.15 ∗ 103 seconds

< 20 min.

Not hard to imagine “small” changes in given scheme such that 1012 flops not enough:

• Reduce the mesh stepsize to 0.1km in each directions and timestep to 10 seconds and the total time would grow from

20 min to

8 days.

• We could replace the Europe with the whole World model (the area: 2 ∗ 107 km2 −→ 5 ∗ 108 km2 ) and to combine the model with an Ocean model.

23 Introduction

1.4 Example 1.1

Therefore, we must say: The needs of science and technology grow faster than available possibilities, need only to change ε and h to get unsatisfied again!

But again, why do we need a parallel computer to achieve this goal?

24 Introduction

1.5

1.5 Example 1.2

Example 1.2

Have a look at the following piece of code: 



do i =1 ,1000000000000 z ( i ) =x ( i ) +y ( i ) ! i e . 3 ∗ 1012 memory accesses end do



Assuming, that data is traveling with the speed of light 3 ∗ 108 m/s, for finishing the operation in 1 second, the average distance of a memory cell must be:

r =

3∗108 m/s∗1s 3∗1012

= 10−4 m.

Typical computer memory is situated on a rectangular mesh. The length of each edge would be −4 m 2∗10 √ 3∗1012

≈ 10−10 m,

– the size of a small atom!

25 Introduction

1.5 Example 1.2

But why is parallel computing still not predominant? Three main reasons: hardware, algorithms and software Hardware: speed of

• networking • peripheral devices • memory access do not cope with the speed growth in processor capacity. Algorithms: An enormous number of different algorithms exist but

• problems start with their implementation on some real life application

26 Introduction Software: development in progress;

• no good enough automatic parallelisers • everything done as handwork • no easily portable libraries for different architectures • does the paralleising effort pay off at all?

BUT (as explained above) parallel computing will take over

1.5 Example 1.2

27 Introduction

1.6

1.6 Example 1.3

Example 1.3

Solving a sparse system of linear equations using MUMPS4.3 (Multifrontal Massively Parallel Solver ): SUN computer class procesors for solution of a medium size problem (262144 unknowns, 1308672 nonzeros) #procs.

time (s)

speedup

1 2 4 8 12 16 20 23

84.3 63.0 53.6 27.0 59.3 29.5 57.5 74.0

1.34 1.57 3.12 1.47 2.86 1.47 1.14

28 Introduction

1.7

1.7 Example 1.4

Example 1.4

Approximation of π using Monte-Carlo method #procs.

time (s)

1 2 4 8 12 16 20 21 23

107.8 53.5 26.9 13.5 9.3 6.8 5.5 9.0 29.2

speedup

2.01 4.00 7.98 11.59 15.85 19.59 11.97 3.69

29 Computer architectures and //

2

Computer architectures and parallelism

Various ways to classify computers:

• Architecture ◦ Single processor computer ◦ Multicore processor ◦ distributed system ◦ shared memory system • operating system ◦ UNIX, ◦ LINUX, ◦ (OpenMosix) ◦ Plan 9

1.7 Example 1.4

30 Computer architectures and //

1.7 Example 1.4

◦ WIN* ◦ etc • usage area ◦ supercomputing ◦ distributed computing ◦ real time sytems ◦ mobile systems ◦ neurological networks ◦ etc But impossible to ignore (implicit or explicit) parallelism in a computer or a set of computers

31 // Architectures

2.1 Processor development

Osa I

Parallel Architectures 2.1

Processor development

Instruction pipelining (similar to car production lines) - performing different suboperations with moving objects

Instruction throughput (the number of instructions that can be executed in a unit of time) increase

32 // Architectures

2.1 Processor development

operation overlap RISC processor pipelines:

• Instruction fetch • Instruction decode and register fetch

• Execute • Memory access • Register write back

Basic five-stage pipeline in a RISC machine (IF = Instruction Fetch, ID = Instruction Decode, EX = Execute, MEM = Memory access, WB = Register write back). In the fourth clock cycle (the green column), the earliest instruction is in MEM stage, and the latest instruction has not yet entered the pipeline.

33 // Architectures

2.1 Processor development

Pipeline length? Limitations:

• pipeline speed defined by the slowest component • Usually, each 5-6 operation - branch • Cost of false prediction grows with the length of pipeline (larger number of subcommands get waisted) One possibility: Multiple pipelines (super-pipelined processors, superscalar execution) - in effect, execution of multiple commands in a single cycle

34 // Architectures Example 2.1: Two pipelines for adding 4 numbers:

2.1 Processor development

35 // Architectures

2.1 Processor development

True Data Dependency - result of one operation being an input to another Resource Dependency - two operations competing for the same resource (e.g. processor floating point unit) Branch Dependency (or Procedural Dependency) - scheduling impossible in case of if-directive Instruction Level Parallelism (ILP)

• possibility for out-of-order instruction execution ◦ factor: size of instruction window • ILP is limited by: ◦ parallelism present in the instruction window ◦ hardware parameters – existence of different functional units

36 // Architectures

· · · ·

2.1 Processor development

number of pipelines pipeline lengths pipeline properties etc

◦ not possible to always utilise all Functional Units (FU) – if at certain timestep none of FUs is utilised – vertical waste – if only some FUs utilised – horisontal waste

• Typically, processors support 4-level superscalar execution

37 // Architectures

2.1 Processor development

Very Long Instruction Word (VLIW) Processors

• Main design concern with superscalar processors – complexity and high price • VLIW – based on compile-time analysis - which commands to bind together for ILP

• These commands get packed together into one (long) instruction (giving the name)

• First used in Multiflow Trace machine (ca 1984) • Intel IA64 - part of the concept being implemented

38 // Architectures

2.1 Processor development

VLIW properties

• Simpler hardware side • Compiler has more context to find ILP • But compiler lacks all the run-time information (e.g. data-misses in cache), therefore, only quite conservative approach possible (just-in-time compilers might have benefit here!)

• More difficult prediction of branching and memory usage • VLIW performance depends highly on compiler abilities, like ◦ loop unrolling ◦ speculative execution ◦ branch prediction etc. • 4-way to 8-way parallelism in VLIW processors

39 // Architectures

2.2

2.2 Memory performance effect

Memory performance effect

• Memory system, not processor speed appears often as a bottleneck • 2 main parameters:

latency - time that it takes from request to delivery from the memory bandwidth - amount of data flow in timestep (between the processor and memory)

40 // Architectures

2.2 Memory performance effect

Example 2.2: Memory latency

• Simplified architecture: 1 GHz processor • memory fetching time 100 ns (no cache), blocksize 1 word • 2 multiply-add units, 4 multiplication or addition operations per clock cycle => peak performance – 4 Gflops.

• Suppose, we need to perform dot product. Each memory access takes 100 clock cycles =⇒ 1 flop per 100 ns =⇒ actual performance only: 10 ???Mflops!

41 // Architectures

2.2 Memory performance effect

Memory latency hiding with the help of cache

• Cache - small but fast memory between main memory and DRAM • works as low latency, high throughput storage • helps to decrease real memory latency only if enough reuse of data is present • part of data served by cache without main memory access – cache hit ratio • often: hit ratio a major factor to performance

42 // Architectures

2.2 Memory performance effect

Example 2.3: Cache effect

• Cache size: 32 KB with latency 1 ns • operation C = AB , where A and B are 32 × 32 matrices (i.e. all matrices A, B and C fit simultaneously into cache • We observe that: ◦ reading 2 matrices into cache (meaning 2K words) takes ca 200µs ◦ multiplying two n × n matrices takes 2n3 operations, =⇒ 64K operations, can be performed in 16K cycles (4 op/cycle)

◦ =⇒ total time (data access + calculations= 200µs + 16µs ◦ close to peak performance 64K/216 or 303 Mflops. • Access to same data corresponds to temporal locality

43 // Architectures

2.2 Memory performance effect

• In given example O(n2 ) data accesses and O(n3 ) computations. Such asymptotic difference very good in case of cache

• Data reusage is of critical importance for performance!

44 // Architectures

2.3

2.3 Memory Bandwidth (MB) and performance

Memory Bandwidth (MB) and performance

• MB is governed by memory bus and memory properties • MB can be increased with enlarging memory block accessed Let system spend l time units to move b data units, where

• l – system latency • b – memory block size (also called cache line)

Example 2.4: effect of memory block size: dot product of two vectors

• In example 2.2 we had b = 1, words giving 10 Mflops

45 // Architectures

2.3 Memory Bandwidth (MB) and performance

• Increasing b = 4 and assuming that data stored in memory linearly =⇒8 flops (4 * and + operations) for each 200 clock cycles corresponding to 1 flop per 25 ns resulting in peak performance 40 Mflops. Observations:

• Block size does not affect system latency • Wide memory bus (like in the example: 4) expensive to implement in hardware • Usually, memory access is performed similarly to pipeline • Therefore, with MB increase, system peak performance increases • Locality of memory addresses IS important! • Good to reorder computations according to data location!

46 // Architectures

2.3 Memory Bandwidth (MB) and performance

Example 2.5: Effect of memory ordering 



f l o a t : : A(1000 ,1000) ,rowsum( 1 0 0 0 )

 

Version A

do i =1 ,1000 rowsum( i ) =0.0 do j =1 ,1000 rowsum( i ) =rowsum( i ) +A( i , j ) enddo enddo !



• Which one is faster: A or B?

 

Version B

do i =1 ,1000 rowsum( i ) =0.0 enddo do j =1 ,1000 do i =1 ,1000 rowsum( i ) =rowsum( i ) +A( i , j ) enddo enddo





47 // Architectures

2.3 Memory Bandwidth (MB) and performance

48 // Architectures

2.4 Other approaches for hiding memory latency

• FORTRAN rule: two-dimensional arrays (matrices) are stored columnwise! • For example, matrix B(1:3,1:2) elements in memory ordered as: B(1,1), B(2,1), B(3,1), B(1,2), B(2,2), B(3,2). (fastest changing – the leftmost index) To avoid spending time how to optimise due to memory ordering – use Fortran95 array syntax! (leaving the optimisation to the compiler)

2.4

Other approaches for hiding memory latency

• Multithreading • Prefetching

49 // Architectures

2.4 Other approaches for hiding memory latency

Example 2.6: Multithreading 

Preliminary code (C):



f o r ( i = 0 ; i < n ; i ++) { c [ i ] = dot_product ( get_row ( a , i ) , b ) ; }



• each dot product appears to be independent, therefore it is possible to recode: 

f o r ( i = 0 ; i < n ; i ++) { c [ i ] = create_thread ( dot_product , get_row ( a , i ) , b ) ; }





50 // Architectures

2.4 Other approaches for hiding memory latency

• Possibility for performing computation in each clock cycle ◦ prerequisites: – memory system supporting multiple requests simultaneously – processor capable switching from one thread to another in each clock cycle (e.g. machines like HEP and Tera)

◦ program should explicitly specify independence in form of threads

51 // Architectures

2.4 Other approaches for hiding memory latency

Prefetching • Cache misses make programs waiting • Why not to take care of data being in cache in advance? • Draw-backs: ◦ need for larger caches ◦ in case of pre-fetched data happens to get over-written, the situation is worse than before!

Drawbacks of latency hiding techniques In both cases, better memory bandwidth between main memory and cache needed!!! Each thread has proportinally smaller part of cache to use. =>

• possibly recovering from latency problem bandwidth problem might occur! • higher need for hardware properties (cache size)

52 // Architectures

2.5 2.5.1

2.5 Classification of Parallel Computers

Classification of Parallel Computers Granularity

In parallel computing, granularity means the amount of computation in relation to communication or synchronisation Periods of computation are typically separated from periods of communication by synchronization events.

• fine level (same operations with different data) ◦ vector processors ◦ instruction level parallelism ◦ fine-grain parallelism: – Relatively small amounts of computational work are done between communication events – Low computation to communication ratio – Facilitates load balancing

53 // Architectures

2.5 Classification of Parallel Computers

– Implies high communication overhead and less opportunity for performance enhancement – If granularity is too fine it is possible that the overhead required for communications and synchronization between tasks takes longer than the computation.

• operation level (different operations simultaneously) • problem level (independent subtasks) ◦ coarse-grain parallelism: – Relatively large amounts of computational work are done between communication/synchronization events – High computation to communication ratio – Implies more opportunity for performance increase – Harder to load balance efficiently

54 // Architectures 2.5.2

2.5 Classification of Parallel Computers

Hardware:

Pipelining (was used in supercomputers, e.g. Cray-1) In N elements in pipeline and for ∀ element L clock cycles =⇒ for calculation it would take L + N cycles; without pipeline L ∗ N cycles



Example of good code for pipelineing:

do i =1 ,k z ( i ) =x ( i ) +y ( i ) end do





55 // Architectures

2.5 Classification of Parallel Computers

Vector processors, fast vector operations (operations on arrays). Previous example good also for vector processor (vector addition) , but, e.g. recursion – hard to optimise for vector processors Example: IntelMMX – simple vector processor. Processor arrays Most often 2-dimensional arrays (For example: MasPar MP2 - massively parallel computer ) neighbours and on edges to the corresponding opposite edge nodes. Processors had mutual clock. Programming such a computer quite specific, special language, MPL, was used; need for thinkinking about communication beMasPar-MP2: 128x128=16384 tween neighbours, or to all processor at processors, each had 64Kbytes memory. once (which was slower). each processor connected to its

56 // Architectures

2.5 Classification of Parallel Computers

Shared memory computer

Distributed systems (e.g.: clusters) Most spread today. One of the main questions on parallel hardware: do the processors share a mutual clock or not?

57 // Architectures

2.5 Classification of Parallel Computers

58 // Architectures

2.6

2.6 Flynn’s classification

Flynn’s classification Instruction

SISD SIMD (MISD) MIMD Data

S - Single M - Multiple I - Instruction

Abbreviations: D - Data For example: Single Instruction Multiple Data stream =>: SISD - single instruction single data stream, (e.g. simple PC) SIMD - Same instructions applied to multiple data. (Example: MasPar) MISD - same data used to perform multiple operations... Sometimes have been considered vector processors belonging here but most often said to be empty class MIMD - Separate data and separate instructions. (Example: computer cluster)

59 // Architectures 2.6.1

2.6 Flynn’s classification

SISD

• A serial (non-parallel) computer • Single Instruction: Only one instruction stream is being acted on by the CPU during any one clock cycle

• Single Data: Only one data stream is being used as input during any one clock cycle

• Deterministic execution • This is the oldest and even today, the most common type of computer • Examples: older generation mainframes, minicomputers and workstations; most modern day PCs.

60 // Architectures 2.6.2

2.6 Flynn’s classification

SIMD

• A type of parallel computer • Single Instruction: All processing units execute the same instruction at any given clock cycle

• Multiple Data: Each processing unit can operate on a different data element • Best suited for specialized problems characterized by a high degree of regularity, such as graphics/image processing.

• Synchronous (lockstep) and deterministic execution • Two varieties: Processor Arrays and Vector Pipelines • Examples:some early computers of this type: ◦ Processor Arrays: Connection Machine CM-2, MasPar MP-1 & MP-2, ILLIAC IV

61 // Architectures

2.6 Flynn’s classification

◦ Vector Pipelines: IBM 9000, Cray X-MP, Y-MP & C90, Fujitsu VP, NEC SX-2, Hitachi S820, ETA10

• graphics cards • Most modern computers, particularly those with graphics processor units (GPUs) employ SIMD instructions and execution units.

• possibility to switch off some processors (with mask arrays)

62 // Architectures 2.6.3

2.6 Flynn’s classification

(MISD):

• A type of parallel computer • Multiple Instruction: Each processing unit operates on the data independently via separate instruction streams.

• Single Data: A single data stream is fed into multiple processing units. • Few actual examples of this class of parallel computer have ever existed. One is the experimental Carnegie-Mellon C.mmp computer (1971).

• Some conceivable uses might be: ◦ multiple frequency filters operating on a single signal stream ◦ multiple cryptography algorithms attempting to crack a single coded message.

63 // Architectures 2.6.4

2.6 Flynn’s classification

MIMD

• A type of parallel computer • Multiple Instruction: Every processor may be executing a different instruction stream

• Multiple Data: Every processor may be working with a different data stream • Execution can be synchronous or asynchronous, deterministic or nondeterministic

• Currently, the most common type of parallel computer - most modern supercomputers fall into this category.

• Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers, multi-core PCs.

• Note: many MIMD architectures also include SIMD execution sub-components

64 // Architectures 2.6.5

2.6 Flynn’s classification

Comparing SIMD with MIMD

• SIMD have less hardware units than MIMD (single instruction unit) • Nevertheless, as SIMD computers are specially designed, they tend to be expensive and timeconsuming to develop

• not all applications suitable for SIMD • Platforms supporting SPMD can be built from cheaper components

65 // Architectures 2.6.6

Flynn-Johnson classification

(picture by: Behrooz Parhami)

2.6 Flynn’s classification

66 // Architectures

2.7

2.7.1

2.7 Type of memory access

Type of memory access

Shared Memory

• common shared memory • Problem occurs when more than one process want to write to (or read from) the same memory address

• Shared memory programming models do deal with these situations

67 // Architectures 2.7.2

Distributed memory

• Networked processors with their private memory

2.7.3

hybrid memory models

• E.g. distributed shared memory, SGI Origin 2000

2.7 Type of memory access

68 // Architectures

2.8 2.8.1

2.8 Communication model of parallel computers

Communication model of parallel computers Communication through shared memory address space

• UMA (uniform memory access) • NUMA (non-uniform memory access) ◦ SGI Origin 2000 ◦ Sun Ultra HPC

69 // Architectures

2.8 Communication model of parallel computers

Comparing UMA and NUMA:

C - Cache, P- Processor, M-Memory / Which (a) & (b) are- UMA, which (c) - NUMA NUMA?

70 // Architectures 2.8.2

2.8 Communication model of parallel computers

Communication through messages

• Using some messaging libraries like MPI, PVM. • All processors are ◦ independent ◦ own private memory ◦ have unique ID • Communication is performed through exchanging messages

71 // Architectures

2.9 2.9.1

Other classifications Algorithm realisation

• using only hardware modules • mixed modules (hardware and software) 2.9.2

Control type

1. synchronous 2. dataflow-driven 3. asynchronous

2.9 Other classifications

72 // Architectures 2.9.3

2.9 Other classifications

Network connection speed

• network bandwidth ◦ can be increased e.g. through channel bonding

• network latency ◦ the time from the source sending a message to the destination receiving it More Whicheasy one to is more increase easy bandwidth! to increase: bandwidth or latency?

73 // Architectures 2.9.4

2.9 Other classifications

Network topology

• Bus-based networks • Ring - one of the simplest topologies

• Array topology ◦ Example: cube (in case of 3D array) • Hypercube ◦ In ring the longest route between 2 processors is: P/2 – in hypercube – log P Ring:

74 // Architectures

Hypercube:

2.9 Other classifications

75 // Architectures

2.9 Other classifications

Figure. How to design a hypercube: Add similar structure and connect corresponding nodes (adding one 1 bit). Problem: large number of connections per node Easy to emulate Hypercube on e.g. MasPar array in log n time:

76 // Architectures

2.9 Other classifications

77 // Architectures

2.9 Other classifications

• Star topology ◦ Speed depends very much on switch properties (e.g. latency, bandwidth, backplane frequency ) and ability to cope with arbitray communication patterns

• Clos-network For example Myrinet. (quite popular around 2005)

• Cheaper but with higher latency: Gbit Ethernet., (10 Gbit Ethernet) • Nowadays, most popular low-latency network type – Infiniband

78 // Architectures

2.10

2.10 Clos-networks

Clos-networks

Theory of Clos networks came into existence with telephone networks – paper : : Charles Clos, – “A Study of Non-Blocking Switching Networks”, 1952. 2.10.1

Myrinet

Myrinet (http://www.myri.com) Standardised network architecture Aims:

• Max throughput with scalability to large number of processors • Low latency • eliminate the need to allocate processors according to network topology ◦ (historical remark: in the same reasons, RAM replaced sequential access memory (disk, trumble) In general, good network makes 1/2 cluster cost!

79 // Architectures Myrinet card and switch

128-port Myrinet Clos network switch

2.10 Clos-networks

80 // Architectures 2.10.2

2.10 Clos-networks

Topological concepts

• Minimum bisection (minimaalne poolitus) Definition. Minimal number of connections between arbitrary two equally sized sets of nodes in the network

• Describes minimal throughput, independently from communication pattern • Upper bound in case of N-node network is N/2 connections • Network topology achieving this upper bound is called full bisection (täispoolitus)

81 // Architectures Example. 8 nodes, network switch has 5 ports:

Minimal What is Minimal bisectionBisection is 1 – i.e.invery this bad case?

2.10 Clos-networks

82 // Architectures But, if to set the network up as follows:

Question: Does there exist full bisection? If yes, how?

2.10 Clos-networks

83 // Architectures

2.10 Clos-networks

Answer: minimal bisection: 4. But not in case of each possible communication pattern! (find!) => not rearrageable network 2.10.3

Rearrangeable networks (ümberjärjestatavad võrgud)

Definition. Network is called rearrangeable, if it can find a route in case of arbitrary permutation of communication patterns (or, whichever node wants to talk to with whichever other node, there is always a way to do it simultaneously) Theorem. Whichever rearrangeable network has full bisection. Is The theopposite opposite is also not true. true Network or not? can have full bisection without being rearrangeable. (Proof is based on Hall’s Theorem)

84 // Architectures

2.10 Clos-networks

Example: 8-node clos network with 4-port switches. "Leaf"-switches "Spine"-switches

Main rule: As many connections to the leaves as many there are to the spineswitches (enables rearangeability) Therefore: Clos networks are:

• scalable upto a large number of nodes • rearrangeable (proved, that in each permutation there exists a routing table) • There exist multiple routes

85 // Architectures

2.10 Clos-networks

In case of too few ports on a switch, it is possible to compose a larger one:

86 // Architectures

2.10 Clos-networks

87 // Architectures

2.11

2.11 Beowulf clusters

Beowulf clusters

• History

• 1993-1994 Donald Becker (NASA) – 16 486 DX4 66MHz processors, ETHERNET-channel bonding, COTS (Commodity Off The Shelf ) – created in CESDIS (The Center of Excellence in Space Data and Information Sciences), 74 MFlops (4,6 MFlops/node).

• Beowulf – formerly project name, not the system Dr Thomas Sterling: "What the hell, call it ’Beowulf.’ No one will ever hear of it anyway."

• Beowulf - first poem in English based on Scandinavian legend about hero who defeated the monster Grendel with his courage and strength

88 // Architectures

2.11 Beowulf clusters

• Network choice Ethernet 100Mbps, Gigabit Ethernet, Infiniband, Myrinet (Scali, pathscale, numalink, etc)

• OS – most often linux • Typical Beowulf cluster

Internet

Frontend

Switch

Nodes

89 // Architectures

• Building a cluster (rack/shelf)?

2.11 Beowulf clusters

90 // Architectures

2.11 Beowulf clusters

• console switch?

• which software.

• Security • Cluster administration

The part of the system needing security measures

Internet

◦ ClusterMonkey

Frontend

Switch

◦ Rocks ◦ Clustermagic ◦ etc. Nodes

91 // Performance

3

3.1 Speedup

Parallel Performance How to measure parallel performance?

3.1

Speedup

S(N, P) :=

tseq (N) t par (N, P)

tseq (N) – time for solving given problem with best known sequential algorithm • In general, different algorithm than the parallel one ◦ If same (and there exist a faster sequential one): Relative Speedup

• 0 < S(N, P) ≤ P • If S(N, P) = P, – linear speedup or optimal speedup (Example, embarrasingly parallel algorithms)

92 // Performance

3.2 Efficiency

• May happen that S(N, P) > P, (swapping; cache effect) – superlinear speedup

• Often S(N, P) < 1, – is slowdown it speedup?

3.2

Efficiency

E(N, P) :=

tseq (N) . P · t par (N, P)

Presumably, 0 < E(N, P) ≤ 1.

93 // Performance

3.3

3.3 Amdahl’s law

Amdahl’s law

In each algorithm ∃ parts that cannot be parallelised

• Let σ (0 < σ ≤ 1) – sequential part • Assume that the rest 1 − σ parallelised optimally

Then, in best case:

S(N, P) =

tseq σ+

1−σ P



tseq

=

1 1 ≤ . 1−σ σ σ+ P

94 // Performance

3.3 Amdahl’s law

Example 1. Assume 5% of the algorithm is not parallelisable (ie. σ = 0.05) => :

P

max S(N, P)

2 4 10 20 100

1.9 3.5 6.9 10.3 16.8 20



Therefore, not much to gain at all with huge number of processors!

Example 2. If σ = 0.67 (33% parallelisable), with P = 10:

S(N, 10) =

1 = 1.42 0.67 + 0.33/10

95 // Performance

3.3 Amdahl’s law

96 // Performance

3.4

3.4 Gustafson-Barsis’ law

Gustafson-Barsis’ law

John Gustafson & Ed Barsis (Scania Laboratory) 1988:

• 1024-processor nCube/10 claimed: they bet Amdahl’s law! • Their σ ≈ 0.004...0.008

• but got S ≈ 1000

• (Acording to Amdahl’s S might be 125...250) How was it possible? Does Amdahl’s law hold? Mathematically – yes. But in practice – not very good idea to solve a problem with fixed size N on whatever number of processors! In general, σ = σ (N) 6= const Usually, σ decreases with N growing! Algorithm is said to be effectively parallel if σ → 0 with N → ∞ Scaled efficiency to avoid misunderstandings:

97 // Performance

3.5

3.5 Scaled efficiency

Scaled efficiency

ES (N, P) :=

tseq (N) t par (P · N, P)

• Problem size increasing accordingly with adding new processors – does time remain the same?

• 0 < ES (N, P) ≤ 1 • If ES (N, P) = 1 – linear speedup

98 // Performance

3.6

Methods to increase efficiency

Factors influencing efficiency:

• communication time • waiting time • additional computations • changing/improving algorithm Profiling parallel programs

• MPE - jumpshot, LMPI, MpiP • Totalview, Vampir, Allinea OPT • Linux - gprof (compiler switch -pg) • SUN - prof, gprof, prism • Many other commercial applications

3.6 Methods to increase efficiency

99 // Performance 3.6.1

3.6 Methods to increase efficiency

Overlapping communication and computations

Example: Parallel Ax-operation for sparse matrices par_sparse_mat.f90 (http://www.ut.ee/~eero/F95jaMPI/Kood/ mpi_CG/par_sparse_mat.f90.html) (“Fortran95 ja MPI” pp. 119-120) Matrix partitioned, divided between processors. Starting communication (nonblocking); calculations at inside parts of the region => economy in waiting times. 3.6.2

Extra computations instead of communication

• Computations in place instead of importing the results over the network • Sometimes it pays off! Example: Random number generation. Broadcasting only seed and generate in parallel (deterministic algorithm)

100 // Performance

3.7

3.7 Benchmarks

Benchmarks

5 main HPC (High Performance Computing) benchmarks:

• NPB • Linpack

• Perf • IOzone • Graph 500

• HINT

3.7.1

Numerical Aerodynamic Simulation (NAS) Parallel Benchmarks (NPB)

MPI, 8 programs (IS,FT,MG,CG,LU,SP,BT,EP) from Computational Fluid Dynamics (CFD) code

• Integer Sort (IS) - integer operations and communication speed. Latency of critical importance

• Fast Fourier Transform (FT). Remote communication performance; 3D ODV solution

101 // Performance

3.7 Benchmarks

• Multigrid (MG). Well-structured communication pattern test. 3D Poisson problem with constant coefficients

• Conjugate Gradient (CG). Irregular communication on unstructured discretisation grid. Finding smallest eigenvalue of a large positive definite matrix

• Lower-Upper diagonal (LU). Blocking communication operations with small granularity, using SSOR (Symmetric Successive Over-Relaxation) to solve sparse 5x5 lower and upper diagonal systems. Length of the message varying a lot

• Scalar pentadiagonal (SP) and block tridiagonal (BT). Balance test between computations and communication. Needs even number of processes. Solving a set of diagonally not dominant SP and BT systems. (Although similar, difference in the ratio of communication and computations – SP having more communication than BT

• Embarrassingly Parallel (EP). Gauss residual method with some specifics; showing the performance of the slowest processor.

• + some new tests in version 3, good with networked multicore processors

102 // Performance 3.7.2

3.7 Benchmarks

Linpack

Jack Dongarra. HPL - High Performance Linpack, using MPI and BLAS. Solving systems of linear equations with dense matrices. The aim is to fit a problem with maximal size (advisably, utilising 80% of memory).

• Used for http://www.top500.org ◦ Also, http://www.bbc.co.uk/news/10187248 • R peak - peak performance in Gflops

• Rmax - maximal achieved performance in Gflops

• N - size of matrix giving peak per-

• NB - blocksize. In general, the

formance in Gflops (usually ) various locking mechanisms needed...

• A bit similar to the previous one (small granularity)

• benefit: no data ownership exist

111 //-programming models

4.4

4.4 Implicit Parallelism

Implicit Parallelism

• no process interaction is visible to the programmer • instead the compiler and/or runtime is responsible for performing it • most common with domain-specific languages where the concurrency within a problem can be more prescribed

112 //-programming models

4.4 Implicit Parallelism

Problem decomposition • Any parallel program is composed of simultaneously executing processes • problem decomposition relates to the way in which these processes are formulated.

• This classification may also be referred to as: ◦ algorithmic skeletons or ◦ parallel programming paradigms

113 //-programming models

4.1

4.2 Data parallel model

Task parallel model

• focuses on processes, or threads of execution. These processes will often be behaviourally distinct, which emphasises the need for communication. Task parallelism is a natural way to express message-passing communication. It is usually classified as MIMD/MPMD (or MISD)

4.2

Data parallel model

• performing operations on a data set ◦ usually regularly structured in an array ◦ set of tasks will operate on this data – independently on separate partitions

• Data accessibility

114 //-programming models

4.2 Data parallel model

◦ shared memory system, the data accessible to all ◦ in a distributed-memory system the data divided between memories and worked on locally.

• Data parallelism usually classified as SIMD/SPMD. • Generally, granularity small • same operation on multiple computing devices • usually needs some clarification from the programmer, how data distributed between the processors

• communication/synchronisation is created automatically by compiler Example HPF (High Performance Fortran)

115 Shared Memory

5

4.2 Data parallel model

Shared memory programing model Dining Philosopher’s problem 5 plates, 5 forksl. One can et only with 2 forks Eat and Think! Avoid starvation (exp. to death)! not too fat? No deadlocks no fights; fairness? ⇒ need for some locking mechnism!

116 Shared Memory

5.1

5.1 PRAM (Parallel Random Access Machine) model

PRAM (Parallel Random Access Machine) model

• consisting of P deterministic RAM processors (von Neumann machines) • processors work in parallel • communication and synchronisation through shared memory Variations

• EREW PRAM (Exclusive Read Exclusive Write) • CREW PRAM (Concurrent Read Exclusive Write) • CRCW PRAM (Concurrent Read Concurrent Write): ◦ common CRCW PRAM - write succeeds only if all have the same value ◦ arbitrary CRCW PRAM - arbitrary process succeeds ◦ priority CRCW PRAM - with predefined priorities

117 Shared Memory

5.2

5.2 OpenMP

OpenMP

OpenMP tutvustus (http://www.llnl.gov/computing/tutorials/

openMP/) Programmeerimismudel:

• Lõimedel (threads) baseeruv paralleelsus • Programmeerija vahend (mitte automaatne paralleliseerija) (explicit parallelism)

• Hargnevuste-ühendamiste (Fork-Join) mudel Alustatakse 1 protsessist – pealõimest (master thread) Fork,Join

118 Shared Memory

• Baseerub kompilaatoridirektiividel C/C++, Fortran*

• Mitmetasemeline paralleelsus (Nested Parallelism) toetatud (kuigi kõik realisatsioonid ei pruugi seda toetada)

• Dünaamilised lõimed lõimede arvu saab muuta programmi töö käigus

5.2 OpenMP

119 Shared Memory

5.2 OpenMP

• Olemasoleva järjestikprogrammi paralleeliseerimine tihti vaid väheste muudatustega lähtetekstis

• OpenMP kujunenud standardiks

120 Shared Memory OpenMP ajalugu:

• Standardiseerimist alustati 1994, taasalustati 1997 • Okt. 1997: Fortran vers. 1.0 • 1998 lõpp: C/C++ vers. 1.0 • Juuni 2000: Fortran vers. 2.0 • Aprill 2002: C/C++ vers. 2.0 • Mai 2008: vers. 3.0 standard • Juuli 2011: vers. 3.1

5.2 OpenMP

121 Shared Memory

5.2 OpenMP

Näide Fortran: 

PROGRAM HELLO INTEGER VAR1, VAR2, VAR3 ! Programmi j ä r j e s t i k u n e osa . . . ! Programmi p a r a l l e e l n e osa , uute lõimede tegemine ; ! m u u t uj a t e skoopide määramine : !$OMP PARALLEL PRIVATE (VAR1, VAR2) SHARED(VAR3) ! P a r a l l e e l n e osa mis jookseb k õ i g i l l õ i m e d e l : . . . ! Kõik lõimed ühinevad pealõimega , p a r a l l e e l s e osa l õ p p : !$OMP END PARALLEL ! Programm j ä r j e s t i k o s a l õ p e t a m i n e : . . . END





122 Shared Memory

5.2 OpenMP

Näide: C/C++: 

# include main ( ) { i n t var1 , var2 , var3 ; Järjestikosa . . . Programmi p a r a l l e e l n e osa , uute lõimede tegemine ; muutujate skoopide määramine : #pragma omp p a r a l l e l p r i v a t e ( var1 , var2 ) shared ( var3 ) { P a r a l l e e l n e osa mis jookseb k õ i g i l lõimedel : . . . Kõik lõimed ühinevad pealõimega , p a r a l l e e l s e osa lõpp : } Järjestikprogrammi lõpetamine . . . }





123 Shared Memory

5.2 OpenMP

Üldreeglid:

• Kommentaare samale reale OpenMP direktiividega panna ei saa • Üldjuhul on Fortrani kompileerimisel võimalik lisada võti OpenMP kommentaaride interpreteerimiseks

• Mitmed OpenMP käsud on paarisdirektiivid: 

!$OMP d i r e c t i v e [ programmiblokk ] !$OMP end d i r e c t i v e





124 Shared Memory

5.3

5.3 OpenMP direktiivid: PARALLEL

OpenMP direktiivid: PARALLEL 



!$OMP PARALLEL [ c l a u s e . . . ] IF ( s c a l a r _ l o g i c a l _ e x p r e s s i o n ) PRIVATE ( l i s t ) SHARED ( l i s t ) DEFAULT ( PRIVATE | FIRSTPRIVATE | SHARED | NONE) FIRSTPRIVATE ( l i s t ) REDUCTION ( o p e r a t o r : l i s t ) COPYIN ( l i s t ) NUM_THREADS ( s c a l a r −i n t e g e r − expression ) block !$OMP END PARALLEL



125 Shared Memory

5.3 OpenMP direktiivid: PARALLEL

Lõimede arvu saab määrata:

• klausliga NUM_THREADS • teegifunktsiooniga omp_set_num_threads() • keskkonnamuutujaga OMP_NUM_THREADS Kitsendused:

• Paralleelpiirkond programmis on blokk mis ei ületa alamrogrammide ega failide piire

• Sünkroniseerimata I/O on etteaimamatu tulemusega • keelatud on paralleelpiirkonda sisse- või väljasuunamine • vaid üksainus IF-lause on lubatud

126 Shared Memory

5.3 OpenMP direktiivid: PARALLEL

Näide: Hello World (F77) 

PROGRAM HELLO INTEGER NTHREADS, TID , OMP_GET_NUM_THREADS, + OMP_GET_THREAD_NUM C Fork a team o f t h r e a d s g i v i n g them t h e i r own copies of v a r i a b l e s !$OMP PARALLEL PRIVATE ( TID ) C Obtain and p r i n t t h r e a d i d TID = OMP_GET_THREAD_NUM( ) PRINT ∗ , ’ H e l l o World from t h r e a d = ’ , TID C Only master t h r e a d does t h i s I F ( TID .EQ. 0 ) THEN NTHREADS = OMP_GET_NUM_THREADS( ) PRINT ∗ , ’ Number o f t h r e a d s = ’ , NTHREADS END I F C A l l t h r e a d s j o i n master t h r e a d and disband !$OMP END PARALLEL END





127 Shared Memory

5.4 OpenMP tööjaotusdirektiivid: DO

OpenMP tööjaotusdirektiivid on: DO, SECTIONS ja SINGLE

5.4

OpenMP tööjaotusdirektiivid: DO 



!$OMP DO [ c l a u s e . . . ] SCHEDULE ( t y p e [ , chunk ] ) ORDERED PRIVATE ( l i s t ) FIRSTPRIVATE ( l i s t ) LASTPRIVATE ( l i s t ) SHARED ( l i s t ) REDUCTION ( o p e r a t o r | i n t r i n s i c : l i s t ) COLLAPSE ( n ) do_loop !$OMP END DO



[ NOWAIT ]

128 Shared Memory

5.4 OpenMP tööjaotusdirektiivid: DO

DO-direktiivi klauslid: • SCHEDULE kirjeldab kuidas iteratsioonid on eri lõimede vahel ärajagatud: ◦ STATIC - tükid pikkusega chunk või ühtlaselt kui chunk ei ole määratud ◦ DYNAMIC - tükid pikkusega chunk (=vaikimisi 1) jaotatakse dünaamiliselt, peale ühe tüki lõpetamist antakse dünaamiliselt uus

◦ GUIDED - tüki suurust vähendatakse eksponentsiaalselt ◦ RUNTIME - jaotus OMP_SCHEDULE järgi.

otsustatakse

jooksvalt

keskkonnamuutuja

◦ AUTO - jaotus otsustatakse kompilaatori poolt ja/või jooksvalt • ORDERED - tsükli täitmine nagu järjestiktsüklis • NO WAIT - peale tsükli lõppu lõimed ei sünkroniseeru vaid jätkavad kohe programmi järgnevate ridadega

• COLLAPSE - mitu sisemist tsüklit lugeda üheks iteratsiooniks mitme teineteise sees paikneva tsükli puhul

129 Shared Memory

5.4 OpenMP tööjaotusdirektiivid: DO

Kitsendusi:

• DO-tsükkel ei saa olla DO WHILE-tsükkel, iteratsiooniparameeter – täisarv! • Programmi õigsus ei tohi sõltuda sellest milline lõim millist tsükli osa täidab • sisenemide-väljumine antud tsüklist ei ole lubatud • ORDERED ja SCEDULE-käsk vaid 1 kord

130 Shared Memory

5.4 OpenMP tööjaotusdirektiivid: DO

DO-direktiivi näide:  PROGRAM VEC_ADD_DO INTEGER N, CHUNKSIZE, CHUNK, I PARAMETER (N=1000) PARAMETER (CHUNKSIZE=100) REAL A(N), B(N), C(N) DO I = 1, N A(I) = I * 1.0 B(I) = A(I) ENDDO CHUNK = CHUNKSIZE !$OMP PARALLEL SHARED(A,B,C) PRIVATE(I) ! Arrays A, B, C will be shared by all threads ! I -- private for each thread (each has a unique copy) !$OMP DO SCHEDULE(DYNAMIC,CHUNK) ! The iterations of the loop will be distributed ! dynamically in CHUNK sized pieces DO I = 1, N C(I) = A(I) + B(I) ENDDO !$OMP END DO NOWAIT !$OMP END PARALLEL END 



131 Shared Memory

5.5

5.5 OpenMP tööjaotusdirektiivid: SECTIONS

OpenMP tööjaotusdirektiivid: SECTIONS

SECTIONS võimaldab defineerida sektsioonid, iga sektsioon eraldi lõimele. Saab kasutada PRIVATE, FIRSTPRIVATE, LASTPRIVATE, REDUCTION klausleid muutujate funktsionaalsuse määratlemiseks paralleelses programmi osas. Kui kasutada lõpus NOWAIT märgendit, siis pärast sektsiooni lõpetamist lõim ei oota, kuni teine lõim lõpetab antud operatsiooni, vaid jätkab tööd. 

!$OMP

SECTIONS [ c l a u s e . . . ] PRIVATE ( l i s t ) FIRSTPRIVATE ( l i s t ) LASTPRIVATE ( l i s t ) REDUCTION ( o p e r a t o r | i n t r i n s i c : list ) !$OMP SECTION block !$OMP SECTION block !$OMP END SECTIONS [ NOWAIT ]





132 Shared Memory

5.5 OpenMP tööjaotusdirektiivid: SECTIONS

Näide: SECTIONS: 

!

!$OMP !$OMP !$OMP

!$OMP



PROGRAM VEC_ADD_SECTIONS INTEGER N, I PARAMETER (N=1000) REAL A(N) , B(N) , C(N) Some i n i t i a l i z a t i o n s DO I = 1 , N A( I ) = I ∗ 1 . 0 B( I ) = A( I ) ENDDO PARALLEL SHARED( A , B , C) , PRIVATE ( I ) SECTIONS SECTION DO I = 1 , N/ 2 C( I ) = A( I ) + B( I ) ENDDO SECTION DO I = 1+N/ 2 , N C( I ) = A( I ) + B( I )

133 Shared Memory

5.5 OpenMP tööjaotusdirektiivid: SECTIONS

ENDDO !$OMP END SECTIONS NOWAIT !$OMP END PARALLEL END



134 Shared Memory

5.6

5.6 OpenMP tööjaotusdirektiivid: SINGLE

OpenMP tööjaotusdirektiivid: SINGLE

Vaid 1 lõim täidab antud blokki, vajalik näiteks I/O puhul jms. 

!$OMP

SINGLE [ c l a u s e . . . ] PRIVATE ( l i s t ) FIRSTPRIVATE ( l i s t ) block !$OMP END SINGLE [ NOWAIT ]



Kitsendusi: sisenemine-väljumine lubamatu SINGLE-blokist



135 Shared Memory

5.7

5.7 OpenMP tööjaotusdirektiivide kombineerimine:

OpenMP tööjaotusdirektiivide kombineerimine:

PARALLEL SECTIONS, PARALLEL DO NB! Kui openMP käsk läheb liiga pikaks ühel real, siis saab seda poolitada ja jätkata: 

!$OMP PARALLEL DO !$OMP& SHARED( a )



Näide: PARALLEL DO



136 Shared Memory

5.7 OpenMP tööjaotusdirektiivide kombineerimine:



PROGRAM VECTOR_ADD INTEGER N, I , CHUNKSIZE, CHUNK PARAMETER (N=1000) PARAMETER (CHUNKSIZE=100) REAL A(N) , B(N) , C(N) ! Some i n i t i a l i z a t i o n s DO I = 1 , N A( I ) = I ∗ 1 . 0 B( I ) = A( I ) ENDDO CHUNK = CHUNKSIZE !$OMP PARALLEL DO !$OMP& SHARED( A , B , C,CHUNK) PRIVATE ( I ) !$OMP& SCHEDULE( STATIC ,CHUNK) DO I = 1 , N C( I ) = A( I ) + B( I ) ENDDO !$OMP END PARALLEL DO END





137 Shared Memory

5.8

5.8 OpenMP: sünkronisatsioonikäsud

OpenMP: sünkronisatsioonikäsud

Vaid pealõim (master ): 



!$OMP MASTER ( block ) !$OMP END MASTER



Samas järjekorras (vaid 1 lõim korraga) nagu paralleliseerimata programm: 

!$OMP ORDERED ( block ) !$OMP END ORDERED



(kasutatav vaid DO ja PARALLEL DO blokis)



138 Shared Memory Vaid 1 lõim korraga (üksteise järel): 

5.8 OpenMP: sünkronisatsioonikäsud 

!$OMP CRITICAL [ name ] block !$OMP END CRITICAL



Näide CRITICAL: programm, mis loeb kokku, kui mitu lõime töötab. Selleks aga ei tohi X väärtus rikutud saada, ehk X võib suurendada järjekorras (nii on tulemus "kaitstud"): 

!$OMP !$OMP !$OMP !$OMP 



PROGRAM CRITICAL INTEGER X X = 0 PARALLEL SHARED( X ) CRITICAL X = X + 1 END CRITICAL END PARALLEL END

139 Shared Memory

5.8 OpenMP: sünkronisatsioonikäsud

Barjäär: 



!$OMP BARRIER



ATOMIC-direktiiv (mini-kriitiline lõik vaid ühe järgneva käsu jaoks, millel on ka omad piirangud, näiteks ei saa see olla call käsk jms.): 



!$OMP ATOMIC



FLUSH-käsk – mälusünkronisatsioon: 

!$OMP FLUSH





( list )

(kõik lõimed kirjutavad oma lokaalsed muutujad tagasi jagatud mällu)

140 Shared Memory

5.9

5.9 OpenMP skoobidirektiivid

OpenMP skoobidirektiivid

PRIVATE - muutujad privaatsed igal lõimel SHARED - muutujad ühised kõikidele lõimedele DEFAULT (PRIVATE | SHARED | NONE) vaikeväärtuse määramine FIRSTPRIVATE - privaatsed muutujad mis saavad algväärtustatud LASTPRIVATE - peale tsüklit saab jagatud muutuja väärtuseks viimase sektsiooni või tsükli väärtuse COPYIN - paralleelsesse ossa sisenemisel saavad kõigi lõimede vastavad muutujad väärtuseks pealõime vastava väärtuse REDUCTION - etteantud operatsiooni teostamine kõigi privaatsete koopiatega antud muutujast

141 Shared Memory

5.9 OpenMP skoobidirektiivid

Näide REDUCTION - kahe suure vektori skalaarkorrutis. Selleks, nagu me teame on vaja korrutada vektorite vastavad komponendid ja siis liita tulemused. Kasutame DO tsüklit, mis korrutab ja liidab korrutise tulemusele. Sellist programmi on lihtne paralleliseerida. Erinevate paralleelsete jupide tulemused tuleb lõpus kokku liita selleks siis kasutatakse REDUCTION(milline tehe, milline muutuja): 

PROGRAM DOT_PRODUCT INTEGER N, CHUNKSIZE, CHUNK, I PARAMETER (N=100) PARAMETER (CHUNKSIZE=10) REAL A(N) , B(N) , RESULT ! Some i n i t i a l i z a t i o n s DO I = 1 , N A( I ) = I ∗ 1 . 0 B( I ) = I ∗ 2 . 0 ENDDO RESULT= 0 . 0 CHUNK = CHUNKSIZE !$OMP PARALLEL DO !$OMP& DEFAULT(SHARED) PRIVATE ( I )



142 Shared Memory

5.9 OpenMP skoobidirektiivid

!$OMP& SCHEDULE( STATIC ,CHUNK) !$OMP& REDUCTION( + : RESULT) DO I = 1 , N RESULT = RESULT + (A( I ) ∗ B( I ) ) ENDDO !$OMP END PARALLEL DO NOWAIT PRINT ∗ , ’ F i n a l R e s u l t = ’ , RESULT END



143 Shared Memory

5.9 OpenMP skoobidirektiivid

Klausel

OpenMP direktiiv

PAR. IF PRIVATE SHARED DEFAULT FIRSTPRIVATE LASTPRIVATE REDUCTION COPYIN SHEDULE ORDERED NOWAIT

x x x x x x x

DO SECTS

SINGLE PAR. DO PAR. SEC

x x

x

x

x x x

x x x

x

x x x

x

x

x x x x x x x x x x

Järgnevatele direktiividel klausleid ei ole:

MASTER CRITICAL BARRIER ATOMIC

FLUSH ORDERED THREADPRIVATE

x x x x x x x x

144 Shared Memory

5.10

5.10 OpenMP teegikäske

OpenMP teegikäske SUBROUTINE OMP_SET_NUM_THREADS(scalar_integer_expr) INTEGER FUNCTION OMP_GET_NUM_THREADS() INTEGER FUNCTION OMP_GET_MAX_THREADS() INTEGER FUNCTION OMP_GET_THREAD_NUM() INTEGER FUNCTION OMP_GET_NUM_PROCS() LOGICAL FUNCTION OMP_IN_PARALLEL() SUBROUTINE OMP_SET_DYNAMIC(scalar_logical_expression) LOGICAL FUNCTION OMP_GET_DYNAMIC()

145 Shared Memory

5.10 OpenMP teegikäske

SUBROUTINE OMP_SET_NESTED(scalar_logical_expression) LOGICAL FUNCTION OMP_GET_NESTED SUBROUTINE OMP_INIT_LOCK(var) SUBROUTINE OMP_INIT_NEST_LOCK(var) SUBROUTINE OMP_DESTROY_LOCK(var) SUBROUTINE OMP_DESTROY_NEST_LOCK(var) SUBROUTINE OMP_SET_LOCK(var) SUBROUTINE OMP_SET_NEST_LOCK(var) SUBROUTINE OMP_UNSET_LOCK(var) SUBROUTINE OMP_UNSET_NEST_LOCK(var)

146 Shared Memory

SUBROUTINE OMP_TEST_LOCK(var) SUBROUTINE OMP_TEST_NEST_LOCK(var)

5.10 OpenMP teegikäske

147 Shared Memory

5.11

5.11 OpenMP näiteprogramme

OpenMP näiteprogramme

• mxmult.c html)

(http://www.ut.ee/~eero/PC/naiteid/mxmult.c.

• omp_mm.f html)

(http://www.ut.ee/~eero/PC/naiteid/omp_mm.f.

• omp_orphan.f (http://www.ut.ee/~eero/PC/naiteid/omp_ orphan.f.html) • omp_reduction.f (http://www.ut.ee/~eero/PC/naiteid/omp_ reduction.f.html) • omp_workshare1.f (http://www.ut.ee/~eero/PC/naiteid/omp_ workshare1.f.html) • omp_workshare2.f (http://www.ut.ee/~eero/PC/naiteid/omp_ workshare2.f.html)

148 Fortran and HPF

6

Fortran and HPF

6.1

Fortran Evolution

• FORmula TRANslation • first compiler 1957 • 1st official standard 1972 – Fortran66 • 2nd standard 1980 – Fortran77 • 1991 – Fortran90 • Fortran95 • Fortran2003 • Fortran2008

6.1 Fortran Evolution

149 Fortran and HPF

6.2 Concept

High Performance Fortran

6.2

Concept

Fortran90 extension

• SPMD (Single Program Multiple Data) model • each process operates with its own part of data • HPF commands specify which processor gets which part of the data • Concurrency is defined by HPF commands based on Fortran90 • HPF directives as comments: !HPF$ Most of the commands of declatrative type concerning data distribution between processes Command INDEPENDENT is an exception (and its attributel NEW) which execute commands

150 Fortran and HPF

6.3 PROCESSORS declaration

6.3 PROCESSORS declaration determines conceptual processor array (which need not reflect the actual hardware structure) Example: 



! HPF$ PROCESSORS, DIMENSION ( 4 ) : : P1 ! HPF$ PROCESSORS, DIMENSION ( 2 , 2 ) : : P2 ! HPF$ PROCESSORS, DIMENSION ( 2 , 1 , 2 ) : : P3



• Number of processors needs to remain the same in a program • If 2 different processor arrays similar, one can assume thet corresponding processors have the same physical process

• !HPF$ PROCESSORS :: P – defines a scalar processor

151 Fortran and HPF

6.4 DISTRIBUTE-directive

6.4 DISTRIBUTE-directive 

says how to deliver data between processors. Example:

REAL, REAL, ! HPF$ ! HPF$ ! HPF$



DIMENSION( 5 0 ) : : A DIMENSION( 1 0 , 1 0 ) : : B, C, D DISTRIBUTE (BLOCK) ONTO P1 : : A !1 −D DISTRIBUTE ( CYCLIC , CYCLIC ) ONTO P2 : : B , C !2 −D DISTRIBUTE D(BLOCK, ∗ ) ONTO P1 ! a l t e r n a t i v e s y n t a x

• array and processor array rank need to confirm • A(1:50) : P1(1:4) – each processor gets d50/4e = 13 elements except P1(4) which gets the rest (11 elements) • * – dimension to be ignored => in this example D distributed row-wise • CYCLIC – elements delivered one-by-one (as cards from pile between players)



152 Fortran and HPF

6.4 DISTRIBUTE-directive

• BLOCK - array elements are delivered block-wise (each block having elements from close range) Example (9 -element array delivered to 3 processors) : CYCLIC: 123123123 BLOCK: 111222333

153 Fortran and HPF

6.4 DISTRIBUTE-directive

Example: DISTRIBUTE block-wise: 



PROGRAM Chunks REAL, DIMENSION( 2 0 ) : : A ! HPF$ PROCESSORS, DIMENSION ( 4 ) : : P ! HPF$ DISTRIBUTE (BLOCK) ONTO P : : A



Example: DISTRIBUTE cyclically: 

PROGRAM Round_Robin REAL, DIMENSION( 2 0 ) : : A ! HPF$ PROCESSORS, DIMENSION ( 4 ) : : P ! HPF$ DISTRIBUTE ( CYCLIC ) ONTO P : : A





154 Fortran and HPF

6.4 DISTRIBUTE-directive

Example: DISTRIBUTE 2-dimensional layout: 



PROGRAM Skwiffy IMPLICIT NONE REAL, DIMENSION ( 4 , 4 ) : : A, B, C ! HPF$ PROCESSORS, DIMENSION ( 2 , 2 ) : : P ! HPF$ DISTRIBUTE (BLOCK, CYCLIC ) ONTO P : : A , B , C B = 1; C = 1; A = B + C END PROGRAM Skwiffy



cyclic in one dimension; block-wise in other:

(11)(12)(11)(12) (11)(12)(11)(12) (21)(22)(21)(22) (21)(22)(21)(22)

155 Fortran and HPF

6.4 DISTRIBUTE-directive

Example: DISTRIBUTE *: 



PROGRAM Skwiffy IMPLICIT NONE REAL, DIMENSION ( 4 , 4 ) : : A, B, C ! HPF$ PROCESSORS, DIMENSION ( 4 ) : : Q ! HPF$ DISTRIBUTE ( ∗ ,BLOCK) ONTO Q : : A , B , C B = 1 ; C = 1 ; A = B + C ; PRINT ∗ , A END PROGRAM Skwiffy



Remarks about DISTRIBUTE

• Without ONTO default structure (given with program arguments, e.g.) • BLOCK – better if algorithm uses a lot of neighbouring elements in the array => less communication, faster

• CYCLIC – good for even distribution • ignoring a dimension (with *) good if calculations due with a whole row or column

156 Fortran and HPF

6.5 Distribution of allocatable arrays

• all scalar variables are by default replicated; updating – compiler task

6.5

Distribution of allocatable arrays

is similar, except that delivery happens right after the memory allocation 



REAL, ALLOCATABLE, DIMENSION ( : , : ) : : A INTEGER : : i e r r ! HPF$ PROCESSORS, DIMENSION( 1 0 , 1 0 ) : : P ! HPF$ DISTRIBUTE (BLOCK, CYCLIC ) : : A ... ALLOCATE(A( 1 0 0 , 2 0 ) , s t a t = i e r r ) ! A a u t o m a t i c a l l y d i s t r i b u t e d here ! b l o c k s i z e i n dim=1 i s 10 elements ... DEALLOCATE(A) END



blocksize is determined right after ALLOCATE command

157 Fortran and HPF

6.6

6.6 HPF rule: Owner Calcuates

HPF rule: Owner Calcuates

Processor on the left of assignment performs the computations Example: 



DO i = 1 ,n a ( i −1) = b ( i ∗ 6) / c ( i + j )−a ( i ∗∗ i ) END DO



• calculation performed by process owning a(i-1) NOTE that – the rule is not obligatory but advisable to the compiler!

• compiler may (for the purpose of less communication in the whole program) to leave only the assignment to the a(i-1) owner-processor

158 Fortran and HPF

6.7

6.7 Scalar variables

Scalar variables



REAL, DIMENSION( 1 0 0 , 1 0 0 ) : : X REAL : : Scal ! HPF$ DISTRIBUTE (BLOCK,BLOCK) : : X .... Scal = X( i , j ) ....



owner of X(i,j) assigns Scal value and sends to other processes (replication)



159 Fortran and HPF

6.8

6.8 Examples of good DISTRIBUTE subdivision

Examples of good DISTRIBUTE subdivision

Example: 



A( 2 : 9 9 ) = (A( : 9 8 ) +A ( 3 : ) ) / 2 ! neighbour c a l c u l a t i o n s B( 2 2 : 5 6 ) = 4.0 ∗ATAN( 1 . 0 ) ! section of B calculated C ( : ) = SUM(D, DIM=1) ! Sum down a column



From the owner calculates rule we get: 

! HPF$ ! HPF$ ! HPF$ ! HPF$





DISTRIBUTE DISTRIBUTE DISTRIBUTE DISTRIBUTE

(BLOCK) ONTO P : : A ( CYCLIC ) ONTO P : : B (BLOCK) ONTO P : : C ! or ( CYCLIC ) ( ∗ ,BLOCK) ONTO P : : D ! or ( ∗ , CYCLIC )

160 Fortran and HPF

6.8 Examples of good DISTRIBUTE subdivision

Example (SSOR): 



DO j = 2 ,n−1 DO i = 2 ,n−1 a ( i , j ) =(omega / 4 ) ∗ ( a ( i , j −1)+a ( i , j +1)+ & a ( i −1, j ) +a ( i +1 , j ) ) +(1−omega) ∗ a ( i , j ) END DO END DO



Best is BLOCK-distribution in both dimensions

161 Fortran and HPF

6.9

6.9 HPF programming methodology

HPF programming methodology

• Need to find balance between concurrency and communication • the more processes the more communication • aiming to ◦ find balanced load based from the owner calculates rule ◦ data locality • use array syntax and intrinsic functions on arrays • avoid deprecated Fortran features (like assumed size and memory associations) Easy to write a program in HPF but difficult to gain good efficiency Programming in HPF technique is more or less like this: 1. Write a correctly working serial program, test and debug it 2. add distribution directives introducing as less as possible communication

162 Fortran and HPF

6.9 HPF programming methodology

Advisable to add the INDEPENDENT directives (where semantically relevant), Perform data alignment (with ALIGN-directive). First thing is to: Choose a good parallel algorithm! Issues reducing efficiency:

• Difficult indexing (may confuse the compiler where particular elements are situated)

• array syntax is very powerful ◦ but with complex constructions may make compiler to fail to optimise • sequential cycles to be left sequential (or replicate) • object redelivery is time consuming

163 Fortran and HPF

6.9 HPF programming methodology

Additional advice: • Use array syntax possibilities instead of cycles • Use intrinsic functions where possible (for possibly better optimisation) • Before parallelisation think whether the argorithm is parallelisable at all? • Use INDIPENDENT and ALIGN directives • It is possible to assign sizes to the lengths in BLOCK and CYCLE if sure in need for the algorithm efficiency, otherwise better leave the decision to the compiler

164 Fortran and HPF

6.10 BLOCK(m) and CYCLIC(m)

6.10 BLOCK(m) and CYCLIC(m) predefining block size; in general make the code less efficient due to more complex accounting on ownership



REAL, ! HPF$ ! HPF$ ! HPF$





DIMENSION( 2 0 ) : : A, B PROCESSORS, DIMENSION ( 4 ) : : P DISTRIBUTE A(BLOCK( 9 ) ) ONTO P DISTRIBUTE B( CYCLIC ( 2 ) ) ONTO P

2D example: 

REAL, DIMENSION ( 4 , 9 ) : : A ! HPF$ PROCESSORS, DIMENSION ( 2 ) : : P ! HPF$ DISTRIBUTE (BLOCK( 3 ) ,CYCLIC ( 2 ) )





ONTO P : : A

165 Fortran and HPF

6.11

6.11 Array alignment

Array alignment

• improves data locality • minimises communication • distributes workload Simplest example: A = B + C – With correct ALIGN’ment it is without communication 2 ways:



! HPF$ ALIGN ( : , : ) WITH T ( : , : )





: : A, B, C

equivalent to: 

! HPF$ ALIGN A ( : , : ) WITH T ( : , : ) ! HPF$ ALIGN B ( : , : ) WITH T ( : , : ) ! HPF$ ALIGN C ( : , : ) WITH T ( : , : )





166 Fortran and HPF

6.11 Array alignment

Example: 



REAL, DIMENSION( 1 0 ) : : A, B, C ! HPF$ ALIGN ( : ) WITH C ( : ) : : A , B



– need only C be in DISTRIBUTE-command. Example, symbol instead of “:” : 

! HPF$ ALIGN ( j ) WITH C( j ) : : A , B



means: for ∀j align arrays A and B with array C. (“:” instead of j – stronger requirement)



167 Fortran and HPF



6.11 Array alignment



! HPF$ ALIGN A( i , j ) WITH B ( i , j )



Example 2 (2-dimensional case): 



REAL, DIMENSION( 1 0 , 1 0 ) : : A, B ! HPF$ ALIGN A ( : , : ) WITH B ( : , : )



is stronger requirement than not assuming same size arrays Good to perform A=B+C+B*C operations (everything local!) transposed alignment: 



REAL, DIMENSION( 1 0 , 1 0 ) : : A, B ! HPF$ ALIGN A( i , : ) WITH B ( : , i )



(second dimension of A and first dimension of B with same length!)

168 Fortran and HPF 

Similarly:

6.11 Array alignment



REAL, DIMENSION( 1 0 , 1 0 ) : : A, B ! HPF$ ALIGN A ( : , j ) WITH B ( j , : )



or: 



REAL, DIMENSION( 1 0 , 1 0 ) : : A, B ! HPF$ ALIGN A( i , j ) WITH B ( j , i )



good to perform operation: 

A=A+TRNSPOSE(B) ∗A ! e v e r y t h i n g l o c a l !





169 Fortran and HPF

6.12

6.12 Strided Alignment

Strided Alignment

Example: Align elements of matrix D with every second element in E: 



REAL, DIMENSION ( 5 ) : : D REAL, DIMENSION( 1 0 ) : : E ! HPF$ ALIGN D ( : ) WITH E ( 1 : : 2 )



could be written also: 



! HPF$ ALIGN D( i ) WITH E ( i ∗2 −1)

 

Operation:

D = D + E(::2) ! local





170 Fortran and HPF 

Example: reverse strided alignment:

6.12 Strided Alignment



REAL, DIMENSION ( 5 ) : : D REAL, DIMENSION( 1 0 ) : : E ! HPF$ ALIGN D ( : ) WITH E (UBOUND( E ) : : − 2 )



could be written also: 

! HPF$ ALIGN D( i ) WITH E(2+UBOUND( E )− i ∗ 2)





171 Fortran and HPF

6.13

6.13 Example on Alignment

Example on Alignment





PROGRAM Warty IMPLICIT NONE REAL, DIMENSION ( 4 ) : : C REAL, DIMENSION ( 8 ) : : D REAL, DIMENSION ( 2 ) : : E C = 1; D = 2 E = D( : : 4 ) + C( : : 2 ) END PROGRAM Warty



minimal (0) communication is achieved with: 

! HPF$ ALIGN C ( : ) WITH D ( : : 2 ) ! HPF$ ALIGN E ( : ) WITH D ( : : 4 ) ! HPF$ DISTRIBUTE (BLOCK) : : D





172 Fortran and HPF

6.14

6.14 Alignment with Allocatable Arrays

Alignment with Allocatable Arrays

• alignment is performed togehter with memory allocation • existing object cannot be aligned to unallocated object Example: 



REAL, DIMENSION ( : ) , ALLOCATABLE : : A,B ! HPF$ ALIGN A ( : ) WITH B ( : )



then 



ALLOCATE (B( 1 0 0 ) , s t a t = i e r r ) ALLOCATE (A( 1 0 0 ) , s t a t = i e r r )



is OK 



ALLOCATE (B( 1 0 0 ) ,A( 1 0 0 ) , s t a t = i e r r )



also OK (allocation starts from left),

173 Fortran and HPF

6.14 Alignment with Allocatable Arrays

but, 



ALLOCATE (A( 1 0 0 ) , s t a t = i e r r ) ALLOCATE (B( 1 0 0 ) , s t a t = i e r r )



or 



ALLOCATE (A( 1 0 0 ) ,B( 1 0 0 ) , s t a t = i e r r )



give error! Simple array cannot be aligned with allocatable: 

REAL, DIMENSION ( : ) : : X REAL, DIMENSION ( : ) , ALLOCATABLE : : A ! HPF$ ALIGN X ( : ) WITH A ( : ) ! ERROR





174 Fortran and HPF 

6.14 Alignment with Allocatable Arrays

One more problem:



REAL, DIMENSION ( : ) , ALLOCATABLE : : A, B ! HPF$ ALIGN A ( : ) WITH B ( : ) ALLOCATE(B( 1 0 0 ) , s t a t = i e r r ) ALLOCATE(A( 5 0 ) , s t a t = i e r r )



”:” says that A and B should be with same length (but they are not!) But this is OK:



REAL, DIMENSION ( : ) , ALLOCATABLE : : A, B ! HPF$ ALIGN A( i ) WITH B ( i ) ALLOCATE(B( 1 0 0 ) , s t a t = i e r r ) ALLOCATE(A( 5 0 ) , s t a t = i e r r )



– still: A cannot be larger than B.



175 Fortran and HPF

6.15 

6.16 Dimension replication

Dimension collapsing

With one element it is possible to align one or severeal dimensions:

! HPF$ ALIGN ( ∗ , : ) WITH Y ( : )





:: X

– each element from Y aligned to a column from X (First dimension of matrix X being collapsed)

6.16

Dimension replication





! HPF$ ALIGN Y ( : ) WITH X ( ∗ , : )



– each processor getting arbitrary row X(:,i) gets also a copy of Y(i)

176 Fortran and HPF

6.16 Dimension replication

Example: 2D Gauss elimination kernel of the program: 



... DO j = i +1 , n A( j , i ) = A( j , i ) / Swap( i ) A( j , i +1:n ) = A( j , i +1:n ) − A( j , i ) ∗Swap( i +1:n ) Y( j ) = Y( j ) − A( j , i ) ∗Temp END DO



Y(k) together with A(k,i) => 



! HPF$ ALIGN Y ( : ) WITH A ( : , ∗ )



Swap(k) together with A(i,k) => 



! HPF$ ALIGN



Swap ( : ) WITH A ( ∗ , : )

No matrix A neighbouring elements in same expression => CYLCIC:

177 Fortran and HPF 

! HPF$ DISTRIBUTE A( CYCLIC , CYCLIC )



6.16 Dimension replication



178 Fortran and HPF 

6.16 Dimension replication

Example: matrix multiplication

PROGRAM ABmult IMPLICIT NONE INTEGER , PARAMETER : : N = 100 INTEGER , DIMENSION (N,N) : : A, B, C INTEGER : : i , j ! HPF$ PROCESSORS square ( 2 , 2 ) ! HPF$ DISTRIBUTE (BLOCK,BLOCK) ONTO square : : C ! HPF$ ALIGN A ( i , ∗ ) WITH C( i , ∗ ) ! r e p l i c a t e c o p i e s o f row A ( i , ∗ ) onto p r o c e s s o r s which compute C( i , j ) ! HPF$ ALIGN B( ∗ , j ) WITH C( ∗ , j ) ! r e p l i c a t e c o p i e s o f column B( ∗ , j ) ) onto p r o c e s s o r s which compute C( i , j ) A = 1 B = 2 C = 0 DO i = 1 , N DO j = 1 , N ! A l l t h e work i s l o c a l due t o ALIGNs C( i , j ) = DOT_PRODUCT(A( i , : ) , B ( : , j ) ) END DO END DO WRITE ( ∗ , ∗ ) C END





179 Fortran and HPF

6.17

6.17 HPF Intrinsic Functions

HPF Intrinsic Functions

NUMBER_OF_PROCESSORS and PROCESSORS_SHAPE – information about physical hardware needed for portability: 



! HPF$ PROCESSORS P1 (NUMBER.OF.PROCESSORS( ) ) ! HPF$ PROCESSORS P2 ( 4 , 4 ,NUMBER.OF.PROCESSORS( ) / 1 6 ) ! HPF$ PROCESSORS P3 ( 0 :NUMBER.OF.PROCESSORS( 1 ) −1, & ! HPF$ 0 :NUMBER.OF.PROCESSORS( 2 ) −1)



2048-processor hyprcube: 



PRINT ∗ , PROCESSORS.SHAPE ( )



would return:

2 2 2 2 2 2 2 2 2 2 2

180 Fortran and HPF

6.18

6.18 HPF Template Syntax

HPF Template Syntax

TEMPLATE - conceptual object, does not use any RAM, defined statically (like 0sized arrays being not assigned to) • are declared • distributed • can be used to align arrays Example:



REAL, ! HPF$ ! HPF$ ! HPF$



DIMENSION( 1 0 ) : : A, B TEMPLATE, DIMENSION( 1 0 ) : : T DISTRIBUTE (BLOCK) : : T ALIGN ( : ) WITH T ( : ) : : A , B

(here only T may be argument to DISTRIBUTE) Combined TEMPLATE directive:



181 Fortran and HPF

6.18 HPF Template Syntax





! HPF$ TEMPLATE, DIMENSION( 1 0 0 , 1 0 0 ) , & ! HPF$ DISTRIBUTE (BLOCK, CYCLIC ) ONTO P : : T ! HPF$ ALIGN A ( : , : ) WITH T ( : , : )



which is equivalent to: 

! HPF$ TEMPLATE, DIMENSION( 1 0 0 , 1 0 0 ) : : T ! HPF$ ALIGN A ( : , : ) WITH T ( : , : ) ! HPF$ DISTRIBUTE T (BLOCK, CYCLIC ) ONTO P





182 Fortran and HPF

6.18 HPF Template Syntax

Example: 



PROGRAM Warty IMPLICIT NONE REAL, DIMENSION ( 4 ) : : C REAL, DIMENSION ( 8 ) : : D REAL, DIMENSION ( 2 ) : : E ! HPF$ TEMPLATE, DIMENSION ( 8 ) : : T ! HPF$ ALIGN D ( : ) WITH T ( : ) ! HPF$ ALIGN C ( : ) WITH T ( : : 2 ) ! HPF$ ALIGN E ( : ) WITH T ( : : 4 ) ! HPF$ DISTRIBUTE (BLOCK) : : T C = 1; D = 2 E = D( : : 4 ) + C( : : 2 ) END PROGRAM Warty



(similar to the example of alignment with stride)

183 Fortran and HPF

6.18 HPF Template Syntax

More examples on using Templates:

• ALIGN A(:)WITH T1(:,∗) – ∀ i, element A(i) replicated according to row T1(i,:). • ALIGN C(i,j) WITH T2(j,i) – transposed C aligned with T2 • ALIGN B(:,∗)WITH T3(:) – ∀ i, matrix B(i,:) aligned with template element T2(i), • DISTRIBUTE (BLOCK,CYCLIC):: T1, T2 • DISTRIBUTE T1(CYCLIC,∗)ONTO P – T1 rows are distributed cyclically

184 Fortran and HPF

6.19 FORALL

6.19 FORALL Syntax:

FORALL([,])& example: 

FORALL ( i =1:n , j =1:m,A( i , j ) .NE. 0 ) A( i , j ) = 1 /A( i , j )





185 Fortran and HPF

6.19 FORALL

Circumstances, where Fortran90 syntax is not enough, but FORALL makes it simple:

• index expressions: 



FORALL ( i =1:n , j =1:n , i / = j ) A( i , j ) = REAL( i + j )



• intrinsic or PURE-functions (which have no side-effects): 



FORALL ( i =1:n : 3 , j =1:n : 5 ) A( i , j ) =



SIN (A( j , i ) )

• subindexing: 

FORALL ( i =1:n , j =1:n ) A(VS( i ) , j ) = i +VS( j )





186 Fortran and HPF

6.19 FORALL

• unusual parts can be accessed: 



FORALL ( i =1:n ) A( i , i ) = B( i ) ! d i a g o n a l ! . . . DO j = 1 , n FORALL ( i =1: j ) A( i , j ) = B( i ) ! t r i a n g u l a r END DO

 

To parallelise add before DO also:



! HPF$ INDEPENDENT, NEW( i )



or write nested FORALL commands: 

FORALL ( j = 1 : n ) FORALL ( i =1: j ) A( i , j ) = B( i ) ! t r i a n g u l a r END FORALL





187 Fortran and HPF

FORALL-command execution: 1. triple-list evaluation 2. scalar matrix evaluation 3. for each .TRUE. mask elements find right hand side value 4. assignment of right hand side value to the left hand side In case of HPF synchronisation between the steps!

6.19 FORALL

188 Fortran and HPF

6.20 PURE-procedures

6.20 PURE-procedures 



PURE REAL FUNCTION F ( x , y ) PURE SUBROUTINE G( x , y , z )



without side effects, i.e.:

• no outer I/O nor ALLOCATE (communication still allowed) • does not change global state of the program • intrinsic functions are of type PURE! • can be used in FORALL and in PURE-procedures • no PAUSE or STOP • FUNCTION formal parameters with attribute INTENT(IN)

189 Fortran and HPF

6.20 PURE-procedures

Example (function): 



PURE REAL FUNCTION F ( x , y ) IMPLICIT NONE REAL, INTENT ( IN ) : : x , y F = x ∗ x + y ∗ y + 2 ∗ x ∗ y + ASIN ( MIN ( x / y , y / x ) ) END FUNCTION F



example of usage: 

FORALL ( i =1:n , j =1:n ) & A( i , j ) = b ( i ) + F ( 1 . 0 ∗ i , 1 . 0 ∗ j )





190 Fortran and HPF

6.20 PURE-procedures

Example (subroutine): 

PURE SUBROUTINE G( x , y , z ) IMPLICIT NONE REAL, INTENT (OUT) , DIMENSION ( : ) : : z REAL, INTENT ( IN ) , DIMENSION ( : ) : : x , y INTEGER i INTERFACE REAL FUNCTION F ( x , y ) REAL, INTENT ( IN ) : : x , y END FUNCTION F END INTERFACE ! ... FORALL( i =1: SIZE ( z ) ) z ( i ) = F ( x ( i ) , y ( i ) ) END SUBROUTINE G





191 Fortran and HPF

6.20 PURE-procedures

“MIMD” example: 

REAL FUNCTION F ( x , i ) ! PURE IMPLICIT NONE REAL, INTENT ( IN ) : : x ! element INTEGER , INTENT ( IN ) : : i ! i n d e x I F ( x > 0 . 0 ) THEN F = x∗x ELSEIF ( i ==1 .OR. i ==n ) THEN F = 0.0 ELSE F = x END I F END FUNCTION F





192 Fortran and HPF

6.21 INDEPENDENT

6.21 INDEPENDENT Directly in front of DO or FORALL 



! HPF$ INDEPENDENT DO i = 1 ,n x ( i ) = i ∗∗ 2 END DO



In front of FORALL: no synchronisation needed between right hand side expression evaluation and assignment If INDEPENDENT loop...

• ...assigns more than once to a same element, parallelisation is lost! • ...includes EXIT, STOP or PAUSE command, iteration needs to execute sequentially to be sure to end in right iteration

• ...has jumps out of the loop or I/O => sequential execution

193 Fortran and HPF 

Is independent:

6.21 INDEPENDENT



! HPF$ INDEPENDENT DO i = 1 , n b( i ) = b( i ) + b( i ) END DO

 

Not independent:



DO i = 1 , n b ( i ) = b ( i +1) + b ( i ) END DO



Not independent: 

DO i = 1 , n b ( i ) = b ( i −1) + b ( i ) END DO





194 Fortran and HPF

6.21 INDEPENDENT

This is independent loop: 



! HPF$ INDEPENDENT DO i = 1 , n a ( i ) = b ( i −1) + b ( i ) END DO



Question to ask: does a later iteration depend on a previous one?

195 Fortran and HPF

6.22 INDEPENDENT NEW command

6.22 INDEPENDENT NEW command – create an independent variable on each process! 

! HPF$ INDEPENDENT, NEW( s1 , s2 ) DO i =1 ,n s1 = Sin ( a ( 1 ) ) s2 = Cos ( a ( 1 ) ) a ( 1 ) = s1 ∗ s1−s2 ∗ s2 END DO





196 Fortran and HPF

6.22 INDEPENDENT NEW command

Rules for NEW-variable:

• cannot be used outside cycle without redefinition • cannot be used in FORALL • not pointers or formal parameters • cannot have SAVE atribute Disallowed: 

! HPF$ INDEPENDENT, NEW( s1 , s2 ) DO i = 1 ,n s1 = SIN ( a ( i ) ) s2 = COS( a ( i ) ) a ( i ) = s1 ∗ s1−s2 ∗ s2 END DO k = s1+s2 ! n o t a l l o w e d !





197 Fortran and HPF

6.23 EXTRINSIC

Example (only outer cycles can be executed independently): 

! HPF$ INDEPENDENT, NEW ( i 2 ) DO i 1 = 1 , n1 ! HPF$ INDEPENDENT, NEW ( i 3 ) DO i 2 = 1 , n2 ! HPF$ INDEPENDENT, NEW ( i 4 ) DO i 3 = 1 , n3 DO i 4 = 1 , n4 a ( i1 , i2 , i 3 ) = a ( i1 , i2 , i 3 ) & + b ( i1 , i2 , i 4 ) ∗ c ( i2 , i3 , i 4 ) END DO END DO END DO END DO



6.23 EXTRINSIC Used in INTERFACE-commands to declare that a routine does not belong to HPF



198 Fortran and HPF

7

MPI

See separate slides on course homepage!

6.23 EXTRINSIC

199 // Alg.Design

Part III

Parallel Algorithms 8

Parallel Algorithm Design Principles • Identifying portions of the work that can be performed concurrently • Mapping the concurrent pieces of work onto multiple processes running in parallel

• Distributing the input, output and also intermediate data associated with the program

• Access management of data shared by multiple processes • Process synchronisation in various stages in parallel program execution

200 // Alg.Design

8.1

8.1 Decomposition, Tasks and Dependency Graphs

Decomposition, Tasks and Dependency Graphs

• Subdividing calculations into smaller components is called decomposition Example: Dense matrix-vector multiplication y = Ab

y[i] = ∑nj=1 A[i, j]b[ j]

([1])

• Computational Problem × Decomposition →Tasks

201 // Alg.Design

8.1 Decomposition, Tasks and Dependency Graphs

◦ in case of y = Ab tasks – calculation of each row – independent

([1]) Tasks and their relative order abstraction:

• Task Dependency Graph ◦ directed acyclic graph – nodes – tasks

202 // Alg.Design

8.1 Decomposition, Tasks and Dependency Graphs

– directed edges – dependences

◦ (can be disconnected) ◦ (edge-set can be empty) • Mapping tasks onto processors • Processors vs. processes

203 // Alg.Design

8.2

Decomposition Techniques

• Recursive decomposition

Example: Quicksort

8.2 Decomposition Techniques

204 // Alg.Design

• Data decomposition ◦ Input data ◦ Output data ◦ intermediate results ◦ Owner calculates rule • Exploratory decomposition Example: 15-square game

• Speculative decomposition • Hybrid decomposition

8.2 Decomposition Techniques

205 // Alg.Design

8.3

Tasks and Interactions

• Task generation ◦ static ◦ dynamic • Task sizes ◦ uniform ◦ non-uniform ◦ knowledge of task sizes ◦ size of data associated with tasks

8.3 Tasks and Interactions

206 // Alg.Design

• Characteristics of Inter-Task Interactions ◦ static versus dynamic ◦ regular versus irregular ◦ read-only versus read-write ◦ one-way versus two-way

8.3 Tasks and Interactions

207 // Alg.Design

8.4

8.4 Mapping Techniques for Load balancing

Mapping Techniques for Load balancing

• Static mapping ◦ Mappings based on data partitioning – array distribution schemes

· block distribution · cyclic and block-cyclic distribution · randomised block distribution – Graph partitioning

208 // Alg.Design

8.4 Mapping Techniques for Load balancing

• Dynamic mapping schemes ◦ centralised schemes – master-slave – self scheduling

· single task scheduling · chunk scheduling ◦ distributed schemes – each process can send/receive work from whichever process

· · · ·

how are sending receiving processes paired together? initiator – sender or receiver? how much work exchanged? when the work transfer is performed?

209 // Alg.Design

8.4 Mapping Techniques for Load balancing

Methods for reducing interaction overheads

• maximising data locality ◦ minimise volume of data exchange ◦ minimise frequency of interactions • overlapping computations with interactions • replicating data or computation • overlapping interactions with other interactions

210 // Alg.Design

8.5

8.5 Parallel Algorithm Models

Parallel Algorithm Models

• data-parallel model • task graph model ◦ typically amount of data relatively large to the amount of computation • work pool model ◦ or task pool model ◦ pool can be – centralised – distributed – statically/dynamically created,

• master-slave model

211 // Alg.Design

8.5 Parallel Algorithm Models

• pipeline (or producer-consumer model) ◦ stream parallelism ◦ stream of data triggers computations ◦ pipelines is a chain of consumer-produces processes ◦ shape of pipeline can be: – linear or multidimesional array – trees – general graphs with or without cycles

• Bulk-sunchronous parallel (BSP) model ◦ Synchronisation steps needed in a regular or irregular pattern – p2p synchronisation – and/or global synchronisation

212 // Alg.Design

9.1 EXAMPLE: 2D FEM for Poisson equation

An example of BSP model based parallelisation: 9

Solving systems of linear equations with sparse matrices

9.1

EXAMPLE: 2D FEM for Poisson equation

Consider as an example Poisson equation

−∆u(x) = f (x), ∀x ∈ Ω, u(x) = g(x), ∀x ∈ Γ, where Laplacian ∆ is defined by

∂ 2u ∂ 2u (∆u)(x) = ( 2 + 2 )(x), x = ∂x ∂y



x y



213 // Alg.Design

9.1 EXAMPLE: 2D FEM for Poisson equation

u can be e.g.: • temperature

• displacement of an elastic membrane fixed at the boundary under a transversal load of intensity

• electro-magnetic potential Divergence theorem Z Z

F · n ds,

div F dx = Ω

Γ

where F = (F1 , F2 ) is a vector-valued function defined on Ω,

div F =

∂ F1 ∂ F2 + , ∂ x1 ∂ x2

and n = (n1 , n2 ) – outward unit normal to Γ. Here dx – element of area in R2 ; ds – arc length along Γ. Applying divergence theorem to: F = (vw, 0) and F = (0, vw) (details not shown here) – Green’s first identity Z Z Z ∂u ∇u · ∇v dx = v ds − v4u dx (1) Ω

Γ

∂n



214 // Alg.Design

9.1 EXAMPLE: 2D FEM for Poisson equation

we come to a variational formulation of the Poisson problem on V :

V = {v : v is continuous on Ω,

∂v ∂v and are piecewise continuous on Ω and ∂ x1 ∂ x2

v = 0 on Γ} Poisson problem in Variational Formulation: Find u ∈ V such that

a(u, v) = ( f , v), ∀v ∈ V

(2)

where in case of Poisson equation Z

∇u · ∇v dx

a(u, v) = Ω

R and (u, v) = Ω u(x)v(x) dx. The gradient ∇ of a scalar function f (x, y) is defined by:

 ∇f =

∂f ∂f , ∂x ∂y



215 // Alg.Design

9.1 EXAMPLE: 2D FEM for Poisson equation

Again, the Variational formulation of Poisson equation is: Z  Ω

 Z ∂u ∂v ∂u ∂v + dx = f vdx ∂x ∂x ∂y ∂y Ω

With discretisation, we replace the space V with Vh – space of piecewise linear functions. Each function in Vh can be written as

vh (x, y) = ∑ ηi ϕi (x, y), where ϕi (x, y) – basis function (or ’hat’ functions)

vh

ϕi

Ni T on Ω

216 // Alg.Design

9.1 EXAMPLE: 2D FEM for Poisson equation

With 2D FEM we demand that the equation in the Variational formulation is satisfied for M basis functions ϕi ∈ Vh i.e. Z Z

∇uh · ∇ϕi dx =



f ϕi dx i = 1, . . . , M Ω

But we have: M

uh =

∑ ξ jϕ j

=⇒

j=1 M

∇uh =

∑ ξ j ∇ϕ j

j=1

M linear equations with respect to unknowns ξ j : Z

M



Ω j=1



ξ j ∇ϕ j · ∇ϕi dx =

Z

f ϕi dx i = 1, . . . , M Ω

217 // Alg.Design

9.1 EXAMPLE: 2D FEM for Poisson equation

The stiffness matrix A = (ai j ) elements and right-hand side b = (bi ) calculation: Z

∇ϕi · ∇ϕ j dx

ai j = a(ϕi , ϕ j ) = Ω

Z

bi = ( f , ϕi ) =

f ϕi dx Ω

Integrals computed only where the pairs ∇ϕi · ∇ϕ j get in touch (have mutual support) Example two basis functions ϕi and ϕ j for nodes Ni and N j Their common support is τ ∪ τ 0 so that Z Z Z

∇ϕi · ∇ϕ j dx =

ai j = Ω

∇ϕi · ∇ϕ j dx +

τ

τ0

∇ϕi · ∇ϕ j dx

218 // Alg.Design

9.1 EXAMPLE: 2D FEM for Poisson equation

Nk Ni

τ τ0

ϕj

ϕi aτk, j Nj

Nk Ni

τ

Nj

Element matrices Consider single element τ Pick two basis functionsϕi and ϕ j (out of three). ϕk – piecewise linear =⇒ denoting by pk (x, y) = ϕk |τ :

pi (x, y) = αi x + βi y + γi , p j (x, y) = α j x + β j y + γ j . =⇒∇ϕi τ = ( ∂∂pxi , ∂∂pyi ) = (αi , βi ) =⇒ their dot product Z τ

∇ϕi · ∇ϕ j dx =

Z

αi α j + βi β j dx. τ

219 // Alg.Design

9.1 EXAMPLE: 2D FEM for Poisson equation

Finding coefficients α and β – put three points (xi , yi , 1), (x j , y j , 0), (xk , yk , 0) to the plane equation and solve the system



      αi xi yi 1 αi 1 D  βi  =  x j y j 1   βi  =  0  γi xk yk 1 γi 0 The integral is computed by the multiplication with the triangle area 12 |det D|

  x y 1 i i 1 τ ai j = ∇ϕi · ∇ϕ j dx = αi α j + βi β j det  x j y j 1  2 τ xk yk 1 Z

The element matrix for τ is  τ  aii aτi j aτik Aτ =  aτji aτj j aτjk  aτki aτk j aτkk

220 // Alg.Design

9.1 EXAMPLE: 2D FEM for Poisson equation

Assembled stiffness matrix

• created by adding appropriately all the element matrices together • Different types of boundary values used in the problem setup result in slightly different stiffness matrices

• Most typical boundary conditions: ◦ Dirichlet’ ◦ Neumann • but also: ◦ free boundary condition ◦ Robin boundary condition

221 // Alg.Design

9.1 EXAMPLE: 2D FEM for Poisson equation

Right-hand side R Right-hand side b : bi = Ω f ϕi dx. Approximate calculation of an integral through the f value in the middle of the triangle τ :

fτ = f(

xi + x j + xk yi + y j + yk ) 3 3

• support – the area of the triangle • ϕi is a pyramid (integrating ϕi over τ gives pyramid volume) bτi

Z

= τ

f τ ϕi dx =

1 τ f |det Dτ | 6

• Assembling procedure for the RHS b values from bτi (somewhat similarly to matrix assembly, but more simple)

• Introduction of Dirichlet boundary condition just eliminates the rows.

222 // Alg.Design

9.1 EXAMPLE: 2D FEM for Poisson equation

y

EXAMPLE

Consider unit square.

Vh – space of piecewise linar functions on Ω with zero on Γ and values defined on inner nodes

x

Discretised problem in Variational Formulation: Find uh ∈ Vh such that

a(uh , v) = ( f , v), ∀v ∈ Vh where in case of Poisson equation Z

∇u · ∇v dx

a(u, v) = Ω

(3)

223 // Alg.Design

9.1 EXAMPLE: 2D FEM for Poisson equation

• In FEM the equation (3) needs to be satisfied on a set of testfunctions ϕi = ϕi (x), ◦ which are defined such that  1, x = xi ϕi = 0 x = x j j 6= i

• and it is demanded that (3) is satisfied with each ϕi (i = 1, ..., N) • As a result, a matrix of the linear equations is obtained In general case n2 × n2 Laplace’s matrix has form:





B

−I

0

···

0

   A=   

−I

B

−I

..

.

0

−I

..

..  . 

.

.. .

..

0

···

 0  ,  .. .. . . −I  0 −I B

.

B

(4)

224 // Alg.Design

9.1 EXAMPLE: 2D FEM for Poisson equation

where n × n matrix B is of form:

    B=   

4

−1

0

···

0

−1

4

−1

..

.. .

0 .. .

0

.



   .. . 0  −1 4 .  .. .. .. . . . −1  · · · 0 −1 4

It means that most of the elements of matrix A are zero Poisson equation with varying coefficient (different materials, varying conductivity, elasticity etc.)

−∇(λ (x, y)∇u) = f

• assume that λ – constant accross each element τ : λ (x, y) = λ τ (x, y) ∈ τ • =⇒ Elemental matrices Aτ get multiplied by λ τ .

225 // Alg.Design

9.1 EXAMPLE: 2D FEM for Poisson equation

9.1.1

Sparse matrix storage schemes

9.1.2

Triple storage format

• n × m matrix A each nonzero with 3 values: integers i and j and (in most applications) real matrix element ai j . =⇒ three arrays: 



i n d i ( 1 : nz ) , i n d j ( nz ) , vals ( nz )



of length nz , – number of matrix A nonzeroes Advantages of the scheme:

• Easy to refer to a particular element • Freedom to choose the order of the elements Disadvantages :

• Nontrivial to find, for example, all nonzeroes of a particular row or column and their positions

226 // Alg.Design 9.1.3

9.1 EXAMPLE: 2D FEM for Poisson equation

Column-major storage format

For each matrix A column j a vector col(j) – giving row numbers i for which ai j 6= 0.

• To store the whole matrix, each col(j) ◦ added into a 1-dimensonal array cols(1:nz) ◦ introduce cptr(1:M) referring to column starts of each column in cols. 



cols ( 1 : nz ) , c p t r ( 1 :M) , vals ( 1 : nz )



Advantages:

• Easy to find matrix column nonzeroes together with their positions

227 // Alg.Design

9.1 EXAMPLE: 2D FEM for Poisson equation

Disadvantages:

• Algorithms become more difficult to read • Difficult to find nonzeroes in a particular row 9.1.4

Row-major storage format

For each matrix A row i a vector row(i) giving column numbers j for which

ai j 6= 0. • To store the whole matrix, each row(j) ◦ added into a 1-dimensonal array rows(1:nz) ◦ introduce rptr(1:N) referring to row starts of each row in rows. 



rows ( 1 : nz ) , r p t r ( 1 :N) , vals ( 1 : nz )



Advantages:

228 // Alg.Design

9.1 EXAMPLE: 2D FEM for Poisson equation

• Easy to find matrix row nonzeroes together with their positions Disadvantages:

• Algorithms become more difficult to read. • Difficult to find nonzeroes in a particular column. 9.1.5

Combined schemes

Triple format enhanced with cols(1:nz), cptr(1:M), rows(1:nz), rptr(1:N). Here cols and rows refer to corresponding matrix A refer to values in triple format. Advantages:

• All operations easy to perform Disadvantages:

• More memory needed. • Reference through indexing in all cases

229 // Alg.Design

9.2 9.2.1

9.2 Solution methods for sparse matrix systems

Solution methods for sparse matrix systems Direct methods

LU-factorisation with minimal fill-in

• analysis

• 100

• factorisation

• 10

• back-substitution

• 1

Discretisations of PDE-s in case of 2D regions – still OK 3D problems – not so efficient Implementations: MUMPS, UMFPACK, ...

230 // Alg.Design 9.2.2

9.2 Solution methods for sparse matrix systems

Krylov subspace method: Preconditioned Conjugate Gradient Method (PCG)





r(0)

= b − Ax(0)

Calculate for i = 1 , 2 , . . . solve Mz(i−1) = r(i−1)

with given s t a r t i n g vector

T

ρi−1 = r(i−1) z(i−1) i f i ==1

p(1) = z(0) else

βi−1 = ρi−1 /ρi−2 p(i) = z(i−1) + βi−1 p(i−1) endif T

q(i) = Ap(i) ; αi = ρi−1 /p(i) q(i) x(i) = x(i−1) + αi p(i) ; r(i) = r(i−1) − αi q(i) check convergence ; continue i f needed end



x(0)

231 // Alg.Design

9.3

9.3 Preconditioning

Preconditioning

Idea: Replace Ax = b with system M −1 Ax = M −1 b. Apply CG to

Bx = c, where B = M −1 A and c = M −1 b. But how to choose M ? Preconditioner M = M T to be chosen such that (i) Problem Mz = r being easy to solve (ii) Matrix B being better conditioned than A, meaning that κ2 (B) < κ2 (A)

(5)

232 // Alg.Design Then

p p #IT(5) = O( κ2 (B)) < O( κ2 (A)) = #IT(soving Ax = b)

• (In extreme cases M = I or M = A) • Preconditioned Conjugate Gradients (PCG) Method ◦ obtained if to take in previous algorithm M 6= I

9.3 Preconditioning

233 // Alg.Design

9.4

9.4 Domain Decompossition Method

Domain Decompossition Method

Preconditioning and parallelisation simultaneously Solution domain Ω divided into P overlapping subdomains

• Substructuring methods • Schwarz method ◦ Additive Schwarz P

M −1 = ∑ RTi A−1 i Ri i=1

– Ri - Restriction operator – A−1 i - i-th subdomain solve · Subdomain solvers – exact or inexact · In case of CG - symmetry of M −1 – prerequisite!

◦ Multiplicative Schwarz • Two-level methods

234 // Alg.Design

9.5

9.5 Domain Decomposition on Unstructured Grids

Domain Decomposition on Unstructured Grids

Domain Decomposition on Unstructured Grids DOUG (University of Bath, University of Tartu) 1997 - 2012 I.G.Graham, M.Haggers, R. Scheichl, L.Stals, E.Vainikko, M.Tehver, K. Skaburskas, O. Batrašev, C. Pöcher Parallel implementation based on:

• MPI

• 2D & 3D problems

• UMFPACK

• 2-level Additive Schwarz method

• METIS • 2-level partitioning • BLAS • Automatic parallelisation and loadbalancing

• Systems of PDEs

• Automatic Coarse Grid generation • Adaptive refinement of the coarse grid

235 // Alg.Design

9.5 Domain Decomposition on Unstructured Grids

DOUG Strategies • Iterative solver based on Krylov subspace methods PCG, MINRES, BICGSTAB, 2-layered FPGMRES with left or right preconditioning

• Non-blocking communication where at all possible

:-) Dot-product: (x, y) – :-( Ax-operation: y := Ax –

• Preconditioner based on Domain Decomposition with 2-level solvers Applying the preconditioner M −1 : solve for z : Mz = r . • Subproblems are solved with a direct, sparse multifrontal solver (UMFPACK)

236 // Alg.Design

9.5 Domain Decomposition on Unstructured Grids

Aggregation-based Domain Decomposition methods • Had been analysed only upto some extent

• We are:

making use of strong connections

• Second aggregation for creating subdomains, or

• using rough aggregation before graph partitioner

• Major progress - development of the theory: sharper bounds

• Implementation based on Fortran95 http://dougdevel. org

1.

R. Scheichl and E. Vainikko, Robust Aggregation-Based Coarsening for Additive Schwarz in the Case of Highly Variable Coefficients, Proceddings of the European Conference on Computational Fluid Dynamics, ECCOMAS CFD 2006 (P. Wesseling, E. ONate, J. Periaux, Eds.), TU Delft, 2006.

2.

R. Scheichl and E. Vainikko, Additive Schwarz and Aggregation-Based Coarsening for Elliptic Problems with Highly Variable Coefficients, Computing, 2007, 80(4), pp 319-343.

237 // Alg.Design

9.5 Domain Decomposition on Unstructured Grids

Aggregation – forming groups of neighbouring graph nodes Key issues:

• how to find good aggregates? • Smoothing step(s) for restriction and interpolation operators Four (often conflicting) aims: 1. follow adequatly underlying physial properties of the domain 2. try to retain optimal aggregate size 3. keep the shape of aggregates regular 4. reduce communication => develop aggregates with smooth boundaries

238 // Alg.Design

9.5 Domain Decomposition on Unstructured Grids

Coefficient jump resolving property

239 // Alg.Design

9.5 Domain Decomposition on Unstructured Grids

Coefficient jump resolving property

240 // Alg.Design

9.5 Domain Decomposition on Unstructured Grids

Algorithm: Shape-preserving aggregation Input: Matrix A, aggregation radius r, strong connection threashold α . Output: Aggregate number for each node in the domain. 1. Scale A to unit diagonal matrix (all ones on diagonal) 2. Find the set S of matrix A strong connectons: S = ∪ni=1 Si , where Si ≡ { j 6= i : |ai j | ≥ α maxk6=i |aik |, unscale A; aggr_num:=0; 3. aggr_num:=aggr_num+1; Choose a seednode x from G or if G = 0, / choose the first nonaggregated node x; level:=0 4. If (level 0 such that T (n) ≤ c f (n) for each n ≥ n0 • T (n) = Ω( f (n)) – ∃ constants c > 0 and n0 > 0 such that T (n) ≥ c f (n) for each n ≥ n0 • T (n) = Θ( f (n)) – T (n) = O( f (n)) and T (n) = Ω( f (n)).

245 // Alg.Optimality

10.2

10.2 Definitions of Optimality

Definitions of Optimality

Consider computational problem Q. Denote T ∗ (n) – Q sequential time complexity ie.: ∃ an algorithm for solution of the problem Q with time O(T ∗ (n)) and it can be shown that the time cannot be improved Sequential algorithm with work time O(T ∗ (n)) is called time optimal 2 types of parallel algorithm optimality definitions (1st weaker than 2nd):

1. Parallel algorithm for solution Q is called optimal or optimal in weak sense or weakly optimal or cost optimal, if W (n) = Θ(T ∗ (n)). 2. Parallel algorithm is called work-time (W T ) optimal or optimal in the strong sense if it can be shown that the program parallel working time T (n) cannot be improved by any other optimal parallel algorithm

246 // Alg.Optimality 

10.2 Definitions of Optimality 

EXAMPLE: Algorithm Sum Input : n = 2k numbers i n array A Output : Sum S = ∑ni=1 A(i) begin 1 . f o r i = 1 : n pardo

B(i) = A(i) 2 . f o r h = 1 : log n do f o r i = 1 : n/2h pardo

B(i) := B(2i − 1) + B(2i) 3 . S := B(1) end



Number of time units ( j): T (n) =

log n + 2 = O(log n) 1. step n operations j.th step (h = j − 1 in algorithm part 2.) n/2 j−1 operations ( j = 2 : log(n) + 1) log n => W (n) = n + ∑ j=1 (n/2 j ) + 1 = O(n) Due to W (n) = O(n) and T ∗ (n) = n, => algorithm optimal. Can be shown that optimal speedup with number of processors p = O( logn n ), (see the book) (In Ja’Ja’ Section.10 shown that it takes Ω(log n) time independently of number of processors => algorithm optimal also in strong sense)

247 // Algorithm Basic Techniques

11 11.1

11.1 Balanced Trees

Basic Techniques in Parallel Algorithms Balanced Trees

Prefix Sums Find si = x1  x2  ...  xi , 1 ≤ i ≤ n, where  – some operation, like *,+,... Easiest way: si = si−1  xi – sequential. T ∗ (n) = O(n)

Figure: sequential and parallel

248 // Algorithm Basic Techniques

11.1 Balanced Trees  ALGORITHM P r e f i x Sums Input : array of n = 2k elements (x1 , x2 , ..., xn ) , where k ≥ 0 i n t e g e r Output : P r e f i x sums si , 1 ≤ i ≤ n begin 1 . i f n = 1 then {sets1 := x1 ; e x i t } 2 . f o r 1 ≤ i ≤ n/2 pardo set yi := x2i−1  x2i 3 . Recursively , f i n d P r e f i x sums f o r {y1 , y2 , ..., yn/2 } and store i n z1 , z2 , ..., zn/2 . 4 . f o r 1 ≤ i ≤ n pardo {i even: set si := zi/2 i=1: set s1 := x1 i odd > 0 : set si := z(i−1)/2  xi } end





249 // Algorithm Basic Techniques

11.1 Balanced Trees

Program work time T (n) and work W (n) estimation:

T (n) = T (n/2) + a W (n) = W (n/2) + bn, where a and b are constants. Solution of the recursion:

T (n) = O(log n) W (n) = O(n). Algorithm is (W T -optimal) or optimal in strong sense (Proof Ja’Ja’ Sec 10), EREW and CREW PRAM

250 // Algorithm Basic Techniques

11.2

11.2 Pointer Jumping Method

Pointer Jumping Method

Rooted Directed tree T is a directed graph with a special node r such that 1. for each v ∈ V − {r} outdegree is 1, outdegree of r is 0; 2. for each v ∈ V − {r} there exists a directed path from node v to the node r. The special node r is called root of tree T With pointer jumping it is possible to quickly manage data stored in rooted directed trees

251 // Algorithm Basic Techniques

11.2 Pointer Jumping Method

Task: Find all roots of forest Forest F :

• consists of rooted directed trees • given by array P(1 : n) : P(i) = j => edge from node i to j • If P(i) = i => i – root Task is to find root S( j) for each j, ( j = 1, ..., n). Simple algorithm (for example, reversing all arrows and thereafter root finding) – linear time At first take S(i) – successor of i-th node it’s P(i) In pointer jumping, each node successor is overwritten with successor’s successor; repeatedly => After k iterations distance between i and S(i) is 2k if S(i) not root

252 // Algorithm Basic Techniques

11.2 Pointer Jumping Method

ALGORITHM Pointer Jumping Input: forest of rooted directed trees with roots pointing to itself, edges given as pairs (i, P(i)), where i = 1, ..., n. Output: root S(i) of the corresponding tree the node i belongs to begin for i = 1, n pardo S(i) := P(i) while (S(i) 6= S(S(i))) do S(i) := S(S(i)) end

CREW PRAM

253 // Algorithm Basic Techniques Example

11.2 Pointer Jumping Method

254 // Algorithm Basic Techniques

11.2 Pointer Jumping Method

Time and work estimate for the given algorithm:

h- maximal tree-length in forest • Time: stepsize between i and S(i) doubles in each iteration until S(S(i)) is root to the tree conatining i ◦ => number of iterations= O(log h) ◦ Each iteration can be performed in O(1) parallel time and O(n) operations ◦ => T (n) = O(log h) • W (n) = O(n log h) (W (n) non-optimal due to existence of O(n) sequential algorithm)

255 // Algorithm Basic Techniques

11.3

11.3 Divide and Conquer

Divide and Conquer

Divide and conquer strategy

• consists of 3 main steps: 1. Dividing input into multiple relatively equal parts 2. Solving recursively subproblems on given parts 3. Merging the found solutions of different subproblems into a solution of the overall problem

256 // Algorithm Basic Techniques

11.3 Divide and Conquer

Finding convex hull of a given set

• Let a set of points S = {p1 , p2 , ..., pn }, be given with coordinates pi = (xi , yi ). • Planar convex hull of set S is the smallest convex polygon, containing all n points of S • Polygon Q is called convex if for arbitrary 2 polygon Q points p and q the line segment (p, q) lies entirely in polygon Q Convex hull problem: Determine the ordered (say, clockwise) list CH(S) of the points of S, determining the boundary of the convex hull of S

257 // Algorithm Basic Techniques

11.3 Divide and Conquer

258 // Algorithm Basic Techniques

11.3 Divide and Conquer

• Can be shown that in case of sequential algorithm T ∗ (n) = Θ(n log n) (sorting can be reduced to the problem of solving convex-hull problem) Let p and q be points from S with the smallest and largest x-coordinate

• => p and q ∈ CH(S). • => CH(S) points p ... q form upper hull UH(S) • q... p form lower LH(S) Remark. Sorting n numbers can be done in O(log n) time on EREW PRAM using total of O(n log n) operations. (Ja’Ja’ Chapter.4.) Assume for simplicity that no points have the same x or y coordinates and that n is power of 2 Sort the points according to x-coordinate (O(log n) time end O(n log n) operations):

x(p1 ) < x(p2 ) < · · · < x(pn ) Let S1 = (p1 , p2 , · · · , p n ) and S2 = (p n +1 , · · · , pn ) and UH(S1 ) and UH(S2 ) have 2 2 been found already. Upper common tangent for sets Si (i = 1, 2) needs to be found.

259 // Algorithm Basic Techniques

11.3 Divide and Conquer

260 // Algorithm Basic Techniques

11.3 Divide and Conquer

The upper common tangent for UH(S1 ) and UH(S2 ) can be found in O(log n) sequential time using binary search (Ja’Ja’ Ch. 6) 0

0

Let UH(S1 ) = (q1 , ..., qs ) and UH(S2 ) = (q1 , ..., qt ) 0

• (q1 = p1 and qt = pn ) 0

• Upper common tangent defined with points: (qi , q j ). UH(S) consists of first i UH(S1 ) points and t − j + 1 UH(S2 ) points: 0 0 UH(S) = (q1 , ..., qi , q j , ..., qt ) – can be determined in O(1) time and O(n) operations

261 // Algorithm Basic Techniques

11.3 Divide and Conquer

ALGORITHM (Simple Upper Hull) Input: Set S, consisting of points in the plane, no two of which have the same x or y coordinates s.t. x(p1 ) < x(p2 ) < · · · < x(pn ), where n = 2k . Output: The upper hull of S begin 1. If n ≤ 4 use brute-force method to find UH(S) and exit 2. Let S1 = (p1 , p2 , · · · , p n2 ) and S2 = (p 2n +1 , · · · , pn ) Recursively compute UH(S1 ) and UH(S2 ) in parallel 3. Find the upper common tangent between UH(S1 ) and end

262 // Algorithm Basic Techniques

11.3 Divide and Conquer

Theorem. Algorithm correctly computes the upper hull of n points in the plane taking O(log2 n) time and using total of O(n log n) operations. Proof. Correctness follows from induction over n T (n) andW (n)? Step 1. – O(1) time Step 2. – T (n/2) time using 2W (n/2) operations Step 3. – Common tangent: O(log n) time, combining O(1) time and O(n) operations => recursions:

• T (n) ≤ T ( 2n ) + a log n, • W (n) ≤ 2W ( 2n ) + bn. with a and b positive constants. Solving the recursions gives: T (n) = O(log2 n) ja W (n) = O(n log n). It follows that: Convex hull of a set of n points can be found in O(log2 n) time using total of O(n log n) operations. => the algorithm is optimal.

263 // Algorithm Basic Techniques

11.4

11.4 Partitioning

Partitioning

Partitioning strategy: 1. breaking up the given problem into p independent subproblems of almost equal sizes 2. solving p problems concurrently

• On contrary to divide-and-concuer strategy – the main task is partitioning, not combining the solution

264 // Algorithm Basic Techniques

11.4 Partitioning

Merging linearly ordered sets Given set S and partial order relation ≤

• Set S is linarly (or totally) ordered if for every pair a, b ∈ S either a ≤ b or b ≤ a Consider 2 nondecreasing sequences A = (a1 , a2 , ..., an ) and B = (b1 , b2 , ..., bn ) with elements from linearly ordered set S

• Task: merge the sequnces A and B into sorted sequence C = (c1 , c2 , ..., c2n ) Let X = (x1 , x2 , ..., xt ) set with elements from set S.

• Rank of element x ∈ S in set X , denoted with rank(x : X), is the number of set X elements less or equal to x. Let Y = (y1 , y2 , ..., ys ) be arbitrary array of elements in set S.

• Ranking of set Y in set X is an array rank(Y : X) = (r1 , r2 , ..., rs ), where ri = rank(yi : X). Ecample:

X = (26, −12, 27, 30, 53, 7); Y = (12, 28, −28). Then rank(Y : X) = (2, 4, 0).

265 // Algorithm Basic Techniques

11.4 Partitioning

For simplicity: all elements of both sorted sequences A and B to merge are distinct => merging problem solvable finding for each element x in A and B its rank in set A ∪ B.

• If rank(x : A ∪ B) = i, then ci = x. • As rank(x : A ∪ B) = rank(x : A) + rank(x : B), then the merging problem can be solved finding two integer arrays rank(A : B) and rank(B : A). So, consider the problem rank(B : A) Let bi – an arbitrary element of B As A– sorted, finding rank(bi , A) can be performed with binary search. (Compare bi with middle element and go on with lower or upper part etc. until we have found a j(i) < bi < a j(i)+1 and rank(bi , A) = j(i)). Sequential time is O(log n) => perform in parallel with each element of B. Work - O(n log n) , which is not optimal since there exists a linear sequential algorithm

266 // Algorithm Basic Techniques

11.4 Partitioning

Optimal merging algorithm Split sequences A and B into log n blocks => ca n/ log n elements into each approximately equal partition

267 // Algorithm Basic Techniques

11.4 Partitioning

Algorithm 2.7 (Partitioning) Input: Two arrays A = (a1 , ..., an ) and B = (b1 , ..., bm ) with elements in increasing order with log m and k(m) = m/ log m being integers Output: k(m) pairs (Ai , Bi ) of sequences A and B subsequences such that: (1) |Bi | = log m; (2) ∑i |Ai | = n and (3) each subsequence Ai and Bi element larger than all elements in subsequences Ai−1 and Bi−1 ∀ 1 ≤ i ≤ k(m) − 1. begin 1. j(0) := 0; j(k(m)) := n 2. for 1 ≤ i ≤ k(m) − 1 pardo 2.1 rank bi log m in sequence A using binary search and j(i) := rank(bi log m : A) 3. for 0 ≤ i ≤ k(m) − 1 pardo 3.1 Bi :=(bi log m+1 , ..., b(i+1) log m ) 3.2 Ai :=(a j(i)+1 , ..., a j(i+1) ) (Ai is empty if j(i) = j(i + 1))

268 // Algorithm Basic Techniques

end

Example:

A = (4, 6, 7, 10, 12, 15, 18, 20) B = (3, 9, 16, 21) Here m = 4, k(m) = 2. As rank(9 : A) = 3 => A0 = (4, 6, 7), B0 = (3, 9) A1 = (10, 12, 15, 18, 20), B1 = (16, 21)

11.4 Partitioning

269 // Algorithm Basic Techniques

11.4 Partitioning

Lemma 2.1: Let C be a sorted sequence obtained with merging two sorted sequences A and B (with lengths n and m). Then Algorithm 2.7 partitions sequences A and B into pairs (Ai , Bi ) such that |Bi | = O(log m), ∑i |Ai | = n and C = (C0 ,C1 , ...), where Ci is sorted sequence obtained through merging subsequences Ai and Bi . In addition, the algorithm works with O(log n) time using O(n + m) operations. Proof. Ensure that each element from subsequences Ai and Bi larger than each element in Ai−1 and Bi−1 . Step 1: O(1) time Step 2: O(log n) time (binary search for each one separately) Operation count: O((log n) × (m/ log m)) = O(m + n), as (m log n/ log m) < (m log(n + m)/ log m) ≤ n + m, if n, m ≥ 4. Step 3: Time: O(1), linear number of opeartions

270 // Algorithm Basic Techniques

11.4 Partitioning

Theorem 2.4 Let A and B be two sorted sequences, each of length n. Merging sequences A and B can be performed in O(log n) time with O(n) operations. Proof. 1. Using algorithm 2.7 divide A and B into pairs of subsequences (Ai , Bi ) (Time:O(log n), Work O(n)) 2. Could use a sequential algorithm for merging (Ai , Bi ) but Ti = O(|Bi | + |Ai |) =

O(log n+?) Therefore, if

|Ai | > O(log n) – 2a) split Ai into pieces of length log n and apply algorithm 2.7, with B replaced with Ai and A replaced with Bi – O(log log n) time using O(|Ai |) operations. |Ai | ≤ O(log n) 2b) use sequential algorithm for merging subsequences

271 // Algorithm Basic Techniques

11.5

11.5 Pipelining

Pipelining

In pipelining, process is divided into subprocesses in such a way that as soon the previous subprocess has finished an arbitrary algorithm phase, it can be followed by the next subprocess

2-3 Trees 2-3 tree is:

• rooted tree with each node having 2 or 3 children • path length from root to all leaves is the same Properties:

• number of leaves in 2-3 tree is between on 2h and 3h (h - tree height) => height of n-leave trees is Θ(log n) • being used in sorted lists (element search, adding, subtracting an element)

272 // Algorithm Basic Techniques Sorted list A = (a1 , a2 , ..., an ), where a1 < a2 < · · · < an as 2-3 tree leaves – elements ai with direction from left to right Inner nodes: L(ν) : M(ν) : R(ν) L(ν) - largest element of leften subtree M(ν) - middle subtree largest element R(ν) - largest element of right subtree (if it exists) Example

11.5 Pipelining

273 // Algorithm Basic Techniques

11.5 Pipelining

• 2-3-tree corresponding to a sorted list A is not defined uniformly • After elemental operations (adding, deleting a leave) the structure must remain the same

Searching: With given b find the leaf of T for which ai ≤ b < ai+1 . Work is proportional to the tree height (O(log n))

274 // Algorithm Basic Techniques Adding a leaf: First find the place for addition with previous algorithm

• (If b = ai – nothing needs to be done) • add the leaf and repair the tree

11.5 Pipelining

275 // Algorithm Basic Techniques

• Add root on new level, if needed

11.5 Pipelining

276 // Algorithm Basic Techniques

11.5 Pipelining

Adding k sorted elements b1 < b2 < · · · < bk to 2-3 tree Sequential algorithm: Time O(k log n), Work: O(k log n)

• With sequential algorithm add b1 and bk to the tree (Time: O(log n)) • In parallel, search places for each element bi • A. Simple case – between each pair ai and ai+1 at most 1 element b j : ◦ adding nodes in parallel (for repairing the tree) moving from bottom to top (never more than 6 subnodes!)

◦ Time: O(log n), ◦ Work: O(k log n). • B. More difficult case – if for example between ai and ai+1 a sequence bl , bl+1 , ..., b f , repair the tree with middle element, then subsequence middle elements etc. in parallel

• Total time: O(log k log n)

277 // Algorithm Basic Techniques

11.5 Pipelining

Pipelining can be used – operations in parallel in waves (each level has exactly one repairment of the tree)!

• Time: O(log k + log n) = O(log n), Work: O(k log n)

278 // Algorithm Basic Techniques

11.6

11.6 Accelerated Cascading

Accelerated Cascading

Consider:

• fast but nonoptimal algorithm and • slower but optimal algorithm Question: Could these be combned into a fast and optimal algorithm?

279 // Algorithm Basic Techniques 11.6.1

11.6 Accelerated Cascading

Constant time maximum algorithm

Algorithm 2.8 (maximum in O(1) time) Input: Array A, with all p elements being different Output: Array M of logical values such that M(i)=1 exactly iff A(i) is maximal element of array A begin 1. for 1 ≤ i, j ≤ p pardo if (A(i) ≥ A( j)) then B(i, j) := 1 else B(i, j) := 0 2. for 1 ≤ i ≤ p pardo M(i) := B(i, 1) ∧ B(i, 2) ∧ · · · ∧ B(i, p) end

CRCW PRAM model, O(1) time, O(p2 ) operations

280 // Algorithm Basic Techniques 11.6.2

11.6 Accelerated Cascading

Doubly logarithmic-time algorithm k

Assume n = 22 for some integer k √ k−1 Doubly logarithmic-depth n-leave tree (DLT): Root has 22 (= n) children, k−2 k−i−1 each of which have 22 children and in general, each node on i-th level has 22 children, 0 ≤ i ≤ k − 1. Each level k node has 2 children Example: 16-node doubly logarithmic-depth tree: (k = 2)

281 // Algorithm Basic Techniques

11.6 Accelerated Cascading

• through induction, it is easy to ensure that number of nodes on i-th level is k k−i 22 −2 , • number of nodes on level k: 22

k −20

k −1

= 22

= n/2.

• depth of the tree: k + 1 = log log n + 1 • DLT can be used for finding maximum of n elements: each node gets maximum of subtree. Starting from beneath with O(1) time using Algorithm 2.8. • => T (n) = O(log log n) • W (n) =? k−i−1 2 ) ) per node, giving on i-th level number of operations: O((22 k k −2k−i k−i−1 2 2 2 2 ) = O(2 ) = O(n) operations on level i. O((2 ) ·2 • => W (n) = O(n log log n) • => algorithm is not optimal!

282 // Algorithm Basic Techniques 11.6.3

11.6 Accelerated Cascading

Making fast algorithm optimal

logarithmic depth binary tree – optimal, time O(log n) Accelerated cascading 1. Starting with optimal algorithm until size of the problem smaller than a certain level 2. Switching over to fast but nonoptimal algorithm Problem of finding maximum: 1. start with binary tree algorithm from leaves dlog log log ne levels up . as number of nodes reduces by factor of two each level, we have maximum within n0 = O(n/ log log n) nodes . Number of operations so far: O(n), Time: O(log log log n). 2. DLT on remaining n0 elements Time: O(log log n0 ) = O(log log n) W (n) = O(n0 log log n0 ) = O(n) operations. => hybrid algorithm optimal and works with O(log log n) time.

283 // Algorithm Basic Techniques

11.7

11.7 Symmetry breaking

Symmetry breaking

Let G = (V, E) directed cycle (inbound and outbound rank=1, ∀u, v ∈ V ∃ directed path from node u to v.

Graph k-coloring is a mapping: c : V → 0, 1, ..., k − 1 such that c(i) 6= c( j) if hi, ji ∈

E.

284 // Algorithm Basic Techniques

11.7 Symmetry breaking

Consider 3-coloring problem. Sequential algorithm is simple – alternating coloring with 0, 1, in the end pick a color from 1, 2. Let edges be given with an array S(1 : n) (there is an edge hi, S(i)i ∈ E ) Previous edge: P(S(i)) = i Suppose there is a coloring c(1 : n) (can always be found, taking for example, c(i) = i) Consider binary representation i = it−1 · · · ik · · · ii i0 ; k-th least significant bit: ik

Algorithm 2.9 (Basic coloring) Input: Directed cycle with edges given by array S(1 : n) and coloring c(1, n) on nodes Output: New coloring c0 (1 : n) begin for 1 ≤ i ≤ n pardo 1. Find k-th least significant bit, where c(i) and c(S(i)) differ 2. c0 (i) := 2k + c(i)k end

285 // Algorithm Basic Techniques

11.7 Symmetry breaking

v 1 3 7 14 2 15 4 5 6 8 10 11 12 9 13

c 0001 0011 0111 1110 0010 1111 0100 0101 0110 1000 1010 1011 1100 1001 1101

k 1 2 0 2 0 0 0 0 1 1 0 0 0 2 2

c’ 2 4 1 5 0 1 0 1 3 2 0 1 0 4 5

286 // Algorithm Basic Techniques

11.7 Symmetry breaking

Lemma 2.3: Output c0 (1 : n) is valid coloring whenever input c(1 : n) is valid coloring. T (n) = O(1), W (n) = O(n). Proof:

• k can be found always, if c is valid coloring • Consider c0 (i) = c0 ( j), if there is an edge i → j => c0 (i) = 2k + c(i)k and c0 ( j) = 2l + c( j)l as c0 (i) = c0 ( j) => k = l => c(i)k = c( j)k , which is contradicting the definition of k . => c0 (i) 6= c0 ( j), whenever there is an edge i → j.

287 // Algorithm Basic Techniques 11.7.1

11.7 Symmetry breaking

Superfast 3-coloring algorithm

Let t > 3 (number of bits in c ) => c0 (1 : n) can be described with dlogte + 1 bits. => if the number of colors in c(1 : n) is q, where 2t−1 < q ≤ 2t , then coloring c0 uses at most 2dlogte+1 = O(t) = O(log q) colors. => number of colors reduces exponentially. reduction takes place whenever t > dlogte + 1 – i.e. t > 3. Applying base coloring algorithm with t = 3 gives at most 6 colors, as c0 (i) = 2k + c(i)k and => 0 ≤ c0 (i) ≤ 5 (as 0 ≤ k ≤ 2).

288 // Algorithm Basic Techniques

11.7 Symmetry breaking

Number of iterations? Denote log(i) x: log(1) x = log x; log(i) x = log(log(i−1) x) Let log∗ x = min{i| log(i) x ≤ 1} Function log∗ x is growing extremely slowly being < 5 for each x ≤ 265536 ! Theorem 2.6: 3-coloring directed cycles can be performed with O(log∗ n) time and O(n log∗ n) operations. Proof: Applying base coloring algorithm, we reach 6 colors at most in O(log∗ n) iterations. Changing colors 3, 4 and 5 with 3 iterations – change each of the colors to the smallest possibility from set {0, 1, 2} (depending on neighbours). It works correctly because no neighbouring node colored at the same time.

289 // Algorithm Basic Techniques 11.7.2

11.7 Symmetry breaking

Optimal 3-coloring algorithm

Remark. (Sorting integers) Let be given n integers in the interval [0, O(log n)]. These can be sorted in O(log n) time using linear number of operations. (Can be achieved through combination of radix-sort and prefix sum algorithms)

Algorithm 2.10 (3-coloring of a cycle) Input: Directed graph of legth n with edges defined by array S(1 : n) Output: 3-coloring of cycle nodes. begin 1. for 1 ≤ i ≤ n pardo C(i) := i 2. Apply algorithm 2.9 once. 3. Sort depending on node colors 4. for i = 3 : 2 dlog ne do each i-color nodes pardo color with smallest possible from the set 0, 1, 2, which is different from the neighbours. end

290 Scalability

12 12.1

12.1 Isoefficiency Metric of Scalability

Parallel Algorithm Scalability1 Isoefficiency Metric of Scalability

• For a given problem size, as we increase the number of processing elements, the overall efficiency of the parallel system goes down with given algorthm.

• For some systems, the efficiency of a parallel system increases if the problem size is increased while keeping the number of processing elements constant. 1

Based on [1]

291 Scalability

12.1 Isoefficiency Metric of Scalability

Variation of efficiency: (a) as the number of processing elements is increased for a given problem size; and (b) as the problem size is increased for a given number of processing elements. The phenomenon illustrated in graph (b) is not common to all parallel systems

• What is the rate at which the problem size must increase with respect to the number of processing elements to keep the efficiency fixed?

292 Scalability

12.1 Isoefficiency Metric of Scalability

• This rate determines the scalability of the system. The slower this rate, the better

• Before we formalize this rate, we define the problem size W as the asymptotic number of operations associated with the best serial algorithm to solve the problem

• We can write parallel runtime as:

TP =

W + To (W, p) p

• The resulting expression for speedup is:

S = =

W Tp Wp W + To (W, p)

293 Scalability

12.1 Isoefficiency Metric of Scalability

• Finally, we write the expression for efficiency as E =

S p

W W + To (W, p) 1 = 1 + To (W, p)/W

=

• For scalable parallel systems, efficiency can be maintained at a fixed value (between 0 and 1) if the ratio To /W is maintained at a constant value. • For a desired value E of efficiency, 1 1 + To (W, p)/W To (W, p) 1−E = W E E W = To (W, p). 1−E E =

294 Scalability

12.1 Isoefficiency Metric of Scalability

• If K = E/(1–E) is a constant depending on the efficiency to be maintained, since To is a function of W and p, we have

W = KTo (W, p)

• The problem size W can usually be obtained as a function of p by algebraic manipulations to keep efficiency constant.

• This function is called the isoefficiency function. • This function determines the ease with which a parallel system can maintain a constant efficiency and hence achieve speedups increasing in proportion to the number of processing elements

295 Scalability

12.1 Isoefficiency Metric of Scalability

Example: Adding n numbers on p processors • The overhead function for the problem of adding n numbers on p processing elements is approximately 2p log p: ◦ The first stage of algorithm (summing n/p numbers locally) takes n/p time

◦ the second phase involves log p steps with communication and addition on each step. Assume single communication takes unit time, the time is

2 log p n + 2 log p p n S = n p + 2 log p

TP =

E =

1 p 1 + 2p log n

296 Scalability

12.1 Isoefficiency Metric of Scalability

• Doing as before, but substituting To by 2p log p , we get W = K2p log p

• Thus, the asymptotic isoefficiency function for this parallel system is Θ(p log p) • If the number of processing elements is increased from p to p’, the problem size (in this case, n ) must be increased by a factor of (p’ log p’)/(p log p) to get the same efficiency as on p processing elements Efficiencies for different n and p: n

p=1

p=4

p=8

p = 16

p = 32

64

1.0

0.80

0.57

0.33

0.17

192

1.0

0.92

0.80

0.60

0.38

320

1.0

0.95

0.87

0.71

0.50

512

1.0

0.97

0.91

0.80

0.62

297 Scalability

12.1 Isoefficiency Metric of Scalability

Example 2: Let To = p3/2 + p3/4W 3/4 • Using only the first term of To in W = KTo (W, p), we get W = K p3/2

• Using only the second term, yields the following relation between W and p: W = K p3/4W 3/4 W 1/4 = K p3/4 W = K 4 p3 • The larger of these two asymptotic rates determines the isoefficiency. This is given by Θ(p3 )

298 Scalability

12.2

12.2 Cost-Optimality and the Isoefficiency Function

Cost-Optimality and the Isoefficiency Function

• A parallel system is cost-optimal if and only if

pTP = Θ(W )

• From this, we have:

W + To (W, p) = Θ(W ) To (W, p) = O(W ) W = Ω(To (W, p))

=> cost-optimal if and only if its overhead function does not asymptotically exceed the problem size

• If we have an isoefficiency function f (p), then it follows that the relation W = Ω( f (p)) must be satisfied to ensure the cost-optimality of a parallel system as it is scaled up.

299 Scalability

12.2 Cost-Optimality and the Isoefficiency Function

Lower Bound on the Isoefficiency Function

• For a problem consisting of W units of work, no more than W processing elements can be used cost-optimally.

• The problem size must increase at least as fast as Θ(p) to maintain fixed efficiency; hence, Ω(p) is the asymptotic lower bound on the isoefficiency function. Degree of Concurrency and the Isoefficiency Function

• The maximum number of tasks that can be executed simultaneously at any time in a parallel algorithm is called its degree of concurrency.

• If C(W ) is the degree of concurrency of a parallel algorithm, then for a problem of size W , no more than C(W ) processing elements can be employed effectively.

300 Scalability

12.2 Cost-Optimality and the Isoefficiency Function

Degree of Concurrency and the Isoefficiency Function: Example

• Consider solving a system of equations in variables by using Gaussian elimination (W = Θ(n3 )) • The n variables must be eliminated one after the other, and eliminating each variable requires Θ(n2 ) computations. • At most Θ(n2 ) processing elements can be kept busy at any time. • Since W = Θ(n3 ) for this problem, the degree of concurrency C(W ) is Θ(W 2/3 ) • Given p processing elements, the problem size should be at least Ω(p3/2 ) to use them all.

301 Graph Algorithms

12.2 Cost-Optimality and the Isoefficiency Function

Scalability of Parallel Graph Algorithms 12.1 Definitions and Representation

12.2 Minimum Spanning Tree: Prim’s Algorithm

12.3 Single-Source Shortest Paths: Dijkstra’s Algorithm

12.4 All-Pairs Shortest Paths

12.5 Connected Components

?? Algorithms for Sparse Graphs

302 Graph Algorithms

12.1

12.1 Definitions and Representation

Definitions and Representation

• An undirected graph G is a pair (V, E), where V is a finite set of points called vertices and E is a finite set of edges • An edge e ∈ E is an unordered pair (u, v), where u, v ∈ V • In a directed graph, the edge e is an ordered pair (u, v). An edge (u, v) is incident from vertex u and is incident to vertex v • A path from a vertex v to a vertex u is a sequence hv0 , v1 , v2 , . . . , vk i of vertices where v0 = v, vk = u, and (vi , vi+1 ) ∈ E for i = 0, 1, ..., k − 1 • The length of a path is defined as the number of edges in the path • An undirected graph is connected if every pair of vertices is connected by a path

• A forest is an acyclic graph, and a tree is a connected acyclic graph • A graph that has weights associated with each edge is called a weighted graph

303 Graph Algorithms

12.1 Definitions and Representation

• Graphs can be represented by their adjacency matrix or an edge (or vertex) list • Adjacency matrices have a value ai, j = 1 if nodes i and j share an edge; 0 otherwise. In case of a weighted graph, ai, j = wi, j , the weight of the edge • The adjacency list representation of a graph G = (V, E) consists of an array Adj[1..|V |] of lists. Each list Adj[v] is a list of all vertices adjacent to v • For a grapn with n nodes, adjacency matrices take Θ(n2 ) space and adjacency list takes Θ(|E|) space.

304 Graph Algorithms

12.1 Definitions and Representation

• A spanning tree of an undirected graph G is a subgraph of G that is a tree containing all the vertices of G • In a weighted graph, the weight of a subgraph is the sum of the weights of the edges in the subgraph

• A minimum spanning tree (MST) for a weighted undirected graph is a spanning tree with minimum weight

305 Graph Algorithms

12.2

12.2 Minimum Spanning Tree: Prim’s Algorithm

Minimum Spanning Tree: Prim’s Algorithm

• Prim’s algorithm for finding an MST is a greedy algorithm • Start by selecting an arbitrary vertex, include it into the current MST • Grow the current MST by inserting into it the vertex closest to one of the vertices already in current MST

306 Graph Algorithms

12.2 Minimum Spanning Tree: Prim’s Algorithm

 procedure PRIM_MST( V , E , w , r ) begin VT := {r} ; d[r] := 0 ; f o r a l l v ∈ (V −VT ) do i f edge (r, v) e x i s t s set d[v] := w(r, v) ; else set d[v] := ∞ ; endif while VT 6= V do begin f i n d a vertex u such t h a t d[u] := min{d[v] | v ∈ (V −VT )} ; VT := VT ∪ {u} ; f o r a l l v ∈ (V −VT ) do d[v] := min{d[v], w(u, v)} ; endwhile end PRIM_MST 



307 Graph Algorithms

12.2 Minimum Spanning Tree: Prim’s Algorithm

Parallel formulation • The algorithm works in n outer iter-

• This node is inserted into MST, and

ations - it is hard to execute these iterations concurrently.

the choice broadcast to all processors.

• The inner loop is relatively easy to parallelize. Let p be the number of processes, and let n be the number

• Each processor updates its part of

of vertices.

• The adjacency matrix is partitioned in a 1-D block fashion, with distance vector d partitioned accordingly.

• In each step, a processor selects the locally closest node, followed by a global reduction to select globally closest node.

the d vector locally.

308 Graph Algorithms

12.2 Minimum Spanning Tree: Prim’s Algorithm

• The cost to select the minimum entry is O(n/p + log p). • The cost of a broadcast is O(log p). • The cost of local updation of the d vector is O(n/p). • The parallel time per iteration is O(n/p + log p). • The total parallel time O(n2 /p + n log p): computation

z }| { communication z }| { n2 Tp = Θ + Θ(n log p) p • Since sequential run time is W = Θ(n2 ): Θ(n2 ) Θ(n2 /p) + Θ(n log n) 1 E = 1 + Θ((p log p)/n) S =

309 Graph Algorithms

12.2 Minimum Spanning Tree: Prim’s Algorithm

• => for cost-optimal parallel formulation (p log p)/n = O(1) • => Prim’s algorithm can use only p = O(n/ log n) processes in this formulation • The corresponding isoefficiency is O(p2 log2 p).

310 Graph Algorithms

12.3

12.3 Single-Source Shortest Paths: Dijkstra’s Algorithm

Single-Source Shortest Paths: Dijkstra’s Algorithm

• For a weighted graph G = (V, E, w), the single-source shortest paths problem is to find the shortest paths from a vertex v ∈ V to all other vertices in V • Dijkstra’s algorithm is similar to Prim’s algorithm. It maintains a set of nodes for which the shortest paths are known

• It grows this set based on the node closest to source using one of the nodes in the current shortest path set

311 Graph Algorithms

12.3 Single-Source Shortest Paths: Dijkstra’s Algorithm

 procedure DIJKSTRA_SINGLE_SOURCE_SP( V , E , w , s ) begin VT := {s} ; f o r a l l v ∈ (V −VT ) do i f (s, v) e x i s t s set l[v] := w(s, v) ; else set l[v] := ∞ ; endif while VT 6= V do begin f i n d a vertex u such t h a t l[u] := min{l[v] | v ∈ (V −VT )} ; VT := VT ∪ {u} ; f o r a l l v ∈ (V −VT ) do l[v] := min{l[v], l[u] + w(u, v)} ; endwhile end DIJKSTRA_SINGLE_SOURCE_SP 

Run time of Dijkstra’s algorithm is Θ(n2 )



312 Graph Algorithms

12.4 All-Pairs Shortest Paths

Parallel formulation • Very similar to the parallel formulation of Prim’s algorithm for minimum spanning trees

• The weighted adjacency matrix is partitioned using the 1-D block mapping • Each process selects, locally, the node closest to the source, followed by a global reduction to select next node

• The node is broadcast to all processors and the l -vector updated. • The parallel performance of Dijkstra’s algorithm is identical to that of Prim’s algorithm

12.4

All-Pairs Shortest Paths

• Given a weighted graph G(V, E, w), the all-pairs shortest paths problem is to find the shortest paths between all pairs of vertices vi , v j ∈ V . • A number of algorithms are known for solving this problem.

313 Graph Algorithms

12.4 All-Pairs Shortest Paths

Matrix-Multiplication Based Algorithm • Consider the multiplication of the weighted adjacency matrix with itself - except, in this case, we replace the multiplication operation in matrix multiplication by addition, and the addition operation by minimization

• Notice that the product of weighted adjacency matrix with itself returns a matrix that contains shortest paths of length 2 between any pair of nodes

• It follows from this argument that An contains all shortest paths • An is computed by doubling powers - i.e., as A, A2 , A4 , A8 , ...

314 Graph Algorithms

12.4 All-Pairs Shortest Paths

315 Graph Algorithms

12.4 All-Pairs Shortest Paths

• We need log n matrix multiplications, each taking time O(n3 ). • The serial complexity of this procedure is O(n3 log n). • This algorithm is not optimal, since the best known algorithms have complexity O(n3 ). Parallel formulation

• Each of the log n matrix multiplications can be performed in parallel. • We can use n3 / log n processors to compute each matrix-matrix product in time log n. • The entire process takes O(log2 n) time.

Dijkstra’s Algorithm

316 Graph Algorithms

12.4 All-Pairs Shortest Paths

• Execute n instances of the single-source shortest path problem, one for each of the n source vertices. • Complexity is O(n3 ). Parallel formulation Two parallelization strategies - execute each of the n shortest path problems on a different processor (source partitioned), or use a parallel formulation of the shortest path problem to increase concurrency (source parallel). Dijkstra’s Algorithm: Source Partitioned Formulation

• Use n processors, each processor Pi finds the shortest paths from vertex vi to all other vertices by executing Dijkstra’s sequential single-source shortest paths algorithm.

• It requires no interprocess communication (provided that the adjacency matrix is replicated at all processes).

317 Graph Algorithms

12.4 All-Pairs Shortest Paths

• The parallel run time of this formulation is: Θ(n2 ). • While the algorithm is cost optimal, it can only use n processors. Therefore, the isoefficiency due to concurrency is Θ(p3 ). Dijkstra’s Algorithm: Source Parallel Formulation

• In this case, each of the shortest path problems is further executed in parallel. We can therefore use up to n2 processors. • Given p processors ( p > n), each single source shortest path problem is executed by p/n processors. • Using previous results, this takes time: computation

z }| { communication z }| { n3 Tp = Θ + Θ (n log p) p

318 Graph Algorithms

12.4 All-Pairs Shortest Paths

• For cost optimality, we have p = O(n2 / log n) and the isoefficiency is Θ((p log p)1.5 ).

Floyd’s Algorithm • Let G = (V, E, w) be the weighted graph with vertices V = {v1 , v2 , ..., vn }. • For any pair of vertices vi , v j ∈ V , consider all paths from vi to v j whose in(k) termediate vertices belong to the subset {v1 , v2 , . . . , vk } (k ≤ n). Let pi, j (of (k)

weight di, j ) be the minimum-weight path among them. (k)

• If vertex vk is not in the shortest path from vi to v j , then pi, j is the same as (k−1)

pi, j

. (k)

(k)

• If vk is in pi, j , then we can break pi, j into two paths - one from vi to vk and one from vk to v j . Each of these paths uses vertices from {v1 , v2 , . . . , vk−1 }.

319 Graph Algorithms

12.4 All-Pairs Shortest Paths

From our observations, the following recurrence relation follows: (k)

di, j =

( w(vin , v j) min

(k−1) (k−1) (k−1) di, j , di,k + dk, j

o if k = 0 if k ≥ 1

This equation must be computed for each pair of nodes and for k = 1, n. The serial complexity is O(n3 ).

 procedure FLOYD_ALL_PAIRS_SP ( A ) begin D0 = A ; f o r k := 1 to n do f o r i := 1 to n do f o r j := 1 to n do (k)

(k−1)

di, j := min(di, j

(k−1)

, di,k

end FLOYD_ALL_PAIRS_SP 

(k−1)

+ dk, j

);



320 Graph Algorithms

12.4 All-Pairs Shortest Paths

Parallel formulation: 2D Block Mapping

√ √ • Matrix D(k) is divided into p blocks of size (n/ p) × (n/ p). • Each processor updates its part of the matrix during each iteration. (k−1)

• To compute dl,r

(k−1)

processor Pi, j must get dl,k

(k−1)

and dk,r

.

√ • In general, during the kth iteration, each of the p) processes containing part √ of the kth row send it to the p − 1 processes in the same column. √ • Similarly, each of the p processes containing part of the kth column sends it √ to the p − 1 processes in the same row.

321 Graph Algorithms

12.4 All-Pairs Shortest Paths

322 Graph Algorithms

12.4 All-Pairs Shortest Paths

323 Graph Algorithms

12.4 All-Pairs Shortest Paths

 procedure FLOYD_2DBLOCK( D(0) ) begin f o r k := 1 to n do begin each process Pi, j t h a t has a segment of the kth row of D(k−1) broadcasts i t to the P∗, j processes ; each process Pi, j t h a t has a segment of the kth column of D(k−1) broadcasts i t to the Pi,∗ processes ; each process waits to receive the needed segments ; each process Pi, j computes i t s p a r t of the D(k) matrix ; end end FLOYD_2DBLOCK 

• During each iteration of the algorithm, the kth row and kth column of processors perform a one-to-all broadcast along their rows/columns.

√ √ • The size of this broadcast is n/ p elements, taking time Θ((n log p)/ p). • The synchronization step takes time Θ(log p).



324 Graph Algorithms

12.4 All-Pairs Shortest Paths

• The computation time is Θ(n2 /p). • The parallel run time of the 2-D block mapping formulation of Floyd’s algorithm is computation

communication

z }| { z  }| { n3 n2 Tp = Θ + Θ √ log p p p • The above formulation can use O(n2 / log2 n) processors cost-optimally. • The isoefficiency of this formulation is Θ(p1.5 log3 p). • This algorithm can be further improved by relaxing the strict synchronization after each iteration. Speeding things up by pipelining

325 Graph Algorithms

12.4 All-Pairs Shortest Paths

• The synchronization step in parallel Floyd’s algorithm can be removed without affecting the correctness of the algorithm.

• A process starts working on the kth iteration as soon as it has computed the k − 1th iteration and has the relevant parts of the D(k−1) matrix. process 4 at time t has just computed a segment of the kth column of the D(k−1) matrix. It sends the segment to processes 3 and 5. These processes receive the segment at time t + 1 (where the time unit is the time it takes for a matrix segment to travel over the communication link between adjacent processes). Similarly, processes farther away from proCommunication protocol followed in cess 4 receive the segment later. Prothe pipelined 2-D block mapping formu- cess 1 (at the boundary) does not forward lation of Floyd’s algorithm. Assume that the segment after receiving it.

√ • In each step, n/ p elements of the first row are sent from process Pi, j to Pi+1, j .

326 Graph Algorithms

12.4 All-Pairs Shortest Paths

• Similarly, elements of the first column are sent from process Pi, j to process Pi, j+1 . √ • Each such step takes time Θ(n/ p). √ • After Θ( p) steps, process P√ p,√ p gets the relevant elements of the first row and first column in time Θ(n). • The values of successive rows and columns follow after time Θ(n2 /p) in a pipelined mode.

• Process P√ p,√ p finishes its share of the shortest path computation in time Θ(n3 /p) + Θ(n). • When process P√ p,√ p has finished the (n − 1)th iteration, it sends the relevant values of the nth row and column to the other processes.

• The overall parallel run time of this formulation is computation

z }| { communication z }| { n3 Tp = Θ + Θ (n) p

327 Graph Algorithms

12.4 All-Pairs Shortest Paths

• The pipelined formulation of Floyd’s algorithm uses up to O(n2 ) processes efficiently.

• The corresponding isoefficiency is Θ(p1.5 ).

All-pairs Shortest Path: Comparison

328 Graph Algorithms

12.5

12.5 Connected Components

Connected Components

• The connected components of an undirected graph are the equivalence classes of vertices under the “is reachable from” relation

• A graph with three connected components: {1,2,3,4}, {5,6,7}, and {8,9}:

Depth-First Search (DFS) Based Algorithm • Perform DFS on the graph to get a forest - each tree in the forest corresponds to a separate connected component

• Part (b) is a depth-first forest obtained from depth-first traversal of the graph in part (a). Each of these trees is a connected component of the graph in part (a):

329 Graph Algorithms

12.5 Connected Components

Parallel Formulation

• Partition the graph across processors and run independent connected component algorithms on each processor. At this point, we have p spanning forests. • In the second step, spanning forests are merged pairwise until only one spanning forest remains.

330 Graph Algorithms

12.5 Connected Components Computing connected components in parallel: The adjacency matrix of the graph G in (a) is partitioned into two parts (b).

Each process gets a subgraph of G ((c) and (e)).

Each process then computes the spanning forest of the subgraph ((d) and (f)). Finally, the two spanning trees are merged to form the solution.

331 Graph Algorithms

12.5 Connected Components

• To merge pairs of spanning forests efficiently, the algorithm uses disjoint sets of edges.

• We define the following operations on the disjoint sets: • find(x) ◦ returns a pointer to the representative element of the set containing x . Each set has its own unique representative.

• union(x, y) ◦ unites the sets containing the elements x and y. The two sets are assumed to be disjoint prior to the operation.

• For merging forest A into forest B, for each edge (u, v) of A, a find operation is performed to determine if the vertices are in the same tree of B. • If not, then the two trees (sets) of B containing u and v are united by a union operation.

332 Graph Algorithms

12.5 Connected Components

• Otherwise, no union operation is necessary. • Hence, merging A and B requires at most 2(n − 1) find operations and (n − 1) union operations. Parallel 1-D Block Mapping

• The n × n adjacency matrix is partitioned into p blocks. • Each processor can compute its local spanning forest in time Θ(n2 /p). • Merging is done by embedding a logical tree into the topology. There are log p merging stages, and each takes time Θ(n). Thus, the cost due to merging is Θ(n log p). • During each merging stage, spanning forests are sent between nearest neighbors. Recall that Θ(n) edges of the spanning forest are transmitted.

333 Graph Algorithms

12.5 Connected Components

• The parallel run time of the connected-component algorithm is local computation

Tp =

z }| { n2 Θ p

forest merging

z }| { + Θ (n log p)

• For a cost-optimal formulation p = O(n/ log n). The corresponding isoefficiency is Θ(p2 log2 p).

334 Summary

CONTENTS

Contents 1 Parallel Computing – motivation 1.1 History of computing . . . . . . . . . 1.2 Expert’s predictions . . . . . . . . . 1.3 Usage areas of a petaflop computer . 1.4 Example 1.1 . . . . . . . . . . . . . 1.5 Example 1.2 . . . . . . . . . . . . . 1.6 Example 1.3 . . . . . . . . . . . . . 1.7 Example 1.4 . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

11 12 15 17 21 24 27 28

2 Computer architectures and parallelism

29

I

31

Parallel Architectures 2.1 2.2 2.3 2.4 2.5

Processor development . . . . . . . . . . . Memory performance effect . . . . . . . . . Memory Bandwidth (MB) and performance . Other approaches for hiding memory latency Classification of Parallel Computers . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

31 39 44 48 52

335 Summary 2.6 2.7 2.8 2.9 2.10 2.11

Flynn’s classification . . . . . . . . . . . . . Type of memory access . . . . . . . . . . . Communication model of parallel computers Other classifications . . . . . . . . . . . . . Clos-networks . . . . . . . . . . . . . . . . Beowulf clusters . . . . . . . . . . . . . . .

3 Parallel Performance 3.1 Speedup . . . . . . . . . . . . 3.2 Efficiency . . . . . . . . . . . . 3.3 Amdahl’s law . . . . . . . . . . 3.4 Gustafson-Barsis’ law . . . . . 3.5 Scaled efficiency . . . . . . . . 3.6 Methods to increase efficiency . 3.7 Benchmarks . . . . . . . . . .

II

CONTENTS

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

Parallel Programming

4 Parallel Programming Models – an Overview

. . . . . . .

. . . . . .

. . . . . . .

. . . . . .

. . . . . . .

. . . . . .

. . . . . . .

. . . . . .

. . . . . . .

. . . . . .

. . . . . . .

. . . . . .

. . . . . . .

. . . . . .

. . . . . . .

. . . . . .

. . . . . . .

. . . . . .

. . . . . . .

. . . . . .

. . . . . . .

. . . . . .

. . . . . . .

. . . . . .

. . . . . . .

. . . . . .

. . . . . . .

. . . . . .

58 66 68 71 78 87

. . . . . . .

91 91 92 93 96 97 98 100

105 105

336 Summary 4.1 4.2 4.3 4.4 4.1 4.2

Model of Tasks and Channels Message passing model . . . Shared Memory model . . . . Implicit Parallelism . . . . . . Task parallel model . . . . . Data parallel model . . . . .

CONTENTS . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

5 Shared memory programing model 5.1 PRAM (Parallel Random Access Machine) model 5.2 OpenMP . . . . . . . . . . . . . . . . . . . . . . 5.3 OpenMP direktiivid: PARALLEL . . . . . . . . . . 5.4 OpenMP tööjaotusdirektiivid: DO . . . . . . . . . 5.5 OpenMP tööjaotusdirektiivid: SECTIONS . . . . . 5.6 OpenMP tööjaotusdirektiivid: SINGLE . . . . . . 5.7 OpenMP tööjaotusdirektiivide kombineerimine: . . 5.8 OpenMP: sünkronisatsioonikäsud . . . . . . . . . 5.9 OpenMP skoobidirektiivid . . . . . . . . . . . . . 5.10 OpenMP teegikäske . . . . . . . . . . . . . . . . 5.11 OpenMP näiteprogramme . . . . . . . . . . . . .

. . . . . .

. . . . . . . . . . .

. . . . . .

. . . . . . . . . . .

. . . . . .

. . . . . . . . . . .

. . . . . .

. . . . . . . . . . .

. . . . . .

. . . . . . . . . . .

. . . . . .

. . . . . . . . . . .

. . . . . .

. . . . . . . . . . .

. . . . . .

. . . . . . . . . . .

. . . . . .

. . . . . . . . . . .

. . . . . .

. . . . . . . . . . .

. . . . . .

106 109 110 111 113 113

. . . . . . . . . . .

115 116 117 124 127 131 134 135 137 140 144 147

337 Summary 6 Fortran and HPF 6.1 Fortran Evolution . . . . . . . . . . . . . . . . 6.2 Concept . . . . . . . . . . . . . . . . . . . . 6.3 PROCESSORS declaration . . . . . . . . . . . 6.4 DISTRIBUTE-directive . . . . . . . . . . . . 6.5 Distribution of allocatable arrays . . . . . . . . 6.6 HPF rule: Owner Calcuates . . . . . . . . . . 6.7 Scalar variables . . . . . . . . . . . . . . . . 6.8 Examples of good DISTRIBUTE subdivision 6.9 HPF programming methodology . . . . . . . . 6.10 BLOCK(m) and CYCLIC(m) . . . . . . . . . 6.11 Array alignment . . . . . . . . . . . . . . . . 6.12 Strided Alignment . . . . . . . . . . . . . . . 6.13 Example on Alignment . . . . . . . . . . . . . 6.14 Alignment with Allocatable Arrays . . . . . . . 6.15 Dimension collapsing . . . . . . . . . . . . . 6.16 Dimension replication . . . . . . . . . . . . . 6.17 HPF Intrinsic Functions . . . . . . . . . . . . 6.18 HPF Template Syntax . . . . . . . . . . . . .

CONTENTS

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

148 148 149 150 151 156 157 158 159 161 164 165 169 171 172 175 175 179 180

338 Summary 6.19 6.20 6.21 6.22 6.23

FORALL . . . . . . . . . . . . . PURE-procedures . . . . . . . . INDEPENDENT . . . . . . . . . INDEPENDENT NEW command EXTRINSIC . . . . . . . . . . .

CONTENTS . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

7 MPI

III

198

Parallel Algorithms

8 Parallel Algorithm Design Principles 8.1 Decomposition, Tasks and Dependency Graphs 8.2 Decomposition Techniques . . . . . . . . . . . 8.3 Tasks and Interactions . . . . . . . . . . . . . . 8.4 Mapping Techniques for Load balancing . . . . 8.5 Parallel Algorithm Models . . . . . . . . . . . .

184 188 192 195 197

199 . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

199 200 203 205 207 210

9 Solving systems of linear equations with sparse matrices 212 9.1 EXAMPLE: 2D FEM for Poisson equation . . . . . . . . . . . . . . . . 212

339 Summary 9.2 9.3 9.4 9.5

CONTENTS

Solution methods for sparse matrix systems . Preconditioning . . . . . . . . . . . . . . . . Domain Decompossition Method . . . . . . . Domain Decomposition on Unstructured Grids

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

229 231 233 234

10 Parallel Algorithm Optimality Definitions 243 10.1 Assessing Parallel Algorithms . . . . . . . . . . . . . . . . . . . . . . 244 10.2 Definitions of Optimality . . . . . . . . . . . . . . . . . . . . . . . . . 245 11 Basic Techniques in Parallel Algorithms 11.1 Balanced Trees . . . . . . . . . . . 11.2 Pointer Jumping Method . . . . . . . 11.3 Divide and Conquer . . . . . . . . . 11.4 Partitioning . . . . . . . . . . . . . . 11.5 Pipelining . . . . . . . . . . . . . . . 11.6 Accelerated Cascading . . . . . . . 11.7 Symmetry breaking . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

247 247 250 255 263 271 278 283

12 Parallel Algorithm Scalability 290 12.1 Isoefficiency Metric of Scalability . . . . . . . . . . . . . . . . . . . . 290

340 Summary 12.2 12.1 12.2 12.3 12.4 12.5

Cost-Optimality and the Isoefficiency Function . . Definitions and Representation . . . . . . . . . . Minimum Spanning Tree: Prim’s Algorithm . . . . Single-Source Shortest Paths: Dijkstra’s Algorithm All-Pairs Shortest Paths . . . . . . . . . . . . . . Connected Components . . . . . . . . . . . . . .

CONTENTS . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

298 302 305 310 312 328