Parallel Programming Methodology

5 downloads 1967 Views 189KB Size Report
Oct 3, 2013 ... Parallel Programming Methodology. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI). Instituto ...
Parallel Programming Methodology

Parallel and Distributed Computing

Department of Computer Science and Engineering (DEI) Instituto Superior T´ ecnico

October 3, 2013

CPD (DEI / IST)

Parallel and Distributed Computing – 6

2013-10-03

1 / 26

Outline

Parallel programming

Dependency graphs

Overheads influence on programming of shared- vs distributed-memory systems

Foster’s design methodology

CPD (DEI / IST)

Parallel and Distributed Computing – 6

2013-10-03

2 / 26

Parallel Programming

Steps:

Identify work that can be done in parallel

Partition work and perhaps data among tasks

Manage data access, communication and synchronization

CPD (DEI / IST)

Parallel and Distributed Computing – 6

2013-10-03

3 / 26

Dependency Graphs

Programs can be modeled as directed graphs:

Nodes: at the finer granularity level, are instructions ⇒ to reduce complexity, nodes may be an arbitrary sequence of statements

CPD (DEI / IST)

Parallel and Distributed Computing – 6

2013-10-03

4 / 26

Dependency Graphs

Programs can be modeled as directed graphs:

Nodes: at the finer granularity level, are instructions ⇒ to reduce complexity, nodes may be an arbitrary sequence of statements

Edges: data dependency constraints among instructions in the nodes

Data Dependency Graphs

CPD (DEI / IST)

Parallel and Distributed Computing – 6

2013-10-03

4 / 26

Dependency Graphs

read(A, B); x = initX(A, B); y = initY(A, B); z = initZ(A, B); for(i = 0; i < N_ENTRIES; i++) x[i] = compX(x[i], y[i], z[i]); for(i = 1; i < N_ENTRIES; i++){ x[i] = solveX(x[i-1]); z[i] = x[i] + y[i]; }

. . .

. . .

finalize1(&x, &y, &z); finalize2(&x, &y, &z); finalize3(&x, &y, &z); CPD (DEI / IST)

Parallel and Distributed Computing – 6

2013-10-03

5 / 26

Types of Parallelism

A

B A C A

A

B A C

B

B

B

B

C

D

B

C C

E

Data Parallelism

Functional Parallelism

CPD (DEI / IST)

Parallel and Distributed Computing – 6

Pipeline Parallelism 2013-10-03

6 / 26

Overheads

Task creation/finish

Data transfer

Communication (synchronization)

Load balancing

CPD (DEI / IST)

Parallel and Distributed Computing – 6

2013-10-03

7 / 26

Shared vs Distributed Memory Systems

Overheads very different depending on type of architecture!

Shared Distributed

CPD (DEI / IST)

Start/Finish H N

Data H N

Load = =

Parallel and Distributed Computing – 6

Comm N H

2013-10-03

8 / 26

Shared vs Distributed Memory Systems

Tasks SM: more dynamic creation of tasks, hence these can be more fine-grained. DM: typically all tasks active until end, hence requires more coarse-grained tasks.

CPD (DEI / IST)

Parallel and Distributed Computing – 6

2013-10-03

9 / 26

Shared vs Distributed Memory Systems

Tasks SM: more dynamic creation of tasks, hence these can be more fine-grained. DM: typically all tasks active until end, hence requires more coarse-grained tasks.

Data SM: data partition not an issue when defining tasks; however caution when accessing shared data: avoid races using mutual-exclusive regions DM: data partition is critical for the performance of the application

CPD (DEI / IST)

Parallel and Distributed Computing – 6

2013-10-03

9 / 26

Shared vs Distributed Memory Systems

Tasks SM: more dynamic creation of tasks, hence these can be more fine-grained. DM: typically all tasks active until end, hence requires more coarse-grained tasks.

Data SM: data partition not an issue when defining tasks; however caution when accessing shared data: avoid races using mutual-exclusive regions DM: data partition is critical for the performance of the application

in both SM and DM: minimize synchronization points be careful about load balancing

CPD (DEI / IST)

Parallel and Distributed Computing – 6

2013-10-03

9 / 26

Shared Memory Systems Typical diagram of a parallel application under shared memory: Master Thread

Other Threads

Fork

Time

Join

Fork

Join

Fork / Join Parallelism CPD (DEI / IST)

Parallel and Distributed Computing – 6

2013-10-03

10 / 26

Shared Memory Systems Application is typically a single program, with directives to handle parallelism: fork / join parallel loops private vs shared variables critical sections

CPD (DEI / IST)

Parallel and Distributed Computing – 6

2013-10-03

11 / 26

Distributed Memory Systems

Cannot use fine granularity!

Each processor gets assigned a (large) task: static scheduling: all tasks start at the beginning of computation dynamic scheduling: tasks start as needed

CPD (DEI / IST)

Parallel and Distributed Computing – 6

2013-10-03

12 / 26

Distributed Memory Systems

Cannot use fine granularity!

Each processor gets assigned a (large) task: static scheduling: all tasks start at the beginning of computation dynamic scheduling: tasks start as needed Application is typically also a single program! ⇒ identification number of each task indicates what is its job. CPD (DEI / IST)

Parallel and Distributed Computing – 6

2013-10-03

12 / 26

Task / Channel Model Parallel programming for distributed memory systems uses:

Task / Channel Model Parallel computation is represented as a set of tasks that may interact with each other by sending messages through channels.

Task: program + local memory + I/O ports Channel: message queue that connects one task’s output port with another task’s input port

All tasks start simultaneously, and finishing time is determined by the time the last task stops its execution. CPD (DEI / IST)

Parallel and Distributed Computing – 6

2013-10-03

13 / 26

Messages in the Task / Channel Model ordering of data in the channel is maintained receiving task blocks until a value is available at the receiver sender never blocks, independently of previous messages not yet delivered

In the task / channel model receiving is a synchronous operation sending is an asynchronous operation

CPD (DEI / IST)

Parallel and Distributed Computing – 6

2013-10-03

14 / 26

Foster’s Design Methodology Development of scalable parallel algorithms by delaying machine-dependent decisions to later stages.

Four steps: partitioning communication agglomeration mapping

CPD (DEI / IST)

Parallel and Distributed Computing – 6

2013-10-03

15 / 26

Foster’s Design Methodology

Problem

Partitioning Communication

Primitive Tasks

Agglomeration

CPD (DEI / IST)

Parallel and Distributed Computing – 6

2013-10-03

16 / 26

Foster’s Design Methodology

Problem

Partitioning Communication

Primitive Tasks

Agglomeration

CPD (DEI / IST)

Parallel and Distributed Computing – 6

2013-10-03

16 / 26

Foster’s Design Methodology

Problem

Partitioning Communication

Primitive Tasks

Agglomeration

CPD (DEI / IST)

Parallel and Distributed Computing – 6

2013-10-03

16 / 26

Foster’s Design Methodology

Problem

Partitioning Communication

Primitive Tasks

Agglomeration

CPD (DEI / IST)

Parallel and Distributed Computing – 6

2013-10-03

16 / 26

Foster’s Design Methodology

Problem

Partitioning Communication

Primitive Tasks

Agglomeration

Mapping

CPD (DEI / IST)

Parallel and Distributed Computing – 6

2013-10-03

16 / 26

Foster’s Design Methodology: Partitioning Partitioning Process of dividing the computation and data into many small primitive tasks. Strategies: (no single universal recipe...) data decomposition functional decomposition recursive decomposition Checklist: > 10 × P primitive tasks than P processors minimize redundant computations and redundant data storage primitive tasks are roughly the same size number of tasks grows naturally with the problem size CPD (DEI / IST)

Parallel and Distributed Computing – 6

2013-10-03

17 / 26

Recursive Decomposition Suitable for problems solvable using divide-and-conquer Steps: decompose a problem into a set of sub-problems recursively decompose each sub-problem stop decomposition when minimum desired granularity reached

CPD (DEI / IST)

Parallel and Distributed Computing – 6

2013-10-03

18 / 26

Data Decomposition Appropriate data partitioning is critical to parallel performance Steps: identify the data on which computations are performed partition the data across various tasks Decomposition can be based on input data output data input + output data intermediate data

CPD (DEI / IST)

Parallel and Distributed Computing – 6

2013-10-03

19 / 26

Input Data Decomposition

Applicable if each output is computed as a function of the input

May be the only natural decomposition if output is unknown problem of finding the minimum in a set or other reductions

Associate a task with each input data partition task performs computation on its part of the data subsequent processing combines partial results from earlier tasks

CPD (DEI / IST)

Parallel and Distributed Computing – 6

2013-10-03

20 / 26

Output Data Decomposition

Applicable if each element of the output can be computed independently algorithm is based on one-to-one or many-to-one functions

Partition the output data across tasks Have each task perform the computation for its outputs

Example: Matrix-vector multiplication

CPD (DEI / IST)

Parallel and Distributed Computing – 6

2013-10-03

21 / 26

Foster’s Design Methodology: Communication Communication Identification of the communication pattern among primitive tasks. local communication: values shared by a small number of tasks draw a channel from producing task to consumer tasks global communication: values are required by a significant number of tasks while important, not useful to represent in the task/channel model Checklist: communication balanced among tasks each task communicates with a small number of tasks tasks can perform their communication concurrently tasks can perform their computations concurrently CPD (DEI / IST)

Parallel and Distributed Computing – 6

2013-10-03

22 / 26

Foster’s Design Methodology: Agglomeration Agglomeration Process of grouping primitive tasks into larger tasks. Strategies: group tasks that have high communication with each other group sender tasks and group receiving tasks group tasks to allow re-use of sequential code Checklist: locality has been maximized replicated computations take less time than the communications they replace amount of replicated data is small enough to allow algorithm to scale tasks are balanced in terms of computation and communication number of tasks grows naturally with problem size number of tasks is small, but at least as great as P cost of modifications to sequential code is minimized CPD (DEI / IST)

Parallel and Distributed Computing – 6

2013-10-03

23 / 26

Foster’s Design Methodology: Mapping Mapping Process of assigning tasks to processors.

Strategies: maximize processor utilization (average % time processor are active) ⇒ even load distribution minimize interprocessor communication ⇒ map tasks with channels among them to the same processor ⇒ take into account network topology

CPD (DEI / IST)

Parallel and Distributed Computing – 6

2013-10-03

24 / 26

Review

Parallel programming

Dependency graphs

Overheads influence on programming of shared- vs distributed-memory systems

Foster’s design methodology

CPD (DEI / IST)

Parallel and Distributed Computing – 6

2013-10-03

25 / 26

Next Class

OpenMP

CPD (DEI / IST)

Parallel and Distributed Computing – 6

2013-10-03

26 / 26